HOW TO DELETE IN KEY-VALUE STORE - Programming版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - HOW TO DELETE IN KEY-VALUE STORE

相关主题
● 问个C++中重复删除指针的问题	● 好热闹啊
● 为什么Cache LRU多用doubly linked list而不是single linked list来实现呢?	● Indiana大学的牛人
● 没有data store的model有什么用吗？	● 说说12306需要多少台机器
● 小公司的网站也要用memcached之类的cache吗？	● mongo,dynamo,cassandra,hbase ,谁会是赢家，谁会落寞？
● 我的一个客户案例（high traffic），请大家批判分析指点	● 分布式系统back end需要哪些知识？
● git的官方文档真叫一个烂	● Spark 和 Dynamodb 之间如何连接
● Android memory leak	● node- mongoose的思考
● linux 能查到 deleted file list 吗	● 有多少人的公司再用卡三爪？跟呆拿魔比有什么优势么？

相关话题的讨论汇总
话题: delete话题: log话题: value话题: key话题: store

进入Programming版参与讨论

(共1页)

k****r
发帖数: 807

To DANIU men,
I am trying to study distributed key-value store. And have a question.
Like Dynamo and Voldemort, the data is append-only written, and the index is
caches. I understand the write and retrieve procedures. However, how to
deal with delete in these systems?
Thanks,

g*****g
发帖数: 34805

Isn't delete just a write? Delete operation is appended in commit log,
during compaction, the row is removed.

is

【在 k****r 的大作中提到】

: To DANIU men,
: I am trying to study distributed key-value store. And have a question.
: Like Dynamo and Voldemort, the data is append-only written, and the index is
: caches. I understand the write and retrieve procedures. However, how to
: deal with delete in these systems?
: Thanks,

w***g
发帖数: 5958

我不知道具体系统怎么实现的。log structure storage的惯用做法应该是
先把对应的记录定位了，打上叉叉表示删掉了。（能search就能定位记录)
然后用打叉叉的记录多了以后用garbage collection来回收空间。

is

【在 k****r 的大作中提到】

k****r
发帖数: 807

re: Isn't delete just a write? Delete operation is appended in commit log,
during compaction, the row is removed.
If delete is one special type of write, can I understand the it as a rewrite
on a key?
BTW, how to swap the page cache for keys? Lets say, the older data file and
index file are like keyA-valueA1, keyB-valueB1, keyC-valueC1, and the new
coming ones are: keyA-valueA2, keyC-valueC2. Then, what is swap doing on
cache? it should still be like keyA...keyB...keyC, but only value of A and B
are updated, right? Then, if the new operation is delB, the keyB value is
set as a special value?
re: 我不知道具体系统怎么实现的。log structure storage的惯用做法应该是
先把对应的记录定位了，打上叉叉表示删掉了。（能search就能定位记录)
然后用打叉叉的记录多了以后用garbage collection来回收空间。
garbage collection is operated during SWAP to new version of data?

g*****g
发帖数: 34805

Every row/column has a flag, you mark it as invalid, that's a delete. All
write operations will also invalidate and/or update the cache for given row.

rewrite
and
B

【在 k****r 的大作中提到】

: re: Isn't delete just a write? Delete operation is appended in commit log,
: during compaction, the row is removed.
: If delete is one special type of write, can I understand the it as a rewrite
: on a key?
: BTW, how to swap the page cache for keys? Lets say, the older data file and
: index file are like keyA-valueA1, keyB-valueB1, keyC-valueC1, and the new
: coming ones are: keyA-valueA2, keyC-valueC2. Then, what is swap doing on
: cache? it should still be like keyA...keyB...keyC, but only value of A and B
: are updated, right? Then, if the new operation is delB, the keyB value is
: set as a special value?

w***g
发帖数: 5958

按goodbug说的，日志里就是
keyA-valueA1
keyB-valueB1
keyC-valueC1
keyA-valueA2,
keyC-valueC2.
delB
读的时候往回找，先找到啥是啥。看到delB就表示B已经没了。
这就真是纯log structure了。
不过我没想明白on-disk索引怎么做，所以说了打叉叉那个办法。
log structure当时被想出来的假设是大内存能够保证足够高的cache hit rate，
所以读磁盘的效率差并不重要，而按log的顺序写入能保证最大化写入吞吐量。
cache在内存中，更新和log顺序没有关系。如果索引在内存中，也和log没关系。
其实以后SSD多了，顺序写的优势也就不再那么重要了。

rewrite
and
B

【在 k****r 的大作中提到】

k****r
发帖数: 807

Thank you for your reply. And it should make sense, especially for the
system with the LRU cache.
However, I don't remember there is related field in Dynamo/Voldemort systems
. Also, I though its caching for versions is not similar with LRU....

row.

【在 g*****g 的大作中提到】

: Every row/column has a flag, you mark it as invalid, that's a delete. All
: write operations will also invalidate and/or update the cache for given row.
:
: rewrite
: and
: B

w***g
发帖数: 5958

log-structured file system是1990年左右出现的，在HDD时代是一个很牛B的发现。
其实现在SSD盛行后已经没啥优势了。之所有目前盛行，我觉得主要是industry还没缓
过劲来，
还有就是实现比较容易。类似的，event-vs-thread以及nginx的牛B是前multi-core时
代的事情。
现在核越来越多了，单核计算能力也越来越强，未来的方向是明显利于thread的。
但是industry还没缓过劲来，所以会阶段性地出现continuation passing style大行其
道的局面。
node.js之类的应该火不了几年了。

【在 w***g 的大作中提到】

: 按goodbug说的，日志里就是
: keyA-valueA1
: keyB-valueB1
: keyC-valueC1
: keyA-valueA2,
: keyC-valueC2.
: delB
: 读的时候往回找，先找到啥是啥。看到delB就表示B已经没了。
: 这就真是纯log structure了。
: 不过我没想明白on-disk索引怎么做，所以说了打叉叉那个办法。

k****r
发帖数: 807

Also, what is the relations between the log, and the data in disk/memory.'
Thanks,

g*****g
发帖数: 34805

分布式系统里删除特殊的地方在于不是所有节点都会收到这条指令，可能会丢。如果你
真的删除了数据，你怎么确定是一个写操作这个节点没收到，还是这个删除别的节点没
收到？
所以 Cassandra一类的系统是写入一个特殊值叫 tombstone，只在一定时间之后做
compaction，这样出错的概率就很低。

【在 w***g 的大作中提到】

相关主题
● git的官方文档真叫一个烂	● 好热闹啊
● Android memory leak	● Indiana大学的牛人
● linux 能查到 deleted file list 吗	● 说说12306需要多少台机器
进入Programming版参与讨论

g*****g
发帖数: 34805

Every system is different, my knowledge is on Cassandra and it's not
necessary accurate for other system. Think of it as one possible solution.

systems

【在 k****r 的大作中提到】

: Thank you for your reply. And it should make sense, especially for the
: system with the LRU cache.
: However, I don't remember there is related field in Dynamo/Voldemort systems
: . Also, I though its caching for versions is not similar with LRU....
:
: row.

w***g
发帖数: 5958

传统数据库的数据主要存在一个磁盘上的B+树(或者hash)表中，log是一个为了保证数据
完整性的机制。更新B+树之前先把对应的操作写入log中。这样如果B+树更新到一半系统
断电导致数据结构损坏，可以通过replay log的办法重建B+树。后来人们发现其实所有
的数据都在log里，要用的时候去log里也能找出数据来，就觉得可以把B+树扔掉了。
然后log就变成了磁盘上唯一的数据结构。数据库的log和一般程序的日志不一样。
数据库的log存的是数据本身，所以必须存在于磁盘上。内存中只能是索引或者cache。

【在 k****r 的大作中提到】

: Also, what is the relations between the log, and the data in disk/memory.'
: Thanks,

k****r
发帖数: 807

Many thanks for DANIUMEN!!! I think I still have a lot of things to learn:)

w**z
发帖数: 8232

In case of C*, make sure to run repair at least once within GC grade period.
Otherwise, tombstones may come back to live. As goodbug said, deletes might
not get to all the nodes because of the eventual consistency, and repair
will fix that.
Read this blog if you want to know more, delete in distributes system is
tricky.
http://thelastpickle.com/blog/2011/05/15/Deletes-and-Tombstones

【在 g*****g 的大作中提到】

: 分布式系统里删除特殊的地方在于不是所有节点都会收到这条指令，可能会丢。如果你
: 真的删除了数据，你怎么确定是一个写操作这个节点没收到，还是这个删除别的节点没
: 收到？
: 所以 Cassandra一类的系统是写入一个特殊值叫 tombstone，只在一定时间之后做
: compaction，这样出错的概率就很低。

w**z
发帖数: 8232

commit log is for durability. C* writes to memtable along with commit log.
In case node crashes before the memtable is flushed to disk, it will recover
from commit log.
For reads, it doesn't go to commit log, it goes to memtable and sstable and
merge them.

【在 w***g 的大作中提到】

k****r
发帖数: 807

Nice information. Let me study first :P

period.
might

【在 w**z 的大作中提到】

: In case of C*, make sure to run repair at least once within GC grade period.
: Otherwise, tombstones may come back to live. As goodbug said, deletes might
: not get to all the nodes because of the eventual consistency, and repair
: will fix that.
: Read this blog if you want to know more, delete in distributes system is
: tricky.
: http://thelastpickle.com/blog/2011/05/15/Deletes-and-Tombstones

(共1页)

进入Programming版参与讨论

相关主题
● 有多少人的公司再用卡三爪？跟呆拿魔比有什么优势么？	● 我的一个客户案例（high traffic），请大家批判分析指点
● 为什么大牛说hbase是strong consistency的？	● git的官方文档真叫一个烂
● 关于aws问goodbug老师一个问题	● Android memory leak
● AWS早上有挂了	● linux 能查到 deleted file list 吗
● 问个C++中重复删除指针的问题	● 好热闹啊
● 为什么Cache LRU多用doubly linked list而不是single linked list来实现呢?	● Indiana大学的牛人
● 没有data store的model有什么用吗？	● 说说12306需要多少台机器
● 小公司的网站也要用memcached之类的cache吗？	● mongo,dynamo,cassandra,hbase ,谁会是赢家，谁会落寞？

相关话题的讨论汇总
话题: delete话题: log话题: value话题: key话题: store

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天