k****r 发帖数: 807 | 1 To DANIU men,
I am trying to study distributed key-value store. And have a question.
Like Dynamo and Voldemort, the data is append-only written, and the index is
caches. I understand the write and retrieve procedures. However, how to
deal with delete in these systems?
Thanks, | g*****g 发帖数: 34805 | 2 Isn't delete just a write? Delete operation is appended in commit log,
during compaction, the row is removed.
is
【在 k****r 的大作中提到】 : To DANIU men, : I am trying to study distributed key-value store. And have a question. : Like Dynamo and Voldemort, the data is append-only written, and the index is : caches. I understand the write and retrieve procedures. However, how to : deal with delete in these systems? : Thanks,
| w***g 发帖数: 5958 | 3 我不知道具体系统怎么实现的。log structure storage的惯用做法应该是
先把对应的记录定位了,打上叉叉表示删掉了。(能search就能定位记录)
然后用打叉叉的记录多了以后用garbage collection来回收空间。
is
【在 k****r 的大作中提到】 : To DANIU men, : I am trying to study distributed key-value store. And have a question. : Like Dynamo and Voldemort, the data is append-only written, and the index is : caches. I understand the write and retrieve procedures. However, how to : deal with delete in these systems? : Thanks,
| k****r 发帖数: 807 | 4 re: Isn't delete just a write? Delete operation is appended in commit log,
during compaction, the row is removed.
If delete is one special type of write, can I understand the it as a rewrite
on a key?
BTW, how to swap the page cache for keys? Lets say, the older data file and
index file are like keyA-valueA1, keyB-valueB1, keyC-valueC1, and the new
coming ones are: keyA-valueA2, keyC-valueC2. Then, what is swap doing on
cache? it should still be like keyA...keyB...keyC, but only value of A and B
are updated, right? Then, if the new operation is delB, the keyB value is
set as a special value?
re: 我不知道具体系统怎么实现的。log structure storage的惯用做法应该是
先把对应的记录定位了,打上叉叉表示删掉了。(能search就能定位记录)
然后用打叉叉的记录多了以后用garbage collection来回收空间。
garbage collection is operated during SWAP to new version of data? | g*****g 发帖数: 34805 | 5 Every row/column has a flag, you mark it as invalid, that's a delete. All
write operations will also invalidate and/or update the cache for given row.
rewrite
and
B
【在 k****r 的大作中提到】 : re: Isn't delete just a write? Delete operation is appended in commit log, : during compaction, the row is removed. : If delete is one special type of write, can I understand the it as a rewrite : on a key? : BTW, how to swap the page cache for keys? Lets say, the older data file and : index file are like keyA-valueA1, keyB-valueB1, keyC-valueC1, and the new : coming ones are: keyA-valueA2, keyC-valueC2. Then, what is swap doing on : cache? it should still be like keyA...keyB...keyC, but only value of A and B : are updated, right? Then, if the new operation is delB, the keyB value is : set as a special value?
| w***g 发帖数: 5958 | 6 按goodbug说的,日志里就是
keyA-valueA1
keyB-valueB1
keyC-valueC1
keyA-valueA2,
keyC-valueC2.
delB
读的时候往回找,先找到啥是啥。看到delB就表示B已经没了。
这就真是纯log structure了。
不过我没想明白on-disk索引怎么做,所以说了打叉叉那个办法。
log structure当时被想出来的假设是大内存能够保证足够高的cache hit rate,
所以读磁盘的效率差并不重要,而按log的顺序写入能保证最大化写入吞吐量。
cache在内存中,更新和log顺序没有关系。如果索引在内存中,也和log没关系。
其实以后SSD多了,顺序写的优势也就不再那么重要了。
rewrite
and
B
【在 k****r 的大作中提到】 : re: Isn't delete just a write? Delete operation is appended in commit log, : during compaction, the row is removed. : If delete is one special type of write, can I understand the it as a rewrite : on a key? : BTW, how to swap the page cache for keys? Lets say, the older data file and : index file are like keyA-valueA1, keyB-valueB1, keyC-valueC1, and the new : coming ones are: keyA-valueA2, keyC-valueC2. Then, what is swap doing on : cache? it should still be like keyA...keyB...keyC, but only value of A and B : are updated, right? Then, if the new operation is delB, the keyB value is : set as a special value?
| k****r 发帖数: 807 | 7 Thank you for your reply. And it should make sense, especially for the
system with the LRU cache.
However, I don't remember there is related field in Dynamo/Voldemort systems
. Also, I though its caching for versions is not similar with LRU....
row.
【在 g*****g 的大作中提到】 : Every row/column has a flag, you mark it as invalid, that's a delete. All : write operations will also invalidate and/or update the cache for given row. : : rewrite : and : B
| w***g 发帖数: 5958 | 8 log-structured file system是1990年左右出现的,在HDD时代是一个很牛B的发现。
其实现在SSD盛行后已经没啥优势了。之所有目前盛行,我觉得主要是industry还没缓
过劲来,
还有就是实现比较容易。类似的,event-vs-thread以及nginx的牛B是前multi-core时
代的事情。
现在核越来越多了,单核计算能力也越来越强,未来的方向是明显利于thread的。
但是industry还没缓过劲来,所以会阶段性地出现continuation passing style大行其
道的局面。
node.js之类的应该火不了几年了。
【在 w***g 的大作中提到】 : 按goodbug说的,日志里就是 : keyA-valueA1 : keyB-valueB1 : keyC-valueC1 : keyA-valueA2, : keyC-valueC2. : delB : 读的时候往回找,先找到啥是啥。看到delB就表示B已经没了。 : 这就真是纯log structure了。 : 不过我没想明白on-disk索引怎么做,所以说了打叉叉那个办法。
| k****r 发帖数: 807 | 9 Also, what is the relations between the log, and the data in disk/memory.'
Thanks, | g*****g 发帖数: 34805 | 10 分布式系统里删除特殊的地方在于不是所有节点都会收到这条指令,可能会丢。如果你
真的删除了数据,你怎么确定是一个写操作这个节点没收到,还是这个删除别的节点没
收到?
所以 Cassandra一类的系统是写入一个特殊值叫 tombstone,只在一定时间之后做
compaction,这样出错的概率就很低。
【在 w***g 的大作中提到】 : 按goodbug说的,日志里就是 : keyA-valueA1 : keyB-valueB1 : keyC-valueC1 : keyA-valueA2, : keyC-valueC2. : delB : 读的时候往回找,先找到啥是啥。看到delB就表示B已经没了。 : 这就真是纯log structure了。 : 不过我没想明白on-disk索引怎么做,所以说了打叉叉那个办法。
| | | g*****g 发帖数: 34805 | 11 Every system is different, my knowledge is on Cassandra and it's not
necessary accurate for other system. Think of it as one possible solution.
systems
【在 k****r 的大作中提到】 : Thank you for your reply. And it should make sense, especially for the : system with the LRU cache. : However, I don't remember there is related field in Dynamo/Voldemort systems : . Also, I though its caching for versions is not similar with LRU.... : : row.
| w***g 发帖数: 5958 | 12 传统数据库的数据主要存在一个磁盘上的B+树(或者hash)表中,log是一个为了保证数据
完整性的机制。更新B+树之前先把对应的操作写入log中。这样如果B+树更新到一半系统
断电导致数据结构损坏,可以通过replay log的办法重建B+树。后来人们发现其实所有
的数据都在log里,要用的时候去log里也能找出数据来,就觉得可以把B+树扔掉了。
然后log就变成了磁盘上唯一的数据结构。数据库的log和一般程序的日志不一样。
数据库的log存的是数据本身,所以必须存在于磁盘上。内存中只能是索引或者cache。
【在 k****r 的大作中提到】 : Also, what is the relations between the log, and the data in disk/memory.' : Thanks,
| k****r 发帖数: 807 | 13 Many thanks for DANIUMEN!!! I think I still have a lot of things to learn:) | w**z 发帖数: 8232 | 14 In case of C*, make sure to run repair at least once within GC grade period.
Otherwise, tombstones may come back to live. As goodbug said, deletes might
not get to all the nodes because of the eventual consistency, and repair
will fix that.
Read this blog if you want to know more, delete in distributes system is
tricky.
http://thelastpickle.com/blog/2011/05/15/Deletes-and-Tombstones
【在 g*****g 的大作中提到】 : 分布式系统里删除特殊的地方在于不是所有节点都会收到这条指令,可能会丢。如果你 : 真的删除了数据,你怎么确定是一个写操作这个节点没收到,还是这个删除别的节点没 : 收到? : 所以 Cassandra一类的系统是写入一个特殊值叫 tombstone,只在一定时间之后做 : compaction,这样出错的概率就很低。
| w**z 发帖数: 8232 | 15 commit log is for durability. C* writes to memtable along with commit log.
In case node crashes before the memtable is flushed to disk, it will recover
from commit log.
For reads, it doesn't go to commit log, it goes to memtable and sstable and
merge them.
【在 w***g 的大作中提到】 : 按goodbug说的,日志里就是 : keyA-valueA1 : keyB-valueB1 : keyC-valueC1 : keyA-valueA2, : keyC-valueC2. : delB : 读的时候往回找,先找到啥是啥。看到delB就表示B已经没了。 : 这就真是纯log structure了。 : 不过我没想明白on-disk索引怎么做,所以说了打叉叉那个办法。
| k****r 发帖数: 807 | 16 Nice information. Let me study first :P
period.
might
【在 w**z 的大作中提到】 : In case of C*, make sure to run repair at least once within GC grade period. : Otherwise, tombstones may come back to live. As goodbug said, deletes might : not get to all the nodes because of the eventual consistency, and repair : will fix that. : Read this blog if you want to know more, delete in distributes system is : tricky. : http://thelastpickle.com/blog/2011/05/15/Deletes-and-Tombstones
|
|