项目有用到leveldb,打算最近看看源码,整理个系列文章,这是第5篇sstable文件。源码注释地址
代码 从上一篇的流程中可以看到这里会出现压缩的场景:
1 2 3 4 5 6 MakeRoomForWrite MaybeScheduleCompaction BackgroundCompaction CompactMemTable WriteLevel0Table BuildTable
BuildTable中open一个新的文件名,然后执行finish
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Status TableBuilder::Finish () { Flush (); WriteRawBlock (r->filter_block->Finish (), kNoCompression, &filter_block_handle); WriteBlock (&meta_index_block, &metaindex_block_handle); WriteBlock (&r->index_block, &index_block_handle); r->status = r->file->Append (footer_encoding); }
WriteRawBlock/WriteBlock会调用Append写,posix下优先写buf,buf不够直接调用write写文件。
分别看看用处:
BlockBuilder BlockBuilder的成员变量如下:
1 2 3 4 5 6 const Options* options_;std::string buffer_; std::vector<uint32_t > restarts_; int counter_; bool finished_; std::string last_key_;
buffer_指向的内容就是下图的第2行,其中每一个entry的格式是第一行的表示。restarts_记录了每一个前缀压缩新的开始点,最后会把这些值存入buffer_后面。其他几个字段没那么重要了。
data_block data_block
的类型是BlockBuilder ,add的kv(一个entry)是(key, value);
filter_block filter_block的类型是FilterBlockBuilder ,成员变量如下:
1 2 3 4 5 6 const FilterPolicy* policy_;std::string keys_; std::vector<size_t > start_; std::string result_; std::vector<Slice> tmp_keys_; std::vector<uint32_t > filter_offsets_;
keys_和start_在AddKey的时候赋值,如下:
1 2 3 4 5 void FilterBlockBuilder::AddKey (const Slice& key) { Slice k = key; start_.push_back (keys_.size ()); keys_.append (k.data (), k.size ()); }
看下StartBlock
的实现:
1 2 3 4 5 6 7 8 void FilterBlockBuilder::StartBlock (uint64_t block_offset) { uint64_t filter_index = (block_offset / kFilterBase); assert (filter_index >= filter_offsets_.size ()); while (filter_index > filter_offsets_.size ()) { GenerateFilter (); } }
StartBlock
被调用的地方有:
TableBuilder构造函数
TableBuilder::Flush()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 TableBuilder::TableBuilder (const Options& options, WritableFile* file) : rep_ (new Rep (options, file)) { if (rep_->filter_block != nullptr ) { rep_->filter_block->StartBlock (0 ); } } void TableBuilder::Flush () { Rep* r = rep_; assert (!r->closed); if (!ok ()) return ; if (r->data_block.empty ()) return ; assert (!r->pending_index_entry); WriteBlock (&r->data_block, &r->pending_handle); if (ok ()) { r->pending_index_entry = true ; r->status = r->file->Flush (); } if (r->filter_block != nullptr ) { r->filter_block->StartBlock (r->offset); } }
filter_offsets_中存了每个filter
的offset,每个filter记录sstable的data_block的2kb的kv的特征值(默认是bloom filter算的Bits)。所以StartBlock函数的入参拿到新的sstable中data_block的offset/2kb,如果比filter_offsets_原来的值大,就不断GenerateFilter生成新的filter,没生成一个filter_offsets_的size就加一。
看下Finish的实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Slice FilterBlockBuilder::Finish () { if (!start_.empty ()) { GenerateFilter (); } const uint32_t array_offset = result_.size (); for (size_t i = 0 ; i < filter_offsets_.size (); i++) { PutFixed32 (&result_, filter_offsets_[i]); } PutFixed32 (&result_, array_offset); result_.push_back (kFilterBaseLg); return Slice (result_); }
注意,每执行一次GenerateFilter,keys_和start_都会clear.但filter_offsets_的值是一直在的。然后这个过程中生成的filter内容都保存在result_.最后Finish的时候filter_offsets_的值以及原来result_的size都会写到result_的后面,kFilterBaseLg=11(1 << 11, 表示2kb)也会写到后面。也就是说最后Finish完就生成了一个新的filter_block。其中result_指向的内容如下:
这就是一个filter_block的内容。后面数据读取单独整理一篇再分析读取的逻辑。
meta_index_block
的类型是BlockBuilder , add的kv(一个entry)是(filter.leveldb.BuiltinBloomFilter2, string(filter_block_handle)); filter_block_handle记录了filter_block的位置信息
index_block index_block
的类型是BlockBuilder , add的kv(一个entry)是(last_key, string(pending_handle));last_key记录了data_block的最后一个key, pending_handle记录了最后一个data_block的位置信息。
last_key更新的位置:
1 2 3 4 5 void TableBuilder::Add (const Slice& key, const Slice& value) { ... r->last_key.assign (key.data (), key.size ()); ... }
pending_handle更新的位置:
1 2 3 4 5 6 7 8 9 10 11 12 TableBuilder::Flush () WriteBlock (&r->data_block, &r->pending_handle) WriteRawBlock (block_contents, type, handle); void TableBuilder::WriteRawBlock (const Slice& block_contents, CompressionType type, BlockHandle* handle) {Rep* r = rep_; handle->set_offset (r->offset); handle->set_size (block_contents.size ()); ... }
Footer有2个数据成员, 记录了metaindex_handle_和index_handle_,也即meta_index_block和index_block的位置信息.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 class Footer { public : ... private : BlockHandle metaindex_handle_; BlockHandle index_handle_; }; class BlockHandle { public : ... private : uint64_t offset_; uint64_t size_; };
reference https://web.archive.org/web/20120131105110/http://rdc.taobao.com/blog/cs/wp-content/plugins/leveldb%E5%AE%9E%E7%8E%B0%E8%A7%A3%E6%9E%90.pdf
http://catkang.github.io/2017/01/17/leveldb-data.html
https://www.bookstack.cn/read/Leveldb-handbook/spilt.2.b308da3c3d01f3cd.md
https://izualzhy.cn/leveldb-block