TheRiver | blog

You have reached the world's edge, none but devils play past here

0%

leveldb源码分析(5) sstable文件

项目有用到leveldb,打算最近看看源码,整理个系列文章,这是第5篇sstable文件。源码注释地址

代码

从上一篇的流程中可以看到这里会出现压缩的场景:

1
2
3
4
5
6
MakeRoomForWrite
MaybeScheduleCompaction
BackgroundCompaction
CompactMemTable
WriteLevel0Table
BuildTable

BuildTable中open一个新的文件名,然后执行finish

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Status TableBuilder::Finish() {

// 写 data_block
Flush();

// 写 filter_block
WriteRawBlock(r->filter_block->Finish(), kNoCompression,
&filter_block_handle);

// 写 meta_index_block
WriteBlock(&meta_index_block, &metaindex_block_handle);

// 写 index_block
WriteBlock(&r->index_block, &index_block_handle);

// 写 footer
r->status = r->file->Append(footer_encoding);
}

WriteRawBlock/WriteBlock会调用Append写,posix下优先写buf,buf不够直接调用write写文件。

分别看看用处:

BlockBuilder

BlockBuilder的成员变量如下:

1
2
3
4
5
6
const Options* options_;
std::string buffer_; // Destination buffer
std::vector<uint32_t> restarts_; // Restart points
int counter_; // Number of entries emitted since restart
bool finished_; // Has Finish() been called?
std::string last_key_;

buffer_指向的内容就是下图的第2行,其中每一个entry的格式是第一行的表示。restarts_记录了每一个前缀压缩新的开始点,最后会把这些值存入buffer_后面。其他几个字段没那么重要了。

data_block

data_block的类型是BlockBuilder,add的kv(一个entry)是(key, value);

filter_block

filter_block的类型是FilterBlockBuilder,成员变量如下:

1
2
3
4
5
6
const FilterPolicy* policy_;
std::string keys_; // Flattened(扁平) key contents
std::vector<size_t> start_; // Starting index in keys_ of each key
std::string result_; // Filter data computed so far
std::vector<Slice> tmp_keys_; // policy_->CreateFilter() argument
std::vector<uint32_t> filter_offsets_;

keys_和start_在AddKey的时候赋值,如下:

1
2
3
4
5
void FilterBlockBuilder::AddKey(const Slice& key) {
Slice k = key;
start_.push_back(keys_.size());
keys_.append(k.data(), k.size());
}

看下StartBlock的实现:

1
2
3
4
5
6
7
8
void FilterBlockBuilder::StartBlock(uint64_t block_offset) {
// kFilterBase = 2kb
uint64_t filter_index = (block_offset / kFilterBase);
assert(filter_index >= filter_offsets_.size());
while (filter_index > filter_offsets_.size()) {
GenerateFilter();
}
}

StartBlock被调用的地方有:

  • TableBuilder构造函数
  • TableBuilder::Flush()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
TableBuilder::TableBuilder(const Options& options, WritableFile* file)
: rep_(new Rep(options, file)) {
if (rep_->filter_block != nullptr) {
rep_->filter_block->StartBlock(0);
}
}

void TableBuilder::Flush() {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->data_block.empty()) return;
assert(!r->pending_index_entry);
WriteBlock(&r->data_block, &r->pending_handle); // data_block写到buf或文件
if (ok()) {
r->pending_index_entry = true;
r->status = r->file->Flush(); // 缓存buf刷新到文件
}
if (r->filter_block != nullptr) {
r->filter_block->StartBlock(r->offset); // ==
}
}

filter_offsets_中存了每个filter的offset,每个filter记录sstable的data_block的2kb的kv的特征值(默认是bloom filter算的Bits)。所以StartBlock函数的入参拿到新的sstable中data_block的offset/2kb,如果比filter_offsets_原来的值大,就不断GenerateFilter生成新的filter,没生成一个filter_offsets_的size就加一。

看下Finish的实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Slice FilterBlockBuilder::Finish() {
if (!start_.empty()) {
GenerateFilter();
}

// Append array of per-filter offsets
const uint32_t array_offset = result_.size();
for (size_t i = 0; i < filter_offsets_.size(); i++) {
PutFixed32(&result_, filter_offsets_[i]);
}

PutFixed32(&result_, array_offset);
// kFilterBaseLg = 11
result_.push_back(kFilterBaseLg); // Save encoding parameter in result
return Slice(result_);
}

注意,每执行一次GenerateFilter,keys_和start_都会clear.但filter_offsets_的值是一直在的。然后这个过程中生成的filter内容都保存在result_.最后Finish的时候filter_offsets_的值以及原来result_的size都会写到result_的后面,kFilterBaseLg=11(1 << 11, 表示2kb)也会写到后面。也就是说最后Finish完就生成了一个新的filter_block。其中result_指向的内容如下:

这就是一个filter_block的内容。后面数据读取单独整理一篇再分析读取的逻辑。

meta_index_block

meta_index_block的类型是BlockBuilder, add的kv(一个entry)是(filter.leveldb.BuiltinBloomFilter2, string(filter_block_handle)); filter_block_handle记录了filter_block的位置信息

index_block

index_block的类型是BlockBuilder, add的kv(一个entry)是(last_key, string(pending_handle));last_key记录了data_block的最后一个key, pending_handle记录了最后一个data_block的位置信息。

last_key更新的位置:

1
2
3
4
5
void TableBuilder::Add(const Slice& key, const Slice& value) {
...
r->last_key.assign(key.data(), key.size());
...
}

pending_handle更新的位置:

1
2
3
4
5
6
7
8
9
10
11
12
TableBuilder::Flush()
WriteBlock(&r->data_block, &r->pending_handle)
WriteRawBlock(block_contents, type, handle);

void TableBuilder::WriteRawBlock(const Slice& block_contents,
CompressionType type, BlockHandle* handle) {
Rep* r = rep_;
handle->set_offset(r->offset);
// block_contents是data_block压缩后的值
handle->set_size(block_contents.size());
...
}

Footer有2个数据成员, 记录了metaindex_handle_和index_handle_,也即meta_index_block和index_block的位置信息.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Footer encapsulates the fixed information stored at the tail
// end of every table file.
class Footer {
public:
...

private:
BlockHandle metaindex_handle_;
BlockHandle index_handle_;
};

// BlockHandle is a pointer to the extent of a file that stores a data
// block or a meta block.
class BlockHandle {
public:
...

private:
uint64_t offset_;
uint64_t size_;
};

reference

https://web.archive.org/web/20120131105110/http://rdc.taobao.com/blog/cs/wp-content/plugins/leveldb%E5%AE%9E%E7%8E%B0%E8%A7%A3%E6%9E%90.pdf

http://catkang.github.io/2017/01/17/leveldb-data.html

https://www.bookstack.cn/read/Leveldb-handbook/spilt.2.b308da3c3d01f3cd.md

https://izualzhy.cn/leveldb-block

----------- ending -----------