提交 8c0bb119 authored 作者: Thomas Mueller's avatar Thomas Mueller

MVStore file format documentation.

上级 831e6937
......@@ -478,15 +478,56 @@ it is recommended to use it together with the MVCC mode
<h2 id="fileFormat">File Format</h2>
<p>
The data is stored in one file. The file contains two file headers (to be safe),
and a number of chunks. The file headers are one block each; a block is 4096 bytes.
The data is stored in one file.
The file contains two file headers (for safety), and a number of chunks.
The file headers are one block each; a block is 4096 bytes.
Each chunk is at least one block, but typically 200 blocks or more.
There might be a number of free blocks in front of every chunk.
Data is stored in the chunks in the form of a
<a href="https://en.wikipedia.org/wiki/Log-structured_file_system">log structured storage</a>.
There is one chunk for every version.
</p>
<pre>
[ file header 1 ] [ file header 2 ] [ chunk ] [ chunk ] ... [ chunk ]
</pre>
<p>
Each chunk contains a number of B-tree pages.
As an example, the following code:
</p>
<pre>
MVStore s = MVStore.open(fileName);
MVMap<Integer, String> map = s.openMap("data");
for (int i = 0; i < 400; i++) {
map.put(i, "Hello");
}
s.commit();
for (int i = 0; i < 100; i++) {
map.put(0, "Hi");
}
s.commit();
s.close();
</pre>
<p>
will result in the following two chunks (excluding metadata):
</p>
<p>
<b>Chunk 1:</b>
</p>
<ul><li>Page 1: leaf with 140 entries (keys 0 - 139)
</li><li>Page 2: leaf with 260 entries (keys 140 - 399)
</li><li>Page 3: node with 2 entries pointing to page 1 and 2 (the root)
</li></ul>
<p>
<b>Chunk 2:</b>
</p>
<ul><li>Page 4: leaf with 140 entries (keys 0 - 139)
</li><li>Page 5: node with 2 entries pointing to page 4 and 1 (the root)
</li></ul>
<p>
That means each chunk contains the changes of one version,
that means the new version of the changed pages and the parent pages,
recursively, up to the root page. Pages in subsequent chunks refer to
pages in earlier chunks.
</p>
<h3>File Header</h3>
<p>
......@@ -573,11 +614,11 @@ to be stored in the next chunk, and the number of live pages in the old chunk is
This mechanism is called copy-on-write, and is similar to how the
<a href="https://en.wikipedia.org/wiki/Btrfs">Btrfs</a> file system works.
Chunks without live pages are marked as free, so the space can be re-used by more recent chunks.
Because not all chunks are of the same size, there can be some unused space in front of a chunk
Because not all chunks are of the same size, there can be a number of free blocks in front of a chunk
for some time (until a small chunk is written or the chunks are compacted).
There is a <a href="http://stackoverflow.com/questions/13650134/after-how-many-seconds-are-file-system-write-buffers-typically-flushed">
delay of 45 seconds</a> (by default) before a free chunk is overwritten,
to ensure new versions are persisted first, as hard disks sometimes re-order write operations.
to ensure new versions are persisted first.
</p>
<p>
How the newest chunk is located when opening a store:
......@@ -613,10 +654,10 @@ and <a href="https://en.wikipedia.org/wiki/Variable-length_quantity">variable si
</li><li>len (variable size int): The number of keys in the page.
</li><li>type (byte): The page type (0 for leaf page, 1 for internal node;
plus 2 if the page data is compressed).
</li><li>keys (byte array): All keys, stored depending on the data type.
</li><li>children (array of long; internal nodes only): The position of the children.
</li><li>childCounts (array of variable size long; internal nodes only):
The total number of entries for the given child page.
</li><li>keys (byte array): All keys, stored depending on the data type.
</li><li>values (byte array; leaf pages only): All values, stored depending on the data type.
</li></ul>
<p>
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论