提交 b9726376 authored 作者: Thomas Mueller's avatar Thomas Mueller

MVStore file format documentation.

上级 29f54eb0
...@@ -493,7 +493,8 @@ There is one chunk for every version. ...@@ -493,7 +493,8 @@ There is one chunk for every version.
There are two file headers, which normally contain the exact same data. There are two file headers, which normally contain the exact same data.
But once in a while, the file headers are updated, and writing could partially fail, But once in a while, the file headers are updated, and writing could partially fail,
which could corrupt a header. That's why there is a second header. which could corrupt a header. That's why there is a second header.
Only the file headers are updated in-place. They contain the following data: Only the file headers are updated in this way (called "in-place update").
The headers contain the following data:
</p> </p>
<pre> <pre>
H:2,block:2,blockSize:1000,chunk:7,created:1441235ef73,format:1,version:7,fletcher:3044e6cc H:2,block:2,blockSize:1000,chunk:7,created:1441235ef73,format:1,version:7,fletcher:3044e6cc
...@@ -503,18 +504,20 @@ The data is stored in the form of a key-value pair. ...@@ -503,18 +504,20 @@ The data is stored in the form of a key-value pair.
Each value is stored as a hexadecimal number. The entries are: Each value is stored as a hexadecimal number. The entries are:
</p> </p>
<ul><li>H: The entry "H:2" stands for the the H2 database. <ul><li>H: The entry "H:2" stands for the the H2 database.
</li><li>block: The block number where one of the latest chunks starts. </li><li>block: The block number where one of the newest chunks starts
</li><li>blockSize: The block size of the file; currently always hex 1000, which is decimal 4096. (but not necessarily the newest).
</li><li>blockSize: The block size of the file; currently always hex 1000, which is decimal 4096,
to match the <a href="https://en.wikipedia.org/wiki/Disk_sector">disk sector</a>
length of modern hard disks.
</li><li>chunk: The chunk id, which is normally the same value as the version; </li><li>chunk: The chunk id, which is normally the same value as the version;
however, the chunk id might roll over to 0, while the version doesn't. however, the chunk id might roll over to 0, while the version doesn't.
</li><li>created: The number of milliseconds since 1970 when the file was created. </li><li>created: The number of milliseconds since 1970 when the file was created.
</li><li>format: The file format number. Currently 1. </li><li>format: The file format number. Currently 1.
</li><li>version: The version number of the chunk. </li><li>version: The version number of the chunk.
</li><li>fletcher: The Fletcher-32 checksum of the header. </li><li>fletcher: The <a href="https://en.wikipedia.org/wiki/Fletcher's_checksum">
Fletcher-32 checksum</a> of the header.
</li></ul> </li></ul>
<p> <p>
</p>
<p>
When opening the file, both headers are read and the checksum is verified. When opening the file, both headers are read and the checksum is verified.
If both headers are valid, the one with the newer version is used. If both headers are valid, the one with the newer version is used.
The chunk with the latest version is then detected (details about this see below), The chunk with the latest version is then detected (details about this see below),
...@@ -522,6 +525,8 @@ and the rest of the metadata is read from there. ...@@ -522,6 +525,8 @@ and the rest of the metadata is read from there.
If the chunk id, block and version are not stored in the file header, If the chunk id, block and version are not stored in the file header,
then the latest chunk lookup starts with the last chunk in the file. then the latest chunk lookup starts with the last chunk in the file.
</p> </p>
<p>
</p>
<h3>Chunk Format</h3> <h3>Chunk Format</h3>
<p> <p>
...@@ -536,9 +541,8 @@ The footer is stored in the last 128 bytes of the chunk. ...@@ -536,9 +541,8 @@ The footer is stored in the last 128 bytes of the chunk.
[ header ] [ page ] [ page ] ... [ page ] [ footer ] [ header ] [ page ] [ page ] ... [ page ] [ footer ]
</pre> </pre>
<p> <p>
The footer is stored so that a reader can verify the chunk is completely written The footer allows to verify that the chunk is completely written (a chunk is written as one write operation),
(a chunk is written as one write operation), and allows to find the start position of the very last chunk in the file.
and to find the start position of the very last chunk in the file.
The chunk header and footer contain the following data: The chunk header and footer contain the following data:
</p> </p>
<pre> <pre>
...@@ -564,15 +568,16 @@ The fields of the chunk header and footer are: ...@@ -564,15 +568,16 @@ The fields of the chunk header and footer are:
Chunks are never updated in-place. Each chunk contains the pages that were Chunks are never updated in-place. Each chunk contains the pages that were
changed in that version (there is one chunk per version, see above), changed in that version (there is one chunk per version, see above),
plus all the parent nodes of those pages, recursively, up to the root page. plus all the parent nodes of those pages, recursively, up to the root page.
If an entry in a map is changed, removed, or added, then the respective page is copied If an entry in a map is changed, removed, or added, then the respective page is copied
(copy-on-write) to be stored in the next chunk, to be stored in the next chunk, and the number of live pages in the old chunk is decremented.
and the number of live pages in the old chunk is decremented. This mechanism is called copy-on-write, and is similar to how the
<a href="https://en.wikipedia.org/wiki/Btrfs">Btrfs</a> file system works.
Chunks without live pages are marked as free, so the space can be re-used by more recent chunks. Chunks without live pages are marked as free, so the space can be re-used by more recent chunks.
Because not all chunks are of the same size, there can be some unused space in front of a chunk Because not all chunks are of the same size, there can be some unused space in front of a chunk
for some time (until a small chunk is written or the chunks are compacted). for some time (until a small chunk is written or the chunks are compacted).
There is a delay of 45 seconds (by default) There is a <a href="http://stackoverflow.com/questions/13650134/after-how-many-seconds-are-file-system-write-buffers-typically-flushed">
before a free chunk is overwritten, to ensure new versions are persisted first delay of 45 seconds</a> (by default) before a free chunk is overwritten,
(as hard disks sometimes re-order write operations). to ensure new versions are persisted first, as hard disks sometimes re-order write operations.
</p> </p>
<p> <p>
How the newest chunk is located when opening a store: How the newest chunk is located when opening a store:
...@@ -592,14 +597,15 @@ In any case, the file header is updated if the next chain gets longer than 20 ho ...@@ -592,14 +597,15 @@ In any case, the file header is updated if the next chain gets longer than 20 ho
<h3>Page Format</h3> <h3>Page Format</h3>
<p> <p>
Each map is a B-tree, and the map data is stored in (B-tree-) pages. Each map is a <a href="https://en.wikipedia.org/wiki/B-tree">B-tree</a>,
and the map data is stored in (B-tree-) pages.
There are leaf pages that contain the key-value pairs of the map, There are leaf pages that contain the key-value pairs of the map,
and internal nodes, which only contain keys and pointers to leaf pages. and internal nodes, which only contain keys and pointers to leaf pages.
The root of a tree is either a leaf or an internal node. The root of a tree is either a leaf or an internal node.
Unlike file header and chunk header and footer, the page data is not human readable. Unlike file header and chunk header and footer, the page data is not human readable.
Instead, it is stored as byte arrays, with long (8 bytes), int (4 bytes), short (2 bytes), Instead, it is stored as byte arrays, with long (8 bytes), int (4 bytes), short (2 bytes),
and variable size int (1 to 5 bytes) and variable size long (1 to 10 bytes). and <a href="https://en.wikipedia.org/wiki/Variable-length_quantity">variable size int and long</a>
The page format is: (1 to 5 / 10 bytes). The page format is:
</p> </p>
<ul><li>length (int): Length of the page in bytes. <ul><li>length (int): Length of the page in bytes.
</li><li>checksum (short): Checksum (chunk id xor offset within the chunk xor page length). </li><li>checksum (short): Checksum (chunk id xor offset within the chunk xor page length).
...@@ -618,29 +624,27 @@ Even though this is not required by the file format, each B-tree is stored ...@@ -618,29 +624,27 @@ Even though this is not required by the file format, each B-tree is stored
"upside down", that means the leaf pages first, then the internal nodes, and lastly the root page. "upside down", that means the leaf pages first, then the internal nodes, and lastly the root page.
</p> </p>
<p> <p>
Variable size values are stored as follows: as long as the value has any bits Pointers to pages are stored as a long, using a special format:
above bit 7 set, the lower 7 bits plus 128 are stored, and then the value 26 bits for the chunk id, 32 bits for the offset within the chunk, 5 bits for the length code,
is shifted to the right by 7 bits. 1 bit for the page type (leaf or internal node).
</p> The page type is encoded so that when clearing or removing a map, leaf pages don't
<p> have to be read (internal nodes do have to be read in order to know where all the pages are;
Pointers to pages are stored as a long. They have a special format: but in a typical B-tree the vast majority of the pages are leaf pages).
the chunk id (shifted 38 bits to the left), plus the offset within the chunk (shifted 6 bits to the left), The absolute file position is not included so that chunks can be
plus the length code (shifted 1 bit to the left), plus the page type (0 for leaf, 1 for internal node).
The page type is encoded so that when clearing or
removing a map, leaf pages don't have to be read (internal nodes do have to be read
in order to know where all the pages are; but in a typical B-tree the vast majority
of the pages are leaf pages). The absolute file position is not included so that chunks can be
moved within the file without having to change page pointers; moved within the file without having to change page pointers;
only the chunk metadata needs to be changed. only the chunk metadata needs to be changed.
The length code is a number between 0 and 31 (inclusive), where 0 means the maximum length The length code is a number from 0 to 31, where 0 means the maximum length
of the page is 32 bytes, 1 means 48 bytes, 2: 64, 3: 96, 4: 128, 5: 192, and so on until 30 which of the page is 32 bytes, 1 means 48 bytes, 2: 64, 3: 96, 4: 128, 5: 192, and so on until 31 which
means 1048576 bytes, and 31 means longer. That way, reading a page only requires one means longer than 1 MB. That way, reading a page only requires one
read operation (except for very large pages, in which case two read operations might be needed). read operation (except for very large pages, in which case two read operations might be needed).
The sum of those maximum length of all pages in a chunk The sum of the maximum length of all pages is stored in the chunk metadata (field "max"),
is stored in the chunk metadata (field "max"), and when a page is marked as removed, the live maximum length is adjusted.
and when a page is marked as removed, the maximum length of that page This allows to estimate the amount of free space within a block, in addition to the number of free pages.
is subtracted from the live maximum length. That way we know not just how many pages in a chunk </p>
are live, but we also have an estimate on the live number of bytes. <p>
The total number of entries in a child nodes is kept to allow efficient range counting,
lookup by index, and skip operations.
The pages form a <a href="http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html">counted B-tree</a>.
</p> </p>
<p> <p>
Data compression: The data after the page type are optionally compressed using the LZF algorithm. Data compression: The data after the page type are optionally compressed using the LZF algorithm.
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论