提交 f6fbb356 authored 作者: Thomas Mueller's avatar Thomas Mueller

MVStore: support two compression levels; measure read and written bytes;…

MVStore: support two compression levels; measure read and written bytes; ascending iteration over entry set
上级 ae3db101
......@@ -103,7 +103,7 @@ Example usage:
MVStore s = new MVStore.Builder().
fileName(fileName).
encryptionKey("007".toCharArray()).
compressData().
compress().
open();
</pre>
<p>
......@@ -114,7 +114,10 @@ The list of available options is:
</li><li>backgroundExceptionHandler: specify a handler for
exceptions that could occur while writing in the background.
</li><li>cacheSize: the cache size in MB.
</li><li>compressData: compress the data when storing.
</li><li>compress: compress the data when storing
using a fast algorithm (LZF).
</li><li>compressHigh: compress the data when storing
using a slower algorithm (Deflate).
</li><li>encryptionKey: the encryption key for file encryption.
</li><li>fileName: the name of the file, for file based stores.
</li><li>fileStore: the storage implementation to use.
......
......@@ -7337,450 +7337,453 @@ backgroundExceptionHandler: specify a handler for exceptions that could occur wh
cacheSize: the cache size in MB.
@mvstore_1043_li
compressData: compress the data when storing.
compress: compress the data when storing using a fast algorithm (LZF).
@mvstore_1044_li
encryptionKey: the encryption key for file encryption.
compressHigh: compress the data when storing using a slower algorithm (Deflate).
@mvstore_1045_li
fileName: the name of the file, for file based stores.
encryptionKey: the encryption key for file encryption.
@mvstore_1046_li
fileStore: the storage implementation to use.
fileName: the name of the file, for file based stores.
@mvstore_1047_li
pageSplitSize: the point where pages are split.
fileStore: the storage implementation to use.
@mvstore_1048_li
pageSplitSize: the point where pages are split.
@mvstore_1049_li
readOnly: open the file in read-only mode.
@mvstore_1049_h2
@mvstore_1050_h2
R-Tree
@mvstore_1050_p
@mvstore_1051_p
The <code>MVRTreeMap</code> is an R-tree implementation that supports fast spatial queries. It can be used as follows:
@mvstore_1051_p
@mvstore_1052_p
The default number of dimensions is 2. To use a different number of dimensions, call <code>new MVRTreeMap.Builder&lt;String&gt;().dimensions(3)</code>. The minimum number of dimensions is 1, the maximum is 255.
@mvstore_1052_h2
@mvstore_1053_h2
Features
@mvstore_1053_h3
@mvstore_1054_h3
Maps
@mvstore_1054_p
@mvstore_1055_p
Each store contains a set of named maps. A map is sorted by key, and supports the common lookup operations, including access to the first and last key, iterate over some or all keys, and so on.
@mvstore_1055_p
@mvstore_1056_p
Also supported, and very uncommon for maps, is fast index lookup: the entries of the map can be be efficiently accessed like a random-access list (get the entry at the given index), and the index of a key can be calculated efficiently. That also means getting the median of two keys is very fast, and a range of keys can be counted very quickly. The iterator supports fast skipping. This is possible because internally, each map is organized in the form of a counted B+-tree.
@mvstore_1056_p
@mvstore_1057_p
In database terms, a map can be used like a table, where the key of the map is the primary key of the table, and the value is the row. A map can also represent an index, where the key of the map is the key of the index, and the value of the map is the primary key of the table (for non-unique indexes, the key of the map must also contain the primary key).
@mvstore_1057_h3
@mvstore_1058_h3
Versions
@mvstore_1058_p
@mvstore_1059_p
A version is a snapshot of all the data of all maps at a given point in time. Creating a snapshot is fast: only those pages that are changed after a snapshot are copied. This behavior is also called COW (copy on write). Rollback to an old version is supported. Old versions are readable until old data is purged.
@mvstore_1059_p
@mvstore_1060_p
The following sample code show how to create a store, open a map, add some data, and access the current and an old version:
@mvstore_1060_h3
@mvstore_1061_h3
Transactions
@mvstore_1061_p
@mvstore_1062_p
To support multiple concurrent open transactions, a transaction utility is included, the <code>TransactionStore</code>. The tool supports PostgreSQL style "read committed" transaction isolation with savepoints, two-phase commit, and other features typically available in a database. There is no limit on the size of a transaction (the log is written to disk for large or long running transactions).
@mvstore_1062_p
@mvstore_1063_p
Internally, this utility stores the old versions of changed entries in a separate map, similar to a transaction log (except that entries of a closed transaction are removed, and the log is usually not stored for short transactions). For common use cases, the storage overhead of this utility is very small compared to the overhead of a regular transaction log.
@mvstore_1063_h3
@mvstore_1064_h3
In-Memory Performance and Usage
@mvstore_1064_p
@mvstore_1065_p
Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>, but usually slower than <code>java.util.HashMap</code>.
@mvstore_1065_p
@mvstore_1066_p
The memory overhead for large maps is slightly better than for the regular map implementations, but there is a higher overhead per map. For maps with less than about 25 entries, the regular map implementations need less memory.
@mvstore_1066_p
@mvstore_1067_p
If no file name is specified, the store operates purely in memory. Except for persisting data, all features are supported in this mode (multi-versioning, index lookup, R-tree and so on). If a file name is specified, all operations occur in memory (with the same performance characteristics) until data is persisted.
@mvstore_1067_p
@mvstore_1068_p
As in all map implementations, keys need to be immutable, that means changing the key object after an entry has been added is not allowed. If a file name is specified, the value may also not be changed after adding an entry, because it might be serialized (which could happen at any time when autocommit is enabled).
@mvstore_1068_h3
@mvstore_1069_h3
Pluggable Data Types
@mvstore_1069_p
@mvstore_1070_p
Serialization is pluggable. The default serialization currently supports many common data types, and uses Java serialization for other objects. The following classes are currently directly supported: <code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal, String, UUID, Date</code> and arrays (both primitive arrays and object arrays). For serialized objects, the size estimate is adjusted using an exponential moving average.
@mvstore_1070_p
@mvstore_1071_p
Parameterized data types are supported (for example one could build a string data type that limits the length).
@mvstore_1071_p
@mvstore_1072_p
The storage engine itself does not have any length limits, so that keys, values, pages, and chunks can be very big (as big as fits in memory). Also, there is no inherent limit to the number of maps and chunks. Due to using a log structured storage, there is no special case handling for large keys or pages.
@mvstore_1072_h3
@mvstore_1073_h3
BLOB Support
@mvstore_1073_p
@mvstore_1074_p
There is a mechanism that stores large binary objects by splitting them into smaller blocks. This allows to store objects that don't fit in memory. Streaming as well as random access reads on such objects are supported. This tool is written on top of the store, using only the map interface.
@mvstore_1074_h3
@mvstore_1075_h3
R-Tree and Pluggable Map Implementations
@mvstore_1075_p
@mvstore_1076_p
The map implementation is pluggable. In addition to the default <code>MVMap</code> (multi-version map), there is a map that supports concurrent write operations, and a multi-version R-tree map implementation for spatial operations.
@mvstore_1076_h3
@mvstore_1077_h3
Concurrent Operations and Caching
@mvstore_1077_p
@mvstore_1078_p
The default map implementation supports concurrent reads on old versions of the data. All such read operations can occur in parallel. Concurrent reads from the page cache, as well as concurrent reads from the file system are supported. Writing changes to the file can occur concurrently to modifying the data, as writing operates on a snapshot.
@mvstore_1078_p
@mvstore_1079_p
Caching is done on the page level. The page cache is a concurrent LIRS cache, which should be resistant against scan operations.
@mvstore_1079_p
@mvstore_1080_p
The default map implementation does not support concurrent modification operations on a map (the same as <code>HashMap</code> and <code>TreeMap</code>). Similar to those classes, the map tries to detect concurrent modification.
@mvstore_1080_p
@mvstore_1081_p
With the <code>MVMapConcurrent</code> implementation, read operations even on the newest version can happen concurrently with all other operations, without risk of corruption. This comes with slightly reduced speed in single threaded mode, the same as with other <code>ConcurrentHashMap</code> implementations. Write operations first read the relevant area from disk to memory (this can happen concurrently), and only then modify the data. The in-memory part of write operations is synchronized.
@mvstore_1081_p
@mvstore_1082_p
For fully scalable concurrent write operations to a map (in-memory and to disk), the map could be split into multiple maps in different stores ('sharding'). The plan is to add such a mechanism later when needed.
@mvstore_1082_h3
@mvstore_1083_h3
Log Structured Storage
@mvstore_1083_p
@mvstore_1084_p
Internally, changes are buffered in memory, and once enough changes have accumulated, they are written in one continuous disk write operation. Compared to traditional database storage engines, this should improve write performance for file systems and storage systems that do not efficiently support small random writes, such as Btrfs, as well as SSDs. (According to a test, write throughput of a common SSD increases with write block size, until a block size of 2 MB, and then does not further increase.) By default, changes are automatically written when more than a number of pages are modified, and once every second in a background thread, even if only little data was changed. Changes can also be written explicitly by calling <code>commit()</code>.
@mvstore_1084_p
@mvstore_1085_p
When storing, all changed pages are serialized, optionally compressed using the LZF algorithm, and written sequentially to a free area of the file. Each such change set is called a chunk. All parent pages of the changed B-trees are stored in this chunk as well, so that each chunk also contains the root of each changed map (which is the entry point for reading this version of the data). There is no separate index: all data is stored as a list of pages. Per store, there is one additional map that contains the metadata (the list of maps, where the root page of each map is stored, and the list of chunks).
@mvstore_1085_p
@mvstore_1086_p
There are usually two write operations per chunk: one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk). If the chunk is appended at the end of the file, the file header is only written at the end of the chunk. There is no transaction log, no undo log, and there are no in-place updates (however, unused chunks are overwritten by default).
@mvstore_1086_p
@mvstore_1087_p
Old data is kept for at least 45 seconds (configurable), so that there are no explicit sync operations required to guarantee data consistency. An application can also sync explicitly when needed. To reuse disk space, the chunks with the lowest amount of live data are compacted (the live data is stored again in the next chunk). To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
@mvstore_1087_p
@mvstore_1088_p
Compared to traditional storage engines (that use a transaction log, undo log, and main storage area), the log structured storage is simpler, more flexible, and typically needs less disk operations per change, as data is only written once instead of twice or 3 times, and because the B-tree pages are always full (they are stored next to each other) and can be easily compressed. But temporarily, disk space usage might actually be a bit higher than for a regular database, as disk space is not immediately re-used (there are no in-place updates).
@mvstore_1088_h3
@mvstore_1089_h3
Off-Heap and Pluggable Storage
@mvstore_1089_p
@mvstore_1090_p
Storage is pluggable. The default storage is to a single file (unless pure in-memory operation is used).
@mvstore_1090_p
@mvstore_1091_p
An off-heap storage implementation is available. This storage keeps the data in the off-heap memory, meaning outside of the regular garbage collected heap. This allows to use very large in-memory stores without having to increase the JVM heap (which would increase Java garbage collection pauses a lot). Memory is allocated using <code>ByteBuffer.allocateDirect</code>. One chunk is allocated at a time (each chunk is usually a few MB large), so that allocation cost is low. To use the off-heap storage, call:
@mvstore_1091_h3
@mvstore_1092_h3
File System Abstraction, File Locking and Online Backup
@mvstore_1092_p
@mvstore_1093_p
The file system is pluggable (the same file system abstraction is used as H2 uses). The file can be encrypted using a encrypting file system wrapper. Other file system implementations support reading from a compressed zip or jar file. The file system abstraction closely matches the Java 7 file system API.
@mvstore_1093_p
@mvstore_1094_p
Each store may only be opened once within a JVM. When opening a store, the file is locked in exclusive mode, so that the file can only be changed from within one process. Files can be opened in read-only mode, in which case a shared lock is used.
@mvstore_1094_p
@mvstore_1095_p
The persisted data can be backed up at any time, even during write operations (online backup). To do that, automatic disk space reuse needs to be first disabled, so that new data is always appended at the end of the file. Then, the file can be copied (the file handle is available to the application). It is recommended to use the utility class <code>FileChannelInputStream</code> to do this. For encrypted databases, both the encrypted (raw) file content, as well as the clear text content, can be backed up.
@mvstore_1095_h3
@mvstore_1096_h3
Encrypted Files
@mvstore_1096_p
@mvstore_1097_p
File encryption ensures the data can only be read with the correct password. Data can be encrypted as follows:
@mvstore_1097_p
@mvstore_1098_p
The following algorithms and settings are used:
@mvstore_1098_li
@mvstore_1099_li
The password char array is cleared after use, to reduce the risk that the password is stolen even if the attacker has access to the main memory.
@mvstore_1099_li
@mvstore_1100_li
The password is hashed according to the PBKDF2 standard, using the SHA-256 hash algorithm.
@mvstore_1100_li
@mvstore_1101_li
The length of the salt is 64 bits, so that an attacker can not use a pre-calculated password hash table (rainbow table). It is generated using a cryptographically secure random number generator.
@mvstore_1101_li
@mvstore_1102_li
To speed up opening an encrypted stores on Android, the number of PBKDF2 iterations is 10. The higher the value, the better the protection against brute-force password cracking attacks, but the slower is opening a file.
@mvstore_1102_li
@mvstore_1103_li
The file itself is encrypted using the standardized disk encryption mode XTS-AES. Only little more than one AES-128 round per block is needed.
@mvstore_1103_h3
@mvstore_1104_h3
Tools
@mvstore_1104_p
@mvstore_1105_p
There is a tool (<code>MVStoreTool</code>) to dump the contents of a file.
@mvstore_1105_h3
@mvstore_1106_h3
Exception Handling
@mvstore_1106_p
@mvstore_1107_p
This tool does not throw checked exceptions. Instead, unchecked exceptions are thrown if needed. The error message always contains the version of the tool. The following exceptions can occur:
@mvstore_1107_code
@mvstore_1108_code
IllegalStateException
@mvstore_1108_li
@mvstore_1109_li
if a map was already closed or an IO exception occurred, for example if the file was locked, is already closed, could not be opened or closed, if reading or writing failed, if the file is corrupt, or if there is an internal error in the tool. For such exceptions, an error code is added to the exception so that the application can distinguish between different error cases.
@mvstore_1109_code
@mvstore_1110_code
IllegalArgumentException
@mvstore_1110_li
@mvstore_1111_li
if a method was called with an illegal argument.
@mvstore_1111_code
@mvstore_1112_code
UnsupportedOperationException
@mvstore_1112_li
@mvstore_1113_li
if a method was called that is not supported, for example trying to modify a read-only map.
@mvstore_1113_code
@mvstore_1114_code
ConcurrentModificationException
@mvstore_1114_li
@mvstore_1115_li
if a map is modified concurrently.
@mvstore_1115_h3
@mvstore_1116_h3
Storage Engine for H2
@mvstore_1116_p
@mvstore_1117_p
The plan is to use the MVStore as the default storage engine for the H2 database in the future (supporting SQL, JDBC, transactions, MVCC, and so on). This is work in progress. To try it out, append <code>;MV_STORE=TRUE</code> to the database URL. In general, performance should be similar than the current default storage engine (the page store). Even though it can be used with the default table level locking, it is recommended to use it together with the MVCC mode (to do that, append <code>;MVCC=TRUE</code> to the database URL).
@mvstore_1117_h2
@mvstore_1118_h2
File Format
@mvstore_1118_p
@mvstore_1119_p
The data is stored in one file. The file contains two file headers (for safety), and a number of chunks. The file headers are one block each; a block is 4096 bytes. Each chunk is at least one block, but typically 200 blocks or more. Data is stored in the chunks in the form of a <a href="https://en.wikipedia.org/wiki/Log-structured_file_system">log structured storage</a>. There is one chunk for every version.
@mvstore_1119_p
@mvstore_1120_p
Each chunk contains a number of B-tree pages. As an example, the following code:
@mvstore_1120_p
@mvstore_1121_p
will result in the following two chunks (excluding metadata):
@mvstore_1121_b
@mvstore_1122_b
Chunk 1:
@mvstore_1122_p
@mvstore_1123_p
- Page 1: (root) node with 2 entries pointing to page 2 and 3
@mvstore_1123_p
@mvstore_1124_p
- Page 2: leaf with 140 entries (keys 0 - 139)
@mvstore_1124_p
@mvstore_1125_p
- Page 3: leaf with 260 entries (keys 140 - 399)
@mvstore_1125_b
@mvstore_1126_b
Chunk 2:
@mvstore_1126_p
@mvstore_1127_p
- Page 4: (root) node with 2 entries pointing to page 3 and 5
@mvstore_1127_p
@mvstore_1128_p
- Page 5: leaf with 140 entries (keys 0 - 139)
@mvstore_1128_p
@mvstore_1129_p
That means each chunk contains the changes of one version: the new version of the changed pages and the parent pages, recursively, up to the root page. Pages in subsequent chunks refer to pages in earlier chunks.
@mvstore_1129_h3
@mvstore_1130_h3
File Header
@mvstore_1130_p
@mvstore_1131_p
There are two file headers, which normally contain the exact same data. But once in a while, the file headers are updated, and writing could partially fail, which could corrupt a header. That's why there is a second header. Only the file headers are updated in this way (called "in-place update"). The headers contain the following data:
@mvstore_1131_p
@mvstore_1132_p
The data is stored in the form of a key-value pair. Each value is stored as a hexadecimal number. The entries are:
@mvstore_1132_li
@mvstore_1133_li
H: The entry "H:2" stands for the the H2 database.
@mvstore_1133_li
@mvstore_1134_li
block: The block number where one of the newest chunks starts (but not necessarily the newest).
@mvstore_1134_li
@mvstore_1135_li
blockSize: The block size of the file; currently always hex 1000, which is decimal 4096, to match the <a href="https://en.wikipedia.org/wiki/Disk_sector">disk sector</a> length of modern hard disks.
@mvstore_1135_li
@mvstore_1136_li
chunk: The chunk id, which is normally the same value as the version; however, the chunk id might roll over to 0, while the version doesn't.
@mvstore_1136_li
@mvstore_1137_li
created: The number of milliseconds since 1970 when the file was created.
@mvstore_1137_li
@mvstore_1138_li
format: The file format number. Currently 1.
@mvstore_1138_li
@mvstore_1139_li
version: The version number of the chunk.
@mvstore_1139_li
@mvstore_1140_li
fletcher: The <a href="https://en.wikipedia.org/wiki/Fletcher's_checksum"> Fletcher-32 checksum</a> of the header.
@mvstore_1140_p
@mvstore_1141_p
When opening the file, both headers are read and the checksum is verified. If both headers are valid, the one with the newer version is used. The chunk with the latest version is then detected (details about this see below), and the rest of the metadata is read from there. If the chunk id, block and version are not stored in the file header, then the latest chunk lookup starts with the last chunk in the file.
@mvstore_1141_h3
@mvstore_1142_h3
Chunk Format
@mvstore_1142_p
@mvstore_1143_p
There is one chunk per version. Each chunk consists of a header, the pages that were modified in this version, and a footer. The pages contain the actual data of the maps. The pages inside a chunk are stored right after the header, next to each other (unaligned). The size of a chunk is a multiple of the block size. The footer is stored in the last 128 bytes of the chunk.
@mvstore_1143_p
@mvstore_1144_p
The footer allows to verify that the chunk is completely written (a chunk is written as one write operation), and allows to find the start position of the very last chunk in the file. The chunk header and footer contain the following data:
@mvstore_1144_p
@mvstore_1145_p
The fields of the chunk header and footer are:
@mvstore_1145_li
@mvstore_1146_li
chunk: The chunk id.
@mvstore_1146_li
@mvstore_1147_li
block: The first block of the chunk (multiply by the block size to get the position in the file).
@mvstore_1147_li
@mvstore_1148_li
len: The size of the chunk in number of blocks.
@mvstore_1148_li
@mvstore_1149_li
map: The id of the newest map; incremented when a new map is created.
@mvstore_1149_li
@mvstore_1150_li
max: The sum of all maximum page sizes (see page format).
@mvstore_1150_li
@mvstore_1151_li
next: The predicted start block of the next chunk.
@mvstore_1151_li
@mvstore_1152_li
pages: The number of pages in the chunk.
@mvstore_1152_li
@mvstore_1153_li
root: The position of the metadata root page (see page format).
@mvstore_1153_li
@mvstore_1154_li
time: The time the chunk was written, in milliseconds after the file was created.
@mvstore_1154_li
@mvstore_1155_li
version: The version this chunk represents.
@mvstore_1155_li
@mvstore_1156_li
fletcher: The checksum of the footer.
@mvstore_1156_p
@mvstore_1157_p
Chunks are never updated in-place. Each chunk contains the pages that were changed in that version (there is one chunk per version, see above), plus all the parent nodes of those pages, recursively, up to the root page. If an entry in a map is changed, removed, or added, then the respective page is copied to be stored in the next chunk, and the number of live pages in the old chunk is decremented. This mechanism is called copy-on-write, and is similar to how the <a href="https://en.wikipedia.org/wiki/Btrfs">Btrfs</a> file system works. Chunks without live pages are marked as free, so the space can be re-used by more recent chunks. Because not all chunks are of the same size, there can be a number of free blocks in front of a chunk for some time (until a small chunk is written or the chunks are compacted). There is a <a href="http://stackoverflow.com/questions/13650134/after-how-many-seconds-are-file-system-write-buffers-typically-flushed"> delay of 45 seconds</a> (by default) before a free chunk is overwritten, to ensure new versions are persisted first.
@mvstore_1157_p
@mvstore_1158_p
How the newest chunk is located when opening a store: The file header contains the position of a recent chunk, but not always the newest one. This is to reduce the number of file header updates. After opening the file, the file headers, and the chunk footer of the very last chunk (at the end of the file) are read. From those candidates, the header of the most recent chunk is read. If it contains a "next" pointer (see above), those chunk's header and footer are read as well. If it turned out to be a newer valid chunk, this is repeated, until the newest chunk was found. Before writing a chunk, the position of the next chunk is predicted based on the assumption that the next chunk will be of the same size as the current one. When the next chunk is written, and the previous prediction turned out to be incorrect, the file header is updated as well. In any case, the file header is updated if the next chain gets longer than 20 hops.
@mvstore_1158_h3
@mvstore_1159_h3
Page Format
@mvstore_1159_p
@mvstore_1160_p
Each map is a <a href="https://en.wikipedia.org/wiki/B-tree">B-tree</a>, and the map data is stored in (B-tree-) pages. There are leaf pages that contain the key-value pairs of the map, and internal nodes, which only contain keys and pointers to leaf pages. The root of a tree is either a leaf or an internal node. Unlike file header and chunk header and footer, the page data is not human readable. Instead, it is stored as byte arrays, with long (8 bytes), int (4 bytes), short (2 bytes), and <a href="https://en.wikipedia.org/wiki/Variable-length_quantity">variable size int and long</a> (1 to 5 / 10 bytes). The page format is:
@mvstore_1160_li
@mvstore_1161_li
length (int): Length of the page in bytes.
@mvstore_1161_li
@mvstore_1162_li
checksum (short): Checksum (chunk id xor offset within the chunk xor page length).
@mvstore_1162_li
@mvstore_1163_li
mapId (variable size int): The id of the map this page belongs to.
@mvstore_1163_li
@mvstore_1164_li
len (variable size int): The number of keys in the page.
@mvstore_1164_li
@mvstore_1165_li
type (byte): The page type (0 for leaf page, 1 for internal node; plus 2 if the keys and values are compressed).
@mvstore_1165_li
@mvstore_1166_li
children (array of long; internal nodes only): The position of the children.
@mvstore_1166_li
@mvstore_1167_li
childCounts (array of variable size long; internal nodes only): The total number of entries for the given child page.
@mvstore_1167_li
@mvstore_1168_li
keys (byte array): All keys, stored depending on the data type.
@mvstore_1168_li
@mvstore_1169_li
values (byte array; leaf pages only): All values, stored depending on the data type.
@mvstore_1169_p
@mvstore_1170_p
Even though this is not required by the file format, pages are stored in the following order: For each map, the root page is stored first, then the internal nodes (if there are any), and then the leaf pages. This should speed up reads for media where sequential reads are faster than random access reads. The metadata map is stored at the end of a chunk.
@mvstore_1170_p
@mvstore_1171_p
Pointers to pages are stored as a long, using a special format: 26 bits for the chunk id, 32 bits for the offset within the chunk, 5 bits for the length code, 1 bit for the page type (leaf or internal node). The page type is encoded so that when clearing or removing a map, leaf pages don't have to be read (internal nodes do have to be read in order to know where all the pages are; but in a typical B-tree the vast majority of the pages are leaf pages). The absolute file position is not included so that chunks can be moved within the file without having to change page pointers; only the chunk metadata needs to be changed. The length code is a number from 0 to 31, where 0 means the maximum length of the page is 32 bytes, 1 means 48 bytes, 2: 64, 3: 96, 4: 128, 5: 192, and so on until 31 which means longer than 1 MB. That way, reading a page only requires one read operation (except for very large pages). The sum of the maximum length of all pages is stored in the chunk metadata (field "max"), and when a page is marked as removed, the live maximum length is adjusted. This allows to estimate the amount of free space within a block, in addition to the number of free pages.
@mvstore_1171_p
@mvstore_1172_p
The total number of entries in child nodes are kept to allow efficient range counting, lookup by index, and skip operations. The pages form a <a href="http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html">counted B-tree</a>.
@mvstore_1172_p
@mvstore_1173_p
Data compression: The data after the page type are optionally compressed using the LZF algorithm.
@mvstore_1173_h3
@mvstore_1174_h3
Metadata Map
@mvstore_1174_p
@mvstore_1175_p
In addition to the user maps, there is one metadata map that contains names and positions of user maps, and chunk metadata. The very last page of a chunk contains the root page of that metadata map. The exact position of this root page is stored in the chunk header. This page (directly or indirectly) points to the root pages of all other maps. The metadata map of a store with a map named "data", and one chunk, contains the following entries:
@mvstore_1175_li
@mvstore_1176_li
chunk.1: The metadata of chunk 1. This is the same data as the chunk header, plus the number of live pages, and the maximum live length.
@mvstore_1176_li
@mvstore_1177_li
map.1: The metadata of map 1. The entries are: name, createVersion, and type.
@mvstore_1177_li
@mvstore_1178_li
name.data: The map id of the map named "data". The value is "1".
@mvstore_1178_li
@mvstore_1179_li
root.1: The root position of map 1.
@mvstore_1179_li
@mvstore_1180_li
setting.storeVersion: The store version (a user defined value).
@mvstore_1180_h2
@mvstore_1181_h2
Similar Projects and Differences to Other Storage Engines
@mvstore_1181_p
@mvstore_1182_p
Unlike similar storage engines like LevelDB and Kyoto Cabinet, the MVStore is written in Java and can easily be embedded in a Java and Android application.
@mvstore_1182_p
@mvstore_1183_p
The MVStore is somewhat similar to the Berkeley DB Java Edition because it is also written in Java, and is also a log structured storage, but the H2 license is more liberal.
@mvstore_1183_p
@mvstore_1184_p
Like SQLite 3, the MVStore keeps all data in one file. Unlike SQLite 3, the MVStore uses is a log structured storage. The plan is to make the MVStore both easier to use as well as faster than SQLite 3. In a recent (very simple) test, the MVStore was about twice as fast as SQLite 3 on Android.
@mvstore_1184_p
@mvstore_1185_p
The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek, and some code is shared between MVStore and MapDB. However, unlike MapDB, the MVStore uses is a log structured storage. The MVStore does not have a record size limit.
@mvstore_1185_h2
@mvstore_1186_h2
Current State
@mvstore_1186_p
@mvstore_1187_p
The code is still experimental at this stage. The API as well as the behavior may partially change. Features may be added and removed (even though the main features will stay).
@mvstore_1187_h2
@mvstore_1188_h2
Requirements
@mvstore_1188_p
@mvstore_1189_p
The MVStore is included in the latest H2 jar file.
@mvstore_1189_p
@mvstore_1190_p
There are no special requirements to use it. The MVStore should run on any JVM as well as on Android.
@mvstore_1190_p
@mvstore_1191_p
To build just the MVStore (without the database engine), run:
@mvstore_1191_p
@mvstore_1192_p
This will create the file <code>bin/h2mvstore-1.3.175.jar</code> (about 130 KB).
@performance_1000_h1
......
......@@ -7337,450 +7337,453 @@ H2 データベース エンジン
#cacheSize: the cache size in MB.
@mvstore_1043_li
#compressData: compress the data when storing.
#compress: compress the data when storing using a fast algorithm (LZF).
@mvstore_1044_li
#encryptionKey: the encryption key for file encryption.
#compressHigh: compress the data when storing using a slower algorithm (Deflate).
@mvstore_1045_li
#fileName: the name of the file, for file based stores.
#encryptionKey: the encryption key for file encryption.
@mvstore_1046_li
#fileStore: the storage implementation to use.
#fileName: the name of the file, for file based stores.
@mvstore_1047_li
#pageSplitSize: the point where pages are split.
#fileStore: the storage implementation to use.
@mvstore_1048_li
#pageSplitSize: the point where pages are split.
@mvstore_1049_li
#readOnly: open the file in read-only mode.
@mvstore_1049_h2
@mvstore_1050_h2
#R-Tree
@mvstore_1050_p
@mvstore_1051_p
# The <code>MVRTreeMap</code> is an R-tree implementation that supports fast spatial queries. It can be used as follows:
@mvstore_1051_p
@mvstore_1052_p
# The default number of dimensions is 2. To use a different number of dimensions, call <code>new MVRTreeMap.Builder&lt;String&gt;().dimensions(3)</code>. The minimum number of dimensions is 1, the maximum is 255.
@mvstore_1052_h2
@mvstore_1053_h2
特徴
@mvstore_1053_h3
@mvstore_1054_h3
#Maps
@mvstore_1054_p
@mvstore_1055_p
# Each store contains a set of named maps. A map is sorted by key, and supports the common lookup operations, including access to the first and last key, iterate over some or all keys, and so on.
@mvstore_1055_p
@mvstore_1056_p
# Also supported, and very uncommon for maps, is fast index lookup: the entries of the map can be be efficiently accessed like a random-access list (get the entry at the given index), and the index of a key can be calculated efficiently. That also means getting the median of two keys is very fast, and a range of keys can be counted very quickly. The iterator supports fast skipping. This is possible because internally, each map is organized in the form of a counted B+-tree.
@mvstore_1056_p
@mvstore_1057_p
# In database terms, a map can be used like a table, where the key of the map is the primary key of the table, and the value is the row. A map can also represent an index, where the key of the map is the key of the index, and the value of the map is the primary key of the table (for non-unique indexes, the key of the map must also contain the primary key).
@mvstore_1057_h3
@mvstore_1058_h3
#Versions
@mvstore_1058_p
@mvstore_1059_p
# A version is a snapshot of all the data of all maps at a given point in time. Creating a snapshot is fast: only those pages that are changed after a snapshot are copied. This behavior is also called COW (copy on write). Rollback to an old version is supported. Old versions are readable until old data is purged.
@mvstore_1059_p
@mvstore_1060_p
# The following sample code show how to create a store, open a map, add some data, and access the current and an old version:
@mvstore_1060_h3
@mvstore_1061_h3
#Transactions
@mvstore_1061_p
@mvstore_1062_p
# To support multiple concurrent open transactions, a transaction utility is included, the <code>TransactionStore</code>. The tool supports PostgreSQL style "read committed" transaction isolation with savepoints, two-phase commit, and other features typically available in a database. There is no limit on the size of a transaction (the log is written to disk for large or long running transactions).
@mvstore_1062_p
@mvstore_1063_p
# Internally, this utility stores the old versions of changed entries in a separate map, similar to a transaction log (except that entries of a closed transaction are removed, and the log is usually not stored for short transactions). For common use cases, the storage overhead of this utility is very small compared to the overhead of a regular transaction log.
@mvstore_1063_h3
@mvstore_1064_h3
#In-Memory Performance and Usage
@mvstore_1064_p
@mvstore_1065_p
# Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>, but usually slower than <code>java.util.HashMap</code>.
@mvstore_1065_p
@mvstore_1066_p
# The memory overhead for large maps is slightly better than for the regular map implementations, but there is a higher overhead per map. For maps with less than about 25 entries, the regular map implementations need less memory.
@mvstore_1066_p
@mvstore_1067_p
# If no file name is specified, the store operates purely in memory. Except for persisting data, all features are supported in this mode (multi-versioning, index lookup, R-tree and so on). If a file name is specified, all operations occur in memory (with the same performance characteristics) until data is persisted.
@mvstore_1067_p
@mvstore_1068_p
# As in all map implementations, keys need to be immutable, that means changing the key object after an entry has been added is not allowed. If a file name is specified, the value may also not be changed after adding an entry, because it might be serialized (which could happen at any time when autocommit is enabled).
@mvstore_1068_h3
@mvstore_1069_h3
#Pluggable Data Types
@mvstore_1069_p
@mvstore_1070_p
# Serialization is pluggable. The default serialization currently supports many common data types, and uses Java serialization for other objects. The following classes are currently directly supported: <code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal, String, UUID, Date</code> and arrays (both primitive arrays and object arrays). For serialized objects, the size estimate is adjusted using an exponential moving average.
@mvstore_1070_p
@mvstore_1071_p
# Parameterized data types are supported (for example one could build a string data type that limits the length).
@mvstore_1071_p
@mvstore_1072_p
# The storage engine itself does not have any length limits, so that keys, values, pages, and chunks can be very big (as big as fits in memory). Also, there is no inherent limit to the number of maps and chunks. Due to using a log structured storage, there is no special case handling for large keys or pages.
@mvstore_1072_h3
@mvstore_1073_h3
#BLOB Support
@mvstore_1073_p
@mvstore_1074_p
# There is a mechanism that stores large binary objects by splitting them into smaller blocks. This allows to store objects that don't fit in memory. Streaming as well as random access reads on such objects are supported. This tool is written on top of the store, using only the map interface.
@mvstore_1074_h3
@mvstore_1075_h3
#R-Tree and Pluggable Map Implementations
@mvstore_1075_p
@mvstore_1076_p
# The map implementation is pluggable. In addition to the default <code>MVMap</code> (multi-version map), there is a map that supports concurrent write operations, and a multi-version R-tree map implementation for spatial operations.
@mvstore_1076_h3
@mvstore_1077_h3
#Concurrent Operations and Caching
@mvstore_1077_p
@mvstore_1078_p
# The default map implementation supports concurrent reads on old versions of the data. All such read operations can occur in parallel. Concurrent reads from the page cache, as well as concurrent reads from the file system are supported. Writing changes to the file can occur concurrently to modifying the data, as writing operates on a snapshot.
@mvstore_1078_p
@mvstore_1079_p
# Caching is done on the page level. The page cache is a concurrent LIRS cache, which should be resistant against scan operations.
@mvstore_1079_p
@mvstore_1080_p
# The default map implementation does not support concurrent modification operations on a map (the same as <code>HashMap</code> and <code>TreeMap</code>). Similar to those classes, the map tries to detect concurrent modification.
@mvstore_1080_p
@mvstore_1081_p
# With the <code>MVMapConcurrent</code> implementation, read operations even on the newest version can happen concurrently with all other operations, without risk of corruption. This comes with slightly reduced speed in single threaded mode, the same as with other <code>ConcurrentHashMap</code> implementations. Write operations first read the relevant area from disk to memory (this can happen concurrently), and only then modify the data. The in-memory part of write operations is synchronized.
@mvstore_1081_p
@mvstore_1082_p
# For fully scalable concurrent write operations to a map (in-memory and to disk), the map could be split into multiple maps in different stores ('sharding'). The plan is to add such a mechanism later when needed.
@mvstore_1082_h3
@mvstore_1083_h3
#Log Structured Storage
@mvstore_1083_p
@mvstore_1084_p
# Internally, changes are buffered in memory, and once enough changes have accumulated, they are written in one continuous disk write operation. Compared to traditional database storage engines, this should improve write performance for file systems and storage systems that do not efficiently support small random writes, such as Btrfs, as well as SSDs. (According to a test, write throughput of a common SSD increases with write block size, until a block size of 2 MB, and then does not further increase.) By default, changes are automatically written when more than a number of pages are modified, and once every second in a background thread, even if only little data was changed. Changes can also be written explicitly by calling <code>commit()</code>.
@mvstore_1084_p
@mvstore_1085_p
# When storing, all changed pages are serialized, optionally compressed using the LZF algorithm, and written sequentially to a free area of the file. Each such change set is called a chunk. All parent pages of the changed B-trees are stored in this chunk as well, so that each chunk also contains the root of each changed map (which is the entry point for reading this version of the data). There is no separate index: all data is stored as a list of pages. Per store, there is one additional map that contains the metadata (the list of maps, where the root page of each map is stored, and the list of chunks).
@mvstore_1085_p
@mvstore_1086_p
# There are usually two write operations per chunk: one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk). If the chunk is appended at the end of the file, the file header is only written at the end of the chunk. There is no transaction log, no undo log, and there are no in-place updates (however, unused chunks are overwritten by default).
@mvstore_1086_p
@mvstore_1087_p
# Old data is kept for at least 45 seconds (configurable), so that there are no explicit sync operations required to guarantee data consistency. An application can also sync explicitly when needed. To reuse disk space, the chunks with the lowest amount of live data are compacted (the live data is stored again in the next chunk). To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
@mvstore_1087_p
@mvstore_1088_p
# Compared to traditional storage engines (that use a transaction log, undo log, and main storage area), the log structured storage is simpler, more flexible, and typically needs less disk operations per change, as data is only written once instead of twice or 3 times, and because the B-tree pages are always full (they are stored next to each other) and can be easily compressed. But temporarily, disk space usage might actually be a bit higher than for a regular database, as disk space is not immediately re-used (there are no in-place updates).
@mvstore_1088_h3
@mvstore_1089_h3
#Off-Heap and Pluggable Storage
@mvstore_1089_p
@mvstore_1090_p
# Storage is pluggable. The default storage is to a single file (unless pure in-memory operation is used).
@mvstore_1090_p
@mvstore_1091_p
# An off-heap storage implementation is available. This storage keeps the data in the off-heap memory, meaning outside of the regular garbage collected heap. This allows to use very large in-memory stores without having to increase the JVM heap (which would increase Java garbage collection pauses a lot). Memory is allocated using <code>ByteBuffer.allocateDirect</code>. One chunk is allocated at a time (each chunk is usually a few MB large), so that allocation cost is low. To use the off-heap storage, call:
@mvstore_1091_h3
@mvstore_1092_h3
#File System Abstraction, File Locking and Online Backup
@mvstore_1092_p
@mvstore_1093_p
# The file system is pluggable (the same file system abstraction is used as H2 uses). The file can be encrypted using a encrypting file system wrapper. Other file system implementations support reading from a compressed zip or jar file. The file system abstraction closely matches the Java 7 file system API.
@mvstore_1093_p
@mvstore_1094_p
# Each store may only be opened once within a JVM. When opening a store, the file is locked in exclusive mode, so that the file can only be changed from within one process. Files can be opened in read-only mode, in which case a shared lock is used.
@mvstore_1094_p
@mvstore_1095_p
# The persisted data can be backed up at any time, even during write operations (online backup). To do that, automatic disk space reuse needs to be first disabled, so that new data is always appended at the end of the file. Then, the file can be copied (the file handle is available to the application). It is recommended to use the utility class <code>FileChannelInputStream</code> to do this. For encrypted databases, both the encrypted (raw) file content, as well as the clear text content, can be backed up.
@mvstore_1095_h3
@mvstore_1096_h3
#Encrypted Files
@mvstore_1096_p
@mvstore_1097_p
# File encryption ensures the data can only be read with the correct password. Data can be encrypted as follows:
@mvstore_1097_p
@mvstore_1098_p
# The following algorithms and settings are used:
@mvstore_1098_li
@mvstore_1099_li
#The password char array is cleared after use, to reduce the risk that the password is stolen even if the attacker has access to the main memory.
@mvstore_1099_li
@mvstore_1100_li
#The password is hashed according to the PBKDF2 standard, using the SHA-256 hash algorithm.
@mvstore_1100_li
@mvstore_1101_li
#The length of the salt is 64 bits, so that an attacker can not use a pre-calculated password hash table (rainbow table). It is generated using a cryptographically secure random number generator.
@mvstore_1101_li
@mvstore_1102_li
#To speed up opening an encrypted stores on Android, the number of PBKDF2 iterations is 10. The higher the value, the better the protection against brute-force password cracking attacks, but the slower is opening a file.
@mvstore_1102_li
@mvstore_1103_li
#The file itself is encrypted using the standardized disk encryption mode XTS-AES. Only little more than one AES-128 round per block is needed.
@mvstore_1103_h3
@mvstore_1104_h3
#Tools
@mvstore_1104_p
@mvstore_1105_p
# There is a tool (<code>MVStoreTool</code>) to dump the contents of a file.
@mvstore_1105_h3
@mvstore_1106_h3
#Exception Handling
@mvstore_1106_p
@mvstore_1107_p
# This tool does not throw checked exceptions. Instead, unchecked exceptions are thrown if needed. The error message always contains the version of the tool. The following exceptions can occur:
@mvstore_1107_code
@mvstore_1108_code
#IllegalStateException
@mvstore_1108_li
@mvstore_1109_li
# if a map was already closed or an IO exception occurred, for example if the file was locked, is already closed, could not be opened or closed, if reading or writing failed, if the file is corrupt, or if there is an internal error in the tool. For such exceptions, an error code is added to the exception so that the application can distinguish between different error cases.
@mvstore_1109_code
@mvstore_1110_code
#IllegalArgumentException
@mvstore_1110_li
@mvstore_1111_li
# if a method was called with an illegal argument.
@mvstore_1111_code
@mvstore_1112_code
#UnsupportedOperationException
@mvstore_1112_li
@mvstore_1113_li
# if a method was called that is not supported, for example trying to modify a read-only map.
@mvstore_1113_code
@mvstore_1114_code
#ConcurrentModificationException
@mvstore_1114_li
@mvstore_1115_li
# if a map is modified concurrently.
@mvstore_1115_h3
@mvstore_1116_h3
#Storage Engine for H2
@mvstore_1116_p
@mvstore_1117_p
# The plan is to use the MVStore as the default storage engine for the H2 database in the future (supporting SQL, JDBC, transactions, MVCC, and so on). This is work in progress. To try it out, append <code>;MV_STORE=TRUE</code> to the database URL. In general, performance should be similar than the current default storage engine (the page store). Even though it can be used with the default table level locking, it is recommended to use it together with the MVCC mode (to do that, append <code>;MVCC=TRUE</code> to the database URL).
@mvstore_1117_h2
@mvstore_1118_h2
#File Format
@mvstore_1118_p
@mvstore_1119_p
# The data is stored in one file. The file contains two file headers (for safety), and a number of chunks. The file headers are one block each; a block is 4096 bytes. Each chunk is at least one block, but typically 200 blocks or more. Data is stored in the chunks in the form of a <a href="https://en.wikipedia.org/wiki/Log-structured_file_system">log structured storage</a>. There is one chunk for every version.
@mvstore_1119_p
@mvstore_1120_p
# Each chunk contains a number of B-tree pages. As an example, the following code:
@mvstore_1120_p
@mvstore_1121_p
# will result in the following two chunks (excluding metadata):
@mvstore_1121_b
@mvstore_1122_b
#Chunk 1:
@mvstore_1122_p
@mvstore_1123_p
# - Page 1: (root) node with 2 entries pointing to page 2 and 3
@mvstore_1123_p
@mvstore_1124_p
# - Page 2: leaf with 140 entries (keys 0 - 139)
@mvstore_1124_p
@mvstore_1125_p
# - Page 3: leaf with 260 entries (keys 140 - 399)
@mvstore_1125_b
@mvstore_1126_b
#Chunk 2:
@mvstore_1126_p
@mvstore_1127_p
# - Page 4: (root) node with 2 entries pointing to page 3 and 5
@mvstore_1127_p
@mvstore_1128_p
# - Page 5: leaf with 140 entries (keys 0 - 139)
@mvstore_1128_p
@mvstore_1129_p
# That means each chunk contains the changes of one version: the new version of the changed pages and the parent pages, recursively, up to the root page. Pages in subsequent chunks refer to pages in earlier chunks.
@mvstore_1129_h3
@mvstore_1130_h3
#File Header
@mvstore_1130_p
@mvstore_1131_p
# There are two file headers, which normally contain the exact same data. But once in a while, the file headers are updated, and writing could partially fail, which could corrupt a header. That's why there is a second header. Only the file headers are updated in this way (called "in-place update"). The headers contain the following data:
@mvstore_1131_p
@mvstore_1132_p
# The data is stored in the form of a key-value pair. Each value is stored as a hexadecimal number. The entries are:
@mvstore_1132_li
@mvstore_1133_li
#H: The entry "H:2" stands for the the H2 database.
@mvstore_1133_li
@mvstore_1134_li
#block: The block number where one of the newest chunks starts (but not necessarily the newest).
@mvstore_1134_li
@mvstore_1135_li
#blockSize: The block size of the file; currently always hex 1000, which is decimal 4096, to match the <a href="https://en.wikipedia.org/wiki/Disk_sector">disk sector</a> length of modern hard disks.
@mvstore_1135_li
@mvstore_1136_li
#chunk: The chunk id, which is normally the same value as the version; however, the chunk id might roll over to 0, while the version doesn't.
@mvstore_1136_li
@mvstore_1137_li
#created: The number of milliseconds since 1970 when the file was created.
@mvstore_1137_li
@mvstore_1138_li
#format: The file format number. Currently 1.
@mvstore_1138_li
@mvstore_1139_li
#version: The version number of the chunk.
@mvstore_1139_li
@mvstore_1140_li
#fletcher: The <a href="https://en.wikipedia.org/wiki/Fletcher's_checksum"> Fletcher-32 checksum</a> of the header.
@mvstore_1140_p
@mvstore_1141_p
# When opening the file, both headers are read and the checksum is verified. If both headers are valid, the one with the newer version is used. The chunk with the latest version is then detected (details about this see below), and the rest of the metadata is read from there. If the chunk id, block and version are not stored in the file header, then the latest chunk lookup starts with the last chunk in the file.
@mvstore_1141_h3
@mvstore_1142_h3
#Chunk Format
@mvstore_1142_p
@mvstore_1143_p
# There is one chunk per version. Each chunk consists of a header, the pages that were modified in this version, and a footer. The pages contain the actual data of the maps. The pages inside a chunk are stored right after the header, next to each other (unaligned). The size of a chunk is a multiple of the block size. The footer is stored in the last 128 bytes of the chunk.
@mvstore_1143_p
@mvstore_1144_p
# The footer allows to verify that the chunk is completely written (a chunk is written as one write operation), and allows to find the start position of the very last chunk in the file. The chunk header and footer contain the following data:
@mvstore_1144_p
@mvstore_1145_p
# The fields of the chunk header and footer are:
@mvstore_1145_li
@mvstore_1146_li
#chunk: The chunk id.
@mvstore_1146_li
@mvstore_1147_li
#block: The first block of the chunk (multiply by the block size to get the position in the file).
@mvstore_1147_li
@mvstore_1148_li
#len: The size of the chunk in number of blocks.
@mvstore_1148_li
@mvstore_1149_li
#map: The id of the newest map; incremented when a new map is created.
@mvstore_1149_li
@mvstore_1150_li
#max: The sum of all maximum page sizes (see page format).
@mvstore_1150_li
@mvstore_1151_li
#next: The predicted start block of the next chunk.
@mvstore_1151_li
@mvstore_1152_li
#pages: The number of pages in the chunk.
@mvstore_1152_li
@mvstore_1153_li
#root: The position of the metadata root page (see page format).
@mvstore_1153_li
@mvstore_1154_li
#time: The time the chunk was written, in milliseconds after the file was created.
@mvstore_1154_li
@mvstore_1155_li
#version: The version this chunk represents.
@mvstore_1155_li
@mvstore_1156_li
#fletcher: The checksum of the footer.
@mvstore_1156_p
@mvstore_1157_p
# Chunks are never updated in-place. Each chunk contains the pages that were changed in that version (there is one chunk per version, see above), plus all the parent nodes of those pages, recursively, up to the root page. If an entry in a map is changed, removed, or added, then the respective page is copied to be stored in the next chunk, and the number of live pages in the old chunk is decremented. This mechanism is called copy-on-write, and is similar to how the <a href="https://en.wikipedia.org/wiki/Btrfs">Btrfs</a> file system works. Chunks without live pages are marked as free, so the space can be re-used by more recent chunks. Because not all chunks are of the same size, there can be a number of free blocks in front of a chunk for some time (until a small chunk is written or the chunks are compacted). There is a <a href="http://stackoverflow.com/questions/13650134/after-how-many-seconds-are-file-system-write-buffers-typically-flushed"> delay of 45 seconds</a> (by default) before a free chunk is overwritten, to ensure new versions are persisted first.
@mvstore_1157_p
@mvstore_1158_p
# How the newest chunk is located when opening a store: The file header contains the position of a recent chunk, but not always the newest one. This is to reduce the number of file header updates. After opening the file, the file headers, and the chunk footer of the very last chunk (at the end of the file) are read. From those candidates, the header of the most recent chunk is read. If it contains a "next" pointer (see above), those chunk's header and footer are read as well. If it turned out to be a newer valid chunk, this is repeated, until the newest chunk was found. Before writing a chunk, the position of the next chunk is predicted based on the assumption that the next chunk will be of the same size as the current one. When the next chunk is written, and the previous prediction turned out to be incorrect, the file header is updated as well. In any case, the file header is updated if the next chain gets longer than 20 hops.
@mvstore_1158_h3
@mvstore_1159_h3
#Page Format
@mvstore_1159_p
@mvstore_1160_p
# Each map is a <a href="https://en.wikipedia.org/wiki/B-tree">B-tree</a>, and the map data is stored in (B-tree-) pages. There are leaf pages that contain the key-value pairs of the map, and internal nodes, which only contain keys and pointers to leaf pages. The root of a tree is either a leaf or an internal node. Unlike file header and chunk header and footer, the page data is not human readable. Instead, it is stored as byte arrays, with long (8 bytes), int (4 bytes), short (2 bytes), and <a href="https://en.wikipedia.org/wiki/Variable-length_quantity">variable size int and long</a> (1 to 5 / 10 bytes). The page format is:
@mvstore_1160_li
@mvstore_1161_li
#length (int): Length of the page in bytes.
@mvstore_1161_li
@mvstore_1162_li
#checksum (short): Checksum (chunk id xor offset within the chunk xor page length).
@mvstore_1162_li
@mvstore_1163_li
#mapId (variable size int): The id of the map this page belongs to.
@mvstore_1163_li
@mvstore_1164_li
#len (variable size int): The number of keys in the page.
@mvstore_1164_li
@mvstore_1165_li
#type (byte): The page type (0 for leaf page, 1 for internal node; plus 2 if the keys and values are compressed).
@mvstore_1165_li
@mvstore_1166_li
#children (array of long; internal nodes only): The position of the children.
@mvstore_1166_li
@mvstore_1167_li
#childCounts (array of variable size long; internal nodes only): The total number of entries for the given child page.
@mvstore_1167_li
@mvstore_1168_li
#keys (byte array): All keys, stored depending on the data type.
@mvstore_1168_li
@mvstore_1169_li
#values (byte array; leaf pages only): All values, stored depending on the data type.
@mvstore_1169_p
@mvstore_1170_p
# Even though this is not required by the file format, pages are stored in the following order: For each map, the root page is stored first, then the internal nodes (if there are any), and then the leaf pages. This should speed up reads for media where sequential reads are faster than random access reads. The metadata map is stored at the end of a chunk.
@mvstore_1170_p
@mvstore_1171_p
# Pointers to pages are stored as a long, using a special format: 26 bits for the chunk id, 32 bits for the offset within the chunk, 5 bits for the length code, 1 bit for the page type (leaf or internal node). The page type is encoded so that when clearing or removing a map, leaf pages don't have to be read (internal nodes do have to be read in order to know where all the pages are; but in a typical B-tree the vast majority of the pages are leaf pages). The absolute file position is not included so that chunks can be moved within the file without having to change page pointers; only the chunk metadata needs to be changed. The length code is a number from 0 to 31, where 0 means the maximum length of the page is 32 bytes, 1 means 48 bytes, 2: 64, 3: 96, 4: 128, 5: 192, and so on until 31 which means longer than 1 MB. That way, reading a page only requires one read operation (except for very large pages). The sum of the maximum length of all pages is stored in the chunk metadata (field "max"), and when a page is marked as removed, the live maximum length is adjusted. This allows to estimate the amount of free space within a block, in addition to the number of free pages.
@mvstore_1171_p
@mvstore_1172_p
# The total number of entries in child nodes are kept to allow efficient range counting, lookup by index, and skip operations. The pages form a <a href="http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html">counted B-tree</a>.
@mvstore_1172_p
@mvstore_1173_p
# Data compression: The data after the page type are optionally compressed using the LZF algorithm.
@mvstore_1173_h3
@mvstore_1174_h3
#Metadata Map
@mvstore_1174_p
@mvstore_1175_p
# In addition to the user maps, there is one metadata map that contains names and positions of user maps, and chunk metadata. The very last page of a chunk contains the root page of that metadata map. The exact position of this root page is stored in the chunk header. This page (directly or indirectly) points to the root pages of all other maps. The metadata map of a store with a map named "data", and one chunk, contains the following entries:
@mvstore_1175_li
@mvstore_1176_li
#chunk.1: The metadata of chunk 1. This is the same data as the chunk header, plus the number of live pages, and the maximum live length.
@mvstore_1176_li
@mvstore_1177_li
#map.1: The metadata of map 1. The entries are: name, createVersion, and type.
@mvstore_1177_li
@mvstore_1178_li
#name.data: The map id of the map named "data". The value is "1".
@mvstore_1178_li
@mvstore_1179_li
#root.1: The root position of map 1.
@mvstore_1179_li
@mvstore_1180_li
#setting.storeVersion: The store version (a user defined value).
@mvstore_1180_h2
@mvstore_1181_h2
#Similar Projects and Differences to Other Storage Engines
@mvstore_1181_p
@mvstore_1182_p
# Unlike similar storage engines like LevelDB and Kyoto Cabinet, the MVStore is written in Java and can easily be embedded in a Java and Android application.
@mvstore_1182_p
@mvstore_1183_p
# The MVStore is somewhat similar to the Berkeley DB Java Edition because it is also written in Java, and is also a log structured storage, but the H2 license is more liberal.
@mvstore_1183_p
@mvstore_1184_p
# Like SQLite 3, the MVStore keeps all data in one file. Unlike SQLite 3, the MVStore uses is a log structured storage. The plan is to make the MVStore both easier to use as well as faster than SQLite 3. In a recent (very simple) test, the MVStore was about twice as fast as SQLite 3 on Android.
@mvstore_1184_p
@mvstore_1185_p
# The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek, and some code is shared between MVStore and MapDB. However, unlike MapDB, the MVStore uses is a log structured storage. The MVStore does not have a record size limit.
@mvstore_1185_h2
@mvstore_1186_h2
#Current State
@mvstore_1186_p
@mvstore_1187_p
# The code is still experimental at this stage. The API as well as the behavior may partially change. Features may be added and removed (even though the main features will stay).
@mvstore_1187_h2
@mvstore_1188_h2
必�?�?�件
@mvstore_1188_p
@mvstore_1189_p
# The MVStore is included in the latest H2 jar file.
@mvstore_1189_p
@mvstore_1190_p
# There are no special requirements to use it. The MVStore should run on any JVM as well as on Android.
@mvstore_1190_p
@mvstore_1191_p
# To build just the MVStore (without the database engine), run:
@mvstore_1191_p
@mvstore_1192_p
# This will create the file <code>bin/h2mvstore-1.3.175.jar</code> (about 130 KB).
@performance_1000_h1
......
......@@ -2444,155 +2444,156 @@ mvstore_1039_li=autoCommitBufferSize\: the size of the write buffer.
mvstore_1040_li=autoCommitDisabled\: to disable auto-commit.
mvstore_1041_li=backgroundExceptionHandler\: specify a handler for exceptions that could occur while writing in the background.
mvstore_1042_li=cacheSize\: the cache size in MB.
mvstore_1043_li=compressData\: compress the data when storing.
mvstore_1044_li=encryptionKey\: the encryption key for file encryption.
mvstore_1045_li=fileName\: the name of the file, for file based stores.
mvstore_1046_li=fileStore\: the storage implementation to use.
mvstore_1047_li=pageSplitSize\: the point where pages are split.
mvstore_1048_li=readOnly\: open the file in read-only mode.
mvstore_1049_h2=R-Tree
mvstore_1050_p=\ The <code>MVRTreeMap</code> is an R-tree implementation that supports fast spatial queries. It can be used as follows\:
mvstore_1051_p=\ The default number of dimensions is 2. To use a different number of dimensions, call <code>new MVRTreeMap.Builder&lt;String&gt;().dimensions(3)</code>. The minimum number of dimensions is 1, the maximum is 255.
mvstore_1052_h2=Features
mvstore_1053_h3=Maps
mvstore_1054_p=\ Each store contains a set of named maps. A map is sorted by key, and supports the common lookup operations, including access to the first and last key, iterate over some or all keys, and so on.
mvstore_1055_p=\ Also supported, and very uncommon for maps, is fast index lookup\: the entries of the map can be be efficiently accessed like a random-access list (get the entry at the given index), and the index of a key can be calculated efficiently. That also means getting the median of two keys is very fast, and a range of keys can be counted very quickly. The iterator supports fast skipping. This is possible because internally, each map is organized in the form of a counted B+-tree.
mvstore_1056_p=\ In database terms, a map can be used like a table, where the key of the map is the primary key of the table, and the value is the row. A map can also represent an index, where the key of the map is the key of the index, and the value of the map is the primary key of the table (for non-unique indexes, the key of the map must also contain the primary key).
mvstore_1057_h3=Versions
mvstore_1058_p=\ A version is a snapshot of all the data of all maps at a given point in time. Creating a snapshot is fast\: only those pages that are changed after a snapshot are copied. This behavior is also called COW (copy on write). Rollback to an old version is supported. Old versions are readable until old data is purged.
mvstore_1059_p=\ The following sample code show how to create a store, open a map, add some data, and access the current and an old version\:
mvstore_1060_h3=Transactions
mvstore_1061_p=\ To support multiple concurrent open transactions, a transaction utility is included, the <code>TransactionStore</code>. The tool supports PostgreSQL style "read committed" transaction isolation with savepoints, two-phase commit, and other features typically available in a database. There is no limit on the size of a transaction (the log is written to disk for large or long running transactions).
mvstore_1062_p=\ Internally, this utility stores the old versions of changed entries in a separate map, similar to a transaction log (except that entries of a closed transaction are removed, and the log is usually not stored for short transactions). For common use cases, the storage overhead of this utility is very small compared to the overhead of a regular transaction log.
mvstore_1063_h3=In-Memory Performance and Usage
mvstore_1064_p=\ Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>, but usually slower than <code>java.util.HashMap</code>.
mvstore_1065_p=\ The memory overhead for large maps is slightly better than for the regular map implementations, but there is a higher overhead per map. For maps with less than about 25 entries, the regular map implementations need less memory.
mvstore_1066_p=\ If no file name is specified, the store operates purely in memory. Except for persisting data, all features are supported in this mode (multi-versioning, index lookup, R-tree and so on). If a file name is specified, all operations occur in memory (with the same performance characteristics) until data is persisted.
mvstore_1067_p=\ As in all map implementations, keys need to be immutable, that means changing the key object after an entry has been added is not allowed. If a file name is specified, the value may also not be changed after adding an entry, because it might be serialized (which could happen at any time when autocommit is enabled).
mvstore_1068_h3=Pluggable Data Types
mvstore_1069_p=\ Serialization is pluggable. The default serialization currently supports many common data types, and uses Java serialization for other objects. The following classes are currently directly supported\: <code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal, String, UUID, Date</code> and arrays (both primitive arrays and object arrays). For serialized objects, the size estimate is adjusted using an exponential moving average.
mvstore_1070_p=\ Parameterized data types are supported (for example one could build a string data type that limits the length).
mvstore_1071_p=\ The storage engine itself does not have any length limits, so that keys, values, pages, and chunks can be very big (as big as fits in memory). Also, there is no inherent limit to the number of maps and chunks. Due to using a log structured storage, there is no special case handling for large keys or pages.
mvstore_1072_h3=BLOB Support
mvstore_1073_p=\ There is a mechanism that stores large binary objects by splitting them into smaller blocks. This allows to store objects that don't fit in memory. Streaming as well as random access reads on such objects are supported. This tool is written on top of the store, using only the map interface.
mvstore_1074_h3=R-Tree and Pluggable Map Implementations
mvstore_1075_p=\ The map implementation is pluggable. In addition to the default <code>MVMap</code> (multi-version map), there is a map that supports concurrent write operations, and a multi-version R-tree map implementation for spatial operations.
mvstore_1076_h3=Concurrent Operations and Caching
mvstore_1077_p=\ The default map implementation supports concurrent reads on old versions of the data. All such read operations can occur in parallel. Concurrent reads from the page cache, as well as concurrent reads from the file system are supported. Writing changes to the file can occur concurrently to modifying the data, as writing operates on a snapshot.
mvstore_1078_p=\ Caching is done on the page level. The page cache is a concurrent LIRS cache, which should be resistant against scan operations.
mvstore_1079_p=\ The default map implementation does not support concurrent modification operations on a map (the same as <code>HashMap</code> and <code>TreeMap</code>). Similar to those classes, the map tries to detect concurrent modification.
mvstore_1080_p=\ With the <code>MVMapConcurrent</code> implementation, read operations even on the newest version can happen concurrently with all other operations, without risk of corruption. This comes with slightly reduced speed in single threaded mode, the same as with other <code>ConcurrentHashMap</code> implementations. Write operations first read the relevant area from disk to memory (this can happen concurrently), and only then modify the data. The in-memory part of write operations is synchronized.
mvstore_1081_p=\ For fully scalable concurrent write operations to a map (in-memory and to disk), the map could be split into multiple maps in different stores ('sharding'). The plan is to add such a mechanism later when needed.
mvstore_1082_h3=Log Structured Storage
mvstore_1083_p=\ Internally, changes are buffered in memory, and once enough changes have accumulated, they are written in one continuous disk write operation. Compared to traditional database storage engines, this should improve write performance for file systems and storage systems that do not efficiently support small random writes, such as Btrfs, as well as SSDs. (According to a test, write throughput of a common SSD increases with write block size, until a block size of 2 MB, and then does not further increase.) By default, changes are automatically written when more than a number of pages are modified, and once every second in a background thread, even if only little data was changed. Changes can also be written explicitly by calling <code>commit()</code>.
mvstore_1084_p=\ When storing, all changed pages are serialized, optionally compressed using the LZF algorithm, and written sequentially to a free area of the file. Each such change set is called a chunk. All parent pages of the changed B-trees are stored in this chunk as well, so that each chunk also contains the root of each changed map (which is the entry point for reading this version of the data). There is no separate index\: all data is stored as a list of pages. Per store, there is one additional map that contains the metadata (the list of maps, where the root page of each map is stored, and the list of chunks).
mvstore_1085_p=\ There are usually two write operations per chunk\: one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk). If the chunk is appended at the end of the file, the file header is only written at the end of the chunk. There is no transaction log, no undo log, and there are no in-place updates (however, unused chunks are overwritten by default).
mvstore_1086_p=\ Old data is kept for at least 45 seconds (configurable), so that there are no explicit sync operations required to guarantee data consistency. An application can also sync explicitly when needed. To reuse disk space, the chunks with the lowest amount of live data are compacted (the live data is stored again in the next chunk). To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
mvstore_1087_p=\ Compared to traditional storage engines (that use a transaction log, undo log, and main storage area), the log structured storage is simpler, more flexible, and typically needs less disk operations per change, as data is only written once instead of twice or 3 times, and because the B-tree pages are always full (they are stored next to each other) and can be easily compressed. But temporarily, disk space usage might actually be a bit higher than for a regular database, as disk space is not immediately re-used (there are no in-place updates).
mvstore_1088_h3=Off-Heap and Pluggable Storage
mvstore_1089_p=\ Storage is pluggable. The default storage is to a single file (unless pure in-memory operation is used).
mvstore_1090_p=\ An off-heap storage implementation is available. This storage keeps the data in the off-heap memory, meaning outside of the regular garbage collected heap. This allows to use very large in-memory stores without having to increase the JVM heap (which would increase Java garbage collection pauses a lot). Memory is allocated using <code>ByteBuffer.allocateDirect</code>. One chunk is allocated at a time (each chunk is usually a few MB large), so that allocation cost is low. To use the off-heap storage, call\:
mvstore_1091_h3=File System Abstraction, File Locking and Online Backup
mvstore_1092_p=\ The file system is pluggable (the same file system abstraction is used as H2 uses). The file can be encrypted using a encrypting file system wrapper. Other file system implementations support reading from a compressed zip or jar file. The file system abstraction closely matches the Java 7 file system API.
mvstore_1093_p=\ Each store may only be opened once within a JVM. When opening a store, the file is locked in exclusive mode, so that the file can only be changed from within one process. Files can be opened in read-only mode, in which case a shared lock is used.
mvstore_1094_p=\ The persisted data can be backed up at any time, even during write operations (online backup). To do that, automatic disk space reuse needs to be first disabled, so that new data is always appended at the end of the file. Then, the file can be copied (the file handle is available to the application). It is recommended to use the utility class <code>FileChannelInputStream</code> to do this. For encrypted databases, both the encrypted (raw) file content, as well as the clear text content, can be backed up.
mvstore_1095_h3=Encrypted Files
mvstore_1096_p=\ File encryption ensures the data can only be read with the correct password. Data can be encrypted as follows\:
mvstore_1097_p=\ The following algorithms and settings are used\:
mvstore_1098_li=The password char array is cleared after use, to reduce the risk that the password is stolen even if the attacker has access to the main memory.
mvstore_1099_li=The password is hashed according to the PBKDF2 standard, using the SHA-256 hash algorithm.
mvstore_1100_li=The length of the salt is 64 bits, so that an attacker can not use a pre-calculated password hash table (rainbow table). It is generated using a cryptographically secure random number generator.
mvstore_1101_li=To speed up opening an encrypted stores on Android, the number of PBKDF2 iterations is 10. The higher the value, the better the protection against brute-force password cracking attacks, but the slower is opening a file.
mvstore_1102_li=The file itself is encrypted using the standardized disk encryption mode XTS-AES. Only little more than one AES-128 round per block is needed.
mvstore_1103_h3=Tools
mvstore_1104_p=\ There is a tool (<code>MVStoreTool</code>) to dump the contents of a file.
mvstore_1105_h3=Exception Handling
mvstore_1106_p=\ This tool does not throw checked exceptions. Instead, unchecked exceptions are thrown if needed. The error message always contains the version of the tool. The following exceptions can occur\:
mvstore_1107_code=IllegalStateException
mvstore_1108_li=\ if a map was already closed or an IO exception occurred, for example if the file was locked, is already closed, could not be opened or closed, if reading or writing failed, if the file is corrupt, or if there is an internal error in the tool. For such exceptions, an error code is added to the exception so that the application can distinguish between different error cases.
mvstore_1109_code=IllegalArgumentException
mvstore_1110_li=\ if a method was called with an illegal argument.
mvstore_1111_code=UnsupportedOperationException
mvstore_1112_li=\ if a method was called that is not supported, for example trying to modify a read-only map.
mvstore_1113_code=ConcurrentModificationException
mvstore_1114_li=\ if a map is modified concurrently.
mvstore_1115_h3=Storage Engine for H2
mvstore_1116_p=\ The plan is to use the MVStore as the default storage engine for the H2 database in the future (supporting SQL, JDBC, transactions, MVCC, and so on). This is work in progress. To try it out, append <code>;MV_STORE\=TRUE</code> to the database URL. In general, performance should be similar than the current default storage engine (the page store). Even though it can be used with the default table level locking, it is recommended to use it together with the MVCC mode (to do that, append <code>;MVCC\=TRUE</code> to the database URL).
mvstore_1117_h2=File Format
mvstore_1118_p=\ The data is stored in one file. The file contains two file headers (for safety), and a number of chunks. The file headers are one block each; a block is 4096 bytes. Each chunk is at least one block, but typically 200 blocks or more. Data is stored in the chunks in the form of a <a href\="https\://en.wikipedia.org/wiki/Log-structured_file_system">log structured storage</a>. There is one chunk for every version.
mvstore_1119_p=\ Each chunk contains a number of B-tree pages. As an example, the following code\:
mvstore_1120_p=\ will result in the following two chunks (excluding metadata)\:
mvstore_1121_b=Chunk 1\:
mvstore_1122_p=\ - Page 1\: (root) node with 2 entries pointing to page 2 and 3
mvstore_1123_p=\ - Page 2\: leaf with 140 entries (keys 0 - 139)
mvstore_1124_p=\ - Page 3\: leaf with 260 entries (keys 140 - 399)
mvstore_1125_b=Chunk 2\:
mvstore_1126_p=\ - Page 4\: (root) node with 2 entries pointing to page 3 and 5
mvstore_1127_p=\ - Page 5\: leaf with 140 entries (keys 0 - 139)
mvstore_1128_p=\ That means each chunk contains the changes of one version\: the new version of the changed pages and the parent pages, recursively, up to the root page. Pages in subsequent chunks refer to pages in earlier chunks.
mvstore_1129_h3=File Header
mvstore_1130_p=\ There are two file headers, which normally contain the exact same data. But once in a while, the file headers are updated, and writing could partially fail, which could corrupt a header. That's why there is a second header. Only the file headers are updated in this way (called "in-place update"). The headers contain the following data\:
mvstore_1131_p=\ The data is stored in the form of a key-value pair. Each value is stored as a hexadecimal number. The entries are\:
mvstore_1132_li=H\: The entry "H\:2" stands for the the H2 database.
mvstore_1133_li=block\: The block number where one of the newest chunks starts (but not necessarily the newest).
mvstore_1134_li=blockSize\: The block size of the file; currently always hex 1000, which is decimal 4096, to match the <a href\="https\://en.wikipedia.org/wiki/Disk_sector">disk sector</a> length of modern hard disks.
mvstore_1135_li=chunk\: The chunk id, which is normally the same value as the version; however, the chunk id might roll over to 0, while the version doesn't.
mvstore_1136_li=created\: The number of milliseconds since 1970 when the file was created.
mvstore_1137_li=format\: The file format number. Currently 1.
mvstore_1138_li=version\: The version number of the chunk.
mvstore_1139_li=fletcher\: The <a href\="https\://en.wikipedia.org/wiki/Fletcher's_checksum"> Fletcher-32 checksum</a> of the header.
mvstore_1140_p=\ When opening the file, both headers are read and the checksum is verified. If both headers are valid, the one with the newer version is used. The chunk with the latest version is then detected (details about this see below), and the rest of the metadata is read from there. If the chunk id, block and version are not stored in the file header, then the latest chunk lookup starts with the last chunk in the file.
mvstore_1141_h3=Chunk Format
mvstore_1142_p=\ There is one chunk per version. Each chunk consists of a header, the pages that were modified in this version, and a footer. The pages contain the actual data of the maps. The pages inside a chunk are stored right after the header, next to each other (unaligned). The size of a chunk is a multiple of the block size. The footer is stored in the last 128 bytes of the chunk.
mvstore_1143_p=\ The footer allows to verify that the chunk is completely written (a chunk is written as one write operation), and allows to find the start position of the very last chunk in the file. The chunk header and footer contain the following data\:
mvstore_1144_p=\ The fields of the chunk header and footer are\:
mvstore_1145_li=chunk\: The chunk id.
mvstore_1146_li=block\: The first block of the chunk (multiply by the block size to get the position in the file).
mvstore_1147_li=len\: The size of the chunk in number of blocks.
mvstore_1148_li=map\: The id of the newest map; incremented when a new map is created.
mvstore_1149_li=max\: The sum of all maximum page sizes (see page format).
mvstore_1150_li=next\: The predicted start block of the next chunk.
mvstore_1151_li=pages\: The number of pages in the chunk.
mvstore_1152_li=root\: The position of the metadata root page (see page format).
mvstore_1153_li=time\: The time the chunk was written, in milliseconds after the file was created.
mvstore_1154_li=version\: The version this chunk represents.
mvstore_1155_li=fletcher\: The checksum of the footer.
mvstore_1156_p=\ Chunks are never updated in-place. Each chunk contains the pages that were changed in that version (there is one chunk per version, see above), plus all the parent nodes of those pages, recursively, up to the root page. If an entry in a map is changed, removed, or added, then the respective page is copied to be stored in the next chunk, and the number of live pages in the old chunk is decremented. This mechanism is called copy-on-write, and is similar to how the <a href\="https\://en.wikipedia.org/wiki/Btrfs">Btrfs</a> file system works. Chunks without live pages are marked as free, so the space can be re-used by more recent chunks. Because not all chunks are of the same size, there can be a number of free blocks in front of a chunk for some time (until a small chunk is written or the chunks are compacted). There is a <a href\="http\://stackoverflow.com/questions/13650134/after-how-many-seconds-are-file-system-write-buffers-typically-flushed"> delay of 45 seconds</a> (by default) before a free chunk is overwritten, to ensure new versions are persisted first.
mvstore_1157_p=\ How the newest chunk is located when opening a store\: The file header contains the position of a recent chunk, but not always the newest one. This is to reduce the number of file header updates. After opening the file, the file headers, and the chunk footer of the very last chunk (at the end of the file) are read. From those candidates, the header of the most recent chunk is read. If it contains a "next" pointer (see above), those chunk's header and footer are read as well. If it turned out to be a newer valid chunk, this is repeated, until the newest chunk was found. Before writing a chunk, the position of the next chunk is predicted based on the assumption that the next chunk will be of the same size as the current one. When the next chunk is written, and the previous prediction turned out to be incorrect, the file header is updated as well. In any case, the file header is updated if the next chain gets longer than 20 hops.
mvstore_1158_h3=Page Format
mvstore_1159_p=\ Each map is a <a href\="https\://en.wikipedia.org/wiki/B-tree">B-tree</a>, and the map data is stored in (B-tree-) pages. There are leaf pages that contain the key-value pairs of the map, and internal nodes, which only contain keys and pointers to leaf pages. The root of a tree is either a leaf or an internal node. Unlike file header and chunk header and footer, the page data is not human readable. Instead, it is stored as byte arrays, with long (8 bytes), int (4 bytes), short (2 bytes), and <a href\="https\://en.wikipedia.org/wiki/Variable-length_quantity">variable size int and long</a> (1 to 5 / 10 bytes). The page format is\:
mvstore_1160_li=length (int)\: Length of the page in bytes.
mvstore_1161_li=checksum (short)\: Checksum (chunk id xor offset within the chunk xor page length).
mvstore_1162_li=mapId (variable size int)\: The id of the map this page belongs to.
mvstore_1163_li=len (variable size int)\: The number of keys in the page.
mvstore_1164_li=type (byte)\: The page type (0 for leaf page, 1 for internal node; plus 2 if the keys and values are compressed).
mvstore_1165_li=children (array of long; internal nodes only)\: The position of the children.
mvstore_1166_li=childCounts (array of variable size long; internal nodes only)\: The total number of entries for the given child page.
mvstore_1167_li=keys (byte array)\: All keys, stored depending on the data type.
mvstore_1168_li=values (byte array; leaf pages only)\: All values, stored depending on the data type.
mvstore_1169_p=\ Even though this is not required by the file format, pages are stored in the following order\: For each map, the root page is stored first, then the internal nodes (if there are any), and then the leaf pages. This should speed up reads for media where sequential reads are faster than random access reads. The metadata map is stored at the end of a chunk.
mvstore_1170_p=\ Pointers to pages are stored as a long, using a special format\: 26 bits for the chunk id, 32 bits for the offset within the chunk, 5 bits for the length code, 1 bit for the page type (leaf or internal node). The page type is encoded so that when clearing or removing a map, leaf pages don't have to be read (internal nodes do have to be read in order to know where all the pages are; but in a typical B-tree the vast majority of the pages are leaf pages). The absolute file position is not included so that chunks can be moved within the file without having to change page pointers; only the chunk metadata needs to be changed. The length code is a number from 0 to 31, where 0 means the maximum length of the page is 32 bytes, 1 means 48 bytes, 2\: 64, 3\: 96, 4\: 128, 5\: 192, and so on until 31 which means longer than 1 MB. That way, reading a page only requires one read operation (except for very large pages). The sum of the maximum length of all pages is stored in the chunk metadata (field "max"), and when a page is marked as removed, the live maximum length is adjusted. This allows to estimate the amount of free space within a block, in addition to the number of free pages.
mvstore_1171_p=\ The total number of entries in child nodes are kept to allow efficient range counting, lookup by index, and skip operations. The pages form a <a href\="http\://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html">counted B-tree</a>.
mvstore_1172_p=\ Data compression\: The data after the page type are optionally compressed using the LZF algorithm.
mvstore_1173_h3=Metadata Map
mvstore_1174_p=\ In addition to the user maps, there is one metadata map that contains names and positions of user maps, and chunk metadata. The very last page of a chunk contains the root page of that metadata map. The exact position of this root page is stored in the chunk header. This page (directly or indirectly) points to the root pages of all other maps. The metadata map of a store with a map named "data", and one chunk, contains the following entries\:
mvstore_1175_li=chunk.1\: The metadata of chunk 1. This is the same data as the chunk header, plus the number of live pages, and the maximum live length.
mvstore_1176_li=map.1\: The metadata of map 1. The entries are\: name, createVersion, and type.
mvstore_1177_li=name.data\: The map id of the map named "data". The value is "1".
mvstore_1178_li=root.1\: The root position of map 1.
mvstore_1179_li=setting.storeVersion\: The store version (a user defined value).
mvstore_1180_h2=Similar Projects and Differences to Other Storage Engines
mvstore_1181_p=\ Unlike similar storage engines like LevelDB and Kyoto Cabinet, the MVStore is written in Java and can easily be embedded in a Java and Android application.
mvstore_1182_p=\ The MVStore is somewhat similar to the Berkeley DB Java Edition because it is also written in Java, and is also a log structured storage, but the H2 license is more liberal.
mvstore_1183_p=\ Like SQLite 3, the MVStore keeps all data in one file. Unlike SQLite 3, the MVStore uses is a log structured storage. The plan is to make the MVStore both easier to use as well as faster than SQLite 3. In a recent (very simple) test, the MVStore was about twice as fast as SQLite 3 on Android.
mvstore_1184_p=\ The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek, and some code is shared between MVStore and MapDB. However, unlike MapDB, the MVStore uses is a log structured storage. The MVStore does not have a record size limit.
mvstore_1185_h2=Current State
mvstore_1186_p=\ The code is still experimental at this stage. The API as well as the behavior may partially change. Features may be added and removed (even though the main features will stay).
mvstore_1187_h2=Requirements
mvstore_1188_p=\ The MVStore is included in the latest H2 jar file.
mvstore_1189_p=\ There are no special requirements to use it. The MVStore should run on any JVM as well as on Android.
mvstore_1190_p=\ To build just the MVStore (without the database engine), run\:
mvstore_1191_p=\ This will create the file <code>bin/h2mvstore-1.3.175.jar</code> (about 130 KB).
mvstore_1043_li=compress\: compress the data when storing using a fast algorithm (LZF).
mvstore_1044_li=compressHigh\: compress the data when storing using a slower algorithm (Deflate).
mvstore_1045_li=encryptionKey\: the encryption key for file encryption.
mvstore_1046_li=fileName\: the name of the file, for file based stores.
mvstore_1047_li=fileStore\: the storage implementation to use.
mvstore_1048_li=pageSplitSize\: the point where pages are split.
mvstore_1049_li=readOnly\: open the file in read-only mode.
mvstore_1050_h2=R-Tree
mvstore_1051_p=\ The <code>MVRTreeMap</code> is an R-tree implementation that supports fast spatial queries. It can be used as follows\:
mvstore_1052_p=\ The default number of dimensions is 2. To use a different number of dimensions, call <code>new MVRTreeMap.Builder&lt;String&gt;().dimensions(3)</code>. The minimum number of dimensions is 1, the maximum is 255.
mvstore_1053_h2=Features
mvstore_1054_h3=Maps
mvstore_1055_p=\ Each store contains a set of named maps. A map is sorted by key, and supports the common lookup operations, including access to the first and last key, iterate over some or all keys, and so on.
mvstore_1056_p=\ Also supported, and very uncommon for maps, is fast index lookup\: the entries of the map can be be efficiently accessed like a random-access list (get the entry at the given index), and the index of a key can be calculated efficiently. That also means getting the median of two keys is very fast, and a range of keys can be counted very quickly. The iterator supports fast skipping. This is possible because internally, each map is organized in the form of a counted B+-tree.
mvstore_1057_p=\ In database terms, a map can be used like a table, where the key of the map is the primary key of the table, and the value is the row. A map can also represent an index, where the key of the map is the key of the index, and the value of the map is the primary key of the table (for non-unique indexes, the key of the map must also contain the primary key).
mvstore_1058_h3=Versions
mvstore_1059_p=\ A version is a snapshot of all the data of all maps at a given point in time. Creating a snapshot is fast\: only those pages that are changed after a snapshot are copied. This behavior is also called COW (copy on write). Rollback to an old version is supported. Old versions are readable until old data is purged.
mvstore_1060_p=\ The following sample code show how to create a store, open a map, add some data, and access the current and an old version\:
mvstore_1061_h3=Transactions
mvstore_1062_p=\ To support multiple concurrent open transactions, a transaction utility is included, the <code>TransactionStore</code>. The tool supports PostgreSQL style "read committed" transaction isolation with savepoints, two-phase commit, and other features typically available in a database. There is no limit on the size of a transaction (the log is written to disk for large or long running transactions).
mvstore_1063_p=\ Internally, this utility stores the old versions of changed entries in a separate map, similar to a transaction log (except that entries of a closed transaction are removed, and the log is usually not stored for short transactions). For common use cases, the storage overhead of this utility is very small compared to the overhead of a regular transaction log.
mvstore_1064_h3=In-Memory Performance and Usage
mvstore_1065_p=\ Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>, but usually slower than <code>java.util.HashMap</code>.
mvstore_1066_p=\ The memory overhead for large maps is slightly better than for the regular map implementations, but there is a higher overhead per map. For maps with less than about 25 entries, the regular map implementations need less memory.
mvstore_1067_p=\ If no file name is specified, the store operates purely in memory. Except for persisting data, all features are supported in this mode (multi-versioning, index lookup, R-tree and so on). If a file name is specified, all operations occur in memory (with the same performance characteristics) until data is persisted.
mvstore_1068_p=\ As in all map implementations, keys need to be immutable, that means changing the key object after an entry has been added is not allowed. If a file name is specified, the value may also not be changed after adding an entry, because it might be serialized (which could happen at any time when autocommit is enabled).
mvstore_1069_h3=Pluggable Data Types
mvstore_1070_p=\ Serialization is pluggable. The default serialization currently supports many common data types, and uses Java serialization for other objects. The following classes are currently directly supported\: <code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal, String, UUID, Date</code> and arrays (both primitive arrays and object arrays). For serialized objects, the size estimate is adjusted using an exponential moving average.
mvstore_1071_p=\ Parameterized data types are supported (for example one could build a string data type that limits the length).
mvstore_1072_p=\ The storage engine itself does not have any length limits, so that keys, values, pages, and chunks can be very big (as big as fits in memory). Also, there is no inherent limit to the number of maps and chunks. Due to using a log structured storage, there is no special case handling for large keys or pages.
mvstore_1073_h3=BLOB Support
mvstore_1074_p=\ There is a mechanism that stores large binary objects by splitting them into smaller blocks. This allows to store objects that don't fit in memory. Streaming as well as random access reads on such objects are supported. This tool is written on top of the store, using only the map interface.
mvstore_1075_h3=R-Tree and Pluggable Map Implementations
mvstore_1076_p=\ The map implementation is pluggable. In addition to the default <code>MVMap</code> (multi-version map), there is a map that supports concurrent write operations, and a multi-version R-tree map implementation for spatial operations.
mvstore_1077_h3=Concurrent Operations and Caching
mvstore_1078_p=\ The default map implementation supports concurrent reads on old versions of the data. All such read operations can occur in parallel. Concurrent reads from the page cache, as well as concurrent reads from the file system are supported. Writing changes to the file can occur concurrently to modifying the data, as writing operates on a snapshot.
mvstore_1079_p=\ Caching is done on the page level. The page cache is a concurrent LIRS cache, which should be resistant against scan operations.
mvstore_1080_p=\ The default map implementation does not support concurrent modification operations on a map (the same as <code>HashMap</code> and <code>TreeMap</code>). Similar to those classes, the map tries to detect concurrent modification.
mvstore_1081_p=\ With the <code>MVMapConcurrent</code> implementation, read operations even on the newest version can happen concurrently with all other operations, without risk of corruption. This comes with slightly reduced speed in single threaded mode, the same as with other <code>ConcurrentHashMap</code> implementations. Write operations first read the relevant area from disk to memory (this can happen concurrently), and only then modify the data. The in-memory part of write operations is synchronized.
mvstore_1082_p=\ For fully scalable concurrent write operations to a map (in-memory and to disk), the map could be split into multiple maps in different stores ('sharding'). The plan is to add such a mechanism later when needed.
mvstore_1083_h3=Log Structured Storage
mvstore_1084_p=\ Internally, changes are buffered in memory, and once enough changes have accumulated, they are written in one continuous disk write operation. Compared to traditional database storage engines, this should improve write performance for file systems and storage systems that do not efficiently support small random writes, such as Btrfs, as well as SSDs. (According to a test, write throughput of a common SSD increases with write block size, until a block size of 2 MB, and then does not further increase.) By default, changes are automatically written when more than a number of pages are modified, and once every second in a background thread, even if only little data was changed. Changes can also be written explicitly by calling <code>commit()</code>.
mvstore_1085_p=\ When storing, all changed pages are serialized, optionally compressed using the LZF algorithm, and written sequentially to a free area of the file. Each such change set is called a chunk. All parent pages of the changed B-trees are stored in this chunk as well, so that each chunk also contains the root of each changed map (which is the entry point for reading this version of the data). There is no separate index\: all data is stored as a list of pages. Per store, there is one additional map that contains the metadata (the list of maps, where the root page of each map is stored, and the list of chunks).
mvstore_1086_p=\ There are usually two write operations per chunk\: one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk). If the chunk is appended at the end of the file, the file header is only written at the end of the chunk. There is no transaction log, no undo log, and there are no in-place updates (however, unused chunks are overwritten by default).
mvstore_1087_p=\ Old data is kept for at least 45 seconds (configurable), so that there are no explicit sync operations required to guarantee data consistency. An application can also sync explicitly when needed. To reuse disk space, the chunks with the lowest amount of live data are compacted (the live data is stored again in the next chunk). To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
mvstore_1088_p=\ Compared to traditional storage engines (that use a transaction log, undo log, and main storage area), the log structured storage is simpler, more flexible, and typically needs less disk operations per change, as data is only written once instead of twice or 3 times, and because the B-tree pages are always full (they are stored next to each other) and can be easily compressed. But temporarily, disk space usage might actually be a bit higher than for a regular database, as disk space is not immediately re-used (there are no in-place updates).
mvstore_1089_h3=Off-Heap and Pluggable Storage
mvstore_1090_p=\ Storage is pluggable. The default storage is to a single file (unless pure in-memory operation is used).
mvstore_1091_p=\ An off-heap storage implementation is available. This storage keeps the data in the off-heap memory, meaning outside of the regular garbage collected heap. This allows to use very large in-memory stores without having to increase the JVM heap (which would increase Java garbage collection pauses a lot). Memory is allocated using <code>ByteBuffer.allocateDirect</code>. One chunk is allocated at a time (each chunk is usually a few MB large), so that allocation cost is low. To use the off-heap storage, call\:
mvstore_1092_h3=File System Abstraction, File Locking and Online Backup
mvstore_1093_p=\ The file system is pluggable (the same file system abstraction is used as H2 uses). The file can be encrypted using a encrypting file system wrapper. Other file system implementations support reading from a compressed zip or jar file. The file system abstraction closely matches the Java 7 file system API.
mvstore_1094_p=\ Each store may only be opened once within a JVM. When opening a store, the file is locked in exclusive mode, so that the file can only be changed from within one process. Files can be opened in read-only mode, in which case a shared lock is used.
mvstore_1095_p=\ The persisted data can be backed up at any time, even during write operations (online backup). To do that, automatic disk space reuse needs to be first disabled, so that new data is always appended at the end of the file. Then, the file can be copied (the file handle is available to the application). It is recommended to use the utility class <code>FileChannelInputStream</code> to do this. For encrypted databases, both the encrypted (raw) file content, as well as the clear text content, can be backed up.
mvstore_1096_h3=Encrypted Files
mvstore_1097_p=\ File encryption ensures the data can only be read with the correct password. Data can be encrypted as follows\:
mvstore_1098_p=\ The following algorithms and settings are used\:
mvstore_1099_li=The password char array is cleared after use, to reduce the risk that the password is stolen even if the attacker has access to the main memory.
mvstore_1100_li=The password is hashed according to the PBKDF2 standard, using the SHA-256 hash algorithm.
mvstore_1101_li=The length of the salt is 64 bits, so that an attacker can not use a pre-calculated password hash table (rainbow table). It is generated using a cryptographically secure random number generator.
mvstore_1102_li=To speed up opening an encrypted stores on Android, the number of PBKDF2 iterations is 10. The higher the value, the better the protection against brute-force password cracking attacks, but the slower is opening a file.
mvstore_1103_li=The file itself is encrypted using the standardized disk encryption mode XTS-AES. Only little more than one AES-128 round per block is needed.
mvstore_1104_h3=Tools
mvstore_1105_p=\ There is a tool (<code>MVStoreTool</code>) to dump the contents of a file.
mvstore_1106_h3=Exception Handling
mvstore_1107_p=\ This tool does not throw checked exceptions. Instead, unchecked exceptions are thrown if needed. The error message always contains the version of the tool. The following exceptions can occur\:
mvstore_1108_code=IllegalStateException
mvstore_1109_li=\ if a map was already closed or an IO exception occurred, for example if the file was locked, is already closed, could not be opened or closed, if reading or writing failed, if the file is corrupt, or if there is an internal error in the tool. For such exceptions, an error code is added to the exception so that the application can distinguish between different error cases.
mvstore_1110_code=IllegalArgumentException
mvstore_1111_li=\ if a method was called with an illegal argument.
mvstore_1112_code=UnsupportedOperationException
mvstore_1113_li=\ if a method was called that is not supported, for example trying to modify a read-only map.
mvstore_1114_code=ConcurrentModificationException
mvstore_1115_li=\ if a map is modified concurrently.
mvstore_1116_h3=Storage Engine for H2
mvstore_1117_p=\ The plan is to use the MVStore as the default storage engine for the H2 database in the future (supporting SQL, JDBC, transactions, MVCC, and so on). This is work in progress. To try it out, append <code>;MV_STORE\=TRUE</code> to the database URL. In general, performance should be similar than the current default storage engine (the page store). Even though it can be used with the default table level locking, it is recommended to use it together with the MVCC mode (to do that, append <code>;MVCC\=TRUE</code> to the database URL).
mvstore_1118_h2=File Format
mvstore_1119_p=\ The data is stored in one file. The file contains two file headers (for safety), and a number of chunks. The file headers are one block each; a block is 4096 bytes. Each chunk is at least one block, but typically 200 blocks or more. Data is stored in the chunks in the form of a <a href\="https\://en.wikipedia.org/wiki/Log-structured_file_system">log structured storage</a>. There is one chunk for every version.
mvstore_1120_p=\ Each chunk contains a number of B-tree pages. As an example, the following code\:
mvstore_1121_p=\ will result in the following two chunks (excluding metadata)\:
mvstore_1122_b=Chunk 1\:
mvstore_1123_p=\ - Page 1\: (root) node with 2 entries pointing to page 2 and 3
mvstore_1124_p=\ - Page 2\: leaf with 140 entries (keys 0 - 139)
mvstore_1125_p=\ - Page 3\: leaf with 260 entries (keys 140 - 399)
mvstore_1126_b=Chunk 2\:
mvstore_1127_p=\ - Page 4\: (root) node with 2 entries pointing to page 3 and 5
mvstore_1128_p=\ - Page 5\: leaf with 140 entries (keys 0 - 139)
mvstore_1129_p=\ That means each chunk contains the changes of one version\: the new version of the changed pages and the parent pages, recursively, up to the root page. Pages in subsequent chunks refer to pages in earlier chunks.
mvstore_1130_h3=File Header
mvstore_1131_p=\ There are two file headers, which normally contain the exact same data. But once in a while, the file headers are updated, and writing could partially fail, which could corrupt a header. That's why there is a second header. Only the file headers are updated in this way (called "in-place update"). The headers contain the following data\:
mvstore_1132_p=\ The data is stored in the form of a key-value pair. Each value is stored as a hexadecimal number. The entries are\:
mvstore_1133_li=H\: The entry "H\:2" stands for the the H2 database.
mvstore_1134_li=block\: The block number where one of the newest chunks starts (but not necessarily the newest).
mvstore_1135_li=blockSize\: The block size of the file; currently always hex 1000, which is decimal 4096, to match the <a href\="https\://en.wikipedia.org/wiki/Disk_sector">disk sector</a> length of modern hard disks.
mvstore_1136_li=chunk\: The chunk id, which is normally the same value as the version; however, the chunk id might roll over to 0, while the version doesn't.
mvstore_1137_li=created\: The number of milliseconds since 1970 when the file was created.
mvstore_1138_li=format\: The file format number. Currently 1.
mvstore_1139_li=version\: The version number of the chunk.
mvstore_1140_li=fletcher\: The <a href\="https\://en.wikipedia.org/wiki/Fletcher's_checksum"> Fletcher-32 checksum</a> of the header.
mvstore_1141_p=\ When opening the file, both headers are read and the checksum is verified. If both headers are valid, the one with the newer version is used. The chunk with the latest version is then detected (details about this see below), and the rest of the metadata is read from there. If the chunk id, block and version are not stored in the file header, then the latest chunk lookup starts with the last chunk in the file.
mvstore_1142_h3=Chunk Format
mvstore_1143_p=\ There is one chunk per version. Each chunk consists of a header, the pages that were modified in this version, and a footer. The pages contain the actual data of the maps. The pages inside a chunk are stored right after the header, next to each other (unaligned). The size of a chunk is a multiple of the block size. The footer is stored in the last 128 bytes of the chunk.
mvstore_1144_p=\ The footer allows to verify that the chunk is completely written (a chunk is written as one write operation), and allows to find the start position of the very last chunk in the file. The chunk header and footer contain the following data\:
mvstore_1145_p=\ The fields of the chunk header and footer are\:
mvstore_1146_li=chunk\: The chunk id.
mvstore_1147_li=block\: The first block of the chunk (multiply by the block size to get the position in the file).
mvstore_1148_li=len\: The size of the chunk in number of blocks.
mvstore_1149_li=map\: The id of the newest map; incremented when a new map is created.
mvstore_1150_li=max\: The sum of all maximum page sizes (see page format).
mvstore_1151_li=next\: The predicted start block of the next chunk.
mvstore_1152_li=pages\: The number of pages in the chunk.
mvstore_1153_li=root\: The position of the metadata root page (see page format).
mvstore_1154_li=time\: The time the chunk was written, in milliseconds after the file was created.
mvstore_1155_li=version\: The version this chunk represents.
mvstore_1156_li=fletcher\: The checksum of the footer.
mvstore_1157_p=\ Chunks are never updated in-place. Each chunk contains the pages that were changed in that version (there is one chunk per version, see above), plus all the parent nodes of those pages, recursively, up to the root page. If an entry in a map is changed, removed, or added, then the respective page is copied to be stored in the next chunk, and the number of live pages in the old chunk is decremented. This mechanism is called copy-on-write, and is similar to how the <a href\="https\://en.wikipedia.org/wiki/Btrfs">Btrfs</a> file system works. Chunks without live pages are marked as free, so the space can be re-used by more recent chunks. Because not all chunks are of the same size, there can be a number of free blocks in front of a chunk for some time (until a small chunk is written or the chunks are compacted). There is a <a href\="http\://stackoverflow.com/questions/13650134/after-how-many-seconds-are-file-system-write-buffers-typically-flushed"> delay of 45 seconds</a> (by default) before a free chunk is overwritten, to ensure new versions are persisted first.
mvstore_1158_p=\ How the newest chunk is located when opening a store\: The file header contains the position of a recent chunk, but not always the newest one. This is to reduce the number of file header updates. After opening the file, the file headers, and the chunk footer of the very last chunk (at the end of the file) are read. From those candidates, the header of the most recent chunk is read. If it contains a "next" pointer (see above), those chunk's header and footer are read as well. If it turned out to be a newer valid chunk, this is repeated, until the newest chunk was found. Before writing a chunk, the position of the next chunk is predicted based on the assumption that the next chunk will be of the same size as the current one. When the next chunk is written, and the previous prediction turned out to be incorrect, the file header is updated as well. In any case, the file header is updated if the next chain gets longer than 20 hops.
mvstore_1159_h3=Page Format
mvstore_1160_p=\ Each map is a <a href\="https\://en.wikipedia.org/wiki/B-tree">B-tree</a>, and the map data is stored in (B-tree-) pages. There are leaf pages that contain the key-value pairs of the map, and internal nodes, which only contain keys and pointers to leaf pages. The root of a tree is either a leaf or an internal node. Unlike file header and chunk header and footer, the page data is not human readable. Instead, it is stored as byte arrays, with long (8 bytes), int (4 bytes), short (2 bytes), and <a href\="https\://en.wikipedia.org/wiki/Variable-length_quantity">variable size int and long</a> (1 to 5 / 10 bytes). The page format is\:
mvstore_1161_li=length (int)\: Length of the page in bytes.
mvstore_1162_li=checksum (short)\: Checksum (chunk id xor offset within the chunk xor page length).
mvstore_1163_li=mapId (variable size int)\: The id of the map this page belongs to.
mvstore_1164_li=len (variable size int)\: The number of keys in the page.
mvstore_1165_li=type (byte)\: The page type (0 for leaf page, 1 for internal node; plus 2 if the keys and values are compressed).
mvstore_1166_li=children (array of long; internal nodes only)\: The position of the children.
mvstore_1167_li=childCounts (array of variable size long; internal nodes only)\: The total number of entries for the given child page.
mvstore_1168_li=keys (byte array)\: All keys, stored depending on the data type.
mvstore_1169_li=values (byte array; leaf pages only)\: All values, stored depending on the data type.
mvstore_1170_p=\ Even though this is not required by the file format, pages are stored in the following order\: For each map, the root page is stored first, then the internal nodes (if there are any), and then the leaf pages. This should speed up reads for media where sequential reads are faster than random access reads. The metadata map is stored at the end of a chunk.
mvstore_1171_p=\ Pointers to pages are stored as a long, using a special format\: 26 bits for the chunk id, 32 bits for the offset within the chunk, 5 bits for the length code, 1 bit for the page type (leaf or internal node). The page type is encoded so that when clearing or removing a map, leaf pages don't have to be read (internal nodes do have to be read in order to know where all the pages are; but in a typical B-tree the vast majority of the pages are leaf pages). The absolute file position is not included so that chunks can be moved within the file without having to change page pointers; only the chunk metadata needs to be changed. The length code is a number from 0 to 31, where 0 means the maximum length of the page is 32 bytes, 1 means 48 bytes, 2\: 64, 3\: 96, 4\: 128, 5\: 192, and so on until 31 which means longer than 1 MB. That way, reading a page only requires one read operation (except for very large pages). The sum of the maximum length of all pages is stored in the chunk metadata (field "max"), and when a page is marked as removed, the live maximum length is adjusted. This allows to estimate the amount of free space within a block, in addition to the number of free pages.
mvstore_1172_p=\ The total number of entries in child nodes are kept to allow efficient range counting, lookup by index, and skip operations. The pages form a <a href\="http\://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html">counted B-tree</a>.
mvstore_1173_p=\ Data compression\: The data after the page type are optionally compressed using the LZF algorithm.
mvstore_1174_h3=Metadata Map
mvstore_1175_p=\ In addition to the user maps, there is one metadata map that contains names and positions of user maps, and chunk metadata. The very last page of a chunk contains the root page of that metadata map. The exact position of this root page is stored in the chunk header. This page (directly or indirectly) points to the root pages of all other maps. The metadata map of a store with a map named "data", and one chunk, contains the following entries\:
mvstore_1176_li=chunk.1\: The metadata of chunk 1. This is the same data as the chunk header, plus the number of live pages, and the maximum live length.
mvstore_1177_li=map.1\: The metadata of map 1. The entries are\: name, createVersion, and type.
mvstore_1178_li=name.data\: The map id of the map named "data". The value is "1".
mvstore_1179_li=root.1\: The root position of map 1.
mvstore_1180_li=setting.storeVersion\: The store version (a user defined value).
mvstore_1181_h2=Similar Projects and Differences to Other Storage Engines
mvstore_1182_p=\ Unlike similar storage engines like LevelDB and Kyoto Cabinet, the MVStore is written in Java and can easily be embedded in a Java and Android application.
mvstore_1183_p=\ The MVStore is somewhat similar to the Berkeley DB Java Edition because it is also written in Java, and is also a log structured storage, but the H2 license is more liberal.
mvstore_1184_p=\ Like SQLite 3, the MVStore keeps all data in one file. Unlike SQLite 3, the MVStore uses is a log structured storage. The plan is to make the MVStore both easier to use as well as faster than SQLite 3. In a recent (very simple) test, the MVStore was about twice as fast as SQLite 3 on Android.
mvstore_1185_p=\ The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek, and some code is shared between MVStore and MapDB. However, unlike MapDB, the MVStore uses is a log structured storage. The MVStore does not have a record size limit.
mvstore_1186_h2=Current State
mvstore_1187_p=\ The code is still experimental at this stage. The API as well as the behavior may partially change. Features may be added and removed (even though the main features will stay).
mvstore_1188_h2=Requirements
mvstore_1189_p=\ The MVStore is included in the latest H2 jar file.
mvstore_1190_p=\ There are no special requirements to use it. The MVStore should run on any JVM as well as on Android.
mvstore_1191_p=\ To build just the MVStore (without the database engine), run\:
mvstore_1192_p=\ This will create the file <code>bin/h2mvstore-1.3.175.jar</code> (about 130 KB).
performance_1000_h1=Performance
performance_1001_a=\ Performance Comparison
performance_1002_a=\ PolePosition Benchmark
......
......@@ -229,7 +229,7 @@ public class Chunk {
DataUtils.appendMap(buff, "block", block);
DataUtils.appendMap(buff, "version", version);
byte[] bytes = buff.toString().getBytes(DataUtils.LATIN);
int checksum = DataUtils.getFletcher32(bytes, bytes.length / 2 * 2);
int checksum = DataUtils.getFletcher32(bytes, bytes.length);
DataUtils.appendMap(buff, "fletcher", checksum);
while (buff.length() < Chunk.FOOTER_LENGTH - 1) {
buff.append(' ');
......
......@@ -19,7 +19,7 @@ public class Cursor<K, V> implements Iterator<K> {
private final MVMap<K, ?> map;
private final K from;
private CursorPos pos;
private K current;
private K current, last;
private V currentValue, lastValue;
private final Page root;
private boolean initialized;
......@@ -44,11 +44,21 @@ public class Cursor<K, V> implements Iterator<K> {
public K next() {
hasNext();
K c = current;
last = current;
lastValue = currentValue;
fetchNext();
return c;
}
/**
* Get the last read key if there was one.
*
* @return the key or null
*/
public K getKey() {
return last;
}
/**
* Get the last read value if there was one.
*
......
......@@ -94,10 +94,15 @@ public class DataUtils {
public static final int PAGE_TYPE_NODE = 1;
/**
* The bit mask for compressed pages.
* The bit mask for compressed pages (compression level fast).
*/
public static final int PAGE_COMPRESSED = 2;
/**
* The bit mask for compressed pages (compression level high).
*/
public static final int PAGE_COMPRESSED_HIGH = 2 + 4;
/**
* The maximum length of a variable size int.
*/
......@@ -394,12 +399,13 @@ public class DataUtils {
}
/**
* Read from a file channel until the buffer is full, or end-of-file
* has been reached. The buffer is rewind after reading.
* Read from a file channel until the buffer is full.
* The buffer is rewind after reading.
*
* @param file the file channel
* @param pos the absolute position within the file
* @param dst the byte buffer
* @throws IllegalStateException if some data could not be read
*/
public static void readFully(FileChannel file, long pos, ByteBuffer dst) {
try {
......@@ -662,20 +668,26 @@ public class DataUtils {
* Calculate the Fletcher32 checksum.
*
* @param bytes the bytes
* @param length the message length (must be a multiple of 2)
* @param length the message length (if odd, 0 is appended)
* @return the checksum
*/
public static int getFletcher32(byte[] bytes, int length) {
int s1 = 0xffff, s2 = 0xffff;
for (int i = 0; i < length;) {
int i = 0, evenLength = length / 2 * 2;
while (i < evenLength) {
// reduce after 360 words (each word is two bytes)
for (int end = Math.min(i + 720, length); i < end;) {
for (int end = Math.min(i + 720, evenLength); i < end;) {
int x = ((bytes[i++] & 0xff) << 8) | (bytes[i++] & 0xff);
s2 += s1 += x;
}
s1 = (s1 & 0xffff) + (s1 >>> 16);
s2 = (s2 & 0xffff) + (s2 >>> 16);
}
if (i < length) {
// odd length: append 0
int x = (bytes[i] & 0xff) << 8;
s2 += s1 += x;
}
s1 = (s1 & 0xffff) + (s1 >>> 16);
s2 = (s2 & 0xffff) + (s2 >>> 16);
return (s2 << 16) | s1;
......
......@@ -30,11 +30,21 @@ public class FileStore {
*/
protected long readCount;
/**
* The number of read bytes.
*/
protected long readBytes;
/**
* The number of write operations.
*/
protected long writeCount;
/**
* The number of written bytes.
*/
protected long writeBytes;
/**
* The free spaces between the chunks. The first block to use is block 2
* (the first two blocks are the store header).
......@@ -85,9 +95,10 @@ public class FileStore {
* @return the byte buffer
*/
public ByteBuffer readFully(long pos, int len) {
readCount++;
ByteBuffer dst = ByteBuffer.allocate(len);
DataUtils.readFully(file, pos, dst);
readCount++;
readBytes += len;
return dst;
}
......@@ -98,9 +109,11 @@ public class FileStore {
* @param src the source buffer
*/
public void writeFully(long pos, ByteBuffer src) {
writeCount++;
fileSize = Math.max(fileSize, pos + src.remaining());
int len = src.remaining();
fileSize = Math.max(fileSize, pos + len);
DataUtils.writeFully(file, pos, src);
writeCount++;
writeBytes += len;
}
/**
......@@ -258,6 +271,15 @@ public class FileStore {
return writeCount;
}
/**
* Get the number of written bytes since this store was opened.
*
* @return the number of write operations
*/
public long getWriteBytes() {
return writeBytes;
}
/**
* Get the number of read operations since this store was opened.
* For file based stores, this is the number of file read operations.
......@@ -268,6 +290,15 @@ public class FileStore {
return readCount;
}
/**
* Get the number of read bytes since this store was opened.
*
* @return the number of write operations
*/
public long getReadBytes() {
return readBytes;
}
public boolean isReadOnly() {
return readOnly;
}
......
......@@ -16,6 +16,7 @@ import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentMap;
import org.h2.mvstore.type.DataType;
import org.h2.mvstore.type.ObjectDataType;
import org.h2.util.New;
......@@ -765,11 +766,47 @@ public class MVMap<K, V> extends AbstractMap<K, V>
@Override
public Set<Map.Entry<K, V>> entrySet() {
HashMap<K, V> map = new HashMap<K, V>();
for (Cursor<K, V> cursor = cursor(null); cursor.hasNext();) {
map.put(cursor.next(), cursor.getValue());
}
return map.entrySet();
final MVMap<K, V> map = this;
final Page root = this.root;
return new AbstractSet<Entry<K, V>>() {
@Override
public Iterator<Entry<K, V>> iterator() {
final Cursor<K, V> cursor = new Cursor<K, V>(map, root, null);
return new Iterator<Entry<K, V>>() {
@Override
public boolean hasNext() {
return cursor.hasNext();
}
@Override
public Entry<K, V> next() {
K k = cursor.next();
return new DataUtils.MapEntry<K, V>(k, cursor.getValue());
}
@Override
public void remove() {
throw DataUtils.newUnsupportedOperationException(
"Removing is not supported");
}
};
}
@Override
public int size() {
return MVMap.this.size();
}
@Override
public boolean contains(Object o) {
return MVMap.this.containsKey(o);
}
};
}
@Override
......
......@@ -20,6 +20,7 @@ import java.util.Map.Entry;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import org.h2.compress.CompressDeflate;
import org.h2.compress.CompressLZF;
import org.h2.compress.Compressor;
import org.h2.mvstore.cache.CacheLongKeyLIRS;
......@@ -186,12 +187,14 @@ public class MVStore {
private int versionsToKeep = 5;
/**
* Whether to compress new pages. Even if disabled, the store may contain
* (old) compressed pages.
* The compression level for new pages (0 for disabled, 1 for fast, 2 for
* high). Even if disabled, the store may contain (old) compressed pages.
*/
private final boolean compress;
private final int compressionLevel;
private final Compressor compressor = new CompressLZF();
private Compressor compressorFast;
private Compressor compressorHigh;
private final UncaughtExceptionHandler backgroundExceptionHandler;
......@@ -247,9 +250,10 @@ public class MVStore {
* @throws IllegalArgumentException if the directory does not exist
*/
MVStore(HashMap<String, Object> config) {
this.compress = config.containsKey("compress");
Object o = config.get("compress");
this.compressionLevel = o == null ? 0 : (Integer) o;
String fileName = (String) config.get("fileName");
Object o = config.get("pageSplitSize");
o = config.get("pageSplitSize");
if (o == null) {
pageSplitSize = fileName == null ? 4 * 1024 : 16 * 1024;
} else {
......@@ -525,7 +529,7 @@ public class MVStore {
s = s.substring(0, s.lastIndexOf("fletcher") - 1);
byte[] bytes = s.getBytes(DataUtils.LATIN);
int checksum = DataUtils.getFletcher32(bytes,
bytes.length / 2 * 2);
bytes.length);
if (check != checksum) {
continue;
}
......@@ -683,7 +687,7 @@ public class MVStore {
m.remove("fletcher");
s = s.substring(0, s.lastIndexOf("fletcher") - 1);
byte[] bytes = s.getBytes(DataUtils.LATIN);
int checksum = DataUtils.getFletcher32(bytes, bytes.length / 2 * 2);
int checksum = DataUtils.getFletcher32(bytes, bytes.length);
if (check == checksum) {
int chunk = DataUtils.readHexInt(m, "chunk", 0);
Chunk c = new Chunk(chunk);
......@@ -706,7 +710,7 @@ public class MVStore {
}
DataUtils.appendMap(buff, fileHeader);
byte[] bytes = buff.toString().getBytes(DataUtils.LATIN);
int checksum = DataUtils.getFletcher32(bytes, bytes.length / 2 * 2);
int checksum = DataUtils.getFletcher32(bytes, bytes.length);
DataUtils.appendMap(buff, "fletcher", checksum);
buff.append("\n");
bytes = buff.toString().getBytes(DataUtils.LATIN);
......@@ -1676,12 +1680,22 @@ public class MVStore {
}
}
Compressor getCompressor() {
return compressor;
Compressor getCompressorFast() {
if (compressorFast == null) {
compressorFast = new CompressLZF();
}
return compressorFast;
}
Compressor getCompressorHigh() {
if (compressorHigh == null) {
compressorHigh = new CompressDeflate();
}
return compressorHigh;
}
boolean getCompress() {
return compress;
int getCompressionLevel() {
return compressionLevel;
}
public int getPageSplitSize() {
......@@ -2247,6 +2261,15 @@ public class MVStore {
return (int) (cache.getMaxMemory() / 1024 / 1024);
}
/**
* Get the cache.
*
* @return the cache
*/
public CacheLongKeyLIRS<Page> getCache() {
return cache;
}
/**
* A background writer thread to automatically store changes from time to
* time.
......@@ -2391,10 +2414,25 @@ public class MVStore {
*
* @return this
*/
public Builder compressData() {
public Builder compress() {
return set("compress", 1);
}
/**
* Compress data before writing using the Deflate algorithm. This will
* save more disk space, but will slow down read and write operations
* quite a bit.
* <p>
* This setting only affects writes; it is not necessary to enable
* compression when reading, even if compression was enabled when
* writing.
*
* @return this
*/
public Builder compressHigh() {
return set("compress", 2);
}
/**
* Set the amount of memory a page should contain at most, in bytes,
* before it is split. The default is 16 KB for persistent stores and 4
......
......@@ -39,6 +39,7 @@ public class OffHeapStore extends FileStore {
"Could not read from position {0}", pos);
}
readCount++;
readBytes += len;
ByteBuffer buff = memEntry.getValue();
ByteBuffer read = buff.duplicate();
int offset = (int) (pos - memEntry.getKey());
......@@ -81,6 +82,7 @@ public class OffHeapStore extends FileStore {
"partial overwrite is not supported", pos);
}
writeCount++;
writeBytes += length;
buff.rewind();
buff.put(src);
return;
......@@ -95,8 +97,9 @@ public class OffHeapStore extends FileStore {
}
private void writeNewEntry(long pos, ByteBuffer src) {
writeCount++;
int length = src.remaining();
writeCount++;
writeBytes += length;
ByteBuffer buff = ByteBuffer.allocateDirect(length);
buff.put(src);
buff.rewind();
......
......@@ -777,7 +777,13 @@ public class Page {
}
boolean compressed = (type & DataUtils.PAGE_COMPRESSED) != 0;
if (compressed) {
Compressor compressor = map.getStore().getCompressor();
Compressor compressor;
if ((type & DataUtils.PAGE_COMPRESSED_HIGH) ==
DataUtils.PAGE_COMPRESSED_HIGH) {
compressor = map.getStore().getCompressorHigh();
} else {
compressor = map.getStore().getCompressorFast();
}
int lenAdd = DataUtils.readVarInt(buff);
int compLen = pageLength + start - buff.position();
byte[] comp = DataUtils.newBytes(compLen);
......@@ -827,18 +833,30 @@ public class Page {
}
MVStore store = map.getStore();
int expLen = buff.position() - compressStart;
if (expLen > 16 && store.getCompress()) {
Compressor compressor = map.getStore().getCompressor();
byte[] exp = new byte[expLen];
buff.position(compressStart).get(exp);
byte[] comp = new byte[expLen * 2];
int compLen = compressor.compress(exp, expLen, comp, 0);
if (compLen + DataUtils.getVarIntLen(compLen - expLen) < expLen) {
buff.position(typePos).
put((byte) (type + DataUtils.PAGE_COMPRESSED));
buff.position(compressStart).
putVarInt(expLen - compLen).
put(comp, 0, compLen);
if (expLen > 16) {
int compressionLevel = store.getCompressionLevel();
if (compressionLevel > 0) {
Compressor compressor;
int compressType;
if (compressionLevel == 1) {
compressor = map.getStore().getCompressorFast();
compressType = DataUtils.PAGE_COMPRESSED;
} else {
compressor = map.getStore().getCompressorHigh();
compressType = DataUtils.PAGE_COMPRESSED_HIGH;
}
byte[] exp = new byte[expLen];
buff.position(compressStart).get(exp);
byte[] comp = new byte[expLen * 2];
int compLen = compressor.compress(exp, expLen, comp, 0);
int plus = DataUtils.getVarIntLen(compLen - expLen);
if (compLen + plus < expLen) {
buff.position(typePos).
put((byte) (type + compressType));
buff.position(compressStart).
putVarInt(expLen - compLen).
put(comp, 0, compLen);
}
}
}
int pageLength = buff.position() - start;
......
......@@ -301,7 +301,10 @@ public class WriteBuffer {
private void grow(int len) {
ByteBuffer temp = buff;
int needed = len - temp.remaining();
int newCapacity = temp.capacity() + Math.max(needed, MIN_GROW);
int grow = Math.max(needed, MIN_GROW);
// grow at least 50% of the current size
grow = Math.max(temp.capacity() / 2, grow);
int newCapacity = temp.capacity() + grow;
buff = ByteBuffer.allocate(newCapacity);
temp.flip();
buff.put(temp);
......
......@@ -79,7 +79,7 @@ public class MVTableEngine implements TableEngine {
builder.encryptionKey(password);
}
if (db.getSettings().compressData) {
builder.compressData();
builder.compress();
// use a larger page split size to improve the compression ratio
builder.pageSplitSize(64 * 1024);
}
......
......@@ -61,6 +61,21 @@ public class TestDataUtils extends TestBase {
for (int i = 0; i < 10000; i += 1000) {
assertEquals(-1, DataUtils.getFletcher32(data, i));
}
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 255; j++) {
Arrays.fill(data, 0, i, (byte) j);
data[i] = 0;
int a = DataUtils.getFletcher32(data, i);
if (i % 2 == 1) {
// add length: same as appending a 0
int b = DataUtils.getFletcher32(data, i + 1);
assertEquals(a, b);
}
data[i] = 10;
int c = DataUtils.getFletcher32(data, i);
assertEquals(a, c);
}
}
long last = 0;
for (int i = 1; i < 255; i++) {
Arrays.fill(data, (byte) i);
......
......@@ -11,6 +11,7 @@ import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Random;
import java.util.TreeMap;
import java.util.concurrent.atomic.AtomicReference;
......@@ -51,6 +52,7 @@ public class TestMVStore extends TestBase {
public void test() throws Exception {
FileUtils.deleteRecursive(getBaseDir(), true);
FileUtils.createDirectories(getBaseDir());
testEntrySet();
testCompressEmptyPage();
testCompressed();
testFileFormatExample();
......@@ -108,11 +110,26 @@ public class TestMVStore extends TestBase {
testLargerThan2G();
}
private void testEntrySet() {
MVStore s = new MVStore.Builder().open();
MVMap<Integer, Integer> map;
map = s.openMap("data");
for (int i = 0; i < 20; i++) {
map.put(i, i * 10);
}
int next = 0;
for (Entry<Integer, Integer> e : map.entrySet()) {
assertEquals(next, e.getKey().intValue());
assertEquals(next * 10, e.getValue().intValue());
next++;
}
}
private void testCompressEmptyPage() {
String fileName = getBaseDir() + "/testDeletedMap.h3";
MVStore store = new MVStore.Builder().
cacheSize(100).fileName(fileName).
compressData().
compress().
autoCommitBufferSize(10 * 1024).
open();
MVMap<String, String> map = store.openMap("test");
......@@ -120,26 +137,41 @@ public class TestMVStore extends TestBase {
store.commit();
store.close();
store = new MVStore.Builder().
compressData().
compress().
open();
store.close();
}
private void testCompressed() {
String fileName = getBaseDir() + "/testCompressed.h3";
MVStore s = new MVStore.Builder().fileName(fileName).compressData().open();
MVMap<String, String> map = s.openMap("data");
String data = "xxxxxxxxxx";
for (int i = 0; i < 400; i++) {
map.put(data + i, data);
}
s.close();
s = new MVStore.Builder().fileName(fileName).open();
map = s.openMap("data");
for (int i = 0; i < 400; i++) {
assertEquals(data, map.get(data + i));
long lastSize = 0;
for (int level = 0; level <= 2; level++) {
FileUtils.delete(fileName);
MVStore.Builder builder = new MVStore.Builder().fileName(fileName);
if (level == 1) {
builder.compress();
} else if (level == 2) {
builder.compressHigh();
}
MVStore s = builder.open();
MVMap<String, String> map = s.openMap("data");
String data = new String(new char[1000]).replace((char) 0, 'x');
for (int i = 0; i < 400; i++) {
map.put(data + i, data);
}
s.close();
long size = FileUtils.size(fileName);
if (level > 0) {
assertTrue(size < lastSize);
}
lastSize = size;
s = new MVStore.Builder().fileName(fileName).open();
map = s.openMap("data");
for (int i = 0; i < 400; i++) {
assertEquals(data, map.get(data + i));
}
s.close();
}
s.close();
}
private void testFileFormatExample() {
......@@ -707,7 +739,7 @@ public class TestMVStore extends TestBase {
s = new MVStore.Builder().
fileName(fileName).
autoCommitDisabled().
compressData().open();
compress().open();
map = s.openMap("test");
// add 10 MB of data
for (int i = 0; i < 1024; i++) {
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论