mvstore.html 33.2 KB
Newer Older
1 2
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
Thomas Mueller's avatar
Thomas Mueller committed
3
Copyright 2004-2013 H2 Group. Multiple-Licensed under the H2 License, Version 1.0,
4 5 6 7 8 9
and under the Eclipse Public License, Version 1.0
(http://h2database.com/html/license.html).
Initial Developer: H2 Group
-->
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><title>
10
MVStore
11 12 13 14 15 16 17 18 19 20 21 22
</title><link rel="stylesheet" type="text/css" href="stylesheet.css" />
<!-- [search] { -->
<script type="text/javascript" src="navigation.js"></script>
</head><body onload="frameMe();">
<table class="content"><tr class="content"><td class="content"><div class="contentDiv">
<!-- } -->

<h1>MVStore</h1>
<a href="#overview">
    Overview</a><br />
<a href="#example_code">
    Example Code</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
23 24 25 26
<a href="#store_builder">
    Store Builder</a><br />
<a href="#r_tree">
    R-Tree</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
27

28 29
<a href="#features">
    Features</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
30 31 32 33 34 35 36 37 38
<a href="#maps">- Maps</a><br />
<a href="#versions">- Versions</a><br />
<a href="#transactions">- Transactions</a><br />
<a href="#inMemory">- In-Memory Performance and Usage</a><br />
<a href="#dataTypes">- Pluggable Data Types</a><br />
<a href="#blob">- BLOB Support</a><br />
<a href="#pluggableMap">- R-Tree and Pluggable Map Implementations</a><br />
<a href="#caching">- Concurrent Operations and Caching</a><br />
<a href="#logStructured">- Log Structured Storage</a><br />
39
<a href="#offHeap">- Off-Heap and Pluggable Storage</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
40 41 42 43
<a href="#fileSystem">- File System Abstraction, File Locking and Online Backup</a><br />
<a href="#encryption">- Encrypted Files</a><br />
<a href="#tools">- Tools</a><br />
<a href="#exceptionHandling">- Exception Handling</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
44
<a href="#storageEngine">- Storage Engine for H2</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
45

46 47 48
<a href="#fileFormat">
    File Format</a><br />

49 50 51 52 53 54 55 56
<a href="#differences">
    Similar Projects and Differences to Other Storage Engines</a><br />
<a href="#current_state">
    Current State</a><br />
<a href="#requirements">
    Requirements</a><br />

<h2 id="overview">Overview</h2>
57 58
<p>
The MVStore is work in progress, and is planned to be the next storage subsystem of H2.
Thomas Mueller's avatar
Thomas Mueller committed
59
But it can also be used directly within an application, without using JDBC or SQL.
60
</p>
61
<ul><li>MVStore stands for "multi-version store".
Thomas Mueller's avatar
Thomas Mueller committed
62
</li><li>Each store contains a number of maps that can be accessed using the <code>java.util.Map</code> interface.
63
</li><li>Both file-based persistence and in-memory operation are supported.
64 65
</li><li>It is intended to be fast, simple to use, and small.
</li><li>Old versions of the data can be read concurrently with all other operations.
Thomas Mueller's avatar
Thomas Mueller committed
66
</li><li>Transactions are supported (including concurrent transactions and 2-phase commit).
Thomas Mueller's avatar
Thomas Mueller committed
67
</li><li>The tool is very modular.
Thomas Mueller's avatar
Thomas Mueller committed
68
    It supports pluggable data types and serialization,
Thomas Mueller's avatar
Thomas Mueller committed
69
    pluggable storage (to a file, to off-heap memory),
Thomas Mueller's avatar
Thomas Mueller committed
70
    pluggable map implementations (B-tree, R-tree, concurrent B-tree currently),
Thomas Mueller's avatar
Thomas Mueller committed
71
    BLOB storage,
72
    and a file system abstraction to support encrypted files and zip files.
73 74 75
</li></ul>

<h2 id="example_code">Example Code</h2>
76
<p>
Thomas Mueller's avatar
Thomas Mueller committed
77
The following sample code shows how to use the tool:
78
</p>
79
<pre>
Thomas Mueller's avatar
Thomas Mueller committed
80 81
import org.h2.mvstore.*;

82 83 84
// open the store (in-memory if fileName is null)
MVStore s = MVStore.open(fileName);

85
// create/get the map named "data"
86
MVMap&lt;Integer, String&gt; map = s.openMap("data");
87

Thomas Mueller's avatar
Thomas Mueller committed
88 89 90
// add and read some data
map.put(1, "Hello World");
System.out.println(map.get(1));
91

Thomas Mueller's avatar
Thomas Mueller committed
92
// close the store (this will persist changes)
93 94 95
s.close();
</pre>

Thomas Mueller's avatar
Thomas Mueller committed
96
<h2 id="store_builder">Store Builder</h2>
97
<p>
98
The <code>MVStore.Builder</code> provides a fluid interface
Thomas Mueller's avatar
Thomas Mueller committed
99
to build a store if configuration options are needed.
Thomas Mueller's avatar
Thomas Mueller committed
100
Example usage:
101 102
</p>
<pre>
103
MVStore s = new MVStore.Builder().
Thomas Mueller's avatar
Thomas Mueller committed
104
    fileName(fileName).
Thomas Mueller's avatar
Thomas Mueller committed
105
    encryptionKey("007".toCharArray()).
106
    compress().
107 108
    open();
</pre>
Thomas Mueller's avatar
Thomas Mueller committed
109 110 111 112 113
<p>
The list of available options is:
</p>
<ul><li>autoCommitBufferSize: the size of the write buffer.
</li><li>autoCommitDisabled: to disable auto-commit.
Thomas Mueller's avatar
Thomas Mueller committed
114
</li><li>backgroundExceptionHandler: a handler for
Thomas Mueller's avatar
Thomas Mueller committed
115 116
    exceptions that could occur while writing in the background.
</li><li>cacheSize: the cache size in MB.
117 118 119 120
</li><li>compress: compress the data when storing
    using a fast algorithm (LZF).
</li><li>compressHigh: compress the data when storing
    using a slower algorithm (Deflate).
Thomas Mueller's avatar
Thomas Mueller committed
121
</li><li>encryptionKey: the key for file encryption.
Thomas Mueller's avatar
Thomas Mueller committed
122
</li><li>fileName: the name of the file, for file based stores.
Thomas Mueller's avatar
Thomas Mueller committed
123
</li><li>fileStore: the storage implementation to use.
Thomas Mueller's avatar
Thomas Mueller committed
124
</li><li>pageSplitSize: the point where pages are split.
Thomas Mueller's avatar
Thomas Mueller committed
125 126
</li><li>readOnly: open the file in read-only mode.
</li></ul>
127

Thomas Mueller's avatar
Thomas Mueller committed
128
<h2 id="r_tree">R-Tree</h2>
129 130
<p>
The <code>MVRTreeMap</code> is an R-tree implementation
131
that supports fast spatial queries. It can be used as follows:
132 133 134 135 136
</p>
<pre>
// create an in-memory store
MVStore s = MVStore.open(null);

137
// open an R-tree map
138 139
MVRTreeMap&lt;String&gt; r = s.openMap("data",
        new MVRTreeMap.Builder&lt;String&gt;());
140 141 142 143 144 145 146 147

// add two key-value pairs
// the first value is the key id (to make the key unique)
// then the min x, max x, min y, max y
r.add(new SpatialKey(0, -3f, -2f, 2f, 3f), "left");
r.add(new SpatialKey(1, 3f, 4f, 4f, 5f), "right");

// iterate over the intersecting keys
148
Iterator&lt;SpatialKey&gt; it =
149 150 151 152 153 154 155
        r.findIntersectingKeys(new SpatialKey(0, 0f, 9f, 3f, 6f));
for (SpatialKey k; it.hasNext();) {
    k = it.next();
    System.out.println(k + ": " + r.get(k));
}
s.close();
</pre>
Thomas Mueller's avatar
Thomas Mueller committed
156 157
<p>
The default number of dimensions is 2. To use a different number of dimensions,
Thomas Mueller's avatar
Thomas Mueller committed
158
call <code>new MVRTreeMap.Builder&lt;String&gt;().dimensions(3)</code>.
Thomas Mueller's avatar
Thomas Mueller committed
159
The minimum number of dimensions is 1, the maximum is 32.
Thomas Mueller's avatar
Thomas Mueller committed
160
</p>
161 162 163

<h2 id="features">Features</h2>

164
<h3 id="maps">Maps</h3>
165
<p>
Thomas Mueller's avatar
Thomas Mueller committed
166
Each store contains a set of named maps.
167
A map is sorted by key, and supports the common lookup operations,
168 169
including access to the first and last key, iterate over some or all keys, and so on.
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
170
Also supported, and very uncommon for maps, is fast index lookup:
Thomas Mueller's avatar
Thomas Mueller committed
171 172 173 174
the entries of the map can be be efficiently accessed like a random-access list
(get the entry at the given index), and the index of a key can be calculated efficiently.
That also means getting the median of two keys is very fast,
and a range of keys can be counted very quickly.
175 176
The iterator supports fast skipping.
This is possible because internally, each map is organized in the form of a counted B+-tree.
177
</p><p>
178
In database terms, a map can be used like a table, where the key of the map is the primary key of the table,
179
and the value is the row. A map can also represent an index, where the key of the map is the key
Thomas Mueller's avatar
Thomas Mueller committed
180
of the index, and the value of the map is the primary key of the table (for non-unique indexes,
181
the key of the map must also contain the primary key).
182 183
</p>

184
<h3 id="versions">Versions</h3>
185 186
<p>
A version is a snapshot of all the data of all maps at a given point in time.
Thomas Mueller's avatar
Thomas Mueller committed
187
Creating a snapshot is fast: only those pages that are changed after a snapshot are copied.
Thomas Mueller's avatar
Thomas Mueller committed
188
This behavior is also called COW (copy on write).
Thomas Mueller's avatar
Thomas Mueller committed
189
Rollback to an old version is supported.
Thomas Mueller's avatar
Thomas Mueller committed
190
Old versions are readable until old data is purged.
Thomas Mueller's avatar
Thomas Mueller committed
191 192 193
</p><p>
The following sample code show how to create a store, open a map, add some data,
and access the current and an old version:
194
</p>
Thomas Mueller's avatar
Thomas Mueller committed
195 196 197 198 199 200 201 202 203 204 205 206
<pre>
// create/get the map named "data"
MVMap&lt;Integer, String&gt; map = s.openMap("data");

// add some data
map.put(1, "Hello");
map.put(2, "World");

// get the current version, for later use
long oldVersion = s.getCurrentVersion();

// from now on, the old version is read-only
Thomas Mueller's avatar
Thomas Mueller committed
207
s.commit();
Thomas Mueller's avatar
Thomas Mueller committed
208 209 210 211 212 213 214

// more changes, in the new version
// changes can be rolled back if required
// changes always go into "head" (the newest version)
map.put(1, "Hi");
map.remove(2);

Thomas Mueller's avatar
Thomas Mueller committed
215
// access the old data (before the commit)
Thomas Mueller's avatar
Thomas Mueller committed
216 217 218 219 220 221 222 223 224 225 226 227
MVMap&lt;Integer, String&gt; oldMap =
        map.openVersion(oldVersion);

// print the old version (can be done
// concurrently with further modifications)
// this will print "Hello" and "World":
System.out.println(oldMap.get(1));
System.out.println(oldMap.get(2));

// print the newest version ("Hi")
System.out.println(map.get(1));
</pre>
228

229 230 231 232
<h3 id="transactions">Transactions</h3>
<p>
To support multiple concurrent open transactions, a transaction utility is included,
the <code>TransactionStore</code>.
Thomas Mueller's avatar
Thomas Mueller committed
233 234 235 236 237
The tool supports PostgreSQL style "read committed" transaction isolation
with savepoints, two-phase commit, and other features typically available in a database.
There is no limit on the size of a transaction
(the log is written to disk for large or long running transactions).
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
238 239
Internally, this utility stores the old versions of changed entries in a separate map, similar to a transaction log,
except that entries of a closed transaction are removed, and the log is usually not stored for short transactions.
Thomas Mueller's avatar
Thomas Mueller committed
240
For common use cases, the storage overhead of this utility is very small compared to the overhead of a regular transaction log.
241 242 243
</p>

<h3 id="inMemory">In-Memory Performance and Usage</h3>
244
<p>
Thomas Mueller's avatar
Thomas Mueller committed
245 246
Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>,
but usually slower than <code>java.util.HashMap</code>.
247 248
</p><p>
The memory overhead for large maps is slightly better than for the regular
249
map implementations, but there is a higher overhead per map.
Thomas Mueller's avatar
Thomas Mueller committed
250
For maps with less than about 25 entries, the regular map implementations need less memory.
Thomas Mueller's avatar
Thomas Mueller committed
251 252 253 254 255 256
</p><p>
If no file name is specified, the store operates purely in memory.
Except for persisting data, all features are supported in this mode
(multi-versioning, index lookup, R-tree and so on).
If a file name is specified, all operations occur in memory (with the same
performance characteristics) until data is persisted.
Thomas Mueller's avatar
Thomas Mueller committed
257
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
258 259
As in all map implementations, keys need to be immutable, that means
changing the key object after an entry has been added is not allowed.
Thomas Mueller's avatar
Thomas Mueller committed
260
If a file name is specified, the value may also not be changed after
Thomas Mueller's avatar
Thomas Mueller committed
261
adding an entry, because it might be serialized
Thomas Mueller's avatar
Thomas Mueller committed
262
(which could happen at any time when autocommit is enabled).
263 264
</p>

265
<h3 id="dataTypes">Pluggable Data Types</h3>
266
<p>
267
Serialization is pluggable. The default serialization currently supports many common data types,
268 269
and uses Java serialization for other objects. The following classes are currently directly supported:
<code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal,
Thomas Mueller's avatar
Thomas Mueller committed
270
String, UUID, Date</code> and arrays (both primitive arrays and object arrays).
271
For serialized objects, the size estimate is adjusted using an exponential moving average.
272 273
</p><p>
Parameterized data types are supported
Thomas Mueller's avatar
Thomas Mueller committed
274
(for example one could build a string data type that limits the length).
275
</p><p>
276
The storage engine itself does not have any length limits, so that keys, values,
277 278 279 280 281
pages, and chunks can be very big (as big as fits in memory).
Also, there is no inherent limit to the number of maps and chunks.
Due to using a log structured storage, there is no special case handling for large keys or pages.
</p>

282
<h3 id="blob">BLOB Support</h3>
283
<p>
284
There is a mechanism that stores large binary objects by splitting them into smaller blocks.
285 286
This allows to store objects that don't fit in memory.
Streaming as well as random access reads on such objects are supported.
Thomas Mueller's avatar
Thomas Mueller committed
287
This tool is written on top of the store, using only the map interface.
288 289
</p>

290
<h3 id="pluggableMap">R-Tree and Pluggable Map Implementations</h3>
291 292
<p>
The map implementation is pluggable.
Thomas Mueller's avatar
Thomas Mueller committed
293
In addition to the default <code>MVMap</code> (multi-version map),
Thomas Mueller's avatar
Thomas Mueller committed
294 295
there is a map that supports concurrent write operations,
and a multi-version R-tree map implementation for spatial operations.
296 297
</p>

298
<h3 id="caching">Concurrent Operations and Caching</h3>
299
<p>
300
The default map implementation supports concurrent reads on old versions of the data.
301
All such read operations can occur in parallel. Concurrent reads from the page cache,
302
as well as concurrent reads from the file system are supported.
Thomas Mueller's avatar
Thomas Mueller committed
303 304
Writing changes to the file can occur concurrently to modifying the data,
as writing operates on a snapshot.
305
</p><p>
306
Caching is done on the page level.
307
The page cache is a concurrent LIRS cache, which should be resistant against scan operations.
308
</p><p>
309
The default map implementation does not support concurrent modification
310
operations on a map (the same as <code>HashMap</code> and <code>TreeMap</code>).
Thomas Mueller's avatar
Thomas Mueller committed
311
Similar to those classes, the map tries to detect concurrent modifications.
312
</p><p>
313
With the <code>MVMapConcurrent</code> implementation,
314 315
read operations even on the newest version can happen concurrently with all other
operations, without risk of corruption.
316
This comes with slightly reduced speed in single threaded mode,
317
the same as with other <code>ConcurrentHashMap</code> implementations.
Thomas Mueller's avatar
Thomas Mueller committed
318
Write operations first read the relevant pages from disk to memory
Thomas Mueller's avatar
Thomas Mueller committed
319
(this can happen concurrently), and only then modify the data.
Thomas Mueller's avatar
Thomas Mueller committed
320
The in-memory parts of write operations are synchronized.
321
</p><p>
322 323
For fully scalable concurrent write operations to a map (in-memory and to disk),
the map could be split into multiple maps in different stores ('sharding').
324
The plan is to add such a mechanism later when needed.
325 326
</p>

327
<h3 id="logStructured">Log Structured Storage</h3>
Thomas Mueller's avatar
Thomas Mueller committed
328
<p>
Thomas Mueller's avatar
Thomas Mueller committed
329
Internally, changes are buffered in memory, and once enough changes have accumulated,
Thomas Mueller's avatar
Thomas Mueller committed
330
they are written in one continuous disk write operation.
Thomas Mueller's avatar
Thomas Mueller committed
331 332 333
Compared to traditional database storage engines,
this should improve write performance for file systems and storage systems
that do not efficiently support small random writes, such as Btrfs, as well as SSDs.
Thomas Mueller's avatar
Thomas Mueller committed
334
(According to a test, write throughput of a common SSD increases with write block size,
335
until a block size of 2 MB, and then does not further increase.)
Thomas Mueller's avatar
Thomas Mueller committed
336
By default, changes are automatically written when more than a number of pages are modified,
Thomas Mueller's avatar
Thomas Mueller committed
337 338
and once every second in a background thread, even if only little data was changed.
Changes can also be written explicitly by calling <code>commit()</code>.
Thomas Mueller's avatar
Thomas Mueller committed
339 340
</p><p>
When storing, all changed pages are serialized,
341
optionally compressed using the LZF algorithm,
Thomas Mueller's avatar
Thomas Mueller committed
342 343 344 345
and written sequentially to a free area of the file.
Each such change set is called a chunk.
All parent pages of the changed B-trees are stored in this chunk as well,
so that each chunk also contains the root of each changed map
Thomas Mueller's avatar
Thomas Mueller committed
346
(which is the entry point for reading this version of the data).
Thomas Mueller's avatar
Thomas Mueller committed
347
There is no separate index: all data is stored as a list of pages.
Thomas Mueller's avatar
Thomas Mueller committed
348
Per store, there is one additional map that contains the metadata (the list of
Thomas Mueller's avatar
Thomas Mueller committed
349 350
maps, where the root page of each map is stored, and the list of chunks).
</p><p>
351 352 353
There are usually two write operations per chunk:
one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk).
If the chunk is appended at the end of the file, the file header is only written at the end of the chunk.
Thomas Mueller's avatar
Thomas Mueller committed
354
There is no transaction log, no undo log,
Thomas Mueller's avatar
Thomas Mueller committed
355
and there are no in-place updates (however, unused chunks are overwritten by default).
Thomas Mueller's avatar
Thomas Mueller committed
356
</p><p>
357
Old data is kept for at least 45 seconds (configurable),
Thomas Mueller's avatar
Thomas Mueller committed
358 359
so that there are no explicit sync operations required to guarantee data consistency.
An application can also sync explicitly when needed.
Thomas Mueller's avatar
Thomas Mueller committed
360
To reuse disk space, the chunks with the lowest amount of live data are compacted
Thomas Mueller's avatar
Thomas Mueller committed
361
(the live data is stored again in the next chunk).
Thomas Mueller's avatar
Thomas Mueller committed
362 363
To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
364
Compared to traditional storage engines (that use a transaction log, undo log, and main storage area),
Thomas Mueller's avatar
Thomas Mueller committed
365 366 367 368 369 370 371
the log structured storage is simpler, more flexible, and typically needs less disk operations per change,
as data is only written once instead of twice or 3 times, and because the B-tree pages are
always full (they are stored next to each other) and can be easily compressed.
But temporarily, disk space usage might actually be a bit higher than for a regular database,
as disk space is not immediately re-used (there are no in-place updates).
</p>

372 373
<h3 id="offHeap">Off-Heap and Pluggable Storage</h3>
<p>
Thomas Mueller's avatar
Thomas Mueller committed
374
Storage is pluggable. Unless pure in-memory operation is used, the default storage is to a single file.
375 376 377 378
</p>
<p>
An off-heap storage implementation is available. This storage keeps the data in the off-heap memory,
meaning outside of the regular garbage collected heap. This allows to use very large in-memory
Thomas Mueller's avatar
Thomas Mueller committed
379 380
stores without having to increase the JVM heap, which would increase Java garbage collection
pauses a lot. Memory is allocated using <code>ByteBuffer.allocateDirect</code>.
381
One chunk is allocated at a time (each chunk is usually a few MB large), so that
382
allocation cost is low. To use the off-heap storage, call:
383
</p>
384 385 386 387 388
<pre>
OffHeapStore offHeap = new OffHeapStore();
MVStore s = new MVStore.Builder().
        fileStore(offHeap).open();
</pre>
389

390
<h3 id="fileSystem">File System Abstraction, File Locking and Online Backup</h3>
391
<p>
Thomas Mueller's avatar
Thomas Mueller committed
392
The file system is pluggable. The same file system abstraction is used as H2 uses.
393
The file can be encrypted using a encrypting file system wrapper.
Thomas Mueller's avatar
Thomas Mueller committed
394
Other file system implementations support reading from a compressed zip or jar file.
395
The file system abstraction closely matches the Java 7 file system API.
396 397 398 399 400 401 402 403
</p>
<p>
Each store may only be opened once within a JVM.
When opening a store, the file is locked in exclusive mode, so that
the file can only be changed from within one process.
Files can be opened in read-only mode, in which case a shared lock is used.
</p>
<p>
404
The persisted data can be backed up at any time,
405
even during write operations (online backup).
406
To do that, automatic disk space reuse needs to be first disabled, so that
407
new data is always appended at the end of the file.
Thomas Mueller's avatar
Thomas Mueller committed
408
Then, the file can be copied. The file handle is available to the application.
409
It is recommended to use the utility class <code>FileChannelInputStream</code> to do this.
Thomas Mueller's avatar
Thomas Mueller committed
410
For encrypted databases, both the encrypted (raw) file content,
411
as well as the clear text content, can be backed up.
412 413
</p>

414
<h3 id="encryption">Encrypted Files</h3>
415
<p>
416
File encryption ensures the data can only be read with the correct password.
417 418 419 420 421 422 423 424 425
Data can be encrypted as follows:
</p>
<pre>
MVStore s = new MVStore.Builder().
    fileName(fileName).
    encryptionKey("007".toCharArray()).
    open();
</pre>
<p>
426
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
427
The following algorithms and settings are used:
428
</p>
429
<ul><li>The password char array is cleared after use,
430 431
    to reduce the risk that the password is stolen
    even if the attacker has access to the main memory.
432
</li><li>The password is hashed according to the PBKDF2 standard,
433
    using the SHA-256 hash algorithm.
Thomas Mueller's avatar
Thomas Mueller committed
434
</li><li>The length of the salt is 64 bits,
435
    so that an attacker can not use a pre-calculated password hash table (rainbow table).
Thomas Mueller's avatar
Thomas Mueller committed
436 437 438 439 440 441 442
    It is generated using a cryptographically secure random number generator.
</li><li>To speed up opening an encrypted stores on Android,
    the number of PBKDF2 iterations is 10.
    The higher the value, the better the protection against brute-force password cracking attacks,
    but the slower is opening a file.
</li><li>The file itself is encrypted using the standardized disk encryption mode XTS-AES.
    Only little more than one AES-128 round per block is needed.
443
</li></ul>
444

445
<h3 id="tools">Tools</h3>
446
<p>
Thomas Mueller's avatar
Thomas Mueller committed
447
There is a tool, the <code>MVStoreTool</code>, to dump the contents of a file.
448
</p>
449

450
<h3 id="exceptionHandling">Exception Handling</h3>
451
<p>
452 453
This tool does not throw checked exceptions.
Instead, unchecked exceptions are thrown if needed.
454 455 456
The error message always contains the version of the tool.
The following exceptions can occur:
</p>
457
<ul><li><code>IllegalStateException</code> if a map was already closed or
458 459 460
    an IO exception occurred, for example if the file was locked, is already closed,
    could not be opened or closed, if reading or writing failed,
    if the file is corrupt, or if there is an internal error in the tool.
Thomas Mueller's avatar
Thomas Mueller committed
461
    For such exceptions, an error code is added
Thomas Mueller's avatar
Thomas Mueller committed
462
    so that the application can distinguish between different error cases.
463 464
</li><li><code>IllegalArgumentException</code> if a method was called with an illegal argument.
</li><li><code>UnsupportedOperationException</code> if a method was called that is not supported,
Thomas Mueller's avatar
Thomas Mueller committed
465 466
    for example trying to modify a read-only map.
</li><li><code>ConcurrentModificationException</code> if a map is modified concurrently.
467
</li></ul>
468

Thomas Mueller's avatar
Thomas Mueller committed
469
<h3 id="storageEngine">Storage Engine for H2</h3>
Thomas Mueller's avatar
Thomas Mueller committed
470
<p>
Thomas Mueller's avatar
Thomas Mueller committed
471
For H2 version 1.4 and newer, the MVStore is the default storage engine
Thomas Mueller's avatar
Thomas Mueller committed
472 473
(supporting SQL, JDBC, transactions, MVCC, and so on).
For older versions, append
Thomas Mueller's avatar
Thomas Mueller committed
474
<code>;MV_STORE=TRUE</code>
Thomas Mueller's avatar
Thomas Mueller committed
475
to the database URL.
noelgrandin's avatar
noelgrandin committed
476
Even though it can be used with the default table level locking,
Thomas Mueller's avatar
Thomas Mueller committed
477
by default the MVCC mode is enabled when using the MVStore.
Thomas Mueller's avatar
Thomas Mueller committed
478 479
</p>

480 481
<h2 id="fileFormat">File Format</h2>
<p>
482 483
The data is stored in one file.
The file contains two file headers (for safety), and a number of chunks.
484
The file headers are one block each; a block is 4096 bytes.
485
Each chunk is at least one block, but typically 200 blocks or more.
486
Data is stored in the chunks in the form of a
487
<a href="https://en.wikipedia.org/wiki/Log-structured_file_system">log structured storage</a>.
488 489 490
There is one chunk for every version.
</p>
<pre>
491
[ file header 1 ] [ file header 2 ] [ chunk ] [ chunk ] ... [ chunk ]
492
</pre>
493 494 495 496 497 498
<p>
Each chunk contains a number of B-tree pages.
As an example, the following code:
</p>
<pre>
MVStore s = MVStore.open(fileName);
499 500
MVMap&lt;Integer, String&gt; map = s.openMap("data");
for (int i = 0; i &lt; 400; i++) {
501 502 503
    map.put(i, "Hello");
}
s.commit();
504
for (int i = 0; i &lt; 100; i++) {
505 506 507 508 509 510 511 512 513
    map.put(0, "Hi");
}
s.commit();
s.close();
</pre>
<p>
will result in the following two chunks (excluding metadata):
</p>
<p>
514
<b>Chunk 1:</b><br />
515 516 517
- Page 1: (root) node with 2 entries pointing to page 2 and 3<br />
- Page 2: leaf with 140 entries (keys 0 - 139)<br />
- Page 3: leaf with 260 entries (keys 140 - 399)<br />
518 519
</p>
<p>
520
<b>Chunk 2:</b><br />
521 522
- Page 4: (root) node with 2 entries pointing to page 3 and 5<br />
- Page 5: leaf with 140 entries (keys 0 - 139)<br />
523 524
</p>
<p>
525
That means each chunk contains the changes of one version:
526
the new version of the changed pages and the parent pages, recursively, up to the root page.
527
Pages in subsequent chunks refer to pages in earlier chunks.
528
</p>
529 530 531 532

<h3>File Header</h3>
<p>
There are two file headers, which normally contain the exact same data.
533 534 535
But once in a while, the file headers are updated, and writing could partially fail,
which could corrupt a header. That's why there is a second header.
Only the file headers are updated in this way (called "in-place update").
536
The headers contain the following data:
537 538 539 540 541
</p>
<pre>
H:2,block:2,blockSize:1000,chunk:7,created:1441235ef73,format:1,version:7,fletcher:3044e6cc
</pre>
<p>
542
The data is stored in the form of a key-value pair.
543 544
Each value is stored as a hexadecimal number. The entries are:
</p>
545
<ul><li>H: The entry "H:2" stands for the the H2 database.
546 547 548 549 550
</li><li>block: The block number where one of the newest chunks starts
    (but not necessarily the newest).
</li><li>blockSize: The block size of the file; currently always hex 1000, which is decimal 4096,
    to match the <a href="https://en.wikipedia.org/wiki/Disk_sector">disk sector</a>
    length of modern hard disks.
551
</li><li>chunk: The chunk id, which is normally the same value as the version;
552
    however, the chunk id might roll over to 0, while the version doesn't.
553 554 555
</li><li>created: The number of milliseconds since 1970 when the file was created.
</li><li>format: The file format number. Currently 1.
</li><li>version: The version number of the chunk.
556 557
</li><li>fletcher: The <a href="https://en.wikipedia.org/wiki/Fletcher's_checksum">
    Fletcher-32 checksum</a> of the header.
558 559 560
</li></ul>
<p>
When opening the file, both headers are read and the checksum is verified.
561 562 563
If both headers are valid, the one with the newer version is used.
The chunk with the latest version is then detected (details about this see below),
and the rest of the metadata is read from there.
564
If the chunk id, block and version are not stored in the file header,
565
then the latest chunk lookup starts with the last chunk in the file.
566
</p>
567 568
<p>
</p>
569 570 571 572

<h3>Chunk Format</h3>
<p>
There is one chunk per version.
573 574 575 576 577 578 579 580 581 582
Each chunk consists of a header, the pages that were modified in this version, and a footer.
The pages contain the actual data of the maps.
The pages inside a chunk are stored right after the header, next to each other (unaligned).
The size of a chunk is a multiple of the block size.
The footer is stored in the last 128 bytes of the chunk.
</p>
<pre>
[ header ] [ page ] [ page ] ... [ page ] [ footer ]
</pre>
<p>
583
The footer allows to verify that the chunk is completely written (a chunk is written as one write operation),
584
and allows to find the start position of the very last chunk in the file.
585
The chunk header and footer contain the following data:
586 587
</p>
<pre>
588 589
chunk:1,block:2,len:1,map:6,max:1c0,next:3,pages:2,root:4000004f8c,time:1fc,version:1
chunk:1,block:2,version:1,fletcher:aed9a4f6
590 591
</pre>
<p>
592 593 594 595 596 597
The fields of the chunk header and footer are:
</p>
<ul><li>chunk: The chunk id.
</li><li>block: The first block of the chunk (multiply by the block size to get the position in the file).
</li><li>len: The size of the chunk in number of blocks.
</li><li>map: The id of the newest map; incremented when a new map is created.
598
</li><li>max: The sum of all maximum page sizes (see page format).
599 600 601 602 603 604 605 606
</li><li>next: The predicted start block of the next chunk.
</li><li>pages: The number of pages in the chunk.
</li><li>root: The position of the metadata root page (see page format).
</li><li>time: The time the chunk was written, in milliseconds after the file was created.
</li><li>version: The version this chunk represents.
</li><li>fletcher: The checksum of the footer.
</li></ul>
<p>
607 608
Chunks are never updated in-place. Each chunk contains the pages that were
changed in that version (there is one chunk per version, see above),
609
plus all the parent nodes of those pages, recursively, up to the root page.
610 611
If an entry in a map is changed, removed, or added, then the respective page is copied,
modified, and stored in the next chunk, and the number of live pages in the old chunk is decremented.
612 613
This mechanism is called copy-on-write, and is similar to how the
<a href="https://en.wikipedia.org/wiki/Btrfs">Btrfs</a> file system works.
614
Chunks without live pages are marked as free, so the space can be re-used by more recent chunks.
615
Because not all chunks are of the same size, there can be a number of free blocks in front of a chunk
616
for some time (until a small chunk is written or the chunks are compacted).
617
There is a <a href="http://stackoverflow.com/questions/13650134/after-how-many-seconds-are-file-system-write-buffers-typically-flushed">
618 619
delay of 45 seconds</a> (by default) before a free chunk is overwritten,
to ensure new versions are persisted first.
620 621 622 623 624
</p>
<p>
How the newest chunk is located when opening a store:
The file header contains the position of a recent chunk, but not always the newest one.
This is to reduce the number of file header updates.
625
After opening the file, the file headers, and the chunk footer of the very last chunk
626 627 628 629 630
(at the end of the file) are read.
From those candidates, the header of the most recent chunk is read.
If it contains a "next" pointer (see above), those chunk's header and footer are read as well.
If it turned out to be a newer valid chunk, this is repeated, until the newest chunk was found.
Before writing a chunk, the position of the next chunk is predicted based on the assumption
631
that the next chunk will be of the same size as the current one.
632 633 634 635 636 637 638
When the next chunk is written, and the previous
prediction turned out to be incorrect, the file header is updated as well.
In any case, the file header is updated if the next chain gets longer than 20 hops.
</p>

<h3>Page Format</h3>
<p>
639
Each map is a <a href="https://en.wikipedia.org/wiki/B-tree">B-tree</a>,
640
and the map data is stored in (B-tree-) pages.
641
There are leaf pages that contain the key-value pairs of the map,
642 643 644
and internal nodes, which only contain keys and pointers to leaf pages.
The root of a tree is either a leaf or an internal node.
Unlike file header and chunk header and footer, the page data is not human readable.
645 646
Instead, it is stored as byte arrays, with long (8 bytes), int (4 bytes), short (2 bytes),
and <a href="https://en.wikipedia.org/wiki/Variable-length_quantity">variable size int and long</a>
647
(1 to 5 / 10 bytes). The page format is:
648 649 650 651 652 653
</p>
<ul><li>length (int): Length of the page in bytes.
</li><li>checksum (short): Checksum (chunk id xor offset within the chunk xor page length).
</li><li>mapId (variable size int): The id of the map this page belongs to.
</li><li>len (variable size int): The number of keys in the page.
</li><li>type (byte): The page type (0 for leaf page, 1 for internal node;
654 655
    plus 2 if the keys and values are compressed with the LZF algorithm, or
    plus 6 if the keys and values are compressed with the Deflate algorithm).
656
</li><li>children (array of long; internal nodes only): The position of the children.
657
</li><li>childCounts (array of variable size long; internal nodes only):
658
    The total number of entries for the given child page.
659
</li><li>keys (byte array): All keys, stored depending on the data type.
660 661 662
</li><li>values (byte array; leaf pages only): All values, stored depending on the data type.
</li></ul>
<p>
663
Even though this is not required by the file format, pages are stored in the following order:
Thomas Mueller's avatar
Thomas Mueller committed
664
For each map, the root page is stored first, then the internal nodes (if there are any),
665 666 667
and then the leaf pages.
This should speed up reads for media where sequential reads are faster than random access reads.
The metadata map is stored at the end of a chunk.
668 669
</p>
<p>
670
Pointers to pages are stored as a long, using a special format:
671
26 bits for the chunk id, 32 bits for the offset within the chunk, 5 bits for the length code,
672
1 bit for the page type (leaf or internal node).
673 674 675 676 677
The page type is encoded so that when clearing or removing a map, leaf pages don't
have to be read (internal nodes do have to be read in order to know where all the pages are;
but in a typical B-tree the vast majority of the pages are leaf pages).
The absolute file position is not included so that chunks can be
moved within the file without having to change page pointers;
678
only the chunk metadata needs to be changed.
679 680 681
The length code is a number from 0 to 31, where 0 means the maximum length
of the page is 32 bytes, 1 means 48 bytes, 2: 64, 3: 96, 4: 128, 5: 192, and so on until 31 which
means longer than 1 MB. That way, reading a page only requires one
Thomas Mueller's avatar
Thomas Mueller committed
682
read operation (except for very large pages).
683 684
The sum of the maximum length of all pages is stored in the chunk metadata (field "max"),
and when a page is marked as removed, the live maximum length is adjusted.
685 686 687
This allows to estimate the amount of free space within a block, in addition to the number of free pages.
</p>
<p>
688
The total number of entries in child pages are kept to allow efficient range counting,
689
lookup by index, and skip operations.
690
The pages form a <a href="http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html">counted B-tree</a>.
691 692 693
</p>
<p>
Data compression: The data after the page type are optionally compressed using the LZF algorithm.
694 695
</p>

696 697
<h3>Metadata Map</h3>
<p>
698
In addition to the user maps, there is one metadata map that contains names and
699 700
positions of user maps, and chunk metadata.
The very last page of a chunk contains the root page of that metadata map.
701
The exact position of this root page is stored in the chunk header.
702 703 704 705 706 707 708 709 710
This page (directly or indirectly) points to the root pages of all other maps.
The metadata map of a store with a map named "data", and one chunk,
contains the following entries:
</p>
<ul><li>chunk.1: The metadata of chunk 1. This is the same data as the chunk header,
    plus the number of live pages, and the maximum live length.
</li><li>map.1: The metadata of map 1. The entries are: name, createVersion, and type.
</li><li>name.data: The map id of the map named "data". The value is "1".
</li><li>root.1: The root position of map 1.
Thomas Mueller's avatar
Thomas Mueller committed
711
</li><li>setting.storeVersion: The store version (a user defined value).
712
</li></ul>
713

714 715
<h2 id="differences">Similar Projects and Differences to Other Storage Engines</h2>
<p>
716
Unlike similar storage engines like LevelDB and Kyoto Cabinet,
Thomas Mueller's avatar
Thomas Mueller committed
717 718
the MVStore is written in Java
and can easily be embedded in a Java and Android application.
719
</p><p>
720
The MVStore is somewhat similar to the Berkeley DB Java Edition
Thomas Mueller's avatar
Thomas Mueller committed
721
because it is also written in Java,
722
and is also a log structured storage, but the H2 license is more liberal.
723
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
724 725 726 727
Like SQLite 3, the MVStore keeps all data in one file.
Unlike SQLite 3, the MVStore uses is a log structured storage.
The plan is to make the MVStore both easier to use as well as faster than SQLite 3.
In a recent (very simple) test, the MVStore was about twice as fast as SQLite 3 on Android.
728 729
</p><p>
The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek,
Thomas Mueller's avatar
Thomas Mueller committed
730
and some code is shared between MVStore and MapDB.
731
However, unlike MapDB, the MVStore uses is a log structured storage.
Thomas Mueller's avatar
Thomas Mueller committed
732
The MVStore does not have a record size limit.
733 734 735 736
</p>

<h2 id="current_state">Current State</h2>
<p>
Thomas Mueller's avatar
Thomas Mueller committed
737 738
The code is still experimental at this stage.
The API as well as the behavior may partially change.
noelgrandin's avatar
noelgrandin committed
739
Features may be added and removed (even though the main features will stay).
740 741 742 743
</p>

<h2 id="requirements">Requirements</h2>
<p>
Thomas Mueller's avatar
Thomas Mueller committed
744 745 746
The MVStore is included in the latest H2 jar file.
</p><p>
There are no special requirements to use it.
Thomas Mueller's avatar
Thomas Mueller committed
747
The MVStore should run on any JVM as well as on Android.
748 749 750 751 752 753 754 755
</p><p>
To build just the MVStore (without the database engine), run:
</p>
<pre>
./build.sh jarMVStore
</pre>
<p>
This will create the file <code>bin/h2mvstore-${version}.jar</code> (about 130 KB).
756 757 758
</p>

<!-- [close] { --></div></td></tr></table><!-- } --><!-- analytics --></body></html>