mvstore.html 20.7 KB
Newer Older
1 2
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
Thomas Mueller's avatar
Thomas Mueller committed
3
Copyright 2004-2013 H2 Group. Multiple-Licensed under the H2 License, Version 1.0,
4 5 6 7 8 9
and under the Eclipse Public License, Version 1.0
(http://h2database.com/html/license.html).
Initial Developer: H2 Group
-->
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><title>
10
MVStore
11 12 13 14 15 16 17 18 19 20 21 22
</title><link rel="stylesheet" type="text/css" href="stylesheet.css" />
<!-- [search] { -->
<script type="text/javascript" src="navigation.js"></script>
</head><body onload="frameMe();">
<table class="content"><tr class="content"><td class="content"><div class="contentDiv">
<!-- } -->

<h1>MVStore</h1>
<a href="#overview">
    Overview</a><br />
<a href="#example_code">
    Example Code</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
23 24 25 26
<a href="#store_builder">
    Store Builder</a><br />
<a href="#r_tree">
    R-Tree</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
27

28 29
<a href="#features">
    Features</a><br />
Thomas Mueller's avatar
Thomas Mueller committed
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
<a href="#maps">- Maps</a><br />
<a href="#versions">- Versions</a><br />
<a href="#transactions">- Transactions</a><br />
<a href="#inMemory">- In-Memory Performance and Usage</a><br />
<a href="#dataTypes">- Pluggable Data Types</a><br />
<a href="#blob">- BLOB Support</a><br />
<a href="#pluggableMap">- R-Tree and Pluggable Map Implementations</a><br />
<a href="#caching">- Concurrent Operations and Caching</a><br />
<a href="#logStructured">- Log Structured Storage</a><br />
<a href="#fileSystem">- File System Abstraction, File Locking and Online Backup</a><br />
<a href="#encryption">- Encrypted Files</a><br />
<a href="#tools">- Tools</a><br />
<a href="#exceptionHandling">- Exception Handling</a><br />
<a href="#tableEngine">- Table Engine for H2</a><br />

45 46 47 48 49 50 51 52
<a href="#differences">
    Similar Projects and Differences to Other Storage Engines</a><br />
<a href="#current_state">
    Current State</a><br />
<a href="#requirements">
    Requirements</a><br />

<h2 id="overview">Overview</h2>
53 54
<p>
The MVStore is work in progress, and is planned to be the next storage subsystem of H2.
Thomas Mueller's avatar
Thomas Mueller committed
55
But it can be also directly within an application, without using JDBC or SQL.
56
</p>
57
<ul><li>MVStore stands for "multi-version store".
58
</li><li>Each store contains a number of maps (using the <code>java.util.Map</code> interface).
59
</li><li>Both file-based persistence and in-memory operation are supported.
60 61
</li><li>It is intended to be fast, simple to use, and small.
</li><li>Old versions of the data can be read concurrently with all other operations.
62
</li><li>Transaction are supported (including concurrent transactions and 2-phase commit).
63
</li><li>The tool is very modular. It supports pluggable data types / serialization,
64
pluggable map implementations (B-tree, R-tree, concurrent B-tree currently), BLOB storage,
65
and a file system abstraction to support encrypted files and zip files.
66 67 68
</li></ul>

<h2 id="example_code">Example Code</h2>
69 70
<p>
The following sample code show how to create a store,
71
open a map, add some data, and access the current and an old version:
72
</p>
73
<pre>
Thomas Mueller's avatar
Thomas Mueller committed
74 75
import org.h2.mvstore.*;

76 77 78
// open the store (in-memory if fileName is null)
MVStore s = MVStore.open(fileName);

79
// create/get the map named "data"
80
MVMap&lt;Integer, String&gt; map = s.openMap("data");
81 82

// add some data
83 84
map.put(1, "Hello");
map.put(2, "World");
85 86 87 88 89 90 91 92 93

// get the current version, for later use
long oldVersion = s.getCurrentVersion();

// from now on, the old version is read-only
s.incrementVersion();

// more changes, in the new version
// changes can be rolled back if required
94
// changes always go into "head" (the newest version)
95 96
map.put(1, "Hi");
map.remove(2);
97 98

// access the old data (before incrementVersion)
99
MVMap&lt;Integer, String&gt; oldMap =
100 101
        map.openVersion(oldVersion);

Thomas Mueller's avatar
Thomas Mueller committed
102 103
// mark the changes as committed
s.commit();
104 105 106

// print the old version (can be done
// concurrently with further modifications)
107
// this will print "Hello" and "World":
108 109 110
System.out.println(oldMap.get(1));
System.out.println(oldMap.get(2));
oldMap.close();
111 112

// print the newest version ("Hi")
113
System.out.println(map.get(1));
114 115 116 117 118

// close the store - this doesn't write to disk
s.close();
</pre>

Thomas Mueller's avatar
Thomas Mueller committed
119
<h2 id="store_builder">Store Builder</h2>
120
<p>
121
The <code>MVStore.Builder</code> provides a fluid interface
Thomas Mueller's avatar
Thomas Mueller committed
122 123
to build a store if more complex configuration options are used.
The following code contains all supported configuration options:
124 125
</p>
<pre>
126
MVStore s = new MVStore.Builder().
Thomas Mueller's avatar
Thomas Mueller committed
127 128
    cacheSize(10).
    compressData().
129
    encryptionKey("007".toCharArray()).
Thomas Mueller's avatar
Thomas Mueller committed
130
    fileName(fileName).
131
    readOnly().
Thomas Mueller's avatar
Thomas Mueller committed
132 133
    writeBufferSize(8).
    writeDelay(100).
134 135
    open();
</pre>
Thomas Mueller's avatar
Thomas Mueller committed
136 137 138 139 140 141 142 143 144
<ul><li>cacheSizeMB: the cache size in MB.
</li><li>compressData: compress the data when storing.
</li><li>encryptionKey: the encryption key for file encryption.
</li><li>fileName: the name of the file, for file based stores.
</li><li>readOnly: open the file in read-only mode.
</li><li>writeBufferSize: the size of the write buffer in MB.
</li><li>writeDelay: the maximum delay until committed
    changes are stored (unless stored explicitly).
</li></ul>
145

Thomas Mueller's avatar
Thomas Mueller committed
146
<h2 id="r_tree">R-Tree</h2>
147 148
<p>
The <code>MVRTreeMap</code> is an R-tree implementation
149
that supports fast spatial queries. It can be used as follows:
150 151 152 153 154
</p>
<pre>
// create an in-memory store
MVStore s = MVStore.open(null);

155
// open an R-tree map
156 157
MVRTreeMap&lt;String&gt; r = s.openMap("data",
        new MVRTreeMap.Builder&lt;String&gt;());
158 159 160 161 162 163 164 165

// add two key-value pairs
// the first value is the key id (to make the key unique)
// then the min x, max x, min y, max y
r.add(new SpatialKey(0, -3f, -2f, 2f, 3f), "left");
r.add(new SpatialKey(1, 3f, 4f, 4f, 5f), "right");

// iterate over the intersecting keys
166
Iterator&lt;SpatialKey&gt; it =
167 168 169 170 171 172 173
        r.findIntersectingKeys(new SpatialKey(0, 0f, 9f, 3f, 6f));
for (SpatialKey k; it.hasNext();) {
    k = it.next();
    System.out.println(k + ": " + r.get(k));
}
s.close();
</pre>
Thomas Mueller's avatar
Thomas Mueller committed
174 175
<p>
The default number of dimensions is 2. To use a different number of dimensions,
Thomas Mueller's avatar
Thomas Mueller committed
176
call <code>new MVRTreeMap.Builder&lt;String&gt;().dimensions(3)</code>.
Thomas Mueller's avatar
Thomas Mueller committed
177 178
The minimum number of dimensions is 1, the maximum is 255.
</p>
179 180 181

<h2 id="features">Features</h2>

182
<h3 id="maps">Maps</h3>
183 184
<p>
Each store supports a set of named maps.
185
A map is sorted by key, and supports the common lookup operations,
186 187
including access to the first and last key, iterate over some or all keys, and so on.
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
188 189
Also supported, and very uncommon for maps, is fast index lookup:
the keys of the map can be accessed like a list
190 191
(get the key at the given index, get the index of a certain key).
That means getting the median of two keys is trivial,
Thomas Mueller's avatar
Thomas Mueller committed
192
and range of keys can be counted very quickly.
193 194
The iterator supports fast skipping.
This is possible because internally, each map is organized in the form of a counted B+-tree.
195
</p><p>
196
In database terms, a map can be used like a table, where the key of the map is the primary key of the table,
197
and the value is the row. A map can also represent an index, where the key of the map is the key
Thomas Mueller's avatar
Thomas Mueller committed
198
of the index, and the value of the map is the primary key of the table (for non-unique indexes,
199
the key of the map must also contain the primary key).
200 201
</p>

202
<h3 id="versions">Versions</h3>
203 204 205 206 207
<p>
Multiple versions are supported.
A version is a snapshot of all the data of all maps at a given point in time.
A transaction is a number of actions between two versions.
</p><p>
208
Versions are not immediately persisted; instead, only the version counter is incremented.
209
If there is a change after switching to a new version, a snapshot of the old version is kept in memory,
Thomas Mueller's avatar
Thomas Mueller committed
210
so that it can still be read.
211 212
</p><p>
Old persisted versions are readable until the old data was explicitly overwritten.
Thomas Mueller's avatar
Thomas Mueller committed
213
Creating a snapshot is fast: only the pages that are changed after a snapshot are copied.
Thomas Mueller's avatar
Thomas Mueller committed
214
This behavior is also called COW (copy on write).
215 216 217 218
</p><p>
Rollback is supported (rollback to any old in-memory version or an old persisted version).
</p>

219 220 221 222 223 224 225 226 227 228 229 230 231 232
<h3 id="transactions">Transactions</h3>
<p>
The multi-version support is the basis for the transaction support.
In the simple case, when only one transaction is open at a time,
rolling back the transaction only requires to revert to an old version.
</p><p>
To support multiple concurrent open transactions, a transaction utility is included,
the <code>TransactionStore</code>.
This utility stores the changed entries in a separate map, similar to a transaction log
(except that only the key of a changed row is stored,
and the entries of a transaction are removed when the transaction is committed).
The storage overhead of this utility is very small compared to the overhead of a regular transaction log.
The tool supports PostgreSQL style "read committed" transaction isolation.
There is no limit on the size of a transaction (the log is not kept in memory).
233
The tool supports savepoints, two-phase commit, and other features typically available in a database.
234 235 236
</p>

<h3 id="inMemory">In-Memory Performance and Usage</h3>
237 238 239 240 241
<p>
Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>
(many operations are actually faster), but usually slower than <code>java.util.HashMap</code>.
</p><p>
The memory overhead for large maps is slightly better than for the regular
242 243
map implementations, but there is a higher overhead per map.
For maps with less than 25 entries, the regular map implementations
244
use less memory on average.
Thomas Mueller's avatar
Thomas Mueller committed
245 246 247 248 249 250
</p><p>
If no file name is specified, the store operates purely in memory.
Except for persisting data, all features are supported in this mode
(multi-versioning, index lookup, R-tree and so on).
If a file name is specified, all operations occur in memory (with the same
performance characteristics) until data is persisted.
251 252
</p>

253
<h3 id="dataTypes">Pluggable Data Types</h3>
254
<p>
255
Serialization is pluggable. The default serialization currently supports many common data types,
256 257
and uses Java serialization for other objects. The following classes are currently directly supported:
<code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal,
Thomas Mueller's avatar
Thomas Mueller committed
258
String, UUID, Date</code> and arrays (both primitive arrays and object arrays).
259 260 261 262
</p><p>
Parameterized data types are supported
(for example one could build a string data type that limits the length for some reason).
</p><p>
263
The storage engine itself does not have any length limits, so that keys, values,
264 265 266 267 268
pages, and chunks can be very big (as big as fits in memory).
Also, there is no inherent limit to the number of maps and chunks.
Due to using a log structured storage, there is no special case handling for large keys or pages.
</p>

269
<h3 id="blob">BLOB Support</h3>
270
<p>
271
There is a mechanism that stores large binary objects by splitting them into smaller blocks.
272 273 274 275 276
This allows to store objects that don't fit in memory.
Streaming as well as random access reads on such objects are supported.
This tool is written on top of the store (only using the map interface).
</p>

277
<h3 id="pluggableMap">R-Tree and Pluggable Map Implementations</h3>
278 279
<p>
The map implementation is pluggable.
Thomas Mueller's avatar
Thomas Mueller committed
280
In addition to the default <code>MVMap</code> (multi-version map),
281 282 283 284
there is a multi-version R-tree map implementation
for spatial operations (contain and intersection; nearest neighbor is not yet implemented).
</p>

285
<h3 id="caching">Concurrent Operations and Caching</h3>
286
<p>
287
The default map implementation supports concurrent reads on old versions of the data.
288
All such read operations can occur in parallel. Concurrent reads from the page cache,
289 290
as well as concurrent reads from the file system are supported.
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
291
Storing changes can occur concurrently to modifying the data, as it operates on a snapshot.
292
</p><p>
293
Caching is done on the page level.
294
The page cache is a concurrent LIRS cache, which should be resistant against scan operations.
295
</p><p>
296
The default map implementation does not support concurrent modification
297
operations on a map (the same as <code>HashMap</code> and <code>TreeMap</code>).
298
Similar to those classes, the map tries to detect concurrent modification.
299
</p><p>
300
With the <code>MVMapConcurrent</code> implementation,
301 302
read operations even on the newest version can happen concurrently with all other
operations, without risk of corruption.
303
This comes with slightly reduced speed in single threaded mode,
304 305 306 307 308
the same as with other <code>ConcurrentHashMap</code> implementations.
Write operations first read the relevant area from disk to memory
(this can happen concurrently), and only then modify the data. The in-memory part of write
operations is synchronized.
</p><p>
309 310
For fully scalable concurrent write operations to a map (in-memory and to disk),
the map could be split into multiple maps in different stores ('sharding').
311
The plan is to add such a mechanism later when needed.
312 313
</p>

314
<h3 id="logStructured">Log Structured Storage</h3>
Thomas Mueller's avatar
Thomas Mueller committed
315
<p>
316
Changes are buffered in memory, and once enough changes have accumulated,
Thomas Mueller's avatar
Thomas Mueller committed
317
they are written in one continuous disk write operation.
318
(According to a test, write throughput of a common SSD gets higher the larger the block size,
319
until a block size of 2 MB, and then does not further increase.)
320
By default, committed changes are automatically written once every second
Thomas Mueller's avatar
Thomas Mueller committed
321 322
in a background thread, even if only little data was changed.
Changes can also be written explicitly by calling <code>store()</code>.
323
To avoid out of memory, uncommitted changes are also written when needed,
Thomas Mueller's avatar
Thomas Mueller committed
324 325
however they are rolled back when closing the store,
or at the latest (when the store was not correctly closed) when opening the store.
Thomas Mueller's avatar
Thomas Mueller committed
326 327
</p><p>
When storing, all changed pages are serialized,
328
optionally compressed using the LZF algorithm,
Thomas Mueller's avatar
Thomas Mueller committed
329 330 331 332
and written sequentially to a free area of the file.
Each such change set is called a chunk.
All parent pages of the changed B-trees are stored in this chunk as well,
so that each chunk also contains the root of each changed map
333
(which is the entry point to read this version of the data).
Thomas Mueller's avatar
Thomas Mueller committed
334
There is no separate index: all data is stored as a list of pages.
Thomas Mueller's avatar
Thomas Mueller committed
335
Per store, there is one additional map that contains the metadata (the list of
Thomas Mueller's avatar
Thomas Mueller committed
336 337
maps, where the root page of each map is stored, and the list of chunks).
</p><p>
338 339 340
There are usually two write operations per chunk:
one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk).
If the chunk is appended at the end of the file, the file header is only written at the end of the chunk.
Thomas Mueller's avatar
Thomas Mueller committed
341
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
342 343
There is no transaction log, no undo log,
and there are no in-place updates (however unused chunks are overwritten by default).
Thomas Mueller's avatar
Thomas Mueller committed
344
</p><p>
345 346 347
Old data is kept for at least 45 seconds (configurable),
so that there are no explicit sync operations required to guarantee data consistency,
but an application can also sync explicitly when needed.
Thomas Mueller's avatar
Thomas Mueller committed
348 349 350 351
To reuse disk space, the chunks with the lowest amount of live data are compacted
(the live data is simply stored again in the next chunk).
To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
352
Compared to traditional storage engines (that use a transaction log, undo log, and main storage area),
Thomas Mueller's avatar
Thomas Mueller committed
353 354 355 356 357 358 359
the log structured storage is simpler, more flexible, and typically needs less disk operations per change,
as data is only written once instead of twice or 3 times, and because the B-tree pages are
always full (they are stored next to each other) and can be easily compressed.
But temporarily, disk space usage might actually be a bit higher than for a regular database,
as disk space is not immediately re-used (there are no in-place updates).
</p>

360
<h3 id="fileSystem">File System Abstraction, File Locking and Online Backup</h3>
361 362
<p>
The file system is pluggable (the same file system abstraction is used as H2 uses).
Thomas Mueller's avatar
Thomas Mueller committed
363 364
The file can be encrypted using an encrypting file system.
Other file system implementations support reading from a compressed zip or jar file.
365 366 367 368 369 370 371 372 373
</p>
<p>
Each store may only be opened once within a JVM.
When opening a store, the file is locked in exclusive mode, so that
the file can only be changed from within one process.
Files can be opened in read-only mode, in which case a shared lock is used.
</p>
<p>
The persisted data can be backed up to a different file at any time,
374
even during write operations (online backup).
375
To do that, automatic disk space reuse needs to be first disabled, so that
376
new data is always appended at the end of the file.
377
Then, the file can be copied (the file handle is available to the application).
378 379
</p>

380
<h3 id="encryption">Encrypted Files</h3>
381
<p>
382
File encryption ensures the data can only be read with the correct password.
383 384 385 386 387 388 389 390 391
Data can be encrypted as follows:
</p>
<pre>
MVStore s = new MVStore.Builder().
    fileName(fileName).
    encryptionKey("007".toCharArray()).
    open();
</pre>
<p>
392
</p><p>
Thomas Mueller's avatar
Thomas Mueller committed
393
The following algorithms and settings are used:
394
</p>
395
<ul><li>The password char array is cleared after use,
396 397
    to reduce the risk that the password is stolen
    even if the attacker has access to the main memory.
398
</li><li>The password is hashed according to the PBKDF2 standard,
399
    using the SHA-256 hash algorithm.
Thomas Mueller's avatar
Thomas Mueller committed
400
</li><li>The length of the salt is 64 bits,
401
    so that an attacker can not use a pre-calculated password hash table (rainbow table).
Thomas Mueller's avatar
Thomas Mueller committed
402 403 404 405 406 407 408
    It is generated using a cryptographically secure random number generator.
</li><li>To speed up opening an encrypted stores on Android,
    the number of PBKDF2 iterations is 10.
    The higher the value, the better the protection against brute-force password cracking attacks,
    but the slower is opening a file.
</li><li>The file itself is encrypted using the standardized disk encryption mode XTS-AES.
    Only little more than one AES-128 round per block is needed.
409
</li></ul>
410

411
<h3 id="tools">Tools</h3>
412 413 414
<p>
There is a tool (<code>MVStoreTool</code>) to dump the contents of a file.
</p>
415

416
<h3 id="exceptionHandling">Exception Handling</h3>
417
<p>
418 419
This tool does not throw checked exceptions.
Instead, unchecked exceptions are thrown if needed.
420 421 422
The error message always contains the version of the tool.
The following exceptions can occur:
</p>
423
<ul><li><code>IllegalStateException</code> if a map was already closed or
424 425 426
    an IO exception occurred, for example if the file was locked, is already closed,
    could not be opened or closed, if reading or writing failed,
    if the file is corrupt, or if there is an internal error in the tool.
427 428 429
</li><li><code>IllegalArgumentException</code> if a method was called with an illegal argument.
</li><li><code>UnsupportedOperationException</code> if a method was called that is not supported,
    for example trying to modify a read-only map or view.
430
</li><li><code>ConcurrentModificationException</code> if the object is modified concurrently.
431
</li></ul>
432

Thomas Mueller's avatar
Thomas Mueller committed
433 434 435 436 437
<h3 id="tableEngine">Table Engine for H2</h3>
<p>
The plan is to use the MVStore as the default storage engine for the H2 database
in the future (supporting SQL, JDBC, transactions, MVCC, and so on).
This is work in progress. To try it out, append
Thomas Mueller's avatar
Thomas Mueller committed
438
<code>;MV_STORE=TRUE</code>
Thomas Mueller's avatar
Thomas Mueller committed
439 440 441 442
to the database URL. In general, functionality and performance should be
similar than the current default storage engine (the page store).
There are a few features that have not been implemented yet or are not complete:
</p>
Thomas Mueller's avatar
Thomas Mueller committed
443
<ul><li>There is still a file <code>.h2.db</code>,
444 445
    and the <code>.lock.db</code> file is still used to lock a database
    (long term, the plan is to no longer use those files).
Thomas Mueller's avatar
Thomas Mueller committed
446 447 448
</li><li>The database file(s) sometimes do not shrink as expected.
</li></ul>

449 450
<h2 id="differences">Similar Projects and Differences to Other Storage Engines</h2>
<p>
451
Unlike similar storage engines like LevelDB and Kyoto Cabinet,
Thomas Mueller's avatar
Thomas Mueller committed
452 453
the MVStore is written in Java
and can easily be embedded in a Java and Android application.
454
</p><p>
455
The MVStore is somewhat similar to the Berkeley DB Java Edition
Thomas Mueller's avatar
Thomas Mueller committed
456
because it is also written in Java,
457
and is also a log structured storage, but the H2 license is more liberal.
458 459
</p><p>
Like SQLite, the MVStore keeps all data in one file.
Thomas Mueller's avatar
Thomas Mueller committed
460 461 462
Unlike SQLite, the MVStore uses is a log structured storage.
The plan is to make the MVStore both easier to use as well as faster than SQLite.
In a recent (very simple) test, the MVStore was about twice as fast as SQLite on Android.
463 464
</p><p>
The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek,
465
and some code is shared between MapDB and JDBM.
466
However, unlike MapDB, the MVStore uses is a log structured storage.
Thomas Mueller's avatar
Thomas Mueller committed
467
The MVStore does not have a record size limit.
468 469 470 471
</p>

<h2 id="current_state">Current State</h2>
<p>
Thomas Mueller's avatar
Thomas Mueller committed
472 473
The code is still experimental at this stage.
The API as well as the behavior may partially change.
474 475 476 477 478
Features may be added and removed (even thought the main features will stay).
</p>

<h2 id="requirements">Requirements</h2>
<p>
Thomas Mueller's avatar
Thomas Mueller committed
479 480 481
The MVStore is included in the latest H2 jar file.
</p><p>
There are no special requirements to use it.
Thomas Mueller's avatar
Thomas Mueller committed
482
The MVStore should run on any JVM as well as on Android.
483 484 485 486 487 488 489 490
</p><p>
To build just the MVStore (without the database engine), run:
</p>
<pre>
./build.sh jarMVStore
</pre>
<p>
This will create the file <code>bin/h2mvstore-${version}.jar</code> (about 130 KB).
491 492 493
</p>

<!-- [close] { --></div></td></tr></table><!-- } --><!-- analytics --></body></html>