1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
Copyright 2004-2013 H2 Group. Multiple-Licensed under the H2 License, Version 1.0,
and under the Eclipse Public License, Version 1.0
(http://h2database.com/html/license.html).
Initial Developer: H2 Group
-->
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><title>
MVStore
</title><link rel="stylesheet" type="text/css" href="stylesheet.css" />
<!-- [search] { -->
<script type="text/javascript" src="navigation.js"></script>
</head><body onload="frameMe();">
<table class="content"><tr class="content"><td class="content"><div class="contentDiv">
<!-- } -->
<h1>MVStore</h1>
<a href="#overview">
Overview</a><br />
<a href="#example_code">
Example Code</a><br />
<a href="#store_builder">
Store Builder</a><br />
<a href="#r_tree">
R-Tree</a><br />
<a href="#features">
Features</a><br />
- <a href="#maps">Maps</a><br />
- <a href="#versions">Versions</a><br />
- <a href="#transactions">Transactions</a><br />
- <a href="#inMemory">In-Memory Performance and Usage</a><br />
- <a href="#dataTypes">Pluggable Data Types</a><br />
- <a href="#blob">BLOB Support</a><br />
- <a href="#pluggableMap">R-Tree and Pluggable Map Implementations</a><br />
- <a href="#caching">Concurrent Operations and Caching</a><br />
- <a href="#logStructured">Log Structured Storage</a><br />
- <a href="#fileSystem">File System Abstraction, File Locking and Online Backup</a><br />
- <a href="#encryption">Encrypted Files</a><br />
- <a href="#tools">Tools</a><br />
- <a href="#exceptionHandling">Exception Handling</a><br />
- <a href="#tableEngine">Table Engine for H2</a><br />
<a href="#differences">
Similar Projects and Differences to Other Storage Engines</a><br />
<a href="#current_state">
Current State</a><br />
<a href="#requirements">
Requirements</a><br />
<h2 id="overview">Overview</h2>
<p>
The MVStore is work in progress, and is planned to be the next storage subsystem of H2.
But it can be also directly within an application, without using JDBC or SQL.
</p>
<ul><li>MVStore stands for "multi-version store".
</li><li>Each store contains a number of maps (using the <code>java.util.Map</code> interface).
</li><li>Both file-based persistence and in-memory operation are supported.
</li><li>It is intended to be fast, simple to use, and small.
</li><li>Old versions of the data can be read concurrently with all other operations.
</li><li>Transaction are supported (including concurrent transactions and 2-phase commit).
</li><li>The tool is very modular. It supports pluggable data types / serialization,
pluggable map implementations (B-tree, R-tree, concurrent B-tree currently), BLOB storage,
and a file system abstraction to support encrypted files and zip files.
</li></ul>
<h2 id="example_code">Example Code</h2>
<p>
The following sample code show how to create a store,
open a map, add some data, and access the current and an old version:
</p>
<pre>
import org.h2.mvstore.*;
// open the store (in-memory if fileName is null)
MVStore s = MVStore.open(fileName);
// create/get the map named "data"
MVMap<Integer, String> map = s.openMap("data");
// add some data
map.put(1, "Hello");
map.put(2, "World");
// get the current version, for later use
long oldVersion = s.getCurrentVersion();
// from now on, the old version is read-only
s.incrementVersion();
// more changes, in the new version
// changes can be rolled back if required
// changes always go into "head" (the newest version)
map.put(1, "Hi");
map.remove(2);
// access the old data (before incrementVersion)
MVMap<Integer, String> oldMap =
map.openVersion(oldVersion);
// mark the changes as committed
s.commit();
// print the old version (can be done
// concurrently with further modifications)
// this will print "Hello" and "World":
System.out.println(oldMap.get(1));
System.out.println(oldMap.get(2));
oldMap.close();
// print the newest version ("Hi")
System.out.println(map.get(1));
// close the store - this doesn't write to disk
s.close();
</pre>
<h2 id="store_builder">Store Builder</h2>
<p>
The <code>MVStore.Builder</code> provides a fluid interface
to build a store if more complex configuration options are used.
The following code contains all supported configuration options:
</p>
<pre>
MVStore s = new MVStore.Builder().
cacheSize(10).
compressData().
encryptionKey("007".toCharArray()).
fileName(fileName).
readOnly().
writeBufferSize(8).
writeDelay(100).
open();
</pre>
<ul><li>cacheSizeMB: the cache size in MB.
</li><li>compressData: compress the data when storing.
</li><li>encryptionKey: the encryption key for file encryption.
</li><li>fileName: the name of the file, for file based stores.
</li><li>readOnly: open the file in read-only mode.
</li><li>writeBufferSize: the size of the write buffer in MB.
</li><li>writeDelay: the maximum delay until committed
changes are stored (unless stored explicitly).
</li></ul>
<h2 id="r_tree">R-Tree</h2>
<p>
The <code>MVRTreeMap</code> is an R-tree implementation
that supports fast spatial queries. It can be used as follows:
</p>
<pre>
// create an in-memory store
MVStore s = MVStore.open(null);
// open an R-tree map
MVRTreeMap<String> r = s.openMap("data",
new MVRTreeMap.Builder<String>());
// add two key-value pairs
// the first value is the key id (to make the key unique)
// then the min x, max x, min y, max y
r.add(new SpatialKey(0, -3f, -2f, 2f, 3f), "left");
r.add(new SpatialKey(1, 3f, 4f, 4f, 5f), "right");
// iterate over the intersecting keys
Iterator<SpatialKey> it =
r.findIntersectingKeys(new SpatialKey(0, 0f, 9f, 3f, 6f));
for (SpatialKey k; it.hasNext();) {
k = it.next();
System.out.println(k + ": " + r.get(k));
}
s.close();
</pre>
<p>
The default number of dimensions is 2. To use a different number of dimensions,
call <code>new MVRTreeMap.Builder<String>().dimensions(3)</code>.
The minimum number of dimensions is 1, the maximum is 255.
</p>
<h2 id="features">Features</h2>
<h3 id="maps">Maps</h3>
<p>
Each store supports a set of named maps.
A map is sorted by key, and supports the common lookup operations,
including access to the first and last key, iterate over some or all keys, and so on.
</p><p>
Also supported, and very uncommon for maps, is fast index lookup:
the keys of the map can be accessed like a list
(get the key at the given index, get the index of a certain key).
That means getting the median of two keys is trivial,
and range of keys can be counted very quickly.
The iterator supports fast skipping.
This is possible because internally, each map is organized in the form of a counted B+-tree.
</p><p>
In database terms, a map can be used like a table, where the key of the map is the primary key of the table,
and the value is the row. A map can also represent an index, where the key of the map is the key
of the index, and the value of the map is the primary key of the table (for non-unique indexes,
the key of the map must also contain the primary key).
</p>
<h3 id="versions">Versions</h3>
<p>
Multiple versions are supported.
A version is a snapshot of all the data of all maps at a given point in time.
A transaction is a number of actions between two versions.
</p><p>
Versions are not immediately persisted; instead, only the version counter is incremented.
If there is a change after switching to a new version, a snapshot of the old version is kept in memory,
so that it can still be read.
</p><p>
Old persisted versions are readable until the old data was explicitly overwritten.
Creating a snapshot is fast: only the pages that are changed after a snapshot are copied.
This behavior is also called COW (copy on write).
</p><p>
Rollback is supported (rollback to any old in-memory version or an old persisted version).
</p>
<h3 id="transactions">Transactions</h3>
<p>
The multi-version support is the basis for the transaction support.
In the simple case, when only one transaction is open at a time,
rolling back the transaction only requires to revert to an old version.
</p><p>
To support multiple concurrent open transactions, a transaction utility is included,
the <code>TransactionStore</code>.
This utility stores the changed entries in a separate map, similar to a transaction log
(except that only the key of a changed row is stored,
and the entries of a transaction are removed when the transaction is committed).
The storage overhead of this utility is very small compared to the overhead of a regular transaction log.
The tool supports PostgreSQL style "read committed" transaction isolation.
There is no limit on the size of a transaction (the log is not kept in memory).
The tool supports savepoints, two-phase commit, and other features typically available in a database.
</p>
<h3 id="inMemory">In-Memory Performance and Usage</h3>
<p>
Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>
(many operations are actually faster), but usually slower than <code>java.util.HashMap</code>.
</p><p>
The memory overhead for large maps is slightly better than for the regular
map implementations, but there is a higher overhead per map.
For maps with less than 25 entries, the regular map implementations
use less memory on average.
</p><p>
If no file name is specified, the store operates purely in memory.
Except for persisting data, all features are supported in this mode
(multi-versioning, index lookup, R-tree and so on).
If a file name is specified, all operations occur in memory (with the same
performance characteristics) until data is persisted.
</p>
<h3 id="dataTypes">Pluggable Data Types</h3>
<p>
Serialization is pluggable. The default serialization currently supports many common data types,
and uses Java serialization for other objects. The following classes are currently directly supported:
<code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal,
String, UUID, Date</code> and arrays (both primitive arrays and object arrays).
</p><p>
Parameterized data types are supported
(for example one could build a string data type that limits the length for some reason).
</p><p>
The storage engine itself does not have any length limits, so that keys, values,
pages, and chunks can be very big (as big as fits in memory).
Also, there is no inherent limit to the number of maps and chunks.
Due to using a log structured storage, there is no special case handling for large keys or pages.
</p>
<h3 id="blob">BLOB Support</h3>
<p>
There is a mechanism that stores large binary objects by splitting them into smaller blocks.
This allows to store objects that don't fit in memory.
Streaming as well as random access reads on such objects are supported.
This tool is written on top of the store (only using the map interface).
</p>
<h3 id="pluggableMap">R-Tree and Pluggable Map Implementations</h3>
<p>
The map implementation is pluggable.
In addition to the default <code>MVMap</code> (multi-version map),
there is a multi-version R-tree map implementation
for spatial operations (contain and intersection; nearest neighbor is not yet implemented).
</p>
<h3 id="caching">Concurrent Operations and Caching</h3>
<p>
The default map implementation supports concurrent reads on old versions of the data.
All such read operations can occur in parallel. Concurrent reads from the page cache,
as well as concurrent reads from the file system are supported.
</p><p>
Storing changes can occur concurrently to modifying the data, as it operates on a snapshot.
</p><p>
Caching is done on the page level.
The page cache is a concurrent LIRS cache, which should be resistant against scan operations.
</p><p>
The default map implementation does not support concurrent modification
operations on a map (the same as <code>HashMap</code> and <code>TreeMap</code>).
Similar to those classes, the map tries to detect concurrent modification.
</p><p>
With the <code>MVMapConcurrent</code> implementation,
read operations even on the newest version can happen concurrently with all other
operations, without risk of corruption.
This comes with slightly reduced speed in single threaded mode,
the same as with other <code>ConcurrentHashMap</code> implementations.
Write operations first read the relevant area from disk to memory
(this can happen concurrently), and only then modify the data. The in-memory part of write
operations is synchronized.
</p><p>
For fully scalable concurrent write operations to a map (in-memory and to disk),
the map could be split into multiple maps in different stores ('sharding').
The plan is to add such a mechanism later when needed.
</p>
<h3 id="logStructured">Log Structured Storage</h3>
<p>
Changes are buffered in memory, and once enough changes have accumulated,
they are written in one continuous disk write operation.
(According to a test, write throughput of a common SSD gets higher the larger the block size,
until a block size of 2 MB, and then does not further increase.)
By default, committed changes are automatically written once every second
in a background thread, even if only little data was changed.
Changes can also be written explicitly by calling <code>store()</code>.
To avoid out of memory, uncommitted changes are also written when needed,
however they are rolled back when closing the store,
or at the latest (when the store was not correctly closed) when opening the store.
</p><p>
When storing, all changed pages are serialized,
optionally compressed using the LZF algorithm,
and written sequentially to a free area of the file.
Each such change set is called a chunk.
All parent pages of the changed B-trees are stored in this chunk as well,
so that each chunk also contains the root of each changed map
(which is the entry point to read this version of the data).
There is no separate index: all data is stored as a list of pages.
Per store, there is one additional map that contains the metadata (the list of
maps, where the root page of each map is stored, and the list of chunks).
</p><p>
There are usually two write operations per chunk:
one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk).
If the chunk is appended at the end of the file, the file header is only written at the end of the chunk.
</p><p>
There is no transaction log, no undo log,
and there are no in-place updates (however unused chunks are overwritten by default).
</p><p>
Old data is kept for at least 45 seconds (configurable),
so that there are no explicit sync operations required to guarantee data consistency,
but an application can also sync explicitly when needed.
To reuse disk space, the chunks with the lowest amount of live data are compacted
(the live data is simply stored again in the next chunk).
To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
</p><p>
Compared to traditional storage engines (that use a transaction log, undo log, and main storage area),
the log structured storage is simpler, more flexible, and typically needs less disk operations per change,
as data is only written once instead of twice or 3 times, and because the B-tree pages are
always full (they are stored next to each other) and can be easily compressed.
But temporarily, disk space usage might actually be a bit higher than for a regular database,
as disk space is not immediately re-used (there are no in-place updates).
</p>
<h3 id="fileSystem">File System Abstraction, File Locking and Online Backup</h3>
<p>
The file system is pluggable (the same file system abstraction is used as H2 uses).
The file can be encrypted using an encrypting file system.
Other file system implementations support reading from a compressed zip or jar file.
</p>
<p>
Each store may only be opened once within a JVM.
When opening a store, the file is locked in exclusive mode, so that
the file can only be changed from within one process.
Files can be opened in read-only mode, in which case a shared lock is used.
</p>
<p>
The persisted data can be backed up to a different file at any time,
even during write operations (online backup).
To do that, automatic disk space reuse needs to be first disabled, so that
new data is always appended at the end of the file.
Then, the file can be copied (the file handle is available to the application).
</p>
<h3 id="encryption">Encrypted Files</h3>
<p>
File encryption ensures the data can only be read with the correct password.
Data can be encrypted as follows:
</p>
<pre>
MVStore s = new MVStore.Builder().
fileName(fileName).
encryptionKey("007".toCharArray()).
open();
</pre>
<p>
</p><p>
The following algorithms and settings are used:
</p>
<ul><li>The password char array is cleared after use,
to reduce the risk that the password is stolen
even if the attacker has access to the main memory.
</li><li>The password is hashed according to the PBKDF2 standard,
using the SHA-256 hash algorithm.
</li><li>The length of the salt is 64 bits,
so that an attacker can not use a pre-calculated password hash table (rainbow table).
It is generated using a cryptographically secure random number generator.
</li><li>To speed up opening an encrypted stores on Android,
the number of PBKDF2 iterations is 10.
The higher the value, the better the protection against brute-force password cracking attacks,
but the slower is opening a file.
</li><li>The file itself is encrypted using the standardized disk encryption mode XTS-AES.
Only little more than one AES-128 round per block is needed.
</li></ul>
<h3 id="tools">Tools</h3>
<p>
There is a tool (<code>MVStoreTool</code>) to dump the contents of a file.
</p>
<h3 id="exceptionHandling">Exception Handling</h3>
<p>
This tool does not throw checked exceptions.
Instead, unchecked exceptions are thrown if needed.
The error message always contains the version of the tool.
The following exceptions can occur:
</p>
<ul><li><code>IllegalStateException</code> if a map was already closed or
an IO exception occurred, for example if the file was locked, is already closed,
could not be opened or closed, if reading or writing failed,
if the file is corrupt, or if there is an internal error in the tool.
</li><li><code>IllegalArgumentException</code> if a method was called with an illegal argument.
</li><li><code>UnsupportedOperationException</code> if a method was called that is not supported,
for example trying to modify a read-only map or view.
</li><li><code>ConcurrentModificationException</code> if the object is modified concurrently.
</li></ul>
<h3 id="tableEngine">Table Engine for H2</h3>
<p>
The plan is to use the MVStore as the default storage engine for the H2 database
in the future (supporting SQL, JDBC, transactions, MVCC, and so on).
This is work in progress. To try it out, append
<code>;DEFAULT_TABLE_ENGINE=org.h2.mvstore.db.MVTableEngine</code>
to the database URL. In general, functionality and performance should be
similar than the current default storage engine (the page store).
There are a few features that have not been implemented yet or are not complete:
</p>
<ul><li>Changing the cache size.
</li><li>Two-phase commit.
</li><li>The database metadata is still stored in a <code>.h2.db</code> file.
</li><li>The database file(s) sometimes do not shrink as expected.
</li></ul>
<h2 id="differences">Similar Projects and Differences to Other Storage Engines</h2>
<p>
Unlike similar storage engines like LevelDB and Kyoto Cabinet,
the MVStore is written in Java
and can easily be embedded in a Java and Android application.
</p><p>
The MVStore is somewhat similar to the Berkeley DB Java Edition
because it is also written in Java,
and is also a log structured storage, but the H2 license is more liberal.
</p><p>
Like SQLite, the MVStore keeps all data in one file.
Unlike SQLite, the MVStore uses is a log structured storage.
The plan is to make the MVStore both easier to use as well as faster than SQLite.
In a recent (very simple) test, the MVStore was about twice as fast as SQLite on Android.
</p><p>
The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek,
and some code is shared between MapDB and JDBM.
However, unlike MapDB, the MVStore uses is a log structured storage.
The MVStore does not have a record size limit.
</p>
<h2 id="current_state">Current State</h2>
<p>
The code is still experimental at this stage.
The API as well as the behavior may partially change.
Features may be added and removed (even thought the main features will stay).
</p>
<h2 id="requirements">Requirements</h2>
<p>
The MVStore is included in the latest H2 jar file.
</p><p>
There are no special requirements to use it.
The MVStore should run on any JVM as well as on Android.
</p><p>
To build just the MVStore (without the database engine), run:
</p>
<pre>
./build.sh jarMVStore
</pre>
<p>
This will create the file <code>bin/h2mvstore-${version}.jar</code> (about 130 KB).
</p>
<!-- [close] { --></div></td></tr></table><!-- } --><!-- analytics --></body></html>