mvstore.html 14.3 KB
Newer Older
1 2 3 4 5 6 7 8 9
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
Copyright 2004-2011 H2 Group. Multiple-Licensed under the H2 License, Version 1.0,
and under the Eclipse Public License, Version 1.0
(http://h2database.com/html/license.html).
Initial Developer: H2 Group
-->
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><title>
10
MVStore
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
</title><link rel="stylesheet" type="text/css" href="stylesheet.css" />
<!-- [search] { -->
<script type="text/javascript" src="navigation.js"></script>
</head><body onload="frameMe();">
<table class="content"><tr class="content"><td class="content"><div class="contentDiv">
<!-- } -->

<h1>MVStore</h1>
<a href="#overview">
    Overview</a><br />
<a href="#example_code">
    Example Code</a><br />
<a href="#features">
    Features</a><br />
<a href="#differences">
    Similar Projects and Differences to Other Storage Engines</a><br />
<a href="#current_state">
    Current State</a><br />
<a href="#building_mvstore">
    Building the MVStore Library</a><br />
<a href="#requirements">
    Requirements</a><br />
<a href="#future_plans">
    Future Plans</a><br />

<h2 id="overview">Overview</h2>
37 38 39
<p>
The MVStore is work in progress, and is planned to be the next storage subsystem of H2.
</p>
40 41 42 43 44 45
<ul><li>MVStore stands for multi-version store.
</li><li>Each store contains a number of maps (using the <code>java.util.Map</code> interface).
</li><li>The data can be persisted to disk (like a key-value store or a database).
</li><li>Fully in-memory operation is supported.
</li><li>It is intended to be fast, simple to use, and small.
</li><li>Old versions of the data can be read concurrently with all other operations.
46
</li><li>Transaction are supported (currently only one transaction at a time).
47 48 49
</li><li>Transactions (even if they are persisted) can be rolled back.
</li><li>The tool is very modular. It supports pluggable data types / serialization,
pluggable map implementations (B-tree and R-tree currently), BLOB storage,
50
and a file system abstraction to support encryption and compressed read-only files.
51 52 53
</li></ul>

<h2 id="example_code">Example Code</h2>
54 55 56 57 58
<h3>Map Operations and Versioning</h3>
<p>
The following sample code show how to create a store,
open a map, add some data, and access the current and an old version.
</p>
59 60 61 62
<pre>
// open the store (in-memory if fileName is null)
MVStore s = MVStore.open(fileName);

63 64
// create/get the map named "data"
MVMap<Integer, String> map = s.openMap("data");
65 66

// add some data
67 68
map.put(1, "Hello");
map.put(2, "World");
69 70 71 72 73 74 75 76 77 78

// get the current version, for later use
long oldVersion = s.getCurrentVersion();

// from now on, the old version is read-only
s.incrementVersion();

// more changes, in the new version
// changes can be rolled back if required
// changes always go into 'head' (the newest version)
79 80
map.put(1, "Hi");
map.remove(2);
81 82

// access the old data (before incrementVersion)
83
MVMap<Integer, String> oldMap =
84 85 86 87 88 89 90 91
        map.openVersion(oldVersion);

// store the newest data to disk
s.store();

// print the old version (can be done
// concurrently with further modifications)
// this will print Hello World
92 93 94
System.out.println(oldMap.get(1));
System.out.println(oldMap.get(2));
oldMap.close();
95 96

// print the newest version ("Hi")
97
System.out.println(map.get(1));
98 99 100 101 102

// close the store - this doesn't write to disk
s.close();
</pre>

103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
<h3>Store Builder</h3>
<p>
The <code>MVStoreBuilder</code> provides a fluid interface
to build a store if more complex configuration options are used.
</p>
<pre>
MVStore s = MVStoreBuilder.
    fileBased(fileName).
    cacheSizeMB(10).
    readOnly().
    open();
</pre>

<h3>R-Tree</h3>
<p>
The <code>MVRTreeMap</code> is an R-tree implementation
that supports fast spatial queries.
</p>
<pre>
// create an in-memory store
MVStore s = MVStore.open(null);

// create an R-tree map
// the key has 2 dimensions, the value is a string
MVRTreeMap<String> r = MVRTreeMap.create(2, new ObjectType());

// open the map
r = s.openMap("data", r);

// add two key-value pairs
// the first value is the key id (to make the key unique)
// then the min x, max x, min y, max y
r.add(new SpatialKey(0, -3f, -2f, 2f, 3f), "left");
r.add(new SpatialKey(1, 3f, 4f, 4f, 5f), "right");

// iterate over the intersecting keys
Iterator<SpatialKey> it =
        r.findIntersectingKeys(new SpatialKey(0, 0f, 9f, 3f, 6f));
for (SpatialKey k; it.hasNext();) {
    k = it.next();
    System.out.println(k + ": " + r.get(k));
}
s.close();
</pre>

148 149 150 151 152 153 154 155 156

<h2 id="features">Features</h2>

<h3>Maps</h3>
<p>
Each store supports a set of named maps.
A map is sorted by key, and supports the common lookup operations, 
including access to the first and last key, iterate over some or all keys, and so on.
</p><p>
157 158 159 160 161 162 163
Also supported, and very uncommon for maps, is fast index lookup.
The keys of the map can be accessed like a list
(get the key at the given index, get the index of a certain key).
That means getting the median of two keys is trivial,
and it allows to very quickly count ranges.
The iterator supports fast skipping.
This is possible because internally, each map is organized in the form of a counted B+-tree.
164
</p><p>
165
In database terms, a map can be used like a table, where the key of the map is the primary key of the table, 
166
and the value is the row. A map can also represent an index, where the key of the map is the key
167 168
of the index, and the value of the map is the primary key of the table (for non-unique indexes
the key of the map must also contain the primary key).
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247
</p>

<h3>Versions / Transactions</h3>
<p>
Multiple versions are supported.
A version is a snapshot of all the data of all maps at a given point in time.
A transaction is a number of actions between two versions.
</p><p>
Versions / transactions are not immediately persisted; instead, only the version counter is incremented.
If there is a change after switching to a new version, a snapshot of the old version is kept in memory,
so that the old version can still be read.
</p><p>
Old persisted versions are readable until the old data was explicitly overwritten.
Creating the snapshot is fast: only the pages that are changed after a snapshot are copied.
This behavior also called COW (copy on write).
</p><p>
Rollback is supported (rollback to any old in-memory version or an old persisted version).
</p>

<h3>Log Structured Storage</h3>
<p>
Currently, store() needs to be called explicitly to save changes.
Changes are buffered in memory, and once enough changes have accumulated
(for example 2 MB), all changes are written in one continuous disk write operation.
But of course, if needed, changes can also be persisted if only little data was changed.
The estimated amount of unsaved changes is tracked.
The plan is to automatically store in a background thread once there are enough changes.
</p><p>
When storing, all changed pages are serialized, 
compressed using the LZF algorithm (this can be disabled), 
and written sequentially to a free area of the file. 
Each such change set is called a chunk.
All parent pages of the changed B-trees are stored in this chunk as well, 
so that each chunk also contains the root of each changed map 
(which is the entry point to read old data).
There is no separate index: all data is stored as a list of pages.
</p><p>
There are currently two write operations per chunk:
one to store the chunk data (the pages), and one to update the file header 
(so it points to the chunk head), but the plan is to write the file header only 
once in a while, so it does not slow down opening the store too much.
</p><p>
There is currently no transaction log, no undo log, 
and there are no in-place updates (however unused chunks are overwritten).
To efficiently persist very small transactions, the plan is to support a transaction log
where only the deltas is stored, until enough changes have accumulated to persist a chunk.
Old versions are kept and are readable until they are no longer needed.
</p><p>
The plan is to keep all old data for at least one or two minutes (configurable), 
so that there are no explicit sync operations required to guarantee data consistency.
To reuse disk space, the chunks with the lowest amount of live data are compacted
(the live data is simply stored again in the next chunk).
To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
</p><p>
Compared to regular databases that use a transaction log, undo log, and main storage area,
the log structured storage is simpler, more flexible, and typically needs less disk operations per change, 
as data is only written once instead of twice or 3 times, and because the B-tree pages are
always full (they are stored next to each other) and can be easily compressed.
But temporarily, disk space usage might actually be a bit higher than for a regular database,
as disk space is not immediately re-used (there are no in-place updates).
</p>

<h3>In-Memory Performance and Usage</h3>
<p>
Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>
(many operations are actually faster), but usually slower than <code>java.util.HashMap</code>.
</p><p>
If no file name is specified, the store operates purely in memory.
Except for persisting data, all features are supported in this mode 
(multi-versioning, index lookup, R-tree and so on).
</p><p>
The memory overhead for large maps is slightly better than for the regular
map implementations, but there is a higher overhead per map. 
For maps with less than 25 entries, the regular map implementations 
use less memory on average.
</p>

<h3>Pluggable Data Types</h3>
<p>
248 249 250 251 252
Serialization is pluggable. The default serialization currently supports many common data types, 
and uses Java serialization for other objects. The following classes are currently directly supported:
<code>Boolean, Byte, Short, Character, Integer, Long, Float, Double, BigInteger, BigDecimal,
byte[], char[], int[], long[], String, UUID</code>.
The plan is to add more common classes (date, time, timestamp, object array).
253 254 255 256 257 258 259 260 261 262 263 264
</p><p>
Parameterized data types are supported
(for example one could build a string data type that limits the length for some reason).
</p><p>
The storage engine itself does not have any length limits, so that keys, values, 
pages, and chunks can be very big (as big as fits in memory).
Also, there is no inherent limit to the number of maps and chunks.
Due to using a log structured storage, there is no special case handling for large keys or pages.
</p>

<h3>BLOB Support</h3>
<p>
265
There is a mechanism that stores large binary objects by splitting them into smaller blocks. 
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291
This allows to store objects that don't fit in memory.
Streaming as well as random access reads on such objects are supported.
This tool is written on top of the store (only using the map interface).
</p>

<h3>R-Tree and Pluggable Map Implementations</h3>
<p>
The map implementation is pluggable.
In addition to the default MVMap (multi-version map), 
there is a multi-version R-tree map implementation
for spatial operations (contain and intersection; nearest neighbor is not yet implemented).
</p>

<h3>Concurrent Operations and Caching</h3>
<p>
At the moment, concurrent read on old versions of the data is supported.
All such read operations can occur in parallel. Concurrent reads from the page cache, 
as well as concurrent reads from the file system are supported.
</p><p>
Caching is done on the page level.
The page cache is a concurrent LIRS cache, 
which should be resistant against scan operations.
</p><p>
Concurrent modification operations on the maps are currently not supported, 
however it is planned to support an additional map implementation 
that supports concurrent writes
292
(at the cost of speed if used in a single thread, same as <code>ConcurrentHashMap</code>).
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310
</p>

<h3>File System Abstraction, File Locking and Online Backup</h3>
<p>
The file system is pluggable (the same file system abstraction is used as H2 uses).
Support for encryption is planned using an encrypting file system.
Other file system implementations support reading from a compressed zip or tar file.
</p>
<p>
Each store may only be opened once within a JVM.
When opening a store, the file is locked in exclusive mode, so that
the file can only be changed from within one process.
Files can be opened in read-only mode, in which case a shared lock is used.
</p>
<p>
The persisted data can be backed up to a different file at any time,
even during write operations (online backup). 
To do that, automatic disk space reuse needs to be first disabled, so that
311 312
new data is always appended at the end of the file. 
Then, the file can be copied (the file handle is available to the application).
313 314
</p>

315 316 317 318 319 320 321 322
<h3>Tools</h3>
<p>
There is a builder for store instances (<code>MVStoreBuilder</code>) 
with a fluent API to simplify building a store instance.
</p>
<p>
There is a tool (<code>MVStoreTool</code>) to dump the contents of a file.
</p>
323 324 325 326 327 328 329

<h2 id="differences">Similar Projects and Differences to Other Storage Engines</h2>
<p>
Unlike similar storage engines like LevelDB and Kyoto Cabinet, the MVStore is written in Java
and can easily be embedded in a Java application.
</p><p>
The MVStore is somewhat similar to the Berkeley DB Java Edition because it is also written in Java, 
330
and is also a log structured storage, but the H2 license is more liberal.
331 332 333 334 335 336 337
</p><p>
Like SQLite, the MVStore keeps all data in one file.
The plan is to make the MVStore easier to use and faster than SQLite on Android
(this was not recently tested, however an initial test was successful).
</p><p>
The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek,
and some code is shared between MapDB and JDBM. 
338
However, unlike MapDB, the MVStore uses is a log structured storage.
339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362
</p>

<h2 id="current_state">Current State</h2>
<p>
The code is still very experimental at this stage.
The API as well as the behavior will probably change.
Features may be added and removed (even thought the main features will stay).
</p>

<h2 id="building_mvstore">Building the MVStore Library</h2>
<p>
There is currently no build script.
To test it, run the test within the H2 project in Eclipse or any other IDE.
</p>

<h2 id="requirements">Requirements</h2>
<p>
There are no special requirements.
The MVStore should run on any JVM as well as on Android
(even thought this was not tested recently).
</p>

<!-- [close] { --></div></td></tr></table><!-- } --><!-- analytics --></body></html>