A persistent multi-version map: initial documentation

115ef83b · Thomas Mueller · f0beaad4 · 115ef83b
--- a/h2/src/docsrc/html/mvstore.html
+++ b/h2/src/docsrc/html/mvstore.html
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<!--
+Copyright 2004-2011 H2 Group. Multiple-Licensed under the H2 License, Version 1.0,
+and under the Eclipse Public License, Version 1.0
+(http://h2database.com/html/license.html).
+Initial Developer: H2 Group
+-->
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
+<head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><title>
+JaQu
+</title><link rel="stylesheet" type="text/css" href="stylesheet.css" />
+<!-- [search] { -->
+<script type="text/javascript" src="navigation.js"></script>
+</head><body onload="frameMe();">
+<table class="content"><tr class="content"><td class="content"><div class="contentDiv">
+<!-- } -->
+
+<h1>MVStore</h1>
+<a href="#overview">
+    Overview</a><br />
+<a href="#example_code">
+    Example Code</a><br />
+<a href="#features">
+    Features</a><br />
+<a href="#differences">
+    Similar Projects and Differences to Other Storage Engines</a><br />
+<a href="#current_state">
+    Current State</a><br />
+<a href="#building_mvstore">
+    Building the MVStore Library</a><br />
+<a href="#requirements">
+    Requirements</a><br />
+<a href="#future_plans">
+    Future Plans</a><br />
+
+<h2 id="overview">Overview</h2>
+<ul><li>MVStore stands for multi-version store.
+</li><li>Each store contains a number of maps (using the <code>java.util.Map</code> interface).
+</li><li>The data can be persisted to disk (like a key-value store or a database).
+</li><li>Fully in-memory operation is supported.
+</li><li>It is intended to be fast, simple to use, and small.
+</li><li>Old versions of the data can be read concurrently with all other operations.
+</li><li>Transaction are supported (currently only one transaction can be open at any time).
+</li><li>Transactions (even if they are persisted) can be rolled back.
+</li><li>The tool is very modular. It supports pluggable data types / serialization,
+pluggable map implementations (B-tree and R-tree currently), BLOB storage,
+and a file system abstraction that supports encryption and compressed
+read-only files.
+</li></ul>
+
+<h2 id="example_code">Example Code</h2>
+<pre>
+// open the store (in-memory if fileName is null)
+MVStore s = MVStore.open(fileName);
+
+// create/get the map "data"
+// the String.class, String.class will be optional later
+MVMap<String, String> map = s.openMap("data",
+        String.class, String.class);
+
+// add some data
+map.put("1", "Hello");
+map.put("2", "World");
+
+// get the current version, for later use
+long oldVersion = s.getCurrentVersion();
+
+// from now on, the old version is read-only
+s.incrementVersion();
+
+// more changes, in the new version
+// changes can be rolled back if required
+// changes always go into 'head' (the newest version)
+map.put("1", "Hi");
+map.remove("2");
+
+// access the old data (before incrementVersion)
+MVMap<String, String> oldMap =
+        map.openVersion(oldVersion);
+
+// store the newest data to disk
+s.store();
+
+// print the old version (can be done
+// concurrently with further modifications)
+// this will print Hello World
+System.out.println(oldMap.get("1"));
+System.out.println(oldMap.get("2"));
+
+// print the newest version ("Hi")
+System.out.println(map.get("1"));
+
+// close the store - this doesn't write to disk
+s.close();
+</pre>
+
+
+<h2 id="features">Features</h2>
+
+<h3>Maps</h3>
+<p>
+Each store supports a set of named maps.
+A map is sorted by key, and supports the common lookup operations, 
+including access to the first and last key, iterate over some or all keys, and so on.
+</p><p>
+Also supported, and unusual for maps, is fast index lookup, as if the map would be a list
+(get the key at the given index, get the index of a certain key), and fast skipping within an iterator.
+Internally, each map is organized in the form of a counted B+-tree.
+</p><p>
+In database terms, a map can be a table, where the key of the map is the primary key of the table, 
+and the value is the row. A map can also represent an index, where the key of the map is the key
+of the index, and the value of the map is the primary key of the table.
+</p>
+
+<h3>Versions / Transactions</h3>
+<p>
+Multiple versions are supported.
+A version is a snapshot of all the data of all maps at a given point in time.
+A transaction is a number of actions between two versions.
+</p><p>
+Versions / transactions are not immediately persisted; instead, only the version counter is incremented.
+If there is a change after switching to a new version, a snapshot of the old version is kept in memory,
+so that the old version can still be read.
+</p><p>
+Old persisted versions are readable until the old data was explicitly overwritten.
+Creating the snapshot is fast: only the pages that are changed after a snapshot are copied.
+This behavior also called COW (copy on write).
+</p><p>
+Rollback is supported (rollback to any old in-memory version or an old persisted version).
+</p>
+
+<h3>Log Structured Storage</h3>
+<p>
+Currently, store() needs to be called explicitly to save changes.
+Changes are buffered in memory, and once enough changes have accumulated
+(for example 2 MB), all changes are written in one continuous disk write operation.
+But of course, if needed, changes can also be persisted if only little data was changed.
+The estimated amount of unsaved changes is tracked.
+The plan is to automatically store in a background thread once there are enough changes.
+</p><p>
+When storing, all changed pages are serialized, 
+compressed using the LZF algorithm (this can be disabled), 
+and written sequentially to a free area of the file. 
+Each such change set is called a chunk.
+All parent pages of the changed B-trees are stored in this chunk as well, 
+so that each chunk also contains the root of each changed map 
+(which is the entry point to read old data).
+There is no separate index: all data is stored as a list of pages.
+</p><p>
+There are currently two write operations per chunk:
+one to store the chunk data (the pages), and one to update the file header 
+(so it points to the chunk head), but the plan is to write the file header only 
+once in a while, so it does not slow down opening the store too much.
+</p><p>
+There is currently no transaction log, no undo log, 
+and there are no in-place updates (however unused chunks are overwritten).
+To efficiently persist very small transactions, the plan is to support a transaction log
+where only the deltas is stored, until enough changes have accumulated to persist a chunk.
+Old versions are kept and are readable until they are no longer needed.
+</p><p>
+The plan is to keep all old data for at least one or two minutes (configurable), 
+so that there are no explicit sync operations required to guarantee data consistency.
+To reuse disk space, the chunks with the lowest amount of live data are compacted
+(the live data is simply stored again in the next chunk).
+To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
+</p><p>
+Compared to regular databases that use a transaction log, undo log, and main storage area,
+the log structured storage is simpler, more flexible, and typically needs less disk operations per change, 
+as data is only written once instead of twice or 3 times, and because the B-tree pages are
+always full (they are stored next to each other) and can be easily compressed.
+But temporarily, disk space usage might actually be a bit higher than for a regular database,
+as disk space is not immediately re-used (there are no in-place updates).
+</p>
+
+<h3>In-Memory Performance and Usage</h3>
+<p>
+Performance of in-memory operations is comparable with <code>java.util.TreeMap</code>
+(many operations are actually faster), but usually slower than <code>java.util.HashMap</code>.
+</p><p>
+If no file name is specified, the store operates purely in memory.
+Except for persisting data, all features are supported in this mode 
+(multi-versioning, index lookup, R-tree and so on).
+</p><p>
+The memory overhead for large maps is slightly better than for the regular
+map implementations, but there is a higher overhead per map. 
+For maps with less than 25 entries, the regular map implementations 
+use less memory on average.
+</p>
+
+<h3>Pluggable Data Types</h3>
+<p>
+Serialization is pluggable. The default serialization currently only supports string values and keys,
+but there is a sample plugin that supports many common data types, 
+and uses Java serialization for other objects.
+</p><p>
+Parameterized data types are supported
+(for example one could build a string data type that limits the length for some reason).
+</p><p>
+The storage engine itself does not have any length limits, so that keys, values, 
+pages, and chunks can be very big (as big as fits in memory).
+Also, there is no inherent limit to the number of maps and chunks.
+Due to using a log structured storage, there is no special case handling for large keys or pages.
+</p>
+
+<h3>BLOB Support</h3>
+<p>
+There is a mechanism that stores BLOBs (large binary objects) 
+by splitting them into smaller blocks. 
+This allows to store objects that don't fit in memory.
+Streaming as well as random access reads on such objects are supported.
+This tool is written on top of the store (only using the map interface).
+</p>
+
+<h3>R-Tree and Pluggable Map Implementations</h3>
+<p>
+The map implementation is pluggable.
+In addition to the default MVMap (multi-version map), 
+there is a multi-version R-tree map implementation
+for spatial operations (contain and intersection; nearest neighbor is not yet implemented).
+</p>
+
+<h3>Concurrent Operations and Caching</h3>
+<p>
+At the moment, concurrent read on old versions of the data is supported.
+All such read operations can occur in parallel. Concurrent reads from the page cache, 
+as well as concurrent reads from the file system are supported.
+</p><p>
+Caching is done on the page level.
+The page cache is a concurrent LIRS cache, 
+which should be resistant against scan operations.
+</p><p>
+Concurrent modification operations on the maps are currently not supported, 
+however it is planned to support an additional map implementation 
+that supports concurrent writes
+(at the cost of speed if used in a single thread, same as ConcurrentHashMap).
+</p>
+
+<h3>File System Abstraction, File Locking and Online Backup</h3>
+<p>
+The file system is pluggable (the same file system abstraction is used as H2 uses).
+Support for encryption is planned using an encrypting file system.
+Other file system implementations support reading from a compressed zip or tar file.
+</p>
+<p>
+Each store may only be opened once within a JVM.
+When opening a store, the file is locked in exclusive mode, so that
+the file can only be changed from within one process.
+Files can be opened in read-only mode, in which case a shared lock is used.
+</p>
+<p>
+The persisted data can be backed up to a different file at any time,
+even during write operations (online backup). 
+To do that, automatic disk space reuse needs to be first disabled, so that
+new data is always appended at the end of the file. Then, the file
+</p>
+
+
+<h2 id="differences">Similar Projects and Differences to Other Storage Engines</h2>
+<p>
+Unlike similar storage engines like LevelDB and Kyoto Cabinet, the MVStore is written in Java
+and can easily be embedded in a Java application.
+</p><p>
+The MVStore is somewhat similar to the Berkeley DB Java Edition because it is also written in Java, 
+and also a log structured storage, but the MVStore licensing is more liberal.
+</p><p>
+Like SQLite, the MVStore keeps all data in one file.
+The plan is to make the MVStore easier to use and faster than SQLite on Android
+(this was not recently tested, however an initial test was successful).
+</p><p>
+The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek,
+and some code is shared between MapDB and JDBM. 
+However, the MVStore is a log structured storage.
+</p>
+
+<h2 id="current_state">Current State</h2>
+<p>
+The code is still very experimental at this stage.
+The API as well as the behavior will probably change.
+Features may be added and removed (even thought the main features will stay).
+</p>
+
+<h2 id="building_mvstore">Building the MVStore Library</h2>
+<p>
+There is currently no build script.
+To test it, run the test within the H2 project in Eclipse or any other IDE.
+</p>
+
+<h2 id="requirements">Requirements</h2>
+<p>
+There are no special requirements.
+The MVStore should run on any JVM as well as on Android
+(even thought this was not tested recently).
+</p>
+
+<!-- [close] { --></div></td></tr></table><!-- } --><!-- analytics --></body></html>
+