提交 249d5f2c authored 作者: Thomas Mueller's avatar Thomas Mueller

MVStore: store the file header also at the end of each chunk, which means even…

MVStore: store the file header also at the end of each chunk, which means even less write operations
上级 8846c1bb
......@@ -18,7 +18,9 @@ Change Log
<h1>Change Log</h1>
<h2>Next Version (unreleased)</h2>
<ul><li>MVStore: a map implementation that supports concurrent operations.
<ul><li>MVStore: store the file header also at the end of each chunk,
which results in a further reduced number of write operations.
</li><li>MVStore: a map implementation that supports concurrent operations.
</li><li>MVStore: unified exception handling; the version is included in the messages.
</li><li>MVStore: old data is now retained for 45 seconds by default.
</ul><li>MVStore: compress is now disabled by default, and can be enabled on request.
......
......@@ -267,34 +267,35 @@ The plan is to add such a mechanism later when needed.
Currently, <code>store()</code> needs to be called explicitly to save changes.
Changes are buffered in memory, and once enough changes have accumulated
(for example 2 MB), all changes are written in one continuous disk write operation.
(According to a test, write throughput of a common SSD gets higher the larger the block size,
until a block size of 2 MB, and then does not further increase.)
But of course, if needed, changes can also be persisted if only little data was changed.
The estimated amount of unsaved changes is tracked.
The plan is to automatically store in a background thread once there are enough changes.
</p><p>
When storing, all changed pages are serialized,
compressed using the LZF algorithm (this can be disabled),
optionally compressed using the LZF algorithm,
and written sequentially to a free area of the file.
Each such change set is called a chunk.
All parent pages of the changed B-trees are stored in this chunk as well,
so that each chunk also contains the root of each changed map
(which is the entry point to read old data).
(which is the entry point to read this version of the data).
There is no separate index: all data is stored as a list of pages.
Per store, the is one additional map that contains the metadata (the list of
maps, where the root page of each map is stored, and the list of chunks).
</p><p>
There are currently two write operations per chunk:
one to store the chunk data (the pages), and one to update the file header
(so it points to the latest chunk), but the plan is to write the file header only
once in a while, in a way that still allows to open a store very quickly.
There are usually two write operations per chunk:
one to store the chunk data (the pages), and one to update the file header (so it points to the latest chunk).
If the chunk is appended at the end of the file, the file header is only written at the end of the chunk.
</p><p>
There is currently no transaction log, no undo log,
and there are no in-place updates (however unused chunks are overwritten).
To efficiently persist very small transactions, the plan is to support a transaction log
where only the deltas is stored, until enough changes have accumulated to persist a chunk.
Old versions are kept and are readable until they are no longer needed.
To save space when persisting very small transactions, the plan is to use a transaction log
where only the deltas are stored, until enough changes have accumulated to persist a chunk.
</p><p>
The plan is to keep all old data for at least one or two minutes (configurable),
so that there are no explicit sync operations required to guarantee data consistency.
Old data is kept for at least 45 seconds (configurable),
so that there are no explicit sync operations required to guarantee data consistency,
but an application can also sync explicitly when needed.
To reuse disk space, the chunks with the lowest amount of live data are compacted
(the live data is simply stored again in the next chunk).
To improve data locality and disk space usage, the plan is to automatically defragment and compact data.
......@@ -343,8 +344,8 @@ Instead, unchecked exceptions are thrown if needed.
The error message always contains the version of the tool.
The following exceptions can occur:
</p>
<ul><li><code>IllegalStateException</code> if a map was already closed,
in IO exception occurred, for example if the file was locked, is already closed,
<ul><li><code>IllegalStateException</code> if a map was already closed or
an IO exception occurred, for example if the file was locked, is already closed,
could not be opened or closed, if reading or writing failed,
if the file is corrupt, or if there is an internal error in the tool.
</li><li><code>IllegalArgumentException</code> if a method was called with an illegal argument.
......@@ -361,7 +362,7 @@ The MVStore is somewhat similar to the Berkeley DB Java Edition because it is al
and is also a log structured storage, but the H2 license is more liberal.
</p><p>
Like SQLite, the MVStore keeps all data in one file.
The plan is to make the MVStore easier to use and faster than SQLite on Android
The plan is to make the MVStore easier to use than SQLite and faster, including faster on Android
(this was not recently tested, however an initial test was successful).
</p><p>
The API of the MVStore is similar to MapDB (previously known as JDBM) from Jan Kotek,
......@@ -387,4 +388,3 @@ The MVStore should run on any JVM as well as on Android
</p>
<!-- [close] { --></div></td></tr></table><!-- } --><!-- analytics --></body></html>
......@@ -299,7 +299,7 @@ public class DataUtils {
/**
* Read from a file channel until the buffer is full, or end-of-file
* has been reached. The buffer is rewind at the end.
* has been reached. The buffer is rewind after reading.
*
* @param file the file channel
* @param pos the absolute position within the file
......
......@@ -27,6 +27,7 @@ import org.h2.mvstore.type.ObjectDataTypeFactory;
import org.h2.mvstore.type.StringDataType;
import org.h2.store.fs.FilePath;
import org.h2.store.fs.FileUtils;
import org.h2.util.MathUtils;
import org.h2.util.New;
import org.h2.util.StringUtils;
......@@ -36,20 +37,18 @@ File format:
header: (blockSize) bytes
header: (blockSize) bytes
[ chunk ] *
(there are two headers for security)
(there are two headers for security at the beginning of the file,
and there is a header after each chunk)
header:
H:3,...
TODO:
- concurrent map; avoid locking during IO (pre-load pages)
- maybe split database into multiple files, to speed up compact
- automated 'kill process' and 'power failure' test
- implement table engine for H2
- maybe split database into multiple files, to speed up compact
- auto-compact from time to time and on close
- test and possibly improve compact operation (for large dbs)
- support background writes (concurrent modification & store)
- limited support for writing to old versions (branches)
- support concurrent operations (including file I/O)
- on insert, if the child page is already full, don't load and modify it
-- split directly (for leaves with 1 entry)
- performance test with encrypting file system
......@@ -93,6 +92,9 @@ TODO:
-- and for MVStore on Windows, auto-detect renamed file
- ensure data is overwritten eventually if the system doesn't have a timer
- SSD-friendly write (always in blocks of 128 or 256 KB?)
- close the file on out of memory or disk write error (out of disk space or so)
- implement a shareded map (in one store, multiple stores)
-- to support concurrent updates and writes, and very large maps
*/
......@@ -448,29 +450,52 @@ public class MVStore {
}
private void readFileHeader() {
byte[] headers = new byte[2 * BLOCK_SIZE];
// we don't have a valid header yet
currentVersion = -1;
// read the last block of the file, and then two first blocks
ByteBuffer buff = ByteBuffer.allocate(3 * BLOCK_SIZE);
buff.limit(BLOCK_SIZE);
fileReadCount++;
DataUtils.readFully(file, 0, ByteBuffer.wrap(headers));
for (int i = 0; i <= BLOCK_SIZE; i += BLOCK_SIZE) {
String s = StringUtils.utf8Decode(headers, i, BLOCK_SIZE).trim();
fileHeader = DataUtils.parseMap(s);
rootChunkStart = Long.parseLong(fileHeader.get("rootChunk"));
creationTime = Long.parseLong(fileHeader.get("creationTime"));
currentVersion = Long.parseLong(fileHeader.get("version"));
lastMapId = Integer.parseInt(fileHeader.get("lastMapId"));
int check = (int) Long.parseLong(fileHeader.get("fletcher"), 16);
DataUtils.readFully(file, fileSize - BLOCK_SIZE, buff);
buff.limit(3 * BLOCK_SIZE);
buff.position(BLOCK_SIZE);
fileReadCount++;
DataUtils.readFully(file, 0, buff);
for (int i = 0; i < 3 * BLOCK_SIZE; i += BLOCK_SIZE) {
String s = StringUtils.utf8Decode(buff.array(), i, BLOCK_SIZE)
.trim();
HashMap<String, String> m = DataUtils.parseMap(s);
String f = m.remove("fletcher");
if (f == null) {
continue;
}
int check;
try {
check = (int) Long.parseLong(f, 16);
} catch (NumberFormatException e) {
check = -1;
}
s = s.substring(0, s.lastIndexOf("fletcher") - 1) + " ";
byte[] bytes = StringUtils.utf8Encode(s);
int checksum = DataUtils.getFletcher32(bytes,
bytes.length / 2 * 2);
if (check == checksum) {
return;
int checksum = DataUtils.getFletcher32(bytes, bytes.length / 2 * 2);
if (check != checksum) {
continue;
}
long version = Long.parseLong(m.get("version"));
if (version > currentVersion) {
fileHeader = m;
rootChunkStart = Long.parseLong(m.get("rootChunk"));
creationTime = Long.parseLong(m.get("creationTime"));
currentVersion = version;
lastMapId = Integer.parseInt(m.get("lastMapId"));
}
}
throw DataUtils.illegalStateException("File header is corrupt");
if (currentVersion < 0) {
throw DataUtils.illegalStateException("File header is corrupt");
}
}
private void writeFileHeader() {
private byte[] getFileHeaderBytes() {
StringBuilder buff = new StringBuilder();
fileHeader.put("lastMapId", "" + lastMapId);
fileHeader.put("rootChunk", "" + rootChunkStart);
......@@ -483,6 +508,11 @@ public class MVStore {
if (bytes.length > BLOCK_SIZE) {
throw DataUtils.illegalArgumentException("File header too large: " + buff);
}
return bytes;
}
private void writeFileHeader() {
byte[] bytes = getFileHeaderBytes();
ByteBuffer header = ByteBuffer.allocate(2 * BLOCK_SIZE);
header.put(bytes);
header.position(BLOCK_SIZE);
......@@ -614,7 +644,7 @@ public class MVStore {
buff = writeBuffer;
buff.clear();
} else {
writeBuffer = buff = ByteBuffer.allocate(maxLength + 1024 * 1024);
writeBuffer = buff = ByteBuffer.allocate(maxLength + 128 * 1024);
}
}
// need to patch the header later
......@@ -644,9 +674,14 @@ public class MVStore {
// the correct value is written in the chunk header
meta.getRoot().writeUnsavedRecursive(c, buff);
buff.flip();
int length = buff.limit();
long filePos = allocateChunk(length);
int chunkLength = buff.position();
int length = MathUtils.roundUpInt(chunkLength, BLOCK_SIZE) + BLOCK_SIZE;
buff.limit(length);
long fileLength = getFileLengthUsed();
long filePos = reuseSpace ? allocateChunk(length) : fileLength;
boolean atEnd = filePos + length >= fileLength;
// need to keep old chunks
// until they are are no longer referenced
......@@ -656,22 +691,30 @@ public class MVStore {
chunks.remove(x);
}
buff.rewind();
c.start = filePos;
c.length = length;
c.length = chunkLength;
c.metaRootPos = meta.getRoot().getPos();
buff.position(0);
c.writeHeader(buff);
buff.rewind();
rootChunkStart = filePos;
revertTemp();
buff.position(buff.limit() - BLOCK_SIZE);
byte[] header = getFileHeaderBytes();
buff.put(header);
// fill the header with zeroes
buff.put(new byte[BLOCK_SIZE - header.length]);
buff.position(0);
fileWriteCount++;
DataUtils.writeFully(file, filePos, buff);
fileSize = Math.max(fileSize, filePos + buff.position());
rootChunkStart = filePos;
revertTemp();
// write the new version (after the commit)
writeFileHeader();
shrinkFileIfPossible(1);
// overwrite the header if required
if (!atEnd) {
writeFileHeader();
shrinkFileIfPossible(1);
}
// some pages might have been changed in the meantime (in the newest version)
unsavedPageCount = Math.max(0, unsavedPageCount - currentUnsavedPageCount);
return version;
......@@ -725,21 +768,18 @@ public class MVStore {
}
private long getFileLengthUsed() {
int min = 2;
long size = 2 * BLOCK_SIZE;
for (Chunk c : chunks.values()) {
if (c.start == Long.MAX_VALUE) {
continue;
}
int last = (int) ((c.start + c.length) / BLOCK_SIZE);
min = Math.max(min, last + 1);
long x = c.start + c.length;
size = Math.max(size, MathUtils.roundUpLong(x, BLOCK_SIZE) + BLOCK_SIZE);
}
return min * BLOCK_SIZE;
return size;
}
private long allocateChunk(long length) {
if (!reuseSpace) {
return getFileLengthUsed();
}
BitSet set = new BitSet();
set.set(0);
set.set(1);
......@@ -749,7 +789,7 @@ public class MVStore {
}
int first = (int) (c.start / BLOCK_SIZE);
int last = (int) ((c.start + c.length) / BLOCK_SIZE);
set.set(first, last +1);
set.set(first, last + 2);
}
int required = (int) (length / BLOCK_SIZE) + 1;
for (int i = 0; i < set.size(); i++) {
......@@ -1197,6 +1237,15 @@ public class MVStore {
} while (last.version > version && chunks.size() > 0);
rootChunkStart = last.start;
writeFileHeader();
// need to write the header at the end of the file as well,
// so that the old end header is not used
byte[] bytes = getFileHeaderBytes();
ByteBuffer header = ByteBuffer.allocate(BLOCK_SIZE);
header.put(bytes);
header.rewind();
fileWriteCount++;
DataUtils.writeFully(file, fileSize, header);
fileSize += BLOCK_SIZE;
readFileHeader();
readMeta();
}
......
......@@ -6,13 +6,11 @@
*/
package org.h2.mvstore;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.StringReader;
import java.io.Writer;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Properties;
import org.h2.store.fs.FilePath;
import org.h2.store.fs.FileUtils;
......@@ -47,9 +45,10 @@ public class MVStoreTool {
* @param fileName the name of the file
* @param writer the print writer
*/
public static void dump(String fileName, PrintWriter writer) throws IOException {
public static void dump(String fileName, Writer writer) throws IOException {
PrintWriter pw = new PrintWriter(writer, true);
if (!FileUtils.exists(fileName)) {
writer.println("File not found: " + fileName);
pw.println("File not found: " + fileName);
return;
}
FileChannel file = null;
......@@ -57,20 +56,21 @@ public class MVStoreTool {
try {
file = FilePath.get(fileName).open("r");
long fileLength = file.size();
byte[] header = new byte[blockSize];
file.read(ByteBuffer.wrap(header), 0);
Properties prop = new Properties();
prop.load(new ByteArrayInputStream(header));
prop.load(new StringReader(new String(header, "UTF-8")));
writer.println("file " + fileName);
writer.println(" length " + fileLength);
writer.println(" " + prop);
ByteBuffer block = ByteBuffer.allocate(40);
pw.println("file " + fileName);
pw.println(" length " + fileLength);
ByteBuffer block = ByteBuffer.allocate(4096);
for (long pos = 0; pos < fileLength;) {
block.rewind();
DataUtils.readFully(file, pos, block);
block.rewind();
if (block.get() != 'c') {
int tag = block.get();
if (tag == 'H') {
pw.println(" header at " + pos);
pw.println(" " + new String(block.array(), "UTF-8").trim());
pos += blockSize;
continue;
}
if (tag != 'c') {
pos += blockSize;
continue;
}
......@@ -80,7 +80,7 @@ public class MVStoreTool {
long metaRootPos = block.getLong();
long maxLength = block.getLong();
long maxLengthLive = block.getLong();
writer.println(" chunk " + chunkId +
pw.println(" chunk " + chunkId +
" at " + pos +
" length " + chunkLength +
" pageCount " + pageCount +
......@@ -102,7 +102,7 @@ public class MVStoreTool {
int type = chunk.get();
boolean compressed = (type & 2) != 0;
boolean node = (type & 1) != 0;
writer.println(" map " + mapId + " at " + p + " " +
pw.println(" map " + mapId + " at " + p + " " +
(node ? "node" : "leaf") + " " +
(compressed ? "compressed " : "") +
"len: " + pageLength + " entries: " + len);
......@@ -111,7 +111,7 @@ public class MVStoreTool {
}
}
} catch (IOException e) {
writer.println("ERROR: " + e);
pw.println("ERROR: " + e);
throw e;
} finally {
if (file != null) {
......@@ -122,8 +122,8 @@ public class MVStoreTool {
}
}
}
writer.println();
writer.flush();
pw.println();
pw.flush();
}
}
......@@ -7,10 +7,9 @@ package org.h2.test.store;
import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.util.Iterator;
import java.util.Map;
import java.util.Random;
......@@ -39,7 +38,7 @@ public class TestConcurrent extends TestMVStore {
public void test() throws Exception {
FileUtils.deleteRecursive(getBaseDir(), true);
// testConcurrentOnlineBackup();
testConcurrentOnlineBackup();
testConcurrentMap();
testConcurrentIterate();
testConcurrentWrite();
......@@ -100,9 +99,7 @@ public class TestConcurrent extends TestMVStore {
}
private void testConcurrentOnlineBackup() throws Exception {
// because absolute and relative reads are mixed, this currently
// only works when using FileChannel directly
String fileName = "nio:" + getBaseDir() + "/onlineBackup.h3";
String fileName = getBaseDir() + "/onlineBackup.h3";
String fileNameRestore = getBaseDir() + "/onlineRestore.h3";
final MVStore s = openStore(fileName);
final MVMap<Integer, byte[]> map = s.openMap("test");
......@@ -111,49 +108,28 @@ public class TestConcurrent extends TestMVStore {
@Override
public void call() throws Exception {
while (!stop) {
for (int i = 0; i < 100; i++) {
for (int i = 0; i < 20; i++) {
map.put(i, new byte[100 * r.nextInt(100)]);
}
s.store();
map.clear();
s.store();
long len = s.getFile().size();
if (len > 1024 * 100) {
if (len > 1024 * 1024) {
// slow down writing a lot
Thread.sleep(200);
} else if (len > 1024 * 100) {
// slow down writing
Thread.sleep(20);
} else if (len > 1024 * 200) {
// slow down writing
Thread.sleep(200);
}
}
}
};
t.execute();
// the wrong way to back up
try {
for (int i = 0; i < 100; i++) {
byte[] buff = readFileSlowly(s.getFile());
FileOutputStream out = new FileOutputStream(fileNameRestore);
out.write(buff);
MVStore s2 = openStore(fileNameRestore);
try {
MVMap<Integer, byte[]> test = s2.openMap("test");
for (Integer k : test.keySet()) {
test.get(k);
}
} finally {
s2.close();
}
}
fail();
} catch (Exception e) {
// expected
}
// the right way to back up
for (int i = 0; i < 10; i++) {
// System.out.println("test " + i);
s.setReuseSpace(false);
byte[] buff = readFileSlowly(s.getFile());
byte[] buff = readFileSlowly(fileName, s.getFile().size());
s.setReuseSpace(true);
FileOutputStream out = new FileOutputStream(fileNameRestore);
out.write(buff);
......@@ -170,22 +146,17 @@ public class TestConcurrent extends TestMVStore {
s.close();
}
private static byte[] readFileSlowly(FileChannel file) throws Exception {
file.position(0);
InputStream in = new BufferedInputStream(Channels.newInputStream(file));
private static byte[] readFileSlowly(String fileName, long length) throws Exception {
InputStream in = new BufferedInputStream(new FileInputStream(fileName));
ByteArrayOutputStream buff = new ByteArrayOutputStream();
for (int j = 0;; j++) {
for (int j = 0; j < length; j++) {
int x = in.read();
if (x < 0) {
break;
}
buff.write(x);
if (j % 4096 == 0) {
Thread.sleep(1);
}
}
// in.close() could close the stream
// in.close();
in.close();
return buff.toByteArray();
}
......
......@@ -265,10 +265,16 @@ public class TestMVRTree extends TestMVStore {
}
private void testRandom() {
testRandom(true);
testRandom(false);
}
private void testRandom(boolean quadraticSplit) {
String fileName = getBaseDir() + "/testRandom.h3";
FileUtils.delete(fileName);
MVStore s = openStore(fileName);
MVRTreeMap<String> m = MVRTreeMap.create(2, StringDataType.INSTANCE);
m.setQuadraticSplit(quadraticSplit);
m = s.openMap("data", m);
HashMap<SpatialKey, String> map = new HashMap<SpatialKey, String>();
Random rand = new Random(1);
......
......@@ -103,7 +103,7 @@ public class TestMVStore extends TestBase {
s.store();
s.close();
int[] expectedReadsForCacheSize = {
3407, 2590, 1924, 1440, 1101, 956, 918
3408, 2590, 1924, 1440, 1102, 956, 918
};
for (int cacheSize = 0; cacheSize <= 6; cacheSize += 4) {
s = MVStoreBuilder.fileBased(fileName).
......@@ -172,6 +172,10 @@ public class TestMVStore extends TestBase {
// test corrupt file headers
for (int i = 0; i <= blockSize; i += blockSize) {
FileChannel fc = f.open("rw");
if (i == 0) {
// corrupt the last block (the end header)
fc.truncate(fc.size() - 4096);
}
ByteBuffer buff = ByteBuffer.allocate(4 * 1024);
fc.read(buff, i);
String h = new String(buff.array(), "UTF-8").trim();
......@@ -577,7 +581,7 @@ public class TestMVStore extends TestBase {
assertEquals(1000, m.size());
assertEquals(284, s.getUnsavedPageCount());
s.store();
assertEquals(3, s.getFileWriteCount());
assertEquals(2, s.getFileWriteCount());
s.close();
s = openStore(fileName);
......@@ -586,8 +590,8 @@ public class TestMVStore extends TestBase {
assertEquals(0, m.size());
s.store();
// ensure only nodes are read, but not leaves
assertEquals(41, s.getFileReadCount());
assertEquals(2, s.getFileWriteCount());
assertEquals(42, s.getFileReadCount());
assertEquals(1, s.getFileWriteCount());
s.close();
}
......@@ -832,6 +836,7 @@ public class TestMVStore extends TestBase {
FileUtils.delete(fileName);
MVStore s = openStore(fileName);
s.close();
s = openStore(fileName);
MVMap<Integer, String> m = s.openMap("data", Integer.class, String.class);
int count = 2000;
......@@ -857,6 +862,7 @@ public class TestMVStore extends TestBase {
}
s.store();
s.close();
s = openStore(fileName);
m = s.openMap("data", Integer.class, String.class);
assertNull(m.get(0));
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论