garbage collection in an append only btree stored on disk

Started by
0 comments, last by IFooBar 12 years, 4 months ago
Hi there, this post is kind of long, but it's as far as I know a previously untackled problem (couldn't find any research papers on it or anything) so it might be interesting. The problem is with trying to do garbage collection on an append-only file. This file has pages written to it sequentially, and every now and then, a previous page becomes obsolete so theoretically it can be reused. I have an implementation of a database and it's stored as a b+tree on disk. It's append only so it's never overwritten, which makes it less susceptible to corruption, and provides automatic versioning.

A few things about the data structures
1) File writes are carried out in pages of size X (4k default) - depending on disk block size.
2) There are 4 types of pages - head, leaf, branch and meta. The leaf and branch pages are data pages. The head is only written once when you create a file for the first time.
3) All data pages have a page number (this gives easy access - page 4 means it's at offset 4* PageSize in the file on disk).
4) A meta page is written after every data commit. The meta page contains the number of the root data page.
5) Since it's append only, whenever a write happens, all the pages that are affected have to be recreated and appended.

Typical process of writing:
a) create file: write header page (that's 4k gone, even though header is about 64 bytes or something)
b) write first key/value pair (i.e. node): create a new leaf page and store the node in it. Then create a meta page and set it's root page to the new leaf page.
** so now we have 3 pages after one write - head page 0, leaf page 1, meta page 2.
c) write another node: find where it goes in tree. There's space in page 1 so copy page 1 and add node to it. Now copy old meta page and set root page to the copy of page 1.
** Now we have 5 pages: head page 0, old leave page 1, old meta page 2, new leaf page 3 and new meta page 4 (which points to root page 3).
d) write node that can cause a split. Find where it goes in tree. Start at root (page 3), see that there is no space, so split.
** page 3 turns into page 5 and 6.
** write nodes to page 5 and 6.
** create branch page 7, and create nodes in branch that point to page 5 and 6
** create new meta page 8 and set root page to branch page 7.

So as you can see, there's a lot of space wastage. As meta pages are added and root pages shift, old pages become reusable. So I'm trying to think of ways to flag them as reusable. One way is to mark a page that needs to be copied as garbage. But the problem with that is the write process. While the newest meta page may be at page 75 on disk, a previously opened read transaction might point to a meta page 32 which has a path to leaf page 3 or something. So a big problem is that since older meta pages are not writable, I can't tell it that "all read transitions on you are now closed".

One other big gotcha. While the main process has write access to the file. Another process can open it for read access and choose which meta page to open it from, so there's no way to know which meta the other read process is accessing, unless I allow the read process to set markers in the file (but then it violates the read only access).

I haven't looked in to it yet, but there's a possibility that linux file systems maybe detach a readonly file if another process starts writing to it. If that is constant then I guess the other process doesn't matter. But that's on my to do list, for now I thought maybe I could pick you guys' brains to see if anyone has dealt with anything similar or has some ideas of how to mark pages and what not.

I hope I explained it well enough, and thanks a lot for any input!
Cheers.
[size=2]aliak.net
Advertisement
D. J. Bernstein's cdb is a bit like this, and redis works a bit similar with its append-only journal file too (and quite possibly a few other programs).

When you modify/advance/invalidate/whatever your constant-db/append-only-journal/whatever, they create a new optimized/packed/garbage-collected copy, and atomically swap files when done.

It sounds primitive, and it is. But it works well. Simple is good.
Thanks for those mentions. Hadn't heard of them before. So they compact the file in a separate thread.
That is one option I am looking at but would like to avoid due to dram constraints.
[size=2]aliak.net

This topic is closed to new replies.

Advertisement