DB Design: AVL and B-Trees

Started by
4 comments, last by Emmanuel Deloget 17 years, 7 months ago
I've been reading up on AVL and B-Trees on Wikipedia, and I still haven't read anything that would make me think one is better than the other for DB operations. B-Trees seem to require less computations when inserting/deleting, while AVL seems to plainly be faster in the searches. Anyone of you guys has any experience with DB design and these data structures? What's your take on this?
Advertisement
Well, usually databases are quite big and so can't be loaded into computer memory all at once, but rather have to be stored on some external storage device (HD drive or similar) and be loaded by parts. Since access to external storage is usually very expensive (=slow) it is very important for DB's performance that there are as few accesses to that storage as possible.

Now, while searching a B-Tree may be a bit slower than searching an AVL tree if both are in stored in RAM memory, the opposite is true when you have to deal with slow external storage.


Let's see why:

Since an AVL tree is just a (balanced) binary tree, you have to make on average "log2 n" comparisons to find an element in the tree. Since the structure is such that you can't predict to which node a single comparison will send you to in advance, you have to access the external storage device "log2 n" times aswell.

Now, in a B-Tree a single node may have up to k children instead of just 2, and therefore the depth of a balanced B-Tree is "logk n". Since you have to find the correct child node to go to when searching for a key, you usually have to make up to "log2 k" comparisons at each node you visit (if you use binary search). The overall comparison complexity of searching a B-Tree is therefore "logk n * log2 k" = "log2 n" (ie. the same as that of an AVL Tree). But this time it is not necessary to also make the same number of accesses to external storage. Since the "link-table" of a single node is usually written sequentially on the storage device, you only have to access external storage at unpredictable (different) places the number of times you access a unique node - and that is on average "logk n" times (=the depth of the tree).

Since the binary search (or similar) through a single node's "link-table" in a B-Tree can be performed in memory, which is much faster than accessing a HD drive, the overall time complexity of the search operation becomes bounded only by external storage access. This in turn means that using a B-Tree instead of an AVL tree, when considering slow external storage, effectively reduces time complexity from "O(log2 n)" to "O(logk n)", where k can be chosen arbitrarily!


Just to illustrate:

Imagine you have find a certain string among n = 10.000 different keys that are stored in a DB. If you were to use a perfectly balanced binary tree to search for it you would have to access the database 13 times on average (an AVL tree can also produce approximately 1.44 times that, so around 19 times in the worst case).

But if you were to use a B-Tree with, say, k = 26 (the number of letters in the alphabet, for example) you would only have to access the database 3 times! This means that access to this database using a B-Tree would be between 4 to 6 times faster than if you were to use an AVL tree. And the difference only grows with the number of keys (=n) and child nodes (=k).

[Edited by - Devilogic on September 20, 2006 9:11:58 AM]
Just to clarify, B trees are optimized for the case where some or most of the tree resides on disk rather than in RAM. Its design allows you to adapt to the sector size of the hard drive and things like that, to provide optimal IO performance.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Woohoo! Smart people! I love it!

Me: What does the B in b-tree stand for?

Coworkers: binary?

Me:
Wiki says:
Quote:The B-tree's creators, Rudolf Bayer and Ed McCreight, have not explained what the B stands for. The most common belief is that B stands for balanced, as all the leaf nodes are at the same level in the tree. B may also stand for Bayer, or for Boeing, because they were working for Boeing Scientific Research Labs at the time.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
Wiki says:
Quote:The B-tree's creators, Rudolf Bayer and Ed McCreight, have not explained what the B stands for. The most common belief is that B stands for balanced, as all the leaf nodes are at the same level in the tree. B may also stand for Bayer, or for Boeing, because they were working for Boeing Scientific Research Labs at the time.


Next time I'll see a crashed Boeing in a forest, I'll remember this [grin]

This topic is closed to new replies.

Advertisement