high-concurrency-btree

High Concurrency B-Tree code in C

NOTE: as of January 2014 google has stopped supporting new downloads. Please see the github repository for download sources at https://github.com/malbrain/Btree-source-code The latest memory mapped B-tree code there is threadskv10h.c The latest file I/O code there is btree2s.c

Also note that all the versions on google code have broken deletekey algorithms that are fixed on the github site.

Implement Lehman and Yao's high concurrency B-Tree in C language with Jaluta's balanced B-link tree operations including a locking enhancement for increased concurrency.

The methods presented by Jaluta compromise concurrency. As an update search proceeds down the B-Tree, U latches to the child nodes are obtained, the child nodes locked into memory from disk, and possibly structure modifications issued, all while the parent node U latch is held. Only one update scan is supported at a time into any area under the parent node of the Btree which is held across disk accesses. The latch protocol and btree methods presented here address this compromise. Also, the delay of structure modifications to the next update search requires extra complications in the method implementations. For comparison, an exact reference implementation of Jaluta's balanced B-tree algorithms is included on the download page which is about 50% more LOC.

Instead of the hierarchical S-U-X BTree latch modes, there are three independent page latches. This is accomplished using advisory file locking from the underlying O/S over three independent sub-segments of each node’s address space:

The first segment is an sharable AccessIntent lock and exclusive NodeDelete lock.
The second segment is the classic sharable ReadLock and exclusive WriteLock for the node.
The third segment exclusive ParentModification lock is held while a node's fence value is posted or removed from its parent node (SMO) during a split or consolidation. It is obtained with lock-coupling over the page's WriteLock. This ensures that these structure modifications are serialized with respect to one another, while allowing release of the WriteLock on the node for increased concurrency when there is only one SMO in progress.

This enhancement allows symmetric Blink node consolidation and reuse with relative simplicity compared to previous approaches.

An additional improvement over Jaluta's BLink node layout is the introduction of a delete bit on each key value. This permits deleting a node's fence key without having to physically remove it from the node, thus obviating the need to maintain a second copy of fence values in the nodes.

Node Consolidation

Empty nodes are removed in three stages. First, a half-merge is performed by moving the right node's contents into the left empty node. The link pointer of the right node is set to point to the left node and its delete bit is set. Second, the left node's fence key is removed from its parent node and the right node's fence key is updated with the left node's pointer. Finally, any remaining readers to the right node are drained by obtaining its NodeDelete and WriteLock and the deleted right node is placed on free list as an available node.

The desired effect is enabling and simplifying empty node consolidation and reuse at minimal cost while maintaining the level of concurrency of the original BLink method.

Key Size and Node Layout

The node size is specifiable in bits, from 9 (512) thru 20 (1 MB) and stores variable sized keys, each from 1 thru 255 bytes in length. The node has a delete bit which redirects searches to the link (left) node. Each key has a deleted bit which allows keeping the fence key value on the page after it's deleted. The space in the node for deleted keys is reclaimed during node cleanup.

Buffer Pool Manager

A buffer pool manager for memory mapped segments of pages is included that utilizes the underlying O/S file mapping interface. Up to 65536 segments are managed (this is a linux and WIN32 system limitation). The Buffer Manager will also run without a pool or file mapping, using standard file system reads and writes when the pool size is set to zero segments. Segment size is set to the number of desired pages in bits (a setting of 4 means 16 pages). With a large enough segment size it is possible to entirely map the Btree into virtual memory. Note that file mapping is not supported over network file systems.

In the single process/multi-threaded implementations the buffer pool manager is shared by multiple threads and utilizes a combination of "lock-striping" and GCLOCK in the buffer pool manager's hash table to reduce contention for the manager's resources.

Note that an optimized for linux version of bt_newpage from threads2i.c can be retro-fitted into the other program versions for a 50% improvement in some tests.

Recommended Parameter Values

From experimental trials, when there is enough memory to fit the Btree into physical memory, the program runs best with the buffer pool manager set to 32768 segments, and the page size set to 16 bits. 10 million 32 byte random hex keys (330 MB) index in 97 seconds on a linux64 system or 131 seconds on a linux32 system.

The 28.7 million key, 304 MB distinct_1 dataset from www.naskitis.com, indexes in 303 seconds on a linux64 system with a 2GB page buffer pool (32768 segments with one page per segment at 65536 bytes each). A search of the distinct_1 dataset with the skew1_1 dataset, 177 million keys in a 1GB file, requires 1184 seconds. For comparison, the jaluta program requires 255 seconds to build, and 849 seconds to search, when run with the same parameters. Of these times, approximately 1/2 is taken by the lock and unlock linux system calls when creating the Btree, and 3/4 when finding keys.

After the size of the Btree exceeds the size of physical memory, the time spent accessing the disk for leaf nodes dwarfs these considerations.

64 Bit Operational Notes

When run on a 64 bit operating system it is possible to configure the buffer pool manager to bring the entire B-tree into the address space, thus obviating the need to unmap and remap file segments to cover all of the Btree pages. The buffer pool segment size is configured by the number of pages per segment in bits. Thus a setting of 14 for this parameter together with a page size of 16 bits means that each of the 32768 buffer pool segments will be 1GB in size, which means that a Btree size of 32TB can be managed without the cost of repeatedly mapping and remaping segments.

Multiple Process/Multiple Thread Version

A multi-process/multi-thread version is available on the download page. The B-tree page latch operating-system advisory locking calls have been replaced by much faster test and set (__sync_fetch_and_add __sync_fetch_and_or) built-in gcc calls on linux and similar calls on Windows. The -D STANDALONE command line interface has been re-ordered to allow for multiple input file names which are processed by independent threads in parallel. The page latch sets have been located on the pages themselves to permit multi-processed applications to share access to the B-tree. The resulting lock manager does not guarantee fairness to the contending threads.

Note: linux version 2.6 processes mmap calls serially among threads, hobbling the buffer pool manager in multi-threaded applications with inadequate resources. Ensure the buffer pool manager is configured with page and pool segment size and pool slots so that remapping is not required during multi-thread operations.

Foster B-trees

An experimental implementation that increases concurrency by utilizing a Foster child relationships between siblings was inspired by Goetz Graefe's work. The basic Foster B-tree mechanism has been enhanced by allowing more than one foster child to accumulate in a queue in a sibling node while they are being posted into permanent parent nodes, instead of the single foster child relationship into a single chain as described by Mr. Graefe. Also the sibling right pointers are maintained to facilitate enumerating keys in order from the B-tree.

Fosterbtreee.c on the download page utilizes pthread rwlock programming (Slim Reader Writer on WIN32) in latches managed by a latch manager that utilizes the first few pages of the B-tree to host a fixed number of latches. Fosterbtreef.c is similar but has test & set spin latches hosted in the latch manager's B-tree pages. The default number of latches is 128.

The test scenarios below offer foster btrees an advantage as new pages of keys do not need to be immediately mapped after creating them as they do for parent modification in the threads2h code. Foster btrees issue their parent modification from the left sibling rather than the right (newpage) sibling.

Detailed Benchmarks

On a 64 bit linux 2.6 system with 4 cpu cores and 8 GB RAM, the following benchmark results were obtained using the skew1_1 thru skew5_1 datasets from www.naskitis.com. These files are comprised of 752495240 keys, of which 1401774 are unique. ``` time threads2g tst.db w 14 32768 4 0 skew1_1 skew2_1 skew3_1 skew4_1 skew5_1

program buffer params btree size elapsed/user /system notes

threads2j 14 32768 4 52MB 8:06/25:13/04:19 5 threads

threads2i 14 32768 4 52MB 9:14/30:36/03:28 5 threads

threads2h 14 32768 4 52MB 6:13/23:03/00:11 5 threads

fosterbtreee 14 32768 4 52MB 6:59/22:36/03:06 5 threads

fosterbtreef 14 32768 4 52MB 6:31/21:39/02:08 5 threads

fosterbtreeg 14 32768 4 52MB 8:18/24:43/05:16 5 threads

btree2r (file-io) 52MB 125:58/31:00/87:00 1 process

threads2i 14 32768 4 52MB 7:14/25:24/01:20 5 processes

threads2h 14 32768 4 52MB 4:59/15:20/00:07 5 processes

fosterbtreee 14 32768 4 52MB 19:44/74:33/01:43 5 processes

btree2r 14 32768 4 52MB 365:03/41:00/835:00 5 processes

jaluta2 14 32768 0 51MB 88:00/50:00/225:00 5 processes ```

Reverse line numbers were appended to the skew1_1 dataset lines and the resulting file of 177999203 distinct keys was inserted 5 times in parallel. Note that the 4 duplicate inserts for each key had no effect, and that the total btree size exceeds memory capacity. The first 6 tests involved 5 threads, then 10 and 50: ``` time threads2g tst.dbmax w 14 16384 6 1 skew1_1 skew1_1 skew1_1 skew1_1 skew1_1

program buffer params btree elapsed/ user /system pool/page/mmap/threads

threads2j 14 16384 6 10.8GB 17:15/ 53:29/07:10 16GB/16K /1MB /5

threads2i 14 16384 6 10.8GB 20:36/ 58:51/12:17 16GB/16K /1MB /5

threads2h 14 16384 6 10.8GB 18:24/ 62:28/05:31 16GB/16K /1MB /5

fosterbtreee 14 16384 6 10.8GB 16:36/ 50:29/06:25 16GB/16K /1MB /5

fosterbtreef 14 16384 6 10.8GB 16:10/ 50:37/06:18 16GB/16K /1MB /5

fosterbtreeg 14 16384 6 10.8GB 18:36/ 54:00/08:49 16GB/16K /1MB /5

threads2h 12 16384 8 11.2GB 20:19/ 70:44/02:21 16GB/ 4K /1MB /5

fosterbtreee 12 16384 8 11.2GB 15;23/ 50:39/04:01 16GB/ 4K /1MB /5

threads2i 12 16384 8 11.2GB 36:10/114:45/26:06 16GB/ 4K /1MB /10

threads2h 12 16384 8 11.2GB 53:41/207:15/05:09 16GB/ 4K /1MB /10

fosterbtreee 12 16384 8 11.2GB 24:41/ 87:25/ 9:15 16GB/ 4K /1MB /10

threads2i 12 16384 8 11.2GB 187:32/594:21/139:30 16GB/ 4K /1MB /50

threads2h 12 16384 8 11.2GB 158:32/518:21/101:30 16GB/ 4K /1MB /50

fosterbtreee 12 16384 8 11.2GB 189:16/528:37/208:58 16GB/ 4K /1MB /50

fosterbtreef 12 16384 8 11.2GB 197:35/537:39/237:33 16GB/ 4K /1MB /50 ```

Next the 28772169 distinct keys in the distinct_1 dataset with unique line numbers appended across 5 inserts for a total of 143860845 unique keys were inserted in parallel in 5 threads with the following results: ``` time threads2h tst.dis w 14 8192 6 1000000000 distinct_1 distinct_1 distinct_1 distinct_1 distinct_1

program buffer params btree size elapsed/user /system notes

threads2j 14 8192 6 7.5GB 13:25/13:06/04:33 8GB pool, 16K pages

threads2i 14 8192 6 7.5GB 11:31/14:46/04:54 8GB pool, 16K pages

threads2h 14 8192 6 7.5GB 27:50/17:33/13:28 8GB pool, 16K pages

fosterbtreee 14 8192 6 7.5GB 12:31/10:57/03:46 8GB pool, 16K pages

fosterbtreef 14 8192 6 7.5GB 12:26/12:35/15:25 8GB pool, 16K pages

fosterbtreeg 14 8192 6 7.5GB 11:07/13:31/04:52 8GB pool, 16K pages

threads2i 12 8192 8 7.5GB 10:17/16:30/04:31 8GB pool, 4K pages

threads2h 12 8192 8 7.7GB 23:23/24:43/22:13 8GB pool, 4K pages

fosterbtreef 12 8192 8 7.7GB 9:26/11:40/14:11 8GB pool, 4K pages

fosterbtreeg 12 8192 8 7.7GB 9:23/12:47/03:55 8GB pool, 4K pages ```

To test the raw reading metrics, the distinct_1 data set was indexed with no key modifications, and the resulting btree was searched with the skew data sets. Construction times are given here: ``` time fosterbtreef tst.dis w 14 8192 4 0 distinct_1

program buffer params btree elapsed/user /system pool/page/mmap/threads

fosterbtreeg 14 8192 4 1.1GB 1:56/ 1:47/00:04 2GB/16K /262K/1

fosterbtreef 14 8192 4 1.1GB 1:47/ 1:36/00:04 2GB/16K /262K/1

fosterbtreee 14 8192 4 1.1GB 1:48/ 1:39/00:04 2GB/16K /262K/1

threads2j 14 8192 4 1.1GB 1:54/ 1:43/00:04 2GB/16K /262K/1

threads2i 14 8192 4 1.1GB 1:57/ 1:52/00:04 2GB/16K /262K/1

threads2h 14 8192 4 1.1GB 1:52/ 1:46/00:04 2GB/16K /262K/1 ```

with search times given here: ``` time fosterbtreef tst.dis f 14 8192 4 0 skew1_1 skew2_1 skew3_1 skew4_1 skew5_1

program buffer params btree elapsed/user/system pool/page/mmap/threads

fosterbtreeg 14 8192 4 1.1GB 8L55/28:17/ 3:20 2GB/16K /262K/5

fosterbtreef 14 8192 4 1.1GB 7:20/25:08/ 1;09 2GB/16K /262K/5

fosterbtreee 14 8192 4 1.1GB 8:14/27:25/ 2:04 2GB/16K /262K/5

threads2j 14 8192 4 1.1GB 8:40/28:52/ 3:12 2GB/16K /262K/5

threads2i 14 8192 4 1.1GB 9:54/31:25/ 4:49 2GB/16K /262K/5

threads2h 14 8192 4 1.1GB 10:32/38:34/ 0:05 2GB/16K /262K/5 ```

Finally, the 3.9GB genome dataset with 77589282 50 character lines from http://hgdownload.cse.ucsc.edu/goldenPath/dasNov1/bigZips/scaffold.fa.gz was split into 5 equal peices and indexed by 5 threads: ``` time threads2h tst.dbgenome w 14 32768 4 0 genaa genab genac genad genae

program buffer params btree elapsed/user /system pool/page/mmap/threads

threads2j 14 32768 4 4.2GB 7:37/ 7:11/ 2:02 8GB/16K /262K/5

threads2i 14 32768 4 4.2GB 6:48/ 7:46/ 2:25 8GB/16K /262K/5

threads2h 14 32768 4 4.2GB 6:22/ 7:58/ 1:55 8GB/16K /262K/5

fosterbtreee 14 32768 4 4.2GB 7:05/ 6:48/ 2:01 8GB/16K /262K/5

fosterbtreef 14 32768 4 4.2GB 7:50/ 6:48/ 2:14 8GB/16K /262K/5

fosterbtreeg 14 32768 4 4.2GB 6:53/ 7"07/ 1:56 8GB/16K /262K/5 ```

Note that threads2h and threads2i have three test and set single word latches hosted in each btree page for multi-process shared access. threads2h uses sched_yield to implement contended access, while threads2i uses a linux futex to implement each latch. The buffer manager for seg_bits < 3 was fixed in threads2h.

Fosterbtreee uses pthreads rwlock latches hosted by a latch manager utilizing the first few pages of the btree. Fosterbtreef uses a latch manager with test and set latches. Fosterbtreeg uses the same latch mechanism as threads2h. Other than the location and implementation of the latches, all of these programs share the same buffer manager code.

Contact Information

Please address any bugs, suggestions, or other concerns and feedback to the author, Karl Malbrain, malbrain@cal.berkeley.edu.

Acknowledgement

Goetz Graefe reviewed earlier versions of this work and made several observations which are incorporated.

Implementation Details

The basic idea is that there are five lock types for each node in three independent sets:

(set 1) AccessIntent: Sharable. The node address has been obtained and will be read. Incompatible with NodeDelete.
(set 1) NodeDelete: Exclusive. The node no longer has a pointer to it. Wait until all readers have relinquished AccessIntent. Incompatible with AccessIntent.
(set 2) ReadLock: Sharable. Read the node. Incompatible with WriteLock.
(set 2) WriteLock: Exclusive. Modify the node. Incompatible with ReadLock and other WriteLock.
(set 3) ParentModification: Exclusive. Change the parent's keys. Incompatible with another ParentModification.

For Search: A ReadLock is obtained for the root node. While descending the tree, a ReadLock on a parent is latch-coupled with obtaining AccessIntent on the child node. A ReadLock (or WriteLock for insert/delete at the target level) on the child node is latch-coupled with AccessIntent. The search ends with either a ReadLock or WriteLock on the target node.

For Insert: A key is added to a node under a WriteLock obtained during Search. The same insert module is used for both leaf and branch nodes.

For Split: The ParentModification lock is obtained over the existing WriteLock (which stalls if a previous split/consolidation for this node is still underway), half the contents are split into a new node (which is not locked), and the WriteLock on the node is released. After the fence keys for the two siblings to the parent node(s) have posted, the ParentModification lock is released.

For Delete: A key is removed from a node under a WriteLock obtained during Search. The same delete module is used for both leaf and branch nodes.

For Consolidation of an empty node: A WriteLock is obtained for the right sibling node, its contents copied into the empty left node under the existing WriteLock, and the right sibling delete bit is set. The right sibling link field is set to point to the left node, which will direct subsequent searches to the left node. A ParentModification lock is latch-coupled with the WriteLock for the left node (which will stall if another ParentModification is underway), and the WriteLock on the right node is also released. The old left node fence key is removed from its parent, and the old right node fence key is updated in its parent to point to the left node. The ParentModification lock on the left node is released. A NodeDelete andWriteLock is obtained for the right node. There are now no pointers and no readers to the right node, and it is placed onto a free-list for use in subsequent splits. The NodeDelete and WriteLock is removed on the now free node.

The bottom line is that the structure modifications to the upper tree levels are serialized while the node itself remains unlocked for subsequent reads and writes while the structure modification proceeds. Except for Consolidate (and during latch-coupling), only one node is locked at a time.

Latch Coupling

At several points in the implementation, latch coupling is used to acquire a new lock on a node and release another. During splits/consolidations this overlap is extended to include additional node processing before the previous lock is released.

Lock Sets

The three lock sets each cover only a part of each node. Thus it is possible to obtain a ParentModification lock over a WriteLock, and release the WriteLock later while continuing to hold ParentModification. The locks in each set are independent from the locks in the other two sets on each node. They are not exacly equivalent to traditional "lock modes" because it is possible to hold two locks on a node, one from each set simultaneously.

Compatibility Matrix

A “y” (yes) entry means that two locks of the types specified by the row and column labels for that entry can be simultaneously held on a data item by two different btree operations, i.e., the two lock types do not conflict. An “n” (no) entry means that the two lock types cannot be concurrently held on a data item by distinct operations, i.e., the lock types conflict. AI ND RL WL PM AccessIntent Y N Y Y Y NodeDelete N N Y Y Y ReadLock Y Y Y N Y WriteLock Y Y N N Y ParentModification Y Y Y Y N

Costs and Mitigations

The extra cost in the protocol is the need to obtain two node locks at each level after the root during Searches instead of one in Jaluta's method. This is mitigated by not having to perform latch upgrades during inserts or deletes. Note that the multiple-thread version reduces latch cost to virtual zero.

Code API

BtDb *bt_open (char *name, uint mode, uint bits, uint map_pool_segments, uint pool_segment_size)

Create btree object (BtDb) and open file under the given access mode (BT_ro -- read-only, BT_rw -- read/write, BT_fl -- flush-writes-to-disk). bits is the btree page size in bits (usually 12 to 16). map_pool_pages is the number of segments to be cached under memory-mapping or zero if file-io is to be used instead of mmap. pool_segment_size' is the number of pages in a pool segment, given in bits (powers of 2). Note thatbits` will be taken from an existential btree file, with the passed parameter ignored.

void bt_close (BtDb *bt)

Close the btree file and release the btree object.

BTERR bt_deletekey (BtDb *bt, uchar *key, uint len, uint lvl)

Delete the specified key from the btree at the specified level (lvl = 0 for leaf key). This function is called recursively to remove keys from levels higher in the tree than the leaves. Returns BTERR_ok for success, or else an error.

uint bt_findkey (BtDb *bt, uchar *key, uint len)

Find the given key in the btree, cache the btree page into bt->cursor, and return the page slot number of the key, or zero if it is not in the btree. Return zero if an error occures with the error in bt->err.

BTERR bt_insertkey (BtDb *bt, uchar *key, uint len, uint lvl, uid id, uint tod)

Insert/Replace the given key in the btree at the given lvl (normally zero) with the given row-id and time-of-day. This function is called recursively to insert fence values into upper levels of the btree.

uint bt_startkey (BtDb *bt, uchar *key, uint len)

Find the first key greater than or equal to the given key. Cache the btree page into bt->cursor and return the btree slot. Return the slot number, or zero if an error occurs.

uint bt_nextkey (BtDb *bt, uint slot)

Return the next slot in the cached page at bt->cursor or move to the next page to the right. Return zero if an error occurs. Note that the last key in the btree is 0xffff for two bytes.

BtKey bt_key (BtDb *bt, uint slot)

Return the BtKey structure pointer for the given key slot on the cached bt->cursor page.

uid bt_uid (BtDb *bt, uint slot)

Return the 48 bit row-id for the given key slot on the cached bt->cursor page.

uint bt_tod (BtDb *bt, uint slot)

Return the tod value for the given key slot on the cached bt->cursor page.

Project Information

License: New BSD License
57 stars
svn-based source control

Labels:
btree concurrency database nodeconsolidation blink jaluta foster