|
MemcachedSlabAllocator
Memcached Slab AllocatorThe code for memcached's famous slab allocator, or slabber, lives in the files slabs.h and slabs.c. The slab allocator does all the memory-management heavy lifting for item data in memcached, and I assume you know what a slab allocator is at a high handwave level, but just wanted to learn more about memcached's implementation. (More links about slab allocators are at the bottom of this writeup.) When you start memcached with a max memory parameter ("-m" on the command-line), for example, it's the slab allocator that knows and tracks when you've hit that max memory limit. Memcached doesn't use the slab allocator for all its memory needs, just for item data (the key/value/length/flags/expiration/cas data that's tracked in the 'item' data structure). Throughout the memcached codebase, you'll see plenty of calls to malloc/calloc/realloc, not to the slab allocator. Memcached will use good old malloc() for lots of the miscellaneous stuff such as for the hashtable buckets or connection tracking data structures. Although the slab allocator is tuned and exists mostly for memcached's item data, it's actually 99% general-purpose. The 1% is because you'll find a handful of references in the slabs.c file to the higher-level "item" application-specific data structure. This dependency can be easily refactored away so that the slab allocator could become truly general purpose, perhaps another day. Slab Allocator InterfaceThe main interface to the slab allocator at runtime are with these functions... unsigned int slabs_clsid(const size_t size); void *slabs_alloc(size_t size, unsigned int id); void slabs_free(void *ptr, size_t size, unsigned int id); ASIDE: Actually, when you look at slabs_alloc() and slabs_free(), you'll see that they just do some locking and call these respective functions in slabs.c, which do the real work... void *do_slabs_alloc(const size_t size, unsigned int id); void do_slabs_free(void *ptr, const size_t size, unsigned int id); This foo_bar() and do_foo_bar() naming and coding pattern is pretty common in memcached, where a foo_bar() function merely acquires a lock, invokes do_foo_bar(), and then releases the lock. This naming pattern is an artifact from an earlier point in memcached's history when you could choose to have multithreading support or just be single-threaded. Nowadays, memcached always runs multithreaded. See the thread.c file for more of this pattern in usage. Anyways, when other parts of code in memcached want to allocate memory for an item that has a certain item_size, they call... unsigned int clsid; item *it; clsid = slabs_clsid(item_size); it = slabs_alloc(item_size, clsid); And, to free an item, they call... slabs_free(it, item_size, clsid); Not quite as simple as malloc() and free() but good enough. ANOTHER ASIDE: if you want memcached to use malloc() instead of the usual slab allocator, just define USE_SYSTEM_MALLOC and recompile. When you do that, do_slabs_alloc() and do_slabs_free() will be compiled with code paths that just uses malloc() and free(), respectively. Perhaps you're itching to prove that the hoard, jemalloc or other memory allocator scheme works better with memcached, and this would be your simple pathway to do so when you have a malloc-signature-compatible alternative. What's a clsid?It's a 'slab class id', so let's talk about slabclass, slabs, chunks, chunk size growth factors, and calling malloc() as little as possible. Because normal malloc() is slow and prone to memory fragmentation, we don't want to use malloc() to get new memory every time memcached wants to remember an item. Instead, the strategy is to invoke malloc() as little as possible, and that's what the slab allocator does by initially calling malloc() to allocate a big hunk (or 'slab') of memory. The slab allocator then divvies up that slab into smaller 'chunks', where these chunks are all the same size in a slab in order to keep things simple. When memcached needs memory to remember an item, the slab allocator can just peel off the next unused chunk from a slab. Since all the chunks are the same size, this is just simple and fast pointer math to get the next chunk. Eventually, the slab allocator will run out of unused chunks and will need to call malloc() to allocate another big slab, but for the most part the slab allocator can avoid calling malloc() so much. You might have realized, of course, that the size of a chunk might actually be bigger than the item data size that we're trying to remember, since the chunk size is fixed for a slab. In short, the slab allocator is trading off some wasted space to get better performance. Every slab, too, is the same size, 1MB. (Actually, memcached has a nice optimization here that makes this not quite true, but let's keep it simple for now.) It's the chunk size that requires an extra layer to track, and that layer is called a 'slabclass'. Here's a high-level containment hierarchy that we'll use to talk about this... slab-allocatorslabclassslabchunk A slab-allocator manages a fixed number of slabclass structs. Each slabclass struct is assigned to manage a unique chunk size. A slabclass struct can track zero or more slabs, where all the slabs in that slabclass will have the same, exact chunk size. And a slab, finally, is divvied up into 1 or more chunks of the same chunk size. (Sometimes, you'll see the codebase use the word 'page' instead of slab. Sometimes, it mixes up the word 'chunk' for other meanings. Sometimes, it uses 'bucket' instead of slabclass. Just look at the code/comment context to figure it out.) Why are there only a fixed number of slabclass structs? Because the slab allocator uses an array for quick O(1) lookup. The array of slabclass structs is sorted from smallest chunk size to largest chunk size. The array index is the same as the slabclass's id, or the 'clsid'. When memcached starts up, it initializes the slab allocator by calling slabs_init() with a max memory limit (the "-m" command-line parameter) and a chunk size growth factor (the "-f" command-line parameter). The default chunk size growth factor is 1.25. Ths slab allocator just runs through its array of slabclass structs, assigning chunk sizes using the chunk size growth factor. To get the most out of memcached, you'll want to tune this chunk size growth factor to avoid wasting memory. How is slabs_free() handled? Each slabclass also has an array of pointers to free, unused chunks. The free array (actually a stack, although the code calls it a list) is consulted first and popped during slabs_alloc(). If there are no free, unused chunks, a new slab is allocated. By default, the slab allocator only creates new slabs lazily, on an as-needed basis, and does not free them. But...EXCEPT 1: there's a DONT_PREALLOC_SLABS flag, which is set by default. If you unset it, the slab allocator will create and assign a single slab to each slabclass on startup. EXCEPT 2: there's an interesting -L command-line flag that makes memcached and the slab allocator preallocate the max amount of memory (specified with the '-m' parameter) right away on startup, by doing a single gigantic malloc() call. This can be a good thing because it reduces internal malloc() overhead (malloc needs data structures to manage all of the 1M requests done otherwise), and grabbing all the resources "up front" means not getting any failing enomem errors later on. Currently (2009/01/20 - rewritten/bin branch), the -L feature only appears to work on platforms that support HAVE_GETPAGES_SIZES and HAVE_MEMCNTL, such as Solaris. With Solaris's support of large, multiple page sizes, the -L flag can help ("not much but a tiny little bit") by reducing TLB misses. Trond Norbye, who worked on these Solaris improvements, is working to make the preallocate part of the -L flag available for all systems. Finally, earlier, I wrote that slabs can be simply thought of just being 1MB in size. In reality, since not all chunk sizes fit perfectly into 1MB, the slab allocator has an optimization that will allocate the closest multiple of chunk size to 1MB, so there's no wasted slop. For example, if the chunk size is 400K, the slab allocator will only create a slab of size 800K, which can hold two chunks (400K + 400K), since 1200K goes over the 1MB limit. The previous chunk-size-multiple optimization is the default behavior. You can switch it off, though, as a tradeoff in order to gain an interesting, underdocumented feature: the ability to reassign slab memory from one slabclass to another slabclass at runtime. To do so, recompile with the ALLOW_SLABS_REASSIGN flag. After that, you can send memcached messages like 'slabs reassign (src_slabclass) (dest_slabclass)'. This feature can be useful, as long running memcached instances will eventually allocate its max memory into slabs. Unfortunately, slabs allocated early in memcached's process life might over time be effectively in the wrong slabclass. Imagine, for example, that you store session data in memcached, and memcached has been up and running for months. Finally, you deploy a new version of your application code, which stores more interesting information in your sessions -- so your session sizes have grown. Suddenly, memcached starts thrashing with huge amounts of evictions. What might have happened is that since the session size grew, the slab allocator needs to use a different slabclass. Most of the slabs are now sitting idle and unused in the old slabclass. The usual solution is to just restart memcached, unless you've turned on ALLOW_SLABS_REASSIGN. Back to: MemcachedInternals To learn more about slab allocators...
Author: steve.yen, with thanks to trond for the feedback, improvements and pointers. |
Sign in to add a comment