IntroductionxvMalloc summary: (pdf) xvMalloc is memory allocator designed specifically for compcache project. It is based on TLSF allocator so we naturally get all its goodness :) - O(1) Alloc/Free - Except for cases where we have to go to system page allocator to get additional memory.
- Very low fragmentation - In all tests, xvMalloc memory usage is within 10% of “Ideal” allocator.
xvMalloc is not derived from TLSF implementation (as provided on its project page). Its written from scratch to cleanly incorporate changes described below. xvMalloc vs TLSFTLSF has some problems when used as-is: - It assumes that its dealing with virtual memory.
- It is designed to handle allocation for large size range 2^32) while our need is to handle very small size range, say, [32, 3/4*PAGE_SIZE. So, we can have very fine grained size lists for entire size range.
- Due to nature of TLSF, size difference between free lists increases rapidly as size increases (for objects > 1024b, free lists are separated by 128 bytes!)
- We have equally spaced (just 8 bytes!) free lists for entire size range. We still use two level bitmap to locate appropriate free list.
xvMalloc overcomes above problems and also makes some assumptions to improve performance and avoid unnecessary complexity. Assumptions: - Allocation size is always in range MAX and range (MAX – MIN) is small. For compcache, this range is, say, 3/4*PAGE_SIZE.
- This allows large number of freelists with small size difference resulting in lesser fragmentation.
- Memory being managed is not backed by any VA. Allocator records <pageNum, offset> for each object. So, we need to provide implementation for map/unmap functions that provide dereferenceable pointer given this <pageNum, offset> pair (kmap_high()).
- xvMalloc stores exact object size in object header (TLSF stores rounded-off size). This saves additional 2bytes per object since; otherwise, user has to store this value in beginning of data section (specific to compcache) and further store all compressed data with slow unaligned access.
Implementation Details- It uses same two level bitmap scheme as used by TLSF to manage freelists with changes to alloc/free as described above. TLSF details are well explained in this paper (pdf).
- Code:
- Checkout from repository: hg clone https://compcache.googlecode.com/hg/ compcache compcache (xvMalloc code in trunk/sub-projects/allocators/xvmalloc-kmod)
- Browse online here.
- Currently, its implemented in userspace for easy debugging and exp. Should be easy to port to kernel.
Limitations- No concern for scalability.
- Its used only from swap path, so it this really a problem?
- No concern for locality. Page selection for any allocation is function of only request size.
EvaluationAdditional performance numbers are here.
- Category 1: Tests with randomly generated workloads with each executing 1 million mallocs and frees.
| Data Summary | | | Ideal | xvMalloc | Diff | | Peak Memory Used (KB) | 2604 | 2824 | 8.4% |
- Category 2: With real workloads, we tend to have majority of allocations within a small range.
Following shows traces collected from actual desktop system with "generic" workload.
In above trace, compressed length 2600 is most common and also sizes < 64. We approximate this using random workload generator with bias for this size range.
Test result with above workload
| Data Summary | | | Ideal | xvMalloc | Diff | | Peak Memory Used (KB) | 2421 | 2548 | 5.2% |
Additional performance numbers are here.Links- xvMalloc: trunk/sub-projects/allocators/xvmalloc-kmod (webview)
- Allocator stress framework (used to generate above data): trunk/sub-projects/alloc_stress (webview)
- TLSF Allocator homepage.
BTW, "xv" vaguely stands for "x"->no, "v"->virtual - we don't need tons of virtual address space :)
|
How does its performance compares to TLSF? Is it faster? Is there any gain using it instead of TLSF? Are there any benchmarks comparing both allocators? When will it come to compcache-0.4? Will we have benchmarks comparing performance of before/after this new allocator?
I haven't yet done comparison with TLSF. Also, don't know when I will integrate this with compcache. I'll try to get some benchmarks once this is ported to kernel.
Will you try to get compcache merged in 2.6.28?
Not yet done speed tests, but xvMalloc is almost same as TLSF wrt space efficiency (almost always 1-2% better than TLSF). I plan to do speed tests soon (there shouldn't be any difference).
Not sure when I will push it for mainline.
A novice question, as I understand, xvMalloc is a pool based allocator. Can xvMalloc be used as a general purpose allocator replacing kmalloc in case my kernel driver does many smaller but mostly similar sized allocations? I have gone through the doc which compares xvMalloc with SLOB and SLUB, I think it fits the bill in case.
I am going through the code currently, does xvMalloc take care of adjusting memory allocations from appropriate zones depending on the memory pressure? E.g. allocate from zone high mem only if zone normal is depleted etc?
As curiosity, I was just trying to compare Jemalloc Linux port and xvMalloc user port. Jemalloc claims very good performance in most cases and guarantees very less fragmentation. Did you evaluate it earlier?
> As curiosity, I was just trying to compare Jemalloc Linux port and xvMalloc user port. Jemalloc claims very good performance in most cases and guarantees very less fragmentation. Did you evaluate it earlier?
The case against jemalloc was clearly explained by John. Copying his response here: ---
AFAICT, Out-of-the-box jemalloc would be useless. From the excerpt below, jemalloc is optimized for allocations less than 512 bytes, while compcache will often want to allocate ~2k. Under compcache, jemalloc would essentially act as a power of 2 slab allocator. It wasn't obvious from skimming the paper whether it would be practical to modify jemalloc to have "quantum-spaced" allocations up to 4k, and I didn't check the source code.
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf page 5 --
Figure 4 shows the size classes for all allocation sizes.
Figure 4: Default size classes, assuming runtime defaults, 4 kB pages and a 16 byte quantum.
It would be simpler to have no subcategories for small allocations by doing away with the quantum- spaced size classes. However, most applications primarily allocate objects that are smaller than 512 bytes, and quantum spacing of size classes substantially reduces average internal fragmentation.