|
Project Information
Featured
Downloads
Links
|
It is faster (4x!) and scales better than FastMM. Also faster (3x) than TopMM, but equal scaling. Because FastMM does not scale at all, you could use TopMM (which scales very good). But TopMM is slower than FastMM (in a simple string test). I made a proof-of-concept to see if I could make a simple and very small and compact MM, which is not as bloated (or difficult to understand) as FastMM and TopMM, and also faster than these two. Of course it must scale on multi core CPU's. Release candidate, ready for testingThe v2 version is ready for testing:
20-03-2012: Version 2.12 released, more stable, interthread memory also works now. If anybody can optimize it further: please help! :-) Simple Benchmark12-10-2011: simple benchmark added, see results.txt ScaleMM2 is fastest, but TCmalloc performs good too (but does not release memory to Windows?). MSVCRT of Windows 7 is not bad. JeMalloc and Hoard are disappointing. Extended Benchmarkhttp://scalemm.googlecode.com/svn/trunk/Challenge/Results/MMBench_all.htm Old BenchmarkDemo test (# of thread with alloc + realloc + free) on a quad core PC (Win7).
As you can see, TopMM and ScaleMM1 have a horizontal line till 4 threads, so it scales perfectly because we have 4 cores! Note: benchmark done with initial ScaleMM1, it must be updated with the newest ScaleMM2 version! Programming considerationsI had the following coding priorities:
Some code parts are not nicely coded or logical ordered, but some randomizations gave better results (see optimizations below) OptimizationsOptimized with Intel VTune and some "trial and error" changes to see which approach gives fastest result. Current speed is the fastest I could get with Delphi code right now, more speed can be gained through pure assembly optimizations. I tried to apply the following optimizations:
I did not check the alignment, but adding some fillers only gives speed decrease so I assume it is the best the way it is now. Internal workingShort description how it works:
It has 2 fixed arrays of block lists, each block list contains blocks with fixed number of a fixed item size, e.g. 50 items of 32 bytes. It has mini blocks (32-256 bytes, with 32 byte step size) and small blocks (256-2048 bytes, 256 byte step size). Larger blocks are not handled yet. Calculation is simple: “requested memory size” div “item size”. For example: Memory reuse is efficient because we have 50 items in a block, so it is more common to have a freed item: a new item is only allocated once, and reused many times after that. Blocks are double linked to each other, so when one block is full, we can get fast and easily the next block. Also a block can be removed easily by linking the previous block to the next block (dynamic arrays are slower with adding (resizing) and removing (shifting items)). A fixed array of 50 items is faster than a linked list Benchmark 2FastCode MM Benchmark with D2010 verses ScaleMM 2.03: If someone can help to improve single thread speed: please! |
