My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
HugeCollections  
Library to support collections with millions or billions of entries.
Updated Sep 21, 2011 by robausti...@gmail.com

When to use this library

If you want to efficiently store large collections of data in memory. This library can dramatically reduce Full GC times and reduce memory consumption as well.

When you have a data type which can be represented by an interface and you want a List for this type.

List<InterfaceType> list = new HugeArrayBuilder<InterfaceType>() {}.create();

The type needs to be described using an interface so its represention can be changed. (Using generated byte code)

The HugeArrayBuilder builds generated classes for the InterfaceType on demand. The builder can be configured to change how the HugeArrayList is created.

A more complex example is

HugeArrayList<MutableTypes> hugeList = new HugeArrayBuilder<MutableTypes>() {{
        allocationSize = 1024*1024;
        classLoader = myClassLoader;
    }}.create();

The source

The best way to get the whole source, including tests is to use subversion.

svn co http://vanilla-java.googlecode.com/svn/trunk/collections/ vanilla-java-collections
cd vanilla-java-collections
mvn test

How does the library differ from ArrayList

  • Uses long for sizes and indecies.
  • Uses column based data making the per element overhead minimal and speed up scans over a single or small set of attributes. This reduces memory usages by 2x or more.
  • Stores attributes in direct memory as much as possible, reducing heap usage dramatically, 10x or more.
  • Allocates blocks of data for consistent add() times, rather than exponentially growing an underlying store (with exponentially increasing delays on a grow of capacity)

Measures

Encoded and Primitive Fields

This uses the Sample Class, MutableTypes (see below). The class has 12 fields of various types and can be converted to primitives in different ways. By storing these primitives (or objects encoded as primitives), the time to perform GCs is almost eliminated.

Object type Average size Average write+read time Full GC Time for 250M elements
List of JavaBean 67 bytes 123 ns 23.3 seconds
HugeArrayList 34 bytes 129 ns 0.0089 seconds
Proxied 34 bytes 1,420 ns 0.0102 seconds

The same data is stored in all cases. The size of data in HugeArrayList is smaller as the data is stored in a more compact manner.

"Generated" uses ObjectWeb ASM to generate byte code.

"Proxied" uses the Proxy class and does not require any additional libraries.

Plain Object Fields

Data Types with reference Objects which cannot be encoded as a primitive are also supported, however the benefit is reduced. If none of the fields can be encoded.

Object type Average size Average write+read time Full GC Time for 250M elements
List of JavaBean 36.8 bytes 33 ns 24.3 seconds
HugeArrayList 16.2 bytes 48 ns 9.96 seconds

The same data is stored in both cases. The size of data in HugeArrayList is smaller as the data is stored in a more compact manner.

Simple example

interface MutableBoolean {
    public void setFlag(boolean b);
    public boolean getFlag();
}

// create a huge array of MutableBoolean
HugeArrayList<MutableBoolean> hugeList = new HugeArrayBuilder<MutableBoolean>() {}.create();
List<MutableBoolean> list = hugeList;
assertEquals(0, list.size());

final long length = 128 * 1000 * 1000 * 1000L; // uses 16 GB
hugeList.setSize(length);

assertEquals(Integer.MAX_VALUE, list.size());
assertEquals(length, hugeList.longSize());

boolean b = false;
long count = 0;
for (MutableBoolean mb : list) {
    mb.setFlag(b = !b);
    if ((int) count++ == 0)
        System.out.println("set " + count);
}

b = false;
count = 0;
for (MutableBoolean mb : list) {
    boolean b2 = mb.getFlag();
    boolean expected = b = !b;
    if (b2 != expected)
        assertEquals(expected, b2);
    if ((int) count++ == 0)
        System.out.println("get " + count);
}

Sample Data Type

public interface MutableTypes {
    public void setBoolean(boolean b);
    public boolean getBoolean();

    public void setBoolean2(Boolean b);
    public Boolean getBoolean2();

    public void setByte(byte b);
    public byte getByte();

    public void setByte2(Byte b);
    public Byte getByte2();

    public void setChar(char ch);
    public char getChar();

    public void setShort(short s);
    public short getShort();

    public void setInt(int i);
    public int getInt();

    public void setFloat(float f);
    public float getFloat();

    public void setLong(long l);
    public long getLong();

    public void setDouble(double d);
    public double getDouble();

    public void setElementType(ElementType elementType);
    public ElementType getElementType();

    public void setString(String text);
    public String getString();
}

Creating an ArrayList

HugeArrayList<MutableTypes> hugeList = new HugeArrayBuilder<MutableTypes>() {}.create();
List<MutableTypes> list = hugeList;

hugeList.setSize(500*1000*1000); // increase the capacity to 500M

// give all the elements values.
int i = 0;
for (MutableTypes mb : list) {
    mb.setBoolean(i % 2 == 0);
    mb.setBoolean2(i % 3 == 0 ? null : i % 3 == 1);
    mb.setByte((byte) i);
    mb.setByte2(i % 31 == 0 ? null : (byte) i);
    mb.setChar((char) i);
    mb.setShort((short) i);
    mb.setInt(i);
    mb.setFloat(i);
    mb.setLong(i);
    mb.setDouble(i);
    mb.setElementType(elementTypes[i % elementTypes.length]);
    mb.setString(strings[i % strings.length]);
    i++;
}

// retrieve all the values.
for (MutableTypes mb : list) {
    boolean b1 = mb.getBoolean();
    Boolean b2 = mb.getBoolean2();
    byte b3 = mb.getByte();
    Byte b4 = mb.getByte2();
    char ch = mb.getChar();
    short s = mb.getShort();
    int i = mb.getInt();
    float f = mb.getFloat();
    long l = mb.getLong();
    double d = mb.getDouble();
    ElementType et = mb.getElementType();
    String text = mb.getString();
}
Comment by zcapeh...@gmail.com, Aug 19, 2011

nice

Comment by dwithers...@gmail.com, Sep 2, 2011

I've been looking for an implementation of Map that uses memory-mapped IO as the backing store, but I don't want to persist it like a cache for reloading later. Trying to figure out if I get that with this library...

Comment by ashwin.jayaprakash, Sep 2, 2011

Does this support random deletes? Or is this immutable?

Comment by project member c.ce...@gmail.com, Sep 8, 2011

Could you provide an example with a Map, for example with an Integer as key and a Double as value? I looked at the code, but I could not understand the example well. Thanks!

Comment by project member c.ce...@gmail.com, Sep 9, 2011

I've just tested your library. Congratulations! It's quite impressive! I appreciate particularly the memory mapped files support.


Sign in to add a comment
Powered by Google Project Hosting