Export to GitHub

airhead-research - issue #96

Optimization SparseBinary-Loading


Posted on Jun 21, 2011 by Massive Kangaroo

Hi,

when loading large SparseBinaryVector files the profiler shows a lot of calls of RandomAccessFile.read(). The reason is because RandomAccessFile.readInt() leads to 4 native calls of read() and RandomAccessFile.readDouble() to 8 calls.

In OnDiskSemanticSpace.loadSparseBinaryOffsets() both read-functions are used to seek over the vector.

In loadSparseBinaryVector each dimension is loaded at once.

I attached a patch which optimized this using a bytebuffer.

Additionally I attached a patch for build.xml to avoid this warning while building sspace on Windows.

"warning: unmappable character for encoding Cp1252"

Attachments

Comment #1

Posted on Jun 22, 2011 by Happy Rhino

This looks great! Any idea how much speed up you see from switching to using a ByteBuffer?

Also, I'm having trouble telling what the difference is the build.xml. There's no patch information. That Windows build warning certainly is annoying, so I would definitely like to have it suppressed.

Comment #2

Posted on Jun 22, 2011 by Massive Kangaroo

Benchmarking loadSparseBinaryOffsets():

new CachingOnDiskSemanticSpace("700mb.sspace");

786747ms -> 6797ms 115 times faster

Benchmarking loadSparseBinaryVector() without loading and same file:

for (Iterator iterator = sspace.getWords().iterator(); iterator.hasNext();) { String word = iterator.next(); sspace.getVector(word); }

813002ms -> 72718ms 11 times faster

Attachments

Status: New

Labels:
Type-Defect Priority-Medium