Hi,
when loading large SparseBinaryVector files the profiler shows a lot of calls of RandomAccessFile.read(). The reason is because RandomAccessFile.readInt() leads to 4 native calls of read() and RandomAccessFile.readDouble() to 8 calls.
In OnDiskSemanticSpace.loadSparseBinaryOffsets() both read-functions are used to seek over the vector.
In loadSparseBinaryVector each dimension is loaded at once.
I attached a patch which optimized this using a bytebuffer.
Additionally I attached a patch for build.xml to avoid this warning while building sspace on Windows.
"warning: unmappable character for encoding Cp1252"
- OnDiskSemanticSpace.patch 3.35KB
Comment #1
Posted on Jun 22, 2011 by Happy RhinoThis looks great! Any idea how much speed up you see from switching to using a ByteBuffer?
Also, I'm having trouble telling what the difference is the build.xml. There's no patch information. That Windows build warning certainly is annoying, so I would definitely like to have it suppressed.
Comment #2
Posted on Jun 22, 2011 by Massive KangarooBenchmarking loadSparseBinaryOffsets():
new CachingOnDiskSemanticSpace("700mb.sspace");
786747ms -> 6797ms 115 times faster
Benchmarking loadSparseBinaryVector() without loading and same file:
for (Iterator iterator = sspace.getWords().iterator(); iterator.hasNext();) { String word = iterator.next(); sspace.getVector(word); }
813002ms -> 72718ms 11 times faster
- build.patch 948
Status: New
Labels:
Type-Defect
Priority-Medium