SnappyLoader takes minutes to initialize #39

amichair · 2013-06-30T11:31:05Z

Sometimes SnappyLoader takes several minutes to initialize (in my case it happens when connecting via a remote debugger to a process which uses snappy via some transitive dependency).

Specifically, it's SnappyLoader.md5sum() that takes ages to complete. I suspect the problem might be with calling digestInputStream.read(), i.e. reading the stream and updating the digest one byte at a time - it is far more efficient to be working with a buffer (even a small 4K buffer will do) and reading it in one fell swoop.

Or, at a higher level, there's actually no reason to use an md5 digest to compare two streams - it would be more efficient and straightforward to just compare the content of the streams directly for equality, with no digest or other calculations (but here too it should be done using a buffer, not reading them byte by byte and comparing). There are plenty of such stream equality utility methods to be found, so no need to write it from scratch either.

xerial · 2013-07-01T00:07:10Z

Any suggestion to a utility method for comparing two streams?

amichair · 2013-07-01T07:24:32Z

This might be a good start:
http://stackoverflow.com/questions/4245863/fast-way-to-compare-inputstreams

xerial · 2013-07-01T08:38:00Z

Thanks. I will try NIO based implementation.

bokken · 2013-07-01T15:17:56Z

The NIO implementation is not a very good one, as it requires moving data back and forth from jvm to native memory multiple times. It would be better if you used a standard ByteBuffer, rather than a DirectByteBuffer (as that avoids the crossing jvm/native memory).
You could also use IOUtils from commons-io.
http://commons.apache.org/proper/commons-io/javadocs/api-2.4/org/apache/commons/io/IOUtils.html#contentEquals(java.io.InputStream,%20java.io.InputStream)

amichair · 2013-07-01T16:35:18Z

I think it's the other way around - using direct ByteBuffers allows the OS to perform the I/O directly into the same memory space that is then accessed by the Java code, whereas with non-direct buffers the I/O is first done in some other native buffer and then copied into a Java array. http://stackoverflow.com/questions/5670862/bytebuffer-allocate-vs-bytebuffer-allocatedirect is the first link I found on this, and confirms it. You can also make the buffer a bit larger than 1K - probably not much difference in performance, but still most systems nowadays read from disk in at least 4K blocks anyway, so might as well pass it through like that.

In any case, the performance difference between the two buffer types in this case is likely negligible, and both should be several orders of magnitude better than the current state of affairs.

bokken · 2013-07-01T23:07:44Z

That is only true if you are reading directly from a
fileinputstram/filechannel. If you are reading from nearly any other source
(such as a ZipInputStream used to read resources in jars) all of the bytes
are pulled into the jvm then pushed back into native then back into jvm 1
byte at a time to do equality comparison.
On Jul 1, 2013 11:35 AM, "amichair" notifications@github.com wrote:

I think it's the other way around - using direct ByteBuffers allows the OS
to perform the I/O directly into the same memory space that is then
accessed by the Java code, whereas with non-direct buffers the I/O is first
done in some other native buffer and then copied into a Java array.
http://stackoverflow.com/questions/5670862/bytebuffer-allocate-vs-bytebuffer-allocatedirectis the first link I found on this, and confirms it. You can also make the
buffer a bit larger than 1K - probably not much difference in performance,
but still most systems nowadays read from disk in at least 4K blocks
anyway, so might as well pass it through like that.

In any case, the performance difference between the two buffer types in
this case is likely negligible, and both should be several orders of
magnitude better than the current state of affairs.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/39#issuecomment-20293570
.

bokken · 2013-07-01T23:54:38Z

Basically there are 2 really good uses for direct byte buffers and several
edge uses.

Transferring data between files, when none of the data is actually
looked at in jvm.
Interacting with jni.

The edge uses are usually around addressing very large chunks of memory.

If all of the data is examined within the jvm (and there is no later jni
involved) it is better to read directly into a byte[].
On Jul 1, 2013 6:07 PM, "Brett Okken" brett.okken.os@gmail.com wrote:

That is only true if you are reading directly from a
fileinputstram/filechannel. If you are reading from nearly any other source
(such as a ZipInputStream used to read resources in jars) all of the bytes
are pulled into the jvm then pushed back into native then back into jvm 1
byte at a time to do equality comparison.
On Jul 1, 2013 11:35 AM, "amichair" notifications@github.com wrote:

I think it's the other way around - using direct ByteBuffers allows the
OS to perform the I/O directly into the same memory space that is then
accessed by the Java code, whereas with non-direct buffers the I/O is first
done in some other native buffer and then copied into a Java array.
http://stackoverflow.com/questions/5670862/bytebuffer-allocate-vs-bytebuffer-allocatedirectis the first link I found on this, and confirms it. You can also make the
buffer a bit larger than 1K - probably not much difference in performance,
but still most systems nowadays read from disk in at least 4K blocks
anyway, so might as well pass it through like that.

In any case, the performance difference between the two buffer types in
this case is likely negligible, and both should be several orders of
magnitude better than the current state of affairs.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/39#issuecomment-20293570
.

xerial · 2013-07-02T01:32:20Z

I just created a snapshot version that simply compares two InputStreams.
https://oss.sonatype.org/content/repositories/snapshots/org/xerial/snappy/snappy-java/1.0.5.1-SNAPSHOT/

@amichair
Could you test this version in your platform?

amichair · 2013-07-09T10:45:33Z

@bokken You're right, as in our case one of the streams is not directly read from a file...

@xerial I tried to recreate the issue using 1.0.5.1-SNAPSHOT but it did not occur, so looks like it's solved. Thanks!

xerial closed this as completed in d7263cc Aug 13, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SnappyLoader takes minutes to initialize #39

SnappyLoader takes minutes to initialize #39

amichair commented Jun 30, 2013

xerial commented Jul 1, 2013

amichair commented Jul 1, 2013

xerial commented Jul 1, 2013

bokken commented Jul 1, 2013

amichair commented Jul 1, 2013

bokken commented Jul 1, 2013

bokken commented Jul 1, 2013

xerial commented Jul 2, 2013

amichair commented Jul 9, 2013

SnappyLoader takes minutes to initialize #39

SnappyLoader takes minutes to initialize #39

Comments

amichair commented Jun 30, 2013

xerial commented Jul 1, 2013

amichair commented Jul 1, 2013

xerial commented Jul 1, 2013

bokken commented Jul 1, 2013

amichair commented Jul 1, 2013

bokken commented Jul 1, 2013

bokken commented Jul 1, 2013

xerial commented Jul 2, 2013

amichair commented Jul 9, 2013