Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SnappyLoader takes minutes to initialize #39

Closed
amichair opened this issue Jun 30, 2013 · 9 comments
Closed

SnappyLoader takes minutes to initialize #39

amichair opened this issue Jun 30, 2013 · 9 comments

Comments

@amichair
Copy link

Sometimes SnappyLoader takes several minutes to initialize (in my case it happens when connecting via a remote debugger to a process which uses snappy via some transitive dependency).

Specifically, it's SnappyLoader.md5sum() that takes ages to complete. I suspect the problem might be with calling digestInputStream.read(), i.e. reading the stream and updating the digest one byte at a time - it is far more efficient to be working with a buffer (even a small 4K buffer will do) and reading it in one fell swoop.

Or, at a higher level, there's actually no reason to use an md5 digest to compare two streams - it would be more efficient and straightforward to just compare the content of the streams directly for equality, with no digest or other calculations (but here too it should be done using a buffer, not reading them byte by byte and comparing). There are plenty of such stream equality utility methods to be found, so no need to write it from scratch either.

@xerial
Copy link
Owner

xerial commented Jul 1, 2013

Any suggestion to a utility method for comparing two streams?

@amichair
Copy link
Author

amichair commented Jul 1, 2013

@xerial
Copy link
Owner

xerial commented Jul 1, 2013

Thanks. I will try NIO based implementation.

@bokken
Copy link
Contributor

bokken commented Jul 1, 2013

The NIO implementation is not a very good one, as it requires moving data back and forth from jvm to native memory multiple times. It would be better if you used a standard ByteBuffer, rather than a DirectByteBuffer (as that avoids the crossing jvm/native memory).
You could also use IOUtils from commons-io.
http://commons.apache.org/proper/commons-io/javadocs/api-2.4/org/apache/commons/io/IOUtils.html#contentEquals(java.io.InputStream,%20java.io.InputStream)

@amichair
Copy link
Author

amichair commented Jul 1, 2013

I think it's the other way around - using direct ByteBuffers allows the OS to perform the I/O directly into the same memory space that is then accessed by the Java code, whereas with non-direct buffers the I/O is first done in some other native buffer and then copied into a Java array. http://stackoverflow.com/questions/5670862/bytebuffer-allocate-vs-bytebuffer-allocatedirect is the first link I found on this, and confirms it. You can also make the buffer a bit larger than 1K - probably not much difference in performance, but still most systems nowadays read from disk in at least 4K blocks anyway, so might as well pass it through like that.

In any case, the performance difference between the two buffer types in this case is likely negligible, and both should be several orders of magnitude better than the current state of affairs.

@bokken
Copy link
Contributor

bokken commented Jul 1, 2013

That is only true if you are reading directly from a
fileinputstram/filechannel. If you are reading from nearly any other source
(such as a ZipInputStream used to read resources in jars) all of the bytes
are pulled into the jvm then pushed back into native then back into jvm 1
byte at a time to do equality comparison.
On Jul 1, 2013 11:35 AM, "amichair" notifications@github.com wrote:

I think it's the other way around - using direct ByteBuffers allows the OS
to perform the I/O directly into the same memory space that is then
accessed by the Java code, whereas with non-direct buffers the I/O is first
done in some other native buffer and then copied into a Java array.
http://stackoverflow.com/questions/5670862/bytebuffer-allocate-vs-bytebuffer-allocatedirectis the first link I found on this, and confirms it. You can also make the
buffer a bit larger than 1K - probably not much difference in performance,
but still most systems nowadays read from disk in at least 4K blocks
anyway, so might as well pass it through like that.

In any case, the performance difference between the two buffer types in
this case is likely negligible, and both should be several orders of
magnitude better than the current state of affairs.


Reply to this email directly or view it on GitHubhttps://github.com//issues/39#issuecomment-20293570
.

@bokken
Copy link
Contributor

bokken commented Jul 1, 2013

Basically there are 2 really good uses for direct byte buffers and several
edge uses.

  1. Transferring data between files, when none of the data is actually
    looked at in jvm.
  2. Interacting with jni.

The edge uses are usually around addressing very large chunks of memory.

If all of the data is examined within the jvm (and there is no later jni
involved) it is better to read directly into a byte[].
On Jul 1, 2013 6:07 PM, "Brett Okken" brett.okken.os@gmail.com wrote:

That is only true if you are reading directly from a
fileinputstram/filechannel. If you are reading from nearly any other source
(such as a ZipInputStream used to read resources in jars) all of the bytes
are pulled into the jvm then pushed back into native then back into jvm 1
byte at a time to do equality comparison.
On Jul 1, 2013 11:35 AM, "amichair" notifications@github.com wrote:

I think it's the other way around - using direct ByteBuffers allows the
OS to perform the I/O directly into the same memory space that is then
accessed by the Java code, whereas with non-direct buffers the I/O is first
done in some other native buffer and then copied into a Java array.
http://stackoverflow.com/questions/5670862/bytebuffer-allocate-vs-bytebuffer-allocatedirectis the first link I found on this, and confirms it. You can also make the
buffer a bit larger than 1K - probably not much difference in performance,
but still most systems nowadays read from disk in at least 4K blocks
anyway, so might as well pass it through like that.

In any case, the performance difference between the two buffer types in
this case is likely negligible, and both should be several orders of
magnitude better than the current state of affairs.


Reply to this email directly or view it on GitHubhttps://github.com//issues/39#issuecomment-20293570
.

@xerial
Copy link
Owner

xerial commented Jul 2, 2013

I just created a snapshot version that simply compares two InputStreams.
https://oss.sonatype.org/content/repositories/snapshots/org/xerial/snappy/snappy-java/1.0.5.1-SNAPSHOT/

@amichair
Could you test this version in your platform?

@amichair
Copy link
Author

amichair commented Jul 9, 2013

@bokken You're right, as in our case one of the streams is not directly read from a file...

@xerial I tried to recreate the issue using 1.0.5.1-SNAPSHOT but it did not occur, so looks like it's solved. Thanks!

@xerial xerial closed this as completed in d7263cc Aug 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants