Google Code Archive - Long-term storage for Google Code Project Hosting.

Posted on Sep 12, 2014 by Swift Dog

This is a question/feature request. In my testing, it seems like the main bottleneck is the building of the GlobalIndex, as opposed to using the FeatureBuilder classes for doing the counts. However, while there is a multithread version of the FeatureBuilder classes, there is none for the GlobalIndex builders. Are there plans to implement parallel versions of these builders? I am not very experienced with Java, but I might try to implement them if it is feasible to do so.

Matt

Comment #1

Posted on Dec 11, 2014 by Happy Dog

Hi Matthew I am really sorry for replying so late. You have a valid point, and I will look into this for the next version.

However the current issue with this project is that I have almost nil time that can be dedicated to jate regularly due to work commitment. I can only work on this in my spare time so I really cannot guarantee when this will be done. But yes definitely I will look into this.

Comment #2

Posted on Dec 11, 2014 by Swift Dog

Thanks Ziqi. I was actually able to implement a multithreaded GlobalIndexMem using ConcurrentHashMap and modifications of the GlobalIndexBuilderMem class. I am still facing a bottleneck with disk I/O, so I tend to build the NP lists sequentially and then distribute the index building from those lists. For future versions, it might make sense to read the documents into memory to allow for parallel reading (although, of course, the application becomes significantly more RAM intensive).

Another area where parallelism is super useful is the variant updater, because there are so many combinations. A multithreaded version of that functionality was easier to implement, since it's all in-memory data structures.

My code is currently kind of a mess, but I can share it when I get some time.

Matt

jatetoolkit - issue #7

Comment #1

Comment #2