This is a question/feature request. In my testing, it seems like the main bottleneck is the building of the GlobalIndex, as opposed to using the FeatureBuilder classes for doing the counts. However, while there is a multithread version of the FeatureBuilder classes, there is none for the GlobalIndex builders. Are there plans to implement parallel versions of these builders? I am not very experienced with Java, but I might try to implement them if it is feasible to do so.
Matt
Comment #1
Posted on Dec 11, 2014 by Happy DogHi Matthew I am really sorry for replying so late. You have a valid point, and I will look into this for the next version.
However the current issue with this project is that I have almost nil time that can be dedicated to jate regularly due to work commitment. I can only work on this in my spare time so I really cannot guarantee when this will be done. But yes definitely I will look into this.
Comment #2
Posted on Dec 11, 2014 by Swift DogThanks Ziqi. I was actually able to implement a multithreaded GlobalIndexMem using ConcurrentHashMap and modifications of the GlobalIndexBuilderMem class. I am still facing a bottleneck with disk I/O, so I tend to build the NP lists sequentially and then distribute the index building from those lists. For future versions, it might make sense to read the documents into memory to allow for parallel reading (although, of course, the application becomes significantly more RAM intensive).
Another area where parallelism is super useful is the variant updater, because there are so many combinations. A multithreaded version of that functionality was easier to implement, since it's all in-memory data structures.
My code is currently kind of a mess, but I can share it when I get some time.
Matt
Status: New
Labels:
Type-Defect
Priority-Medium