|
#Silk - Frequently Asked Questions Why is Silk taking so long to link? How can I improve this?The most common problems and solutions to this question are: Use local SPARQL endpointsQuerying remote and especially public SPARQL endpoints tends to be very slow. Even if the server itself is fast, the delay added by the network can significantly reduce Silk's performance. I would normally advise having datasets to be linked locally except if they are fairly small. If you do need to link remotely, see the next point. Turn on SPARQL query cachingThis is especially important if you are not using local datasets. Be sure to have DoCache activated for remote SPARQL endpoints. Set this option to 1 to boost performance a lot. This setting caches the results of all SPARQL queries to a given server so that they won't have to be done again. Note: This cached data is even preserved between runs in the "cache" subdirectory. If you ever run into caching inconsistencies, you might want to delete the files there. Use an appropriate output verbosity levelAre you running your main linking run with verbosity level 4 (-vvvv)? This outputs every single comparison and slows down things a lot. It's only really there for testing and tuning, not for doing the real linking run. For normal linking usage, I would recommend "-vv" (verbosity level 2), which periodically (every 100 comparisons) outputs only a timestamp and the number of comparisons already performed. By multiplying the amount of source and target resources, you can calculate how many comparisons will be needed in total (sorry that this isn't outputted yet automatically in the code). Performance can be slow at first, then picks upDon't be discouraged if the first X comparisons run slowly, where X is the amount of target resources, since every target resource's properties have to fetched remotely one time when they are compared to the first source resource. When the next source resource is compared to them, their data should already be in the cache (provided you turned DoCache on) and things will be progressing much faster from that point on. I just tried starting to run the DBpedia vs. DrugBank example. The DrugBank server seems really slow, so do have patience for that initially or host the datasets locally. Use index-based string prematching, if applicableComparing M source resources to N target resources for linking takes O(M*N) time, which often is impractical for huge datasets. This can be improved by using Silk's indexing feature in some cases to reduce the needed time complexity drastically to O(M+N). Note that this requires the Xapian library (with Python bindings) to be installed. See http://www4.wiwiss.fu-berlin.de/bizer/silk/spec/#prematching and section 2.4 of the Silk paper http://juliusv.com/silk_iswc_2009.pdf for details about this advanced feature. |
I have heard, that there is a newer Version of SILK now, which is written in Scala. Where can I find this version?