My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
FAQ  

Featured, FAQ
Updated Feb 10, 2010 by julius.volz

#Silk - Frequently Asked Questions

Why is Silk taking so long to link? How can I improve this?

The most common problems and solutions to this question are:

Use local SPARQL endpoints

Querying remote and especially public SPARQL endpoints tends to be very slow. Even if the server itself is fast, the delay added by the network can significantly reduce Silk's performance. I would normally advise having datasets to be linked locally except if they are fairly small. If you do need to link remotely, see the next point.

Turn on SPARQL query caching

This is especially important if you are not using local datasets. Be sure to have DoCache activated for remote SPARQL endpoints. Set this option to 1 to boost performance a lot. This setting caches the results of all SPARQL queries to a given server so that they won't have to be done again. Note: This cached data is even preserved between runs in the "cache" subdirectory. If you ever run into caching inconsistencies, you might want to delete the files there.

Use an appropriate output verbosity level

Are you running your main linking run with verbosity level 4 (-vvvv)? This outputs every single comparison and slows down things a lot. It's only really there for testing and tuning, not for doing the real linking run. For normal linking usage, I would recommend "-vv" (verbosity level 2), which periodically (every 100 comparisons) outputs only a timestamp and the number of comparisons already performed. By multiplying the amount of source and target resources, you can calculate how many comparisons will be needed in total (sorry that this isn't outputted yet automatically in the code).

Performance can be slow at first, then picks up

Don't be discouraged if the first X comparisons run slowly, where X is the amount of target resources, since every target resource's properties have to fetched remotely one time when they are compared to the first source resource. When the next source resource is compared to them, their data should already be in the cache (provided you turned DoCache on) and things will be progressing much faster from that point on. I just tried starting to run the DBpedia vs. DrugBank example. The DrugBank server seems really slow, so do have patience for that initially or host the datasets locally.

Use index-based string prematching, if applicable

Comparing M source resources to N target resources for linking takes O(M*N) time, which often is impractical for huge datasets. This can be improved by using Silk's indexing feature in some cases to reduce the needed time complexity drastically to O(M+N). Note that this requires the Xapian library (with Python bindings) to be installed. See http://www4.wiwiss.fu-berlin.de/bizer/silk/spec/#prematching and section 2.4 of the Silk paper http://juliusv.com/silk_iswc_2009.pdf for details about this advanced feature.

Comment by kon...@gmx.at, Jun 24, 2010

I have heard, that there is a newer Version of SILK now, which is written in Scala. Where can I find this version?


Sign in to add a comment
Powered by Google Project Hosting