My favorites | Sign in
Project Hosting will be READ-ONLY Thursday at 3:00pm UTC for up to 3 hours for network maintenance.
Project Home Downloads Issues Source
Project Information
Members
Featured
Downloads

Wikipedia Redirect

Functionality of Wikipedia Redirect:

  • Extracts pairs of a title and a redirected title (e.g. "USA" -> "United States") from a wikipedia dump on any language (implementation is language-independent; we tested on Japanese and English Wikipedia).
  • Serializes/deserializes the extracted redirect data

Requirement

As for the memory requirement, Japanese Wikipedia data has been processed successfully using only 256MB (JVM option: -Xmx256m). Processing English Wikipedia requires 4GB (JVM option: -Xmx4g).

Extracting redirect data from Wikipedia dump

1. Check out the code base and build it.

$ svn co http://wikipedia-redirect.googlecode.com/svn/trunk/edu.cmu.lti.wikipedia_redirect
$ cd edu.cmu.lti.wikipedia_redirect
$ javac src/edu/cmu/lti/wikipedia_redirect/*.java

2. Get a wikipedia dump (pages-articles).

$ wget 'http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2'
$ bunzip2 jawiki-latest-pages-articles.xml.bz2

3. Run WikipediaRedirectExtractor to obtain wikipedia redirects (takes about 1 minute to run)

$ java -cp src edu.cmu.lti.wikipedia_redirect.WikipediaRedirectExtractor ./jawiki-latest-pages-articles.xml

4. Make sure the redirect file (serialization of WikipediaRedirect object) is created. You'll also find a tab-separated .txt file for your convenience.

$ ls target -lh
-rw-r--r-- 1 hideki users  21M 2011-10-11 01:32 target/wikipedia_redirect.txt
-rw-r--r-- 1 hideki users  23M 2011-10-11 01:32 target/wikipedia_redirect.ser

Running a demo

$ cat src/edu/cmu/lti/wikipedia_redirect/Demo.java 
package edu.cmu.lti.wikipedia_redirect;
import java.io.File;

public class Demo {
  public static void main(String[] args) throws Exception {
    // Initialization
    System.out.print("Deserializing Wikipedia Redirect ...");
    long t0 = System.nanoTime();
    WikipediaRedirect wr = IOUtil.loadWikipediaRedirect(new File(args[0]));
    long t1 = System.nanoTime();
    System.out.println(" done in "+(double)(t1-t0)/(double)1000000000+" sec.\n");
    
    // Let's find a redirection given a source word.
    String[] srcTerms = {"オサマビンラディン", "オサマ・ビンラーディン",
            "東日本大地震","東日本太平洋沖地震" ,"慶大", "NACSIS", 
            "ダイアモンド", "アボガド", "バイオリン", "平成12年", "3.14"};
    StringBuilder sb = new StringBuilder();
    for ( String src : srcTerms ) {
      sb.append("\""+wr.get(src)+"\" was redirected from \""+src+"\"\n");
    }
    System.out.println(sb.toString()+"--\n");

    // Let's find which source words redirect to the given target word.
    String target = "東北地方太平洋沖地震";
    Set<String> keys = wr.getKeysByValue(target);
    System.out.println("All of the following redirect to \""+target+"\":\n"+keys);
  }
}
$ java -cp src edu.cmu.lti.wikipedia_redirect.Demo ./target/wikipedia_redirect.ser
Deserializing Wikipedia Redirect ... done in 15.183415261 sec.

"ウサーマ・ビン・ラーディン" was redirected from "オサマビンラディン"
"ウサーマ・ビン・ラーディン" was redirected from "オサマ・ビンラーディン"
"東北地方太平洋沖地震" was redirected from "東日本大地震"
"東北地方太平洋沖地震" was redirected from "東日本太平洋沖地震"
"慶應義塾大学" was redirected from "慶大"
"国立情報学研究所" was redirected from "NACSIS"
"ダイヤモンド" was redirected from "ダイアモンド"
"アボカド" was redirected from "アボガド"
"ヴァイオリン" was redirected from "バイオリン"
"2000年" was redirected from "平成12年"
"円周率" was redirected from "3.14"
--

All of the following redirect to "東北地方太平洋沖地震":
[Tohoku Region Pacific Coast Earthquake, 2011年三陸沖地震, 2011年東北地方・太平洋沖地震, 3.11, 平成三陸沖地震, 東北地方・太平洋沖地震, 東北大地震, 平成23年東北地方太平洋沖地震, 東北太平洋沖地震, 東北・関東大地震, 東北沖大地震, 東北地方太平洋地震, 東北沖太平洋地震, 東日本太平洋沖地震, 東北地方太平洋沖地震 (2011年), 東日本大地震, 東北地方太平洋岸地震, 2011年三陸地震, 2011年東北地震, 東北関東大地震, 2011年東北地方太平洋沖地震, 東日本巨大地震, 平成三陸地震, 平成23年東北地方・太平洋沖地震, 2011年太平洋沖地震]

Note: Sometimes redirections in Wikipedia (a pair of source title and its redirected title) are not alternative forms of the same entity. Doing some reasonable filtering might help in terms of disk usage and speed (the definition of "noise" may depend on your application).

Link

Powered by Google Project Hosting