My favorites | Sign in
Project Logo
                
Search
for
Updated Jul 25, 2009 by axelclk
Labels: Featured, Phase-Implementation
Mediawiki2HTML  
How to convert Mediawiki text to HTML

How to convert Mediawiki text to HTML

The general idea for the wiki to html WikiModel is, that the common wiki syntax rendering is hidden in the internal WikipediaParser. Users of the API should derive a class from WikiModel or AbstractWikiModel, where special things could be managed.

A simple wiki text to HTML conversion looks like this:

public static void main(String[] args)
	{
		WikiModel wikiModel = 
                            new WikiModel("http://www.mywiki.com/wiki/${image}", 
                                          "http://www.mywiki.com/wiki/${title}");
		String htmlStr = wikiModel.render("This is a simple [[Hello World]] wiki tag");
		System.out.print(htmlStr);
	}

and creates the following HTML snippet:

<p>This is a simple <a href="http://www.mywiki.com/wiki/Hello_World" title="Hello World">Hello World</a> wiki tag</p>

As you can see the ${title} variable is replaced by the text of the wikilink according to the rules specified in the Mediawiki Help:Link article.

You can for example overwrite the WikiModel#parseInternalImageLink() method to change the default rendering behaviour of the [[Image:...]] tag.

public class WikiTestModel extends WikiModel {
  public WikiTestModel(String imageBaseURL, String linkBaseURL) {
    super(imageBaseURL, linkBaseURL);
  }
  public void parseInternalImageLink(String imageNamespace, String rawImageLink) {
    ...

    ...
  }
}

By default the rendering engine doesn't allow the style attribute to avoid cross-site scripting risks. You can define the style attribute as allowed in a static block of your WikiModel implementation.

  static {
    TagNode.addAllowedAttribute("style");  
    ...
  }

Look in the WikiModel.java and AbstractWikiModel.java sources for an example:

A more advanced example can be found in the HTMLCreatorTest.java file. If you run this example the first time, the Tom Hanks wiki source from Wikipedia is downloaded through the Wikipedia API. The downloaded wiki texts and templates are stored in an Apache Derby database, and associated images are downloaded in an already existing image directory C:\temp\WikiImages. After the first run there's a new Derby database created in the directory C:\temp\WikiDB. Every subsequent run of this code snippet will only download the Tom Hanks wiki source. The associated templates and images are already cached in the Derby database and in the images directory:

public static void testWikipediaENAPI(String title) {
		String[] listOfTitleStrings = {
			title
		};
		String titleURL = Encoder.encodeTitleLocalUrl(title);
		User user = new User("", "", "http://en.wikipedia.org/w/api.php");
		user.login();
		String mainDirectory = "c:/temp/";
		// the following subdirectory should not exist if you would like to create a
		// new database
		String databaseSubdirectory = "WikiDB";
		// the following directory must exist for image downloads
		String imageDirectory = "c:/temp/WikiImages";
		// the generated HTML will be stored in this file name:
		String generatedHTMLFilename = mainDirectory + titleURL + ".html";
		
		WikiDB db = null;

		try {
			db = new WikiDB(mainDirectory, databaseSubdirectory);
			APIWikiModel wikiModel = new APIWikiModel(user, db, "${image}", "${title}", imageDirectory);
			DocumentCreator creator = new DocumentCreator(wikiModel, user, listOfTitleStrings);
			creator.setHeader(HTMLConstants.HTML_HEADER1 + HTMLConstants.CSS_SCREEN_STYLE + HTMLConstants.HTML_HEADER2);
			creator.setFooter(HTMLConstants.HTML_FOOTER);
			wikiModel.setUp();
			creator.renderToFile(generatedHTMLFilename);

		} catch (IOException e) {
			e.printStackTrace();
		} catch (Exception e1) {
			e1.printStackTrace();
		} finally {
			if (db != null) {
				try {
					db.tearDown();
				} catch (Exception e) {
					e.printStackTrace();
				}
			}
		}
	}

Comment by mail2rajja, Jun 01, 2009

Please provide detail steps to use with MediaWiki? website.

Comment by mmwalczak, Jul 25, 2009

Hi, I have question about WikiParser? and chinese or japanese languages. If texts in wiki are in latin, everything is fine, but when there are chinese or japanese texts in link after "|", then parser takes as linktext, not only text until "]]" but everything until first point or comma, or other character, which is not japanese or chinese. Can you comment or explain something about that?

Comment by axelclk, Jul 25, 2009

I think you use the feature describe in this section: http://meta.wikimedia.org/wiki/Help:Link#Syntax

[[a|b]]c 'c' is appended to the end of the link text -> "bc"

Example for plurals: Dolphins are [[aquatic mammal]]s that are closely related to [[whale]]s and [[porpoise]]s.

I don't know if this is always suitable for chinese or japanese languages?

Comment by mmwalczak, Jul 26, 2009

Thanks for your answer. Quite often in chinese or japanese some glyphs are right after piped link, without space etc. In such cases it's happen. That's means, I have to look for some settings in wiki for such situations. One more thanks.

Comment by seun.osewa, Nov 03, 2009

You didn't indicate the imports needed to make the above examples work.


Sign in to add a comment
Hosted by Google Code