|
Mediawiki2HTML
How to convert Mediawiki text to HTML
How to convert Mediawiki text to HTMLThe general idea for the wiki to html WikiModel is, that the common wiki syntax rendering is hidden in the internal WikipediaParser. Users of the API should derive a class from WikiModel or AbstractWikiModel, where special things could be managed. A simple wiki text to HTML conversion looks like this: public static void main(String[] args)
{
WikiModel wikiModel =
new WikiModel("http://www.mywiki.com/wiki/${image}",
"http://www.mywiki.com/wiki/${title}");
String htmlStr = wikiModel.render("This is a simple [[Hello World]] wiki tag");
System.out.print(htmlStr);
}and creates the following HTML snippet: <p>This is a simple <a href="http://www.mywiki.com/wiki/Hello_World" title="Hello World">Hello World</a> wiki tag</p> As you can see the ${title} variable is replaced by the text of the wikilink according to the rules specified in the Mediawiki Help:Link article. You can for example overwrite the WikiModel#parseInternalImageLink() method to change the default rendering behaviour of the [[Image:...]] tag. public class WikiTestModel extends WikiModel {
public WikiTestModel(String imageBaseURL, String linkBaseURL) {
super(imageBaseURL, linkBaseURL);
}
public void parseInternalImageLink(String imageNamespace, String rawImageLink) {
...
...
}
}By default the rendering engine doesn't allow the style attribute to avoid cross-site scripting risks. You can define the style attribute as allowed in a static block of your WikiModel implementation. static {
TagNode.addAllowedAttribute("style");
...
}Look in the WikiModel.java and AbstractWikiModel.java sources for an example:
A more advanced example can be found in the HTMLCreatorTest.java file. If you run this example the first time, the Tom Hanks wiki source from Wikipedia is downloaded through the Wikipedia API. The downloaded wiki texts and templates are stored in an Apache Derby database, and associated images are downloaded in an already existing image directory C:\temp\WikiImages. After the first run there's a new Derby database created in the directory C:\temp\WikiDB. Every subsequent run of this code snippet will only download the Tom Hanks wiki source. The associated templates and images are already cached in the Derby database and in the images directory: public static void testWikipediaENAPI(String title) {
String[] listOfTitleStrings = {
title
};
String titleURL = Encoder.encodeTitleLocalUrl(title);
User user = new User("", "", "http://en.wikipedia.org/w/api.php");
user.login();
String mainDirectory = "c:/temp/";
// the following subdirectory should not exist if you would like to create a
// new database
String databaseSubdirectory = "WikiDB";
// the following directory must exist for image downloads
String imageDirectory = "c:/temp/WikiImages";
// the generated HTML will be stored in this file name:
String generatedHTMLFilename = mainDirectory + titleURL + ".html";
WikiDB db = null;
try {
db = new WikiDB(mainDirectory, databaseSubdirectory);
APIWikiModel wikiModel = new APIWikiModel(user, db, "${image}", "${title}", imageDirectory);
DocumentCreator creator = new DocumentCreator(wikiModel, user, listOfTitleStrings);
creator.setHeader(HTMLConstants.HTML_HEADER1 + HTMLConstants.CSS_SCREEN_STYLE + HTMLConstants.HTML_HEADER2);
creator.setFooter(HTMLConstants.HTML_FOOTER);
wikiModel.setUp();
creator.renderToFile(generatedHTMLFilename);
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e1) {
e1.printStackTrace();
} finally {
if (db != null) {
try {
db.tearDown();
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
|
Sign in to add a comment
Please provide detail steps to use with MediaWiki? website.
Hi, I have question about WikiParser? and chinese or japanese languages. If texts in wiki are in latin, everything is fine, but when there are chinese or japanese texts in link after "|", then parser takes as linktext, not only text until "]]" but everything until first point or comma, or other character, which is not japanese or chinese. Can you comment or explain something about that?
I think you use the feature describe in this section: http://meta.wikimedia.org/wiki/Help:Link#Syntax
[[a|b]]c 'c' is appended to the end of the link text -> "bc"
Example for plurals: Dolphins are [[aquatic mammal]]s that are closely related to [[whale]]s and [[porpoise]]s.
I don't know if this is always suitable for chinese or japanese languages?
Thanks for your answer. Quite often in chinese or japanese some glyphs are right after piped link, without space etc. In such cases it's happen. That's means, I have to look for some settings in wiki for such situations. One more thanks.
You didn't indicate the imports needed to make the above examples work.