| Title | Better Unicode compliance in TeX |
|---|---|
| Student | Arthur Reutenauer |
| Mentor | Eric Muller |
| Abstract | |
|
The well-known program TeX, and its extensions, have always had an outstanding tradition of producing beautiful documents, but they have paid less attention to other end, the input stream. Direct UTF-8 input has only been made possible fairly recently, by two extensions to the TeX engine called LuaTeX and XeTeX (of course, TeX could already process UTF-8-encoded text thanks to appropriate macros). General support for Unicode properties is rather poor.
My project is to investigate the current state of Unicode support in TeX, and implement improvements for different aspects of Unicode compliance, as defined in chapter 3 of the Unicode Standard. Since complete compliance is probably too much for such a project, I have sorted out a few points which have seemed the most interesting to me: handling of combining characters (in conjunction with normalization, see UAX #15, http://www.unicode.org/reports/tr15/), bidi algorithm (UAX #9), and issues related to hyphenation (UAX #14, line breaking properties, and UAX #29's section about word boundaries). I have already made some experiments about normalization in LuaTeX with ConTeXt (link below), but I will do my best to implement the same algorithms in XeTeX as well, because it seems essential to support both those engines, be it only because each one of them implies a particular philosophy and perspective on TeX programming: support for LuaTeX could be done entirely at the macro level, mostly with Lua code, whereas XeTeX would probably require adding new primitives and modifying some of the libraries it uses, in particular ICU. |
|