My favorites | Sign in
Project Logo
          
Code license: Apache License 2.0
Labels: cjk, unicode
Feeds:
People details
Project owners:
  gavingrover

Note: The vy-language beta downloads are now all deprecated because the dependency file jparsec-1.2.jar is no longer available from Codehaus. This site now only supplies the CJK character decomposition data.

CJK Decomposition File

The CJK Decomposition File is a graphical analysis of the most common 20,934 Chinese/Japanese characters in Unicode (the 20,922 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block).

For each character, I've recorded one or two constituent components, and a decomposition type. Only pictorial configurations are used, not semantic ones. Where characters have typeface differences I've used the one in the Unicode spec reference listing. When there's more than one possible configuration, I've selected one only. I've "created" a few thousand characters to cater for decomposition components not themselves among the collected characters. (Although many are in the CJK extension A and B blocks, I kept those out of scope.) To represent these extra characters in the data, sometimes I've used a multi-character sequence, sometimes a user-defined glyph.

Downloads

The download file is a zip containing 2 files:

(1) The CSV-format data file, with 4 fields: the character, first component, second component (or -), type of decomposition.

(2) The truetype font file to make viewing the data file easier.

Licence

The CSV-format file is totally my own work, and distributed under Apache software licence 2.0. Although I used many internet-based listings to help create the data, there's no trace of them in the data file itself.

The font file is based on some pre-existing proprietary or copyleft font, and inherits that licence.

If you need to use the decomposition data in some BSD-style-licenced work, you can write a quick script to replace the user-defined glyphs with a unique multi-character identifying sequence.

See ongoing progress on creating a CJK-character-based programming language at http://gavingrover.blogspot.com.









Hosted by Google Code