My favorites | Sign in
Project Logo
          
Code license: Apache License 2.0
Labels: cjk, unicode
Feeds:
People details
Project owners:
  gavingrover

CJK Decomposition File

The CJK Decomposition File is a graphical analysis of the most common 20,934 Chinese/Japanese characters in Unicode (the 20,922 characters in the Unicode CJK common ideograph block, plus the 12 unique characters from the CJK compatibility block).

For each character, I've recorded one or two constituent components, and a decomposition type. Only pictorial configurations are used, not semantic ones. Where characters have typeface differences I've used the one in the Unicode spec reference listing. When there's more than one possible configuration, I've selected one only. I've "created" a few thousand characters to cater for decomposition components not themselves among the collected characters. (Although many are in the CJK extension A and B blocks, I kept those out of scope.) To represent these extra characters in the data, sometimes I've used a multi-character sequence, sometimes a user-defined glyph.

Downloads

The download file is a zip containing 2 files:

(1) The CSV-format data file, with 4 fields: the character, first component, second component (or -), type of decomposition.

(2) The truetype font file to make viewing the data file easier.

Licence

The CSV-format file is totally my own work, and distributed under both the Apache software licence 2.0 and LGPL licences. Although I used many internet-based listings to help create the data, there's no trace of them in the data file itself.

The font file is based on some pre-existing proprietary or copyleft font, and inherits that licence.

If you need to use the decomposition data in some BSD-style-licenced work, you can write a quick script to replace the user-defined glyphs with a unique multi-character identifying sequence.

Notes

(1) This data has been used in Christoph Burgmer's CJKLib software.

(2) See my own ongoing progress on creating a CJK-character-based programming language at http://gavingrover.blogspot.com and http://code.google.com/p/groovyscript.

(3) The vy-language beta downloads are now all deprecated because development on it has ceased.









Hosted by Google Code