Intelligent Chinese Input Method

Unlike western languages which have a small alphabet, Chinese has a much larger alphabet consists of several thousand Chinese characters (Han Zi). In order to input Chinese text with the usual keyboard originally designed for inputing English, a coding scheme that maps a series of Latin characters to Chinese text is required. A input method us a software that resides between normal user input interface and the low level system I/O routine, mapping series of key sequences to Chinese text transparently.

There are several widely used coding schemes for inputing Chinese text. The Pin Yin method is the most popular input method among Mandarin speakers. It maps a phonetic (Pin Yin) code to one or more Chinese characters. For instance, the following sample illustrates inputing a sentence using Pin Yin method:

                              Pin Yin Decoding
zhong hua ren min gong he guo ----------------> The People's Republic of China

One major deficiency of Pin Yin method is the translation ambiguity: one phonetic code can map up to 100 different Chinese characters. This is not surprising, since the coding scheme needs to represent thousands of different Chinese characters with only 417 different phonetic code. Hence a user must select the character he or she wants during inputing, which greatly slow down the input speed. A good input method should have the ability to resolve translation ambiguity in some intelligent way and minimize the need of human intervention.

Methology

A good Chinese input method should have the following property:

Robust: inputed phonetic codes may contain incomplete phonetic words,punctuations and non-phonetic alphabets. A good system must deal with these situations gracefully.
Efficient: efficient is another major point when choosing input method. Since a sentence may generate hundreds of segmentation results. The cost for selecting best sentence should be as small as possible.

Typically,the process of Phonetic to Chinese conversion consists of two stages:

segment a phonetic sequence into separate phonetic words
convert a series of phonetic words into corresponding Chinese text in a given context

The first segment problem can be solved using standard Chinese segmentation method. The second problem can be attacked with a Statistical Language Model. The system works much the same way as the lattice scoring component in a speech recognition system: a character lattice is built according to the input phonetic sequence, the a SLM can be employed to choose the best sentence generated from the character lattice.

Currently, a Trigram SLM is built to select the best path from the lattice. The adaptive part of the system is implemented with a Memory-based Learner aims at adjusting the model's parameters according to user's preference on-line. Both Pin Yin and Wu Bi are supported in whole sentence input mode. The input method conforms to the XIM protocol and works as a standalone XIM server under X-Window (Linux and FreeBSD). This software is still in its early stage and no code is available yet. However, you can view some fancy screen-shot:

graphics/wubi.png

Wu Bi input method in action

graphics/pinyin.png

Pin Yin input method in action

TODO List

Use a more sophisticated Whole Sentence Expontinal Language Model to re-score N-Best list
Port to other major OSs: win32, OS X...
Use XMLRPC to do client/server communication
Embedded into console Chinese environment: zhcon

This project is suspended as of 2004, and probably will not be developed for a long time. The major reason is that I have become an experienced WuBi user and am satisfied with my current inputting speed: 35 - 60 characters per minute. Therefore I lose interest in developing a PinYin solution that is actually much slower than WuBi (at least for myself). If you are a PinYin user and have not used WuBi before, I recommend you have a try, and you will be highly rewarded in the end.

Last Change :16-Dec-2004. Please send any question to Zhang Le