[XeTeX] Polyglossia: Support for romanization of CJK

Thu Jun 16 01:16:49 CEST 2011

On 6/15/2011 11:44 AM, Gerrit wrote:
> Hello again, everyone,
>
> I am currently writing an article, in which I also have some 
> romanization of Japanese. Until now, I have to define the hyphenation 
> manually, which I think is a little bit of a nuisance.
>
> [snip]
>
> What do you think about that?

Since phonetic guide texts for CJKV are tied to characters, I would 
consider the most logical split one where the guide text is dictated by 
the character boundaries, and the language used. Hyphenation for guide 
text would be strongly tied to the original text splits, as 
pronunciation guide text does not significantly run past the character 
boundary (more creative uses of top text such as the common Japanese 
practice of treating it as a 'thinking space', using the real text to 
express what is said and the guide text what is thought wouldn't be 
convered by this of course. Nor should they, probably).

To my knowledge, this is already automatically the case for (Mandarin) 
Chinese, as every character only has a single syllable pronunciation, so 
hyphenation is unlikely to even matter; whether it's romanised or 
bopomofo, the guide text won't run past the character.

For Japanese this is also true for the most part, with a very small 
number of special words that consist of multiple characters that only 
have a single syllable pronunciation (like 所為, romanised as "sei", 
which cannot be decomposed as [se]-[i]. In Japanese the furigana for 
this is never split up over multiple lines either). Aside from these 
words, there are some "ateji" readings for words, where some originally 
character-less word has been assigned a set of characters that do not 
normally "spell" that word. For these, you would also need special 
hyphenation rules. However, the vast majority of Japanese words follow 
the rules of compositional reading, so 天国(tengoku) would split up as 
天(ten-)//国(-goku) and 腹切り(harakiri) would split up as 腹(hara-)//切 
り(-kiri), with optional guide text over the syllable り(ri) depending 
on the target audience.

I do not know about character guide texts in other Asian languages that 
borrowed Chinese characters.

The main challenge would be to build the "which character maps to which 
reading in which word" dataset, which will be quite vast. For western 
languages grammars can be constructed that fairly accurately describe 
when a word would be allowed to split, based on its written form.  For 
CJK languages that approach goes straight out the window, because you 
can split anywhere in a sentence. This means that there is no concept of 
"hyphenation", and it will only apply to western guide text, which for 
chinese character words requires knowing the pronunciation of these 
words (or taking a really good guess and allowing the author to override 
guesses). Particularly for Chinese and Japanese this leads to huge 
datasets; the first because even though most characters are complete 
words, and typically only have one pronunciation, there are easily ten 
thousand characters in daily use (although of course not all as 
frequent), the second because even though there are fewer characters to 
contend with in Japanese, some 3500, the actual pronunciations depend on 
the words characters are used in, and unlike Chinese most Japanese words 
are actually compound character words, still leaving you with over ten 
thousands distinct combinations for which you can't really abstract 
pronunciation rules because most characters in Japanese have three or 
four readings (at least). To get automate hyphenation right, you first 
need to tackle automatic guessing of pronunciation (even lexical 
analysers for Japanese like MeCab, ChaSen or YamCha can't get around 
this) and you'll end up with quite a few MB of data just to hyphenate 
guide text, and then only when it's western guide text.

That's not to discourage anyone from taking a stab at it, it's just 
quite a mountain of work.

- Mike "Pomax" Kamermans
nihongoresources.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20110615/93bacb57/attachment.html>