[XeTeX] anti-xunicode ;-)

Fri Jul 21 02:40:45 CEST 2006

Firmicus wrote:
> (1) Is this a sound way to solve the problem I face? If yes, would it 
> make sense to extend my package to cover a range of characters with 
> diacritics that many OT fonts are likely to lack, but which are easily 
> composable by TeX macros? Note that if a glyph is actually present in 
> the current font, it will be used instead of the TeX composition. What 
> I propose may in some way be a little bit sinful (from a pure 
> Unicode/OpenType perspective), but at least it tries to minimize those 
> little sins... ;-)
You can make your approach standard-compliant if you make sure that the 
characters that you input are in the Unicode canonical decomposition order.

For example, the sequences \u0045\u0323\u0301, \u0045\u0301\u0323, 
\u00C9\u0323, \u1EB8\u0301 are all valid Unicode representations of 
LATIN CAPITAL LETTER E WITH DOT BELOW AND COMBINING ACUTE ACCENT, which 
does not have a precomposed Unicode codepoint. Even for a common 
character such as LATIN CAPITAL LETTER E WITH ACUTE (\u00C9), the valid 
representations are both \u00C9 and \u0045\u0301.

Whenever a typesetting application finds a sequence of encoded 
characters that involve combining accents, it can have a multitude of 
options on how to produce the final rendered glyph. For example, if 
there the sequence \u0045\u0301 in the stream, the application might:

(a) apply the Unicode canonical composition to the string, thus arriving 
at the codepoint \u00C9, and then attempt to render the glyph directly 
from the font, using the "cmap" table mapping for \u00C9 -> the "Eacute" 
glyph.
(b) apply the "ccmp" OpenType Layout feature onto the string 
\u0045\u0301. The two separate Unicode codepoints are then converted to 
glyphs using the "cmap" table, typically ending up with two glyphs "E" 
"acutecomb" (the "acutecomb" glyph may or may not be present). If both 
glyphs are present, the application applies the "ccmp" feature found in 
the font, which may include the substitution "E" "acutecomb" -> "Eacute"
(c) apply the "mark" OpenType Layout feature onto the string 
\u0045\u0301. The two separate Unicode codepoints are then converted to 
glyphs using the "cmap" table, typically ending up with two glyphs "E" 
"acutecomb" (the "acutecomb" glyph may or may not be present). If both 
glyphs are present, the application applies the "mark" feature found in 
the font, which may include an appropriate positioning of the glyph 
"acutecomb" over the glyph "E".
(d) apply "heuristic positioning" in other ways.

The last option (d) could involve trying to locate the glyph associated 
with \u0301 in the font ("acutecomb") and if that’s not present, try its 
spacing variant (\u00B4, "acute"). Then, the application might try to 
position either "acutecomb" or "acute" either by simply centering it 
over the glyph or perhaps by doing some "smart" shifting around, for 
example shifting the accent up by the distance of the font’s caps height 
and x-height if the accent is placed over an uppercase letter.

All the heuristic positioning is certainly not specified by either 
Unicode or OpenType but the application is free to try and optimize the 
final appearance of the text that way.

I’ve included some discussions of the OpenType implementation of such 
situations here:
http://groups.msn.com/fontlab/tipsandtricks.msnw?action=get_message&mview=0&ID_Message=3403

Regards,
Adam Twardoch
Fontlab Ltd. / MyFonts / Silesian Letters

-- 

Adam Twardoch
http://www.twardoch.com/