[XeTeX] anti-xunicode ;-)
Adam Twardoch
list.adam at twardoch.com
Fri Jul 21 02:40:45 CEST 2006
Firmicus wrote:
> (1) Is this a sound way to solve the problem I face? If yes, would it
> make sense to extend my package to cover a range of characters with
> diacritics that many OT fonts are likely to lack, but which are easily
> composable by TeX macros? Note that if a glyph is actually present in
> the current font, it will be used instead of the TeX composition. What
> I propose may in some way be a little bit sinful (from a pure
> Unicode/OpenType perspective), but at least it tries to minimize those
> little sins... ;-)
You can make your approach standard-compliant if you make sure that the
characters that you input are in the Unicode canonical decomposition order.
For example, the sequences \u0045\u0323\u0301, \u0045\u0301\u0323,
\u00C9\u0323, \u1EB8\u0301 are all valid Unicode representations of
LATIN CAPITAL LETTER E WITH DOT BELOW AND COMBINING ACUTE ACCENT, which
does not have a precomposed Unicode codepoint. Even for a common
character such as LATIN CAPITAL LETTER E WITH ACUTE (\u00C9), the valid
representations are both \u00C9 and \u0045\u0301.
Whenever a typesetting application finds a sequence of encoded
characters that involve combining accents, it can have a multitude of
options on how to produce the final rendered glyph. For example, if
there the sequence \u0045\u0301 in the stream, the application might:
(a) apply the Unicode canonical composition to the string, thus arriving
at the codepoint \u00C9, and then attempt to render the glyph directly
from the font, using the "cmap" table mapping for \u00C9 -> the "Eacute"
glyph.
(b) apply the "ccmp" OpenType Layout feature onto the string
\u0045\u0301. The two separate Unicode codepoints are then converted to
glyphs using the "cmap" table, typically ending up with two glyphs "E"
"acutecomb" (the "acutecomb" glyph may or may not be present). If both
glyphs are present, the application applies the "ccmp" feature found in
the font, which may include the substitution "E" "acutecomb" -> "Eacute"
(c) apply the "mark" OpenType Layout feature onto the string
\u0045\u0301. The two separate Unicode codepoints are then converted to
glyphs using the "cmap" table, typically ending up with two glyphs "E"
"acutecomb" (the "acutecomb" glyph may or may not be present). If both
glyphs are present, the application applies the "mark" feature found in
the font, which may include an appropriate positioning of the glyph
"acutecomb" over the glyph "E".
(d) apply "heuristic positioning" in other ways.
The last option (d) could involve trying to locate the glyph associated
with \u0301 in the font ("acutecomb") and if that’s not present, try its
spacing variant (\u00B4, "acute"). Then, the application might try to
position either "acutecomb" or "acute" either by simply centering it
over the glyph or perhaps by doing some "smart" shifting around, for
example shifting the accent up by the distance of the font’s caps height
and x-height if the accent is placed over an uppercase letter.
All the heuristic positioning is certainly not specified by either
Unicode or OpenType but the application is free to try and optimize the
final appearance of the text that way.
I’ve included some discussions of the OpenType implementation of such
situations here:
http://groups.msn.com/fontlab/tipsandtricks.msnw?action=get_message&mview=0&ID_Message=3403
Regards,
Adam Twardoch
Fontlab Ltd. / MyFonts / Silesian Letters
--
Adam Twardoch
http://www.twardoch.com/
More information about the XeTeX
mailing list