[XeTeX] xunicode.sty bug
Jonathan Kew
jonathan_kew at sil.org
Tue Jul 18 12:36:17 CEST 2006
On 18 Jul 2006, at 11:03 am, Ralf Stubner wrote:
> Jonathan Kew <jonathan_kew at sil.org> writes:
>
>>> Ux00AD soft hyphen
>>
>> This is the Unicode character that means essentially the same as
>> TeX's "\-". A non-printing layout control that indicates a potential
>> break point, not a visible character in its own right. If the line
>> actually breaks there, the appropriate visible manifestation is
>> script/language-dependent; a common default would be to insert U+2010
>> before the break, but this is not universally correct.
>
> I vaguely remember that there are some discussions concerning soft
> hyphen being nonprinting or not. Might have been on
> <URL:http://www.cs.tut.fi/~jkorpela/shy.html>. I don't have a clear
> opinion here at the moment.
Yes, this is an interesting and informative discussion (and it's a
messy situation!).
It seems to me that the ISO-8859-1 code xAD was closer to being a
presentational glyph than a character, in terms of the Unicode/WG2
character/glyph model (but the model was not clearly articulated at
that time), while Unicode itself defines U+00AD more clearly as a
layout control character.
> It is a printing character in fonts like
> MinionPro or Charis SIL.
Right; many (most) fonts map this character to a visible hyphen
glyph. However, the Standard (p.388) says:
<quote src="http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf">
Hyphenation. U+00AD SOFT HYPHEN (SHY) indicates an intraword break
point, where a
line break is preferred if a word must be hyphenated or otherwise
broken across lines. Such
break points are generally determined by an automatic hyphenator. The
use of SHY is generally
limited to situations where users need to override the behavior of
such a hyphenator.
The visible rendering of a line break at an intraword break point,
whether automatically
determined or indicated by a SHY, depends on the surrounding
characters, the language,
and, at times, the meaning of the word. The precise rules are outside
the scope of this standard,
but see Unicode Standard Annex #14, “Line Breaking Properties,” for
additional
information. A common default rendering is to insert a hyphen before
the line break, but
this is incorrect in many situations.
</quote>
As such, U+00AD should not normally be rendered directly by a text
display system, and so it is irrelevant what glyph is in the font. If
the potential break position indicated by U+00AD is not used, it
should have no visible result at all; and if the position is used, it
should be rendered as appropriate depending on the surrounding
characters, language, etc.
Having a visible glyph for U+00AD in a font may be useful if text is
displayed by a "dumb" system that does not handle its Unicode
semantics. But in this case, it may be a bad idea for the glyph to
look like a "normal" hyphen, as this could mislead people into using
it thinking that it will always be a visible character. Using a
specially-marked glyph (e.g., with dashed box around) might be a
better choice. (This can also be used by editors that want to support
a "show invisibles" mode.)
In the case of xetex, I think a sensible default (to handle the
situation where U+00AD occurs in the input text) would be to say:
\catcode"AD=\active
\let^^ad=\-
JK
More information about the XeTeX
mailing list