[XeTeX] Assignment of codes (particularly \catcode) based on Unicode data
Jonathan Kew
jfkthame at gmail.com
Wed May 6 16:09:55 CEST 2015
On 6/5/15 14:14, Joseph Wright wrote:
> Based on the current files, we have a block to set \XeTeXcharclass,
> which only applies to XeTeX. The logic followed in that code is that
> characters in the file LineBreak.txt which have class "ID" (ideographs)
> not only set the \XeTeXcharclass class to 1 but also set the \catcode of
> the code point to 11. That leads to a difference between the two Unicode
> engines. My current feeling is that the data file should split this
> process such that the category code change applies to both XeTeX and
> LuaTeX, with the XeTeX-specific code separate. Does this make sense and
> indeed does the current assignment make sense?
>
ISTM that the most appropriate (default) \catcode for characters with
class ID is clearly letter (11), and would suggest that LuaTeX should
follow XeTeX in this.
So yes, splitting out the XeTeX-specific code and having LuaTeX share
the catcode assignments makes sense.
After all, if users can write control sequences such as
\hello
\halló
\Здравствуйте
\ሰላም
\सलाम
they should equally well be able to write
\你好
\こんにちわ
and have each of these treated as single control sequences, too. This
will not work if category ID characters are given catcode 12.
If you're making improvements to unicode-letters.def, I would suggest
also adding a section that assigns catcode 15 (invalid) to the code
values "D800 - "DFFF (i.e. the UTF-16 surrogates, which should never be
used in isolation as characters).
JK
More information about the XeTeX
mailing list