[XeTeX] XeTeX and code page 65001
d fulano
donfulanito at hotmail.com
Thu Feb 9 17:11:09 CET 2012
It seems to be this is the way the command prompt behaves with an
invalid (incomplete) utf8 sequence.. Even other command prompt
programs eg ftp seem to behave strangely with the 65001 codepage
if random accented characters are typed which correspond to
invalid utf8.
In utf-8, to typeset é (e accent) / unicode E9, you need to type 2 bytes:
195 169 in decimal
C3 A9 in hex
In contrast,in UTF-8 when you type é (e accent) this signifies
the first of 3 bytes, which actually encode chinese characters
at Unicode 9000+.
So somethng fails in interactive TeX as there are no other
valid characers following é as required in utf-8.
You can test the above as:
-a- create a text file with the line:
\font\arial="Arial Unicode MS" at 12pt\arial é\bye
The two characters are created by character map or
on the numeric keyboard: alt-195 alt-169.
Running this through xetex produces just one character é - the e with accute.
Xetex expects utf8 input.
(need to make sure though that the editor you use doesnt
try to be 'helpful' by reencoding the two strange characters as urf,
resulting in 4 bytes, so dont choose 'save as utf8 format')
-b- if you create \é (backslash e-accute) in a file and run
it, the program stops with undefined control sequence \é
the two characters displayed after the backslash are the
utf8 encoding of é e-acute. So Xetex outputs utf8 text.
At least this is what I get on my pc.
utf8 is not the same as unicode, it's an encoding for unicode, which
takes good unicode characters and translates into multi-byte 'garbage'.
Only the first 127 ASCII characters stay the same under UTF8, and
the rest convert into multi-bytes.
But, why do you want to change your *keyboard* input to utf8 anyway?
It's not that you can do the utf conversion in your head and type the
converted characters in.
Xetex expects utf-8 input by default, so you could simply 'type' in utf-encoded
characters eg é and it would work. (Easier to use a utf8 enabled text editor
though...). So there is no need for special translation.
Xetex also produces uft-8 output on the screen by default, at least this is what
I see when there is a problem with accented characters. (thier utf encodings)
It's just that the command window does't translate these utf8 characters into
nice glyphs. And that is the case regardless of the cp 65001 setting.
Also, the switch for a Unicode-enabled command prompt "cmd /u" also doent help
with this either.
In all the above cases however (whatever the chcp and whatever /u switch is used)
if I open the tex .log file with a text (utf8) editor, I see the correct
symbols anyway.
>
> F:\>xetex
> This is XeTeX, Version 3.1415926-2.2-0.9997.4 (Web2C 2010)
> restricted \write18 enabled.
> **\relax
> entering extended mode
>
> *é
>
> ! Emergency stop.
> <*> \relax
>
> No pages of output.
> Transcript written on texput.log.
As you can see, the é, when entered in code page
65001, is interpreted as a Ctrl-z.
More information about the XeTeX
mailing list