[XeTeX] XeTeX and code page 65001

Thu Feb 9 17:11:09 CET 2012

It seems to be this is the way the command prompt behaves with an 

invalid (incomplete) utf8 sequence..  Even other command prompt 

programs eg ftp seem to behave strangely with the 65001 codepage 

if random accented characters are typed which correspond to 

invalid utf8.

In utf-8, to typeset é (e accent) / unicode E9, you need to type 2 bytes: 

195 169 in decimal

C3 A9 in hex

In contrast,in UTF-8 when you type é (e accent) this signifies 

the first of 3 bytes, which actually encode chinese characters 

at Unicode 9000+. 

So somethng fails in interactive TeX as there are no other 

valid characers following é as required in utf-8. 

You can test the above as:

-a- create a text file with the line:

\font\arial="Arial Unicode MS" at 12pt\arial Ã©\bye

The two characters are created by character map or 

on the numeric keyboard: alt-195 alt-169.

Running this through xetex produces just one character é - the e with accute. 

Xetex expects utf8 input.

(need to make sure though that the editor you use doesnt

try to be 'helpful' by reencoding the two strange characters as urf, 

resulting in 4 bytes, so dont choose 'save as utf8 format')

-b- if you create \é (backslash e-accute) in a file and run 

it, the program stops with undefined control sequence \Ã©

the two characters displayed after the backslash are the 

utf8 encoding of é e-acute. So Xetex outputs utf8 text.

At least this is what I get on my pc.

utf8 is not the same as unicode, it's an encoding for unicode, which 

takes good unicode characters and translates into multi-byte 'garbage'.

Only the first 127 ASCII characters stay the same under UTF8, and 

the rest convert into multi-bytes. 

But, why do you want to change your *keyboard* input to utf8 anyway?

It's not that you can do the utf conversion in your head and type the

converted characters in.

Xetex expects utf-8 input by default, so you could simply 'type' in utf-encoded

characters eg Ã© and it would work. (Easier to use a utf8 enabled text editor 

though...). So there is no need for special translation.

Xetex also produces uft-8 output on the screen by default, at least this is what 

I see when there is a problem with accented characters. (thier utf encodings)

It's just that the command window does't translate these utf8 characters into

nice glyphs. And that is the case regardless of the cp 65001 setting. 

Also, the switch for a Unicode-enabled command prompt "cmd /u" also doent help 

with this either.

In all the above cases however (whatever the chcp and whatever /u switch is used)

if I open the tex .log file with a text (utf8) editor, I see the correct 

symbols anyway. 

>
> F:\>xetex
> This is XeTeX, Version 3.1415926-2.2-0.9997.4 (Web2C 2010)
>  restricted \write18 enabled.
> **\relax
> entering extended mode
>
> *é
>
> ! Emergency stop.
> <*> \relax
>
> No pages of output.
> Transcript written on texput.log.

As you can see, the é, when entered in code page
65001, is interpreted as a Ctrl-z.