[XeTeX] XeTeX and code page 65001

Zdenek Wagner zdenek.wagner at gmail.com
Thu Feb 9 17:42:47 CET 2012


2012/2/9 d fulano <donfulanito at hotmail.com>:
>
> It seems to be this is the way the command prompt behaves with an
>
> invalid (incomplete) utf8 sequence..  Even other command prompt
>
> programs eg ftp seem to behave strangely with the 65001 codepage
>
> if random accented characters are typed which correspond to
>
> invalid utf8.
>
>
>
>
>
> In utf-8, to typeset é (e accent) / unicode E9, you need to type 2 bytes:
>
> 195 169 in decimal
>
> C3 A9 in hex
>
>
>
> In contrast,in UTF-8 when you type é (e accent) this signifies
>
> the first of 3 bytes, which actually encode chinese characters
>
> at Unicode 9000+.
>
> So somethng fails in interactive TeX as there are no other
>
> valid characers following é as required in utf-8.
>
>
>
> You can test the above as:
>
>
>
> -a- create a text file with the line:
>
>
>
> \font\arial="Arial Unicode MS" at 12pt\arial é\bye
>
This must be some Windows misfeature, my bash shell in Linux works
correclly, locales are set to UTF-8, I can type Czech accented
characters as well as Devanagari directly on the keybord. The
following test works as expected:

This is XeTeX, Version 3.1415926-2.3-0.9997.5 (TeX Live 2011)
**\relax
entering extended mode

*\font\a="Nakula" \a ěšč करना \bye

>
>
> The two characters are created by character map or
>
> on the numeric keyboard: alt-195 alt-169.
>
> Running this through xetex produces just one character é - the e with accute.
>
> Xetex expects utf8 input.
>
> (need to make sure though that the editor you use doesnt
>
> try to be 'helpful' by reencoding the two strange characters as urf,
>
> resulting in 4 bytes, so dont choose 'save as utf8 format')
>
>
>
> -b- if you create \é (backslash e-accute) in a file and run
>
> it, the program stops with undefined control sequence \é
>
> the two characters displayed after the backslash are the
>
> utf8 encoding of é e-acute. So Xetex outputs utf8 text.
>
>
>
> At least this is what I get on my pc.
>
>
>
> utf8 is not the same as unicode, it's an encoding for unicode, which
>
> takes good unicode characters and translates into multi-byte 'garbage'.
>
> Only the first 127 ASCII characters stay the same under UTF8, and
>
> the rest convert into multi-bytes.
>
>
>
> But, why do you want to change your *keyboard* input to utf8 anyway?
>
> It's not that you can do the utf conversion in your head and type the
>
> converted characters in.
>
>
>
> Xetex expects utf-8 input by default, so you could simply 'type' in utf-encoded
>
> characters eg é and it would work. (Easier to use a utf8 enabled text editor
>
> though...). So there is no need for special translation.
>
>
>
> Xetex also produces uft-8 output on the screen by default, at least this is what
>
> I see when there is a problem with accented characters. (thier utf encodings)
>
> It's just that the command window does't translate these utf8 characters into
>
> nice glyphs. And that is the case regardless of the cp 65001 setting.
>
> Also, the switch for a Unicode-enabled command prompt "cmd /u" also doent help
>
> with this either.
>
>
>
> In all the above cases however (whatever the chcp and whatever /u switch is used)
>
> if I open the tex .log file with a text (utf8) editor, I see the correct
>
> symbols anyway.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>>
>> F:\>xetex
>> This is XeTeX, Version 3.1415926-2.2-0.9997.4 (Web2C 2010)
>>  restricted \write18 enabled.
>> **\relax
>> entering extended mode
>>
>>>>
>> ! Emergency stop.
>> <*> \relax
>>
>> No pages of output.
>> Transcript written on texput.log.
>
> As you can see, the é, when entered in code page
> 65001, is interpreted as a Ctrl-z.
>
>
>
>
>
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



More information about the XeTeX mailing list