[texhax] unicode

pierre.mackay pierre.mackay at comcast.net
Wed Aug 3 19:01:32 CEST 2005


Alexander Grahn wrote:

>Thank you William and Pierre for your answers,
>
>I will give the ucs-package a try. My computer has not been generally unicode
>enabled. Therefore, original strings are ordinary (8 bit?)-ASCII.
>
>I found an example PDF-file which suits my needs. At one position therein
>the string appears as
>
>  /T (somestring)
>
>and at another as
>
>  [(^@s^@o^@m^@e^@s^@t^@r^@i^@n^@g) 10 0 R ...]
>
>  
>
Actually, both of them can be considered Unicode. The difference is that 
one is stored in UTF8, which happens to be the same as 8-bit ASCII, and 
the second is stored as 16-bit wide characters, which are not recognized 
by your editor. As we slowly creep toward Unicode compatibility, it is 
going to be important to keep the code, expressed as code-points in the 
Unicode Standard, distinct from the various expressions of it in 
software. Here I might make a plug for the latest Open Office Writer, 
which does a splendid job of converting the Macintosh 16-bit + 8bit 
convention to true UTF8. (I don't know what Macintosh does about 24-bit 
pages in the standard. It's not a problem I ever expect to have.

Anyway, until you get to characters with an octal value of 077, or 
decimal 127, you really can't say that ASCII is not Unicode. The early 
idea was that everything should be converted to wide-char,
at the price of doubling the size of all text files, but a glance over 
source code that one would think would be affected by this idea 
indicates that very few developers are willing to take the hit. UTF is 
really quite remarkable as a solution.)

>if I open the file in the Vim editor.
>
>The first occurence is obviously plain ASCII and the second one Unicode
>(the PDF-specification 1.6 is saying this).
>
>If, for example, I define a string
>as
>
>  \def\mystring{^@C^@a^@r^@l}
>
>  
>
\def\anotherstring{^^@A^^@B^^@C}
works, but when you set it with ABC\anotherstring DEF
the result seems to strip out the nulls.

A hex dump of the dvi file shows the sequence ABCABCDEF, with the null 
bytes omitted.

Curiouser and curiouser

Pierre



More information about the texhax mailing list