[tex-live] Re: UTF-8 support

Petr Olsak olsak@math.feld.cvut.cz
Sat, 25 Jan 2003 07:26:52 +0100 (CET)


On Fri, 24 Jan 2003, Vladimir Volovich wrote:

> that is, in my opinion, not a major limitation (one can always use
> braces to delimit macro arguments). also, in encTeX, the "dirty
> tricks" with e.g. \catcode will not always work: when some UCS
> character is defined as a \macro, \catcode will produce an error.
> so compatibility with old documents is not preserved in all cases.
>
> e.g., if i use
> \chardef\ruYa="code
> \mubyte \ruYa ....\endmubyte
> and then the input file contains:
> \catcode`\<ruYa>=11
> then encTeX will fail badly, as it will translate it to
> \catcode`\\ruYa=11
> am i wrong?

If your old document uses \catcode`Ya=11 then it maps the cyrrilic
characters to one of the 256 internal slots in TeX. You can do the same by
encTeX but you cannot map to these slots more than 256 UTF-codes without
information loss. Our old document did not map more than 256 codes too.
Where is the incompatibility?

Note, that my suggestion about \chardef\ruYa="code was inspired by another
our question: how to manage more than 256 codes. This is not a problem
of a compatibility with a old documents.

> you are saying that encTeX is capable of storing a table of
> \warn...byte definitions for 2^31 characters in memory? :)

Yes. The memory needs are optimized.

> as far as i understand (see above), encTeX does not provide full
> compatibility with old macros;

You are wrong. You has asked me to another thing (how to manage more
than 256 codes) and following my answer you decided something about
incomaptibility. This is not fair. You can set the encTeX table in order
to 100% backward compatibility is provided.

> more important, this functionality is
> already available in Omega: you can use
>
> \ocp\someOCP=sometranslation
> \InputTranslation currentfile \someOCP

If you needs to manage more than 256 UTF-8 codes (on input and on
\write output) and you are using only 8-bit fonts (the 16-bit metrics
are not available, for example) then you need to map some UTF-8 codes
to control sequences and solve the typesetting realisation of them as
a macro (which does a composite, switches temporary to the another font
etc.).

The encTeX serves the mapping of UTF-8 codes to control sequences on
the input and reverse mapping of control sequences to UTF-8 codes.
It means that the output of encTeX's input processor is a mix of bytes
and tokens (tokenized as a control sequences) and that the encTeX's
output processor in \write arguments can convert the selected
control sequences to UTF-8 codes without expansion.

I know too little about Omega but I know that OTP is a state automaton
which manages only with the bytes but which don't cooperates with
token processor and which is not able to output the mix of bytes and
tokenized control sequeces. It means that the verbatim mode is not
possible to implemet with more than 256 UTF-8 codes and 8-bit internal
fonts in Omega. The OTP cannot suppress the expansion of control sequences
on the write arguments. My question is: how can be solved the problem of
more than 256 UTF-8 codes with 8-bit fonts in Omega?

Of course, Omega can inpire by encTeX idea in its future versions :-).

> so why use yet another non-standard TeX extension instead of existing
> (and more powerful) Omega?

Where is the pdfOmega?

>  PO> It is possible but you cannot \write the \'A sequences into such
>  PO> files.  If you are working with more than 256 codes, then it is
>  PO> not possible in purely 8bit TeX but it is possible in encTeX.
>
> you are wrong... LaTeX + UCS packags perfectly provides such
> functionality: you can have UTF-8 encoded input file with cyrillic
> french czech etc characters and writing to files will correctly
> convert all these characters to in variant internal representation
> (\'A, \CYRYA, etc). would you like me to make a sample file?

You are wrong. This is not so perfect concept because of:

1. \write files can be processed by another programs which uderstands only
   UTF-8 encoding.
2. The \write file can be re-input in verbatim mode in TeX.

I only repeat my arguments because it seems that you are not noted them.

Best regadrs

Petr Olsak