[XeTeX] handling malformed UTF-8 input
Jonathan Kew
jonathan_kew at sil.org
Tue Feb 19 19:19:19 CET 2008
Hi Marcin,
On 19 Feb 2008, at 4:27 pm, Marcin Woliński wrote:
> Hi,
>
> I'd like to report a funny problem with (mis)interpretation of
> malformed
> utf-8 input files. A few days ago a user of my document classes
> mwcls
> (e.g.,
> http://www.tug.org/texlive/devsrc/Master/texmf-dist/tex/latex/mwcls/
> mwart.cls) reported being unable to process a document with XeTeX.
> A quick examination revealed that the source of the problem is the
> following comment, which makes XeTeX not see the following line
> with \fi-s:
>
> \else\ifnum#1<\previous at toc@level
> \addpenalty\@secpenalty % czy to dobra wartość?
> \fi\fi
>
> The file is ISO Latin-2 encoded (that is: comments include a few
> Latin-2
> characters, the code proper is pure ASCII) and XeTeX tries to
> interpret
> it as UTF-8. The character ć (cacute) is encoded as byte 1110
> 0110, so
> XeTeX considers it a start of a 3-byte sequence and ignores two
> following bytes, the second of which is an endline, so the next line
> gets commented out.
>
> Other instances of this mechanism are illustrated in the attached file
> (try running it with and without \XeTeXinputencoding set).
>
> This of course could be considered a bug in mwart.cls and obviously
> I'm
> going to correct it there.
Well, it's not really a bug, but it does lead to an unnecessary
incompatibility with programs (like xetex) that try to process it as
UTF-8. And unfortunately there's no real standard for tagging plain-
text files with encoding information; there are various conventions
but none of them are universal. So keeping "code" such as TeX macros
in plain ASCII wherever possible is the safest and most portable
option, I think.
>
> I think however, that XeTeX could be more careful when reading
> malformed
> UTF-8 files. Since continuation bytes in UTF-8 sequences have to
> be of
> the form 10xxxxxx it would be safer to gobble only such bytes or at
> least not to treat ASCII characters as parts of UTF-8 sequences. That
> way the endline would be always interpreted as an endline and comments
> would always end where they should.
> Is that a change worth introducing?
Yes, you're right; XeTeX is not careful about this, and should be
made more robust. This is something that's been nagging at my mind,
as I know it's a potential problem, so this gives an added incentive
to fix it.
Thanks,
JK
More information about the XeTeX
mailing list