[XeTeX] handling malformed UTF-8 input
Jonathan Kew
jonathan_kew at sil.org
Thu Feb 21 11:12:50 CET 2008
On 21 Feb 2008, at 8:26 am, Mojca Miklavec wrote:
> On Thu, Feb 21, 2008 at 1:48 AM, Akira Kakuto wrote:
>>> For those who are happy rebuilding xetex from source, I'd appreciate
>>> knowing of any problems with your (real) UTF-8 files after this
>>> patch
>>> is applied. As far as I know, valid files should be processed
>>> unchanged.
>>
>> The new one fails to create ConTeXt format:
>> It stops when it is reading 'lang-cz.pat' with an
>> error message '!Nonletter'. Probably 'lang-cz.pat'
>> is not a utf-8 file.
>
> The content is valid UTF-8, but there are a few latin2 (I guess)
> characters in comments at the beginning of file.
Yes, it looks that way.
>
> That file is autogenerated (comments taken out of some other non-utf
> file). The content/patterns should be OK, but thanks for the warning -
> that can/should be fixed.
Indeed it should. Mixing encodings in a "plain text" file is a no-
no.... there's no reliable way for processes to know how to interpret
the bytes they find. You may get away with it in TeX files if the
misinterpreted garbage happens to follow a '%' byte, but that doesn't
make it acceptable. Suppose someone tries to print a verbatim listing
of the file...
(Try opening lang-cz.pat in a text editor. Either it'll be read as
UTF-8, giving you garbage in the "samples", or as some other
encoding, in which case the patterns themselves will appear as
garbage. Sorry to sound harsh, but the file is fundamentally broken.
That needs to be fixed in ConTeXt, not worked around in XeTeX.)
>
> (Ulrike's
(I think that was Ross, actually.)
> suggestion to recognise end of lines seems OK to me as that
> would tolerate problems in comments, while rest would be intact, but
> in any case: garbage in->garbage out.
No, I can't agree with this. If the file is broken w.r.t. encoding,
we should either do a one-time switch to "raw bytes" mode, so as to
try and continue processing in a default "simplified" mode (which may
or may not lead to subsequent errors, of course), or stop immediately.
> Stopping processing the file
> with an error if it's not a valid UTF-8 would be just as OK to me,
> even though it might sound a bit radical.
I wondered about making it an error rather than a warning; maybe that
would be better.
> There are *lots* of warnings
> in TeX files, and one can easily miss that one and miss the fact that
> there are some broken characters somewhere in the last pages of a book
> because of some problematic comments in the middle.)
What do others think about this -- should "invalid UTF-8 byte
sequence" be an error rather than a warning and fallback?
JK
More information about the XeTeX
mailing list