[XeTeX] handling malformed UTF-8 input
Ross Moore
ross at ics.mq.edu.au
Sat Feb 23 06:13:04 CET 2008
Hi Mike,
On 23/02/2008, at 1:09 PM, Mike Maxwell wrote:
> Ross Moore wrote:
>> If there was to be malformed data in the name field,
>> this should *not* cause correctly formed UTF8 data in the
>> subsequent address field to be displayed in a "bytes" mode.
>
> Can you reliably recover from such an error in UTF-8 data? That is,
> assume that there is a mal-formed byte where you're expecting the
> first
> byte of a UTF-8 character. How do you know where the next (and
> possibly
> correct, possibly incorrect) UTF-8 character should begin?
Absolutely; UTF8 is designed as follows.
Any byte starting:
with 0 is an ascii 7-bit character;
with 10 is a data-byte for a 2+ bit sequence;
with 11...10 is a header byte for a 2+ byte
sequence, where the number of consecutive 1s
tells how many bytes are involved.
Thus it is possible to tell where there is an error,
and which bytes are involved in that error.
More specifically, a character byte-sequence *must*
start with either 0.... (1-byte) or 11.... (2+ bytes).
If the latter, the number of trailing bytes is known,
and each must start with 10.... .
If this does not happen, then you know that there is
an error, and you can tell where a valid sequence
might (!) restart.
(In fact the proposal was to restart UTF8 after the
next line-end character, either Ux000A or Ux000D.
In reality, it could restart at the next ASCII character
or try to restart at any 11...10.. byte.)
Thus all bytes that don't fit validly into a UTF8
sequence can be identified and marked as being bad.
Of course if the encoding was not intended to be UTF8
then there could be lots of bytes marked as being bad.
So an expert needs to try to identify what encoding
was intended, by trying out different ones to see what
gives a sensible character string for some language.
The context in which the data was obtained should be
sufficient to allow this, in any practical situation.
Feedback from whomever provided that data helps
in being confident that the correct interpretation
has been obtained.
This is not mathematical certainty --- but it should
not need to be.
> --
> Mike Maxwell
> What good is a universe without somebody around to look at it?
> --Robert Dicke, Princeton physicist
Hope this helps,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 +2 9850 8955
Sydney, Australia 2109 fax: +61 +2 9850 8114
------------------------------------------------------------------------
More information about the XeTeX
mailing list