[XeTeX] search arabic text in pdf using adobe reader 7.0
François Charette
firmicus at ankabut.net
Wed Feb 6 15:09:47 CET 2008
Jonathan Kew a écrit :
> On 6 Feb 2008, at 9:00 am, François Charette wrote:
>
>> This seems to be an issue (not only for copying but also for
>> searching) with the font Scheherazade, which also occurs when it is
>> typeset with plain xetex (and so is not related to your operating
>> system or your PDF viewer). In fact, only *isolated* characters can
>> be correctly copied or searched, the other characters come out, as
>> you say, as "garbage" (actually as characters with code-points
>> above U+100000, in the so-called "Supplementary Private Use Area B"
>> of Unicode). I suppose Jonathan should be able to tell us more
>> about this...
>>
>
> It's actually an issue with xdvipdfmx, I think. I have just (two
> minutes ago) fixed a bug that prevented the proper ToUnicode mappings
> being generated for unencoded glyphs (such as contextual forms).
>
> The Linotype font worked differently because (I assume) it encodes
> all the contextual forms in the Arabic Presentation Forms blocks, and
> then Adobe Reader probably "knows" to map these back to the Basic
> Arabic letters. But that whole approach is flawed, as not all
> characters have a full set of Presentation Form codepoints; this is
> even more obvious in the case of complex calligraphic fonts with many
> variants. So relying on the glyphs having direct Unicode mappings in
> the 'cmap' is inherently inadequate.
>
> xdvipdfmx tries to deal with this by generating additional ToUnicode
> mappings from the glyph names, wherever possible, but there was a bug
> in that code. It should work better now.
>
>
That makes perfect sense. Thanks for that informative report and for the
bugfix in xdvipdfmx! I'll compile the new version from svn later on.
I'll let you know if I encounter further problems.
> Another issue, though, is directionality (and character reordering,
> in the case of Indic scripts); I doubt this is handled properly yet.
> In principle, I think the only robust solution would be the use of
> the ActualText feature in PDF, but that is not yet supported.
>
I guess this is probably not handled correctly now. I had never heard of
the ActualText feature, but I just consulted §10.8.3 of the PDF
Reference v1.7. Still not entirely clear to me how that relates to
directionality... Perhaps together with /ReversedChars ? Well I
obviously know too little about PDF internals :)
FC
More information about the XeTeX
mailing list