[XeTeX] search arabic text in pdf using adobe reader 7.0
Jonathan Kew
jonathan_kew at sil.org
Wed Feb 6 10:54:48 CET 2008
On 6 Feb 2008, at 9:00 am, François Charette wrote:
>
> This seems to be an issue (not only for copying but also for
> searching) with the font Scheherazade, which also occurs when it is
> typeset with plain xetex (and so is not related to your operating
> system or your PDF viewer). In fact, only *isolated* characters can
> be correctly copied or searched, the other characters come out, as
> you say, as "garbage" (actually as characters with code-points
> above U+100000, in the so-called "Supplementary Private Use Area B"
> of Unicode). I suppose Jonathan should be able to tell us more
> about this...
It's actually an issue with xdvipdfmx, I think. I have just (two
minutes ago) fixed a bug that prevented the proper ToUnicode mappings
being generated for unencoded glyphs (such as contextual forms).
The Linotype font worked differently because (I assume) it encodes
all the contextual forms in the Arabic Presentation Forms blocks, and
then Adobe Reader probably "knows" to map these back to the Basic
Arabic letters. But that whole approach is flawed, as not all
characters have a full set of Presentation Form codepoints; this is
even more obvious in the case of complex calligraphic fonts with many
variants. So relying on the glyphs having direct Unicode mappings in
the 'cmap' is inherently inadequate.
xdvipdfmx tries to deal with this by generating additional ToUnicode
mappings from the glyph names, wherever possible, but there was a bug
in that code. It should work better now.
Another issue, though, is directionality (and character reordering,
in the case of Indic scripts); I doubt this is handled properly yet.
In principle, I think the only robust solution would be the use of
the ActualText feature in PDF, but that is not yet supported.
JK
More information about the XeTeX
mailing list