[XeTeX] search arabic text in pdf using adobe reader 7.0

Wed Feb 6 10:54:48 CET 2008

On 6 Feb 2008, at 9:00 am, François Charette wrote:
>
> This seems to be an issue (not only for copying but also for  
> searching) with the font Scheherazade, which also occurs when it is  
> typeset with plain xetex (and so is not related to your operating  
> system or your PDF viewer). In fact, only *isolated* characters can  
> be correctly copied or searched, the other characters come out, as  
> you say, as "garbage" (actually as characters with code-points  
> above U+100000, in the so-called "Supplementary Private Use Area B"  
> of Unicode). I suppose Jonathan should be able to tell us more  
> about this...

It's actually an issue with xdvipdfmx, I think. I have just (two  
minutes ago) fixed a bug that prevented the proper ToUnicode mappings  
being generated for unencoded glyphs (such as contextual forms).

The Linotype font worked differently because (I assume) it encodes  
all the contextual forms in the Arabic Presentation Forms blocks, and  
then Adobe Reader probably "knows" to map these back to the Basic  
Arabic letters. But that whole approach is flawed, as not all  
characters have a full set of Presentation Form codepoints; this is  
even more obvious in the case of complex calligraphic fonts with many  
variants. So relying on the glyphs having direct Unicode mappings in  
the 'cmap' is inherently inadequate.

xdvipdfmx tries to deal with this by generating additional ToUnicode  
mappings from the glyph names, wherever possible, but there was a bug  
in that code. It should work better now.

Another issue, though, is directionality (and character reordering,  
in the case of Indic scripts); I doubt this is handled properly yet.  
In principle, I think the only robust solution would be the use of  
the ActualText feature in PDF, but that is not yet supported.

JK