[XeTeX] New feature REQUEST for xetex
Jonathan Kew
jfkthame at gmail.com
Tue Feb 23 11:52:22 CET 2016
On 23/2/16 10:37, Zdenek Wagner wrote:
> How Jonathan,
>
> how do you put the ActualText to PDF? Is it per syllable, or per word?
Per word.
> We have a commercial OCR software that can convert scanned PDF to pages
> with selectable texts. I have not examined it thoroughly but it seems to
> me that it analyzes the scanned image, splits it to subimages "per word"
> and attaches ActualText to each word. In such a way it is impossible to
> select just a group of characters, the smallest entity that can be
> copied & pasted (or searched for) is a word. It might fix the
> hignlighting problem but I am just guessing.
I don't think so. Even single-syllable words like भी don't highlight
well in the example.
(FWIW, it is possible to search for a substring within a word, and
Acrobat finds it OK, but it can't accurately highlight what's been
found; you get the same (inaccurate) highlighting of the word regardless
of what substring within it was searched.)
Setting ActualText per syllable would make finer-grained copy/paste
possible (currently, entire words are always copied), but would be
significantly more complex to implement (as well as adding to the PDF
file bloat). I think the per-word version should be a useful start, at
least.
>
>
> Zdeněk Wagner
> http://ttsm.icpf.cas.cz/team/wagner.shtml
> http://icebearsoft.euweb.cz
>
> 2016-02-23 11:06 GMT+01:00 Jonathan Kew <jfkthame at gmail.com
> <mailto:jfkthame at gmail.com>>:
>
> On 23/2/16 02:54, Andrew Cunningham wrote:
>
> It would probably more than double, i was under the impression that
> ActualText was a tag attrubute, so extensive tagging would be
> needed,
> and actual text added to the tags.
>
>
> The ActualText tagging is highly compressible, so in practice the
> increase in overall PDF size is not all that great.
>
>
> But the question is how to practically make use of ActualText if
> there
> is a visible text layer.
>
> PDF/UA for instance leaves the question deliberately ambigious.
> ActualText is the way to make the content accessible, but developers
> creating tools for PDF do not actually have to process the
> ActualText.
>
> So to index and search PDF files you need to build a discovery
> system
> utilising tools that allow you to specify the use of ActualText in
> preference to a visible text layer.
>
>
> Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
> results in the correct Unicode text (more or less), and Find behaves
> as expected.
>
> Other PDF readers (such as Apple's Preview) may well ignore the
> ActualText tagging, in which case it doesn't help. I don't know
> whether tools like Evince or Okular handle it....
>
>
> I'm attaching two sample PDFs with a simple chunk of Hindi text
> (from the Unicode web site). The first, dev-old.pdf, is what XeTeX
> currently generates (using the "Annapurna SIL" OpenType font). In
> general, Copy/Paste and text search don't work very well -- a few
> characters may be OK, but others are junk.
>
> The second sample, dev-actualtext.pdf, was generated with an
> experimental new \XeTeXgenerateactualtext feature, which
> automatically "tags" each word with an ActualText representation.
>
> Some points to note:
>
> - The file size is 24662 bytes, while dev-old was 22875 bytes. Not
> too bad. Of course, a lot of that is the embedded font data; with
> longer documents that have lots of text but only a few fonts, the
> difference would presumably be somewhat greater.
>
> - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
> Preview.app.
>
> - Highlighting of selected text (in Acrobat Reader) is somewhat
> broken, apparently due to the ActualText tagging (it looks better in
> dev-old). This may be fixable by tweaking exactly how the tagging is
> written into the PDF; I haven't investigated it further.
>
>
> No guarantees at this point as to whether/when this feature will
> actually be available. It was just a quick attempt to hack something
> up, to see how promising the results might be...
>
> JK
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
> http://tug.org/mailman/listinfo/xetex
>
>
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
> http://tug.org/mailman/listinfo/xetex
>
More information about the XeTeX
mailing list