search accents in pdf generated by TeX

William F Hammond hmwlfsr at yahoo.com
Sat Jan 29 07:20:03 CET 2022


Ulrike Fischer <news3 at nililand.de> writes:

> Am Thu, 27 Jan 2022 21:55:32 -0800 schrieb William F Hammond via
> texhax:
>
>> First, I don't know what the statement "accented letters are
>> not recognized by the pdf" means.  If we're talking about
>> typesetting with pdftex, then I think that the PDF output is
>> UTF-8 encoded. 
>
> No.

I do understand that a PDF file is not a text file if that
is why you are saying "no".  But ...

>> If one runs the program "pdftotext", which
>> is part of an Ubuntu package called poppler-utils on my
>> Ubuntu platform, the output text is UTF-8 encoded.  I think
>> that text TeX's algorithmic accents are implemented using
                   ^^^^^^^^^^^
>> Unicode combining characters. 
>
> No, not with pdftex. If you compile
>
> \documentclass{article}
>
> \begin{document}
> ä ö ü é è
  ^^^^^^^^^

When I said that I was using TeX's *algorithmic* accents,
I meant  \"a \"o \"u \'e \`e

But, anyway, the original question was about
plain TeX, not LaTeX, and I was trying to address
that.


> \end{document}
>
> and then copy and paste you will get
>
>     ¨a ¨o ¨u ´e `e
>
> that is 
>    
>     U+00A8a U+00A8o U+00A8u U+00B4e U+0060e
>
> (U+00A8 is for example diaresis).

I have not been able to duplicate what you say.

Perhaps you have a newer version of pdftex.  Mine
is from TeXLive 2017.  But I doubt if that is the
explanation.  Perhaps there are differences in
"locale" arrangements.

> if you add \usepackage[T1]{fontenc} and so use a font
> which has the needed glyphs then you get the correct
> unicode code points and a searchable pdf
>
>      ä ö ü é è

With what input encoding for your LaTeX source?  Are you
saying that you have an arrangement for pdflatex to read
unicode up U+00FF when text-encoded as UTF-8 (as in your
last email) ??

                              -- Bill




More information about the texhax mailing list.