search accents in pdf generated by TeX
William F Hammond
hmwlfsr at yahoo.com
Sat Jan 29 07:20:03 CET 2022
Ulrike Fischer <news3 at nililand.de> writes:
> Am Thu, 27 Jan 2022 21:55:32 -0800 schrieb William F Hammond via
> texhax:
>
>> First, I don't know what the statement "accented letters are
>> not recognized by the pdf" means. If we're talking about
>> typesetting with pdftex, then I think that the PDF output is
>> UTF-8 encoded.
>
> No.
I do understand that a PDF file is not a text file if that
is why you are saying "no". But ...
>> If one runs the program "pdftotext", which
>> is part of an Ubuntu package called poppler-utils on my
>> Ubuntu platform, the output text is UTF-8 encoded. I think
>> that text TeX's algorithmic accents are implemented using
^^^^^^^^^^^
>> Unicode combining characters.
>
> No, not with pdftex. If you compile
>
> \documentclass{article}
>
> \begin{document}
> ä ö ü é è
^^^^^^^^^
When I said that I was using TeX's *algorithmic* accents,
I meant \"a \"o \"u \'e \`e
But, anyway, the original question was about
plain TeX, not LaTeX, and I was trying to address
that.
> \end{document}
>
> and then copy and paste you will get
>
> ¨a ¨o ¨u ´e `e
>
> that is
>
> U+00A8a U+00A8o U+00A8u U+00B4e U+0060e
>
> (U+00A8 is for example diaresis).
I have not been able to duplicate what you say.
Perhaps you have a newer version of pdftex. Mine
is from TeXLive 2017. But I doubt if that is the
explanation. Perhaps there are differences in
"locale" arrangements.
> if you add \usepackage[T1]{fontenc} and so use a font
> which has the needed glyphs then you get the correct
> unicode code points and a searchable pdf
>
> ä ö ü é è
With what input encoding for your LaTeX source? Are you
saying that you have an arrangement for pdflatex to read
unicode up U+00FF when text-encoded as UTF-8 (as in your
last email) ??
-- Bill
More information about the texhax
mailing list.