XeLaTeX to Word/OpenOffice - the state of the art?
BPJ
bpj at melroch.se
Fri Mar 15 13:34:48 CET 2019
Den 2019-03-15 kl. 08:31, skrev Janusz S. Bień:
> On Fri, Mar 15 2019 at 7:19 +01, BPJ wrote:
>> I use, despite myself, Google Docs to convert PDF to DOCX,
>
> How???
>
>> then Pandoc from DOCX to everything else. It works even with weird
>> magazine layouts.
>
> Best regards
>
> Janusz
>
This may be old news to some, but I can’t remember having seen it,
so I make a post for the record.
I just discovered that you can convert a PDF to Markdown (or any
other format Pandoc supports) by uploading it to Google Drive,
opening it in Google Docs and downloading it from there as DOCX,
then converting the DOCX to Markdown with Pandoc. The result is
quite good!
The steps:
1. Log into <drive.google.com> in a web browser.
2. Select the menu [My Drive⏷] → [Upload files…] in the top bar.
More recently there is a “button” [+ New] in the top left
corner. Click on it and select [File upload] in the menu which
appears.
3. At least on my system a file dialog opens. Browse to the PDF
file; select it; click [Open].
4. (If this doesn’t work try step 5.)
i. The file appears in the “Quick access” field just below
the top bar. You may need to refresh a couple of times.
ii. Right-click the file thumbnail; choose [Open with] →
[Google Docs].
5. If step 4 doesn’t work (the PDF file doesn’t appear in the
quick access field):
i. Start typing the PDF file name in the [Search Drive] box
at the top.
ii. Click on the file in the menu which appears.
iii. The file opens in the Drive PDF viewer.
iv. At the top there is a menu [Open with Google Docs]. Click
on it and select Google Docs.
Or look up the file in the file list and follow 4.ii. (Hard
when there are lots of files in the list!)
6. You should now find yourself in the Google Docs document view.
7. In the [File] menu choose [Download as] → [Microsoft Word
(.docx)].
8. Save the DOCX file to disk and convert it with Pandoc the same
as you would any DOCX file, or edit it with Word/LibreOffice/… if
you are of that persuasion.
Basic formatting — paragraphs, bold, italics — works very well.
Some more advanced formatting is more or less broken:
- Tables become ordinary text, not very well lined up.
- Nested lists are flattened.
- Small caps text disappears entirely! If you have access to the
original LaTeX file I suggest putting this in your preamble:
\renewcommand\textsc[1]{\textbf{\textit{#1}}}
or if bold italics actually occur in your document this:
\usepackage{textcase}
\renewcommand\textsc[1]{\textbf{\textit{\MakeTextUppercase{#1}}}}
Uggly as hell but sequences of uppercase bold italics are
unlikely to actually occur in a document and are relatively easy
to find and replace with something better in a “word processor” or
in a text editor after conversion from DOCX to some sensible
format with Pandoc.
If you post-edit in a “WP” you may try (x)color and something
like \renewcommand\textsc[1]{\textcolor{red}{#1}} instead. That
may be hard to find _with_ the “WP” but is relatively easy to find
_in_ the “WP” for a human eye.
You may want to correct these things in the “word processor” but
my definite preference is to convert the DOCX file to Pandoc’s
extended Markdown with Pandoc, fix things up and then convert
(back) to DOCX. You can then also apply your own custom named
styles for things like color.
http://pandoc.org/MANUAL.html#custom-styles
http://pandoc.org/MANUAL.html#option--reference-doc
It still says “For best results, do not make changes to this file
other than modifying the styles used by pandoc” but that is just
what you want to do if you are using custom styles, including
adding your own! BTW you may want to avoid non-ASCII and
non-alphanumeric characters in your custom style names so that you
don’t need to quote your custom-style attribute values!
Speaking of small caps it has its official Pandoc syntax: [small
caps text]{.smallcaps}, but that is far too verbose by Markdown
standards! ;-) I usually overload Pandoc’s generally useless
strikeout syntax so that I can type ~~small caps text~~ with this
Pandoc Lua filter:
function Strikeout (elem)
return pandoc.SmallCaps(elem.content)
end
I hope this is of use to someone!
/bpj
More information about the XeTeX
mailing list