[tldoc] html + \Thanh updates

Sun Aug 17 22:22:36 CEST 2008

Staszek Wawrykiewicz writes:

 > Just corrected and commited. I use more simply way with live4ht.cfg OK?

Hi Staszek, 
I just run svn up and examined the files in Emacs (converting to
Hex if necessary).  Please excuse me if the following is more
confusing then helpful.

It seems that the the current version (rev. 10383) is correct now,
except that the double accented character is wrong in texlive-pl.html:

  <li class="itemize">H&#x00E0;n  Th&#x00EA;  Th&#x00E0;nha,

Double accented characters cannot be represented by a single byte.
The correct codepoint is 1EBF.  But in your html file there is 00EA,
which corresponds to \^e.

So it should be

  <li class="itemize">H&#x00E0;n  Th&#x1EBF;  Th&#x00E0;nha,

It was also correct in the former version (rev. 10191): 

  <li class="itemize">Hàn&#x00A0;Th&#x1EBF;&#x00A0;Thànha,

This file was latin-1 encoded but your browser assumed latin-2. 
The reason was a typo in the header:

  charset=iso-8559-1"
               ^^^
In ISO-latin-1 and UCS-2 00E0 is \`a while in latin-2 it's \'r.

It worked for you (means: Polish characters had been displayed
properly) because the header was incorrect and the default encoding in
your browser is latin-2.  It possibly doesn't work for people with
other defaults, like UTF-8.

I can reproduce it here.  When I change the language tag to iso-8859-1
I see Thanh's name displayed properly.  When I change it to iso-8859-2
then I see \'r instead of \`a.  I don't see any differences regarding
Polish characters, maybe because I don't know what I should look for. 

The new version (rev. 10383) has an iso-latin-2 header which is more
appropriate for Polish.  I suppose that the old one (rev. 10191) is
created in latin-1 and some Polish glyphs aren't displayed properly.
I have no clue. At least tex4ht assumed latin-1 when it processed
Thanh's name.  Staszek, could you dig out rev. 10191 (this is what
currently is on the ISO image) and check?  It is probably better not
to use a web browser for testing because most of them are error
tolerant and you can never be sure whether they try to find a
workaround when they enconter a problem.

It's better to examine a file within a text editor.  Editors which can
produce hex dumps are very helpful here.  If you create a hex dump in
Emacs, the cursor will jump to the same position where it had been in
the text file.  This is very helpful.

When I simply run the Makefile I get an html file with
charset=iso-8859-1.  What did you to get iso-8859-2?

Thanh's name is spelt properly in the other languages, except in
Czech:

  <li class="itemize">Hàn&#x00A0;Thê´&#x00A0;Thànhovi

But maybe it had been written this way deliberately because if a font
doesn't contain Vietnamese glyphs the result is still acceptable and
much better than a black rectangle.

A more robust definition of \Thanh would be:

\usepackage[T5,T1]{fontenc} %% the latter one is the default.

\def\Thanh{\ifx\HCode\UnDef
   {\fontencoding{T5}\selectfont H\`an~Th\'\ecircumflex~Th\`anh}\else	
   \HCode{H\string&#x00E0;n\string&#x00A0;Th\string&#x1EBF;\string&#x00A0;Th\string&#x00E0;nh}\fi}

If you instruct tex4ht to produce UTF-8, then it's sufficient to
define 

\def\Thanh{{\fontencoding{T5}\selectfont H\`an~Th\'\ecircumflex~Th\`anh}}

UTF-8 support for Vietnamese works quite fine in tex4ht.  There are
some test files in texmf-dist/doc/generic/vntex/tests and Eitan
already fixed all known bugs some years ago.  But I have no experience
with other languages...

Maybe I should look into it soon.  It's better to produce UTF-8 and to
avoid code like \llap{\raise... in order to put an accent above
\ecircumflex. Since we have Latin Modern and TeX Gyre now we can do it
much better.  The problem with the documentation is that nobody cares
about it until a few days before the deadline.  Thus, old stuff which
isn't needed anymore is inherited from older releases.

If we want to change anything, we should do it soon, not when people
are confronted with a deadline.  As I said, everything has to be
tested carefully and now we have the time to do it.

BTW, switching to UTF-8 for HTML output doesn't affect PDF output at
all and it's quite independent from input encodings.  Everybody can
still use his favorite input encoding. 

Regards,
  Reinhard

-- 
----------------------------------------------------------------------------
Reinhard Kotucha			              Phone: +49-511-3373112
Marschnerstr. 25
D-30167 Hannover	                      mailto:reinhard.kotucha at web.de
----------------------------------------------------------------------------
Microsoft isn't the answer. Microsoft is the question, and the answer is NO.
----------------------------------------------------------------------------