[texhax] TeX hyphenation -- why do so many words get no hyphens

Petr Sojka sojka at informatics.muni.cz
Thu Aug 5 20:01:00 CEST 2004


On Thu, Aug 05, 2004 at 09:54:15AM -0400, Barbara Beeton wrote:
> i forgot to mention in my earlier message that i am trying
> to find frank liang, to obtain permission to post the text
> of his dissertation (word hy-phe-na-tion by com-pu-ter) on
> the tug web site.

I bit of googling shows some Frank Liang with a small photo
on http://www.technopolicy.net/policy.php?content=advisory
but without email.  Another photo is of him is in
http://sts.scu.edu/globalization/Brochure.pdf, 
Try googling with `"Frank Liang" Shanghai'
 
Another Frank Liang is with Tyco ELectronics in TX, USA.
 
> that document explains the constraints on the selection
> of patterns better than i could, and would be a valuable
> addition to the public record.

On p. 29:
"We decided to base the algorithm on our copy of Webster's _Pocket_
Dictionary, mainly because this was the only word list we had that 
included all derived forms."
and later
"..testing the algorithm against a larger dictionary
obtained from a publisher, containing about 115000 entries
produced about 10000 errors on words not in the pocket
dictionary."

As today's computer resources and fine-tuned setting
of patgen generation parameters (thresholds) allow for
having patterns generated from a wordlist with 1000000 word
forms (Czech morphology based spell checker actually knows 
5000000 word forms) with _no_ errors, short|empty exception list
and 99.9% coverage of hyphenation points,, 
the only constraint is actually having somebody 
to check such a big wordlist prehyphenated by currently
available patterns (I am willing to make the pattern generation).

There is no problem having new patterns as new \language
in addition to the standard hyphen.tex for those sacrifying
quality (together with having backward compatibility, which 
is actually lost even with adding new exceptions).

Best regards
Petr Sojka



More information about the texhax mailing list