[tex-live] Better ways to find packages and documentation

Wed Jul 4 17:33:20 CEST 2007

Hi,

We're getting closer and closer. :)

Norbert Preining <preining at logic.at> wrote:

> On Die, 03 Jul 2007, Florent Rougon wrote:
>> I took a little bit of time to answer, because I wanted to look at the
>> new TL infrastructure before (and also because I'm trying to have a life
>
> Please send me your comments and suggestions for this, too (private or
> here on list). We are still in the initial phase and everything is open.

Well, I'm not deeply involved in that as you are, so for the moment I'm
only in a position to try to understand how it works, but not really to
criticize and judge...

>> Do you want to visit Paris? :)
>
> Yes. When? Do you have some logic/mathematics institute nearby so that I
> can give a talk? That would make it a official journey ;-)

For holidays, we can arrange something in July or August (I think I'll
stay here till at least mid-July; for the rest, the details still have
to be worked out).

As for the talk, well, good idea... Problem is, I don't have close
relationships to university guys (I'm not from that world, I'm from what
we call « Grandes écoles » and only spent one year in University
proper).

But I know there is a guy from the ENS (École Normale Supérieure)
reading this list; that would be great if he could arrange to invite you
for a talk. :)

>> When I read this texlive.tlpdb excerpt, it is not at all obvious (for a
>> program) that foo-de.pdf has the attribute language='de'.
>
> So the problem we/you try to solve is the additional tagging of single
> documents with language tags? Right?

Mostly. For individual documents, there are *already* (even if not
always, or not always up-to-date) langage *attributes* in the CTAN
Catalogue, so that would be nice to take advantage of them.

Additionally, I proposed to add *tags* to indiviual documents, but
that's only a proposal. It allows for finer classification of the docs
but implies a bit more complexity. Reading your mail, it seems you don't
really want to bother with that.

BTW: the emphasis here is to remind you that language='de' in a
     <documentation> element of the Catalogue is *not* what I call a tag
     here: it is an XML _attribute_ of the <documentation>.

     When I say _tag_ in this discussion, it refers to tags in the realm
     of faceted classification, i.e. as implemented in libtagcoll and
     debtags (the latter relying on the former).

> If we just for now ignore this problem and consider *only* CTAN package
> tagging (isn't it what you proposed below?) then that could be ignored.

Yes, that was a proposal.

> I would say for a start some system that allows me to search for
> packages with some tags, and spits out all/some document files, without
> any further sub-division (what kind of doc file etc) would already be a
> HUGE step forward, probably enough for all kind of needs. People *can*
> generally understand what a file is for, at leastafter the first 2
> pages.

But it takes time. Most things are only a matter of time. I can read the
whole CTAN English documentation given enough time...

>> I thought that in texlive.tlpdb:
>>   - either the 'name' field indicates the CTAN package name;
>>   - or there is a 'catalogue' field indicating the CTAN package name.
>
> In theory, yes. But nobody ever checked this!!!

Then make reality fit theory. :)

> We have some catalogue entries because I realized that the names are
> changed, but in general we *just*assume* that it is like this.

Why not assume that the tlpdb-to-CTAN package name mapping is correct
(using "catalogue" fields when they exist) and just fix any problem that
arises?

>> FWIW, I couldn't find ctan2tl by browsing the TL repository through the
>> web interface. Where is it?
>
> http://www.tug.org/svn/texlive/trunk/Build/tools/ctan2tl

Ah, thanks. Somehow most of the infrastructure stuff seemed to be in
Master, so I didn't look closely enough in Build. I'm still wondering
about the guidelines that define the contents of each of these dirs...

>> Yup, easy, but I want the Catalogue metadata carried with each doc file.
>
> But didn't you propose that tagging is done on the package level of the
> catalogue, not on the document level?

Yes, that was a proposal, an option. 

> In this case no need?

Well, not much of a need, but still. The language attributes of doc
files in the Catalogue are not what I'd call document tags. So even if
we abandon the idea of document tags, it still be better to take
advantage of these attributes from the Catalogue metadata.

>> 1) Can a given CTAN package be split among several TEXMF trees (in TL,
>>    in MiKTeX, etc.)? Or rather, do we want to support that?
>
> No.

Good. Then we can have the metadata be self-contained in each TEXMF
tree, with relatives paths from the base of the TEXMF tree. This is nice
from a theoretical POV and allows natural extension for TEXMFLOCAL data.

>> 2) Do you want the data for TL-available packages to be split into
>>    individual files for each package, or gathered into big files?
>> 
>>    Advantage for the split version: it's easier to register/unregister a
>>    package by distributors: just add or remove the corresponding files.
>> 
>>    Disadvantage: takes more space, clutters the filesystem.
>
> You mean of the documentation or something else?

The info I'm talking about here is:

  $package is available in TL
  $package has the following tags: macropackage::latex,
                                   field::chemistry, etc.
  $package ships the following doc files:
      relative/path/from/base/of/texmf/tree/file1 $short_desc1 language=de
      relative/path/from/base/of/texmf/tree/file2 $short_desc2 language=en
      ...

We can have this info in one file per CTAN package, or all compiled into
one big file. The first version allows for easier management by
distributors such as Debian (easy to add or remove a package), but it
creates a zillion files, each of which take one filesystem block, and
this can be slow on DVD media as you pointed out.

>From your later comments, I believe you prefer going for one big file
containing all the data for available packages.

>> 3) Do you want to reduce data redundancy as much as possible?
>
> Yes. The texlive cd will/can carry a copy of the catalogue (in some
> way), and that should be taken for further information.

OK, then I absolutely need a way to link from the "big file" we were
talking about (may be your available.tlpdb) to each CTAN package. Thus,
we need to specify that TL pkg name == CTAN package name, except when
the tlpdb has a "catalogue" field, in which case this indicates the CTAN
package name. Or something like that.

>   We want to reduce data redundancy *in the source packages*!!!
>
>   The TLPOBJ files *CAN* (and hopefully will) be enriched with additional
>   information from the catalogue

Ah. You should know what you want. :)

Too possibilities:
  - either we ship the Catalogue in XML format somewhere and we need a
    solid way to link from the other files to the Catalogue;
  - or we put all the needed information in "big files" that are derived
    from the Catalogue at installation time.

>  but first I have to write a catalogue, access Perl module (and read
>  xml, grrrr ;-).

Are you trying to parse it manually or what???

With the appropriate Python module (for instance xml.etree.ElementTree,
but there are many others), it is really very simple.

I have a movie catalog that is stored in XML format like this:

<?xml version="1.0" encoding="ISO-8859-15"?>
<collection>
  <disc number="1">
    <video type="movie" version="VF">
      <title>
        L'âge de glace
      </title>
      <director>
        Chris Wedge
      </director>
      <director>
        Carlos Saldanha
      </director>
      <year>
        2002
      </year>
    </video>

    <video type="movie" version="VF">
      <title>
        Smoke
      </title>
      <director>
        Wayne Wang
      </director>
      <year>
        1995
      </year>
    </video>

    <video type="movie" version="VO" audio="en" subtitles="fr">
      <title>
        La nuit des morts-vivants
      </title>
      <director>
        George A. Romero
      </director>
      <year>
        1968
      </year>
    </video>

    <video type="movie" version="VO" audio="fr">
      <title>
        Le mouton enragé
      </title>
      <director>
        Michel Deville
      </director>
      <year>
        1974
      </year>
    </video>
  </disc>

  <disc number="2">

  [...]

  </disc>
</collection>

and my Python script for generating LaTeX documents and Makefiles for
the catalog in various sort orders (by movie title and by disc number
for now) with templates for the various portions of LaTeX code (the
global documents as well as the portions that are repeated for each
movie) that can be stored in
/usr/{local/}?share/progname/templates/{by_title,by_disc_number/} or
$HOME/.progname/templates/{by_title,by_disc_number/} with configuration
in $HOME/.progname/config.py is only 450 lines when discarding the
comments (but not the "usage" help). See attachment.

Parsing the XML data is as simple as that:

import xml.etree.ElementTree as et

def main():

        [...]

        # p is a dictionary containing the parameters read from the
        # config file
        tree = et.ElementTree(file=p["input file"])
        write_catalogs(p, tree)

def write_catalogs(p, tree):
    write_catalog(p, tree, "by title", cmp_func_by_title)
    write_catalog(p, tree, "by disc", cmp_func_by_disc)

def write_catalog(p, tree, catalog_type, cmp_func):
    root = tree.getroot()

    [...]

    l = []

# --------------------8<-----------------------------8<--------------------
# Start of portion that reads the XML data
#
    for disc_elt in root:
        disc_number = int(disc_elt.get("number"))

        for video_elt in disc_elt:
          l.append((video_elt, disc_number))
#
# End of portion that reads the XML data
# --------------------8<-----------------------------8<--------------------

    l.sort(cmp=cmp_func)

    [...]

Then I write the particular data for a given movie, properly quoted for
special characters in LaTeX, based on a template like that:

@TITLE@ \emph{[@COMMENTS@]} & @AUDIO_VERSION@ & @AUDIO_LANG@ & @SUBTITLES_LANG@ & @DISC@\\

    # Write the info for each record
    for video_elt, disc_nb in data:
        title = video_elt.findtext("title").strip()

        comments = video_elt.findtext("comments", None)

        # snip code that choses the right template depending on
        # whether we want commments on movies in the output file
        # (assigns the variable 't')

        audio_version = video_elt.get("version", "")
        audio_lang = video_elt.get("audio", "")
        subtitles_lang = video_elt.get("subtitles", "")

        for from_str, to_str in \
                {"@TITLE@": title,
                 "@COMMENTS@": comments,
                 "@AUDIO_VERSION@": audio_version,
                 "@AUDIO_LANG@": audio_lang,
                 "@SUBTITLES_LANG@": subtitles_lang,
                 "@DISC@": ("%u" % disc_nb)
                 }.iteritems():
            if to_str is not None:
                t = t.replace(from_str, latex_quote(to_str))

I have to say, if you find XML difficult to read by a program, then
maybe, just maybe, you should look on the side of the language in use.
;-)

>   We want to include at least
>   - title/long description
>   - some version/license information
>   - (taggging information?)

Easy to add:

Tags: foo, bar, ...

Or we can have a separate file for all the package tags, as done in
debtags (actually, in Debian, we have now a fixed approach: fast-paced
evolution in /var/lib/debtags/package-tags provided you have the
appropriate lines in /etc/debtags/sources.list; slow-paced [approved by
the ftp-masters] for the tags embedded in the Packages file).

> I can imagine that additionally installed packages (in TEXMFLOCAL) drop
> their description files into TEXMFLOCAL/somewhere.

Sure. Probably not homogeneous if we adopt the "big file" approach, but
it is mostly an aesthetic problem (not homogeneous, because for adding
third-party stuff, I won't adopt the "big file" approach, since it's so
much easier for distributors to add/remove individual files per
package).

> The plan is that every local installation will have a 
> 	local.tlpdb
> (name to be changed) containing *only* those packages which are
> installed.

Good.

> Of course this doesn't handle the TEXMFLOCAL files.

That can be handled with additional files added by the admin.

>>    It is quite possible that we don't need to go so far as tagging
>>    individual doc files:
>
> ACK, and thus I would also ignore the problems you were talking above,
> about tagging individual files with language tags.
>
> There are *some* packages already in CTAN with language names, e.g., 
> 	lshort-russian
> and some others. I propose that we don't need the complexity of single
> file tagging, and neither the language tags for single files.

Hmmmm... grmmmpffff... okaaaaaaaaaaaaaaaaaay...

(but language='de' isn't a tag, it's an attribute)

>> (well, there is another question as I see from the rest of your mail: do
>> you prefer XML or RFC-2822 format? I saw you have some grief about XML,
>
> You have seen the texlive.tlpdb, I guess you know the answer ;-)

Maybe I have changed your mind now. ;-)

Well, maybe it's not so fun to change data in an already existing XML
file, but at least _reading_ it should be quite easy with a proper
library (module).

> I don't believe you. Well, btw, if you can write Python access modules
> for the TLPSRC/TLPOBJ/TLPDB/TLTREE/... that would be great,too. In fact
> for *application* (i.e., installer/updater, etc) purposes only access
> modules for
> 	TLPOBJ and TLPDB
> would be necessary. The rest is only for us at obj generation time.

Easy... but it's probably better to do that:
  - when the syntax is more or less settled;
  - or when I actually need it for the tool we're discussing about.

Or does someone already needs Python modules for this stuff??

>> You mean, there will be an unacceptable performance hit if anything in
>> this design causes a program to read one file per CTAN package? Because
>> of DVD head movements and things like that?
>
> Yes.

OK, then let's go for "big files" for everything excect third-party
stuff (in order to ease Debian packaging and such).

> I could even imagine that, if we do the tagging on a per package level,
> that we add the tagging to the TLPDB, and then we have the TLPDB of
> installed stuff, the TLPDB of available stuff of the TeX LIve
> installation media, and the additional .xml/whatever files dropped into
> TEXMFLOCAL.

Exactly.

> Yes, the installer creates a local.tlpdb for every installation.

Good.

> So my proposal is:
> - tagging is done on a per package level, not per file level

OK.

Hum, well, does everyone agree? :)

> - tags are stored in the Catalogue
> - tags are defined either by (format to be specified) files in the CTAN
>   dir as uploaded by the author, or via the web interface, or by the
>   CTAN maintainers

OK.

> - tags are taken from the catalogue when generating the to be shipped 
>   texlive.tlpdb and stored there

Earlier, you were saying that we would ship the Catalogue. In this case,
if you *really* want to reduce data redundancy, we can go look for the
tags in the Catalog.

*But* the Catalogue is made up of a zillion files, and we said we won't
accept any design that reads a zillion files. So, we need a compiled
version of the Catalogue, and this in fact can well be in the
texlive.tlpdb file.

Conclusion: no need to ship the Catalogue anymore (in XML format, that
            is)?

(tsss, tsss, tsss... but how do I get my language attributes, then? Will
 you kindly host them in texlive.tlpdb? :)

> - locally installed packages can ship (format to be specified) files 
>   in TEXMFLOCAL/(location to be specified)
> - the doc search program takes the infos/tags from:
> 	- the texlive.tlpdb as shipped on the DVD
> 	- the local.tlpdb of installed packages
> 	- the additional files in TEXMFLOCAL
>   and presents the info in some structured way (would allow people
>   to search also in *not already installed* packages or only under those
>   which are already installed.

Agreed.

We're almost there. :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: flo-gen-video-catalogs
Type: application/x-python-code
Size: 17045 bytes
Desc: Program for generating movie catalogs from XML to LaTeX
	format
Url : http://tug.org/pipermail/tex-live/attachments/20070704/903de7f1/attachment-0001.pyc