Mailing List Archive

Status report: documentation conversion
I have spent a lot more time on converting the filter document to a
"standard" format. This is a report of where I currently stand:

1. The base source is now an Asciidoc file. This uses more-or-less the
default Asciidoc markup, with a number of additions that I have added to
make it a bit richer. I have a custom Asciidoc configuration file that
is used to turn this file into DocBook XML. This seems reasonably
workable, though I have not yet had to deal with figures or indexes.

2. Converting DocBook XML to HTML using xmlto seems to do a reasonable
job, though again, I haven't yet had to deal with figures or indexes.
I have created a private XSL style-sheet that modifies the standard
style to suit my preferences.

3. The route for converting DocBook XML to plain text is to turn it into
HTML, and then use the w3c browser to make text (w3c does a much better
job than lynx, IMO). There will have to be some fudging here, because
w3c does not do character conversions and the base file now contains
non-ascii characters like en-dashes, typographic quotes and apostrophes,
and the "fi" ligature. Also, it is probably a good idea to put quotes
round strings that, in HTML, are rendered in a monospaced font. So I
plan to write a script that pre-processes the HTML before passing it to
w3c.

4. There are problems in turning DocBook XML into PostScript or PDF. The
free technology for doing this appears to be very immature. I have used
xmlto to turn the file into a "formatted objects" (.fo) file, which is
then processed by the "fop" command. Again, I've used a private style
sheet to adjust the default design (reducing the size of fonts for
headings, reducing some of the spacing, changing the way
cross-references are handled, for instance).

The "fop" command is currently at release 0.20 (at least on my Gentoo
system). Running it produces a lot of errors and "not implemented
yet" warnings (even for a short test file), though it does succeed in
producing output. It is rather slow (it seems to be written in Java).

5. The output produced by xmlto->fo->fop->PostScript is lousy in a
number of ways:

(a) Problem: The hyphenation logic is poor. It seems to hyphenate
"unintelligently", that is, to use hypenation even a line is
pretty "tight" without it, and not considering the look of the
whole paragraph. The result is that you often get several
hyphenated lines in succession. Also, I suspect it is using
algorithmic hyphenation, which IMO gets it wrong too often.

Solution: I have turned hyphenation off.

(b) Problem: The pagination is also unintelligent. It is quite capable
of putting a section heading as the last line on a page. It also
generates "widow" and "orphan" lines frequently. The former (last
line of a paragraph - often a short line - on the first line of a
page) look particularly awful.

Solution: My custom configuration allows for a means of forcing a
page break, but this is a hack, and is not nice from a maintenance
point of view, because these manual breaks have to be reviewed for
each edition.

I do not know of a solution to the widows and orphans problem.

(c) I tried to create tables without any lines as a means of
displaying information in a non-monospaced font but in fixed
positions on the page (this will be needed for the main
specification options definitions). I failed to persuade it not to
draw lines. Maybe I've just missed something here.

(d) Having carefully set up Asciidoc to turn the letter sequence of
"f" followed by "i" into the Unicode value for the "fi" ligature
(because I care about these things), I found that this is not
recognized by "fop". I believe it is fop not xmlto, because it
works OK in the HTML output. I imagine that it has some incomplete
font tables or something, because it manages to do the typographic
quotes and the dashes OK.

Solution: I could pre-process the XML to remove the "fi" ligatures
before building PostScript and PDF. The output would be readable,
but not as nice.

6. I am sorely tempted to write a script to turn the DocBook into a
format that I can process with SGCAL. This isn't as stupid as it
seems: it will keep me happy while I'm still around, while at the
same time holding the source in a standardized form that can be
processed in other ways in the future.

But first: I will investigate the problem of the main spec file. This
is *much* bigger, and has many more features that need to be dealt
with. I will report back in due course.

7. Lost facilities: It seems pretty certain that we will have to lose
the "change bars" feature: DocBook doesn't seem to have anything
suitable. Also, the HTML will not be in the current frame format,
though perhaps it can be massaged. Whatever, it is unlikely that we
can maintain the distinction between the options and concepts index,
as there's only one index facility in DocBook.

8. Texinfo: I have not investigated this yet. I understand there are
HTML->info converters. Let's hope one of them works. :-)

--
Philip Hazel University of Cambridge Computing Service,
ph10@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
Re: Status report: documentation conversion [ In reply to ]
Philip Hazel wrote:
>
> 7. Lost facilities: It seems pretty certain that we will have to lose
> the "change bars" feature: DocBook doesn't seem to have anything
> suitable.

I'm not a docbook expert, but can't RevisionFlag="Added" be used for that?

- Marc
Re: Status report: documentation conversion [ In reply to ]
On Fri, 11 Mar 2005, Marc Sherman wrote:

> I'm not a docbook expert, but can't RevisionFlag="Added" be used for that?

Thanks for making me away of that feature. I'll look closer at it in due
course. At first glance, it seems that it might be useful for, for
example, whole paragraphs. It may not be possible just to flag up a
single line, as sometimes happens at present. (Of course, I'm assuming
that something can process it too...)

--
Philip Hazel University of Cambridge Computing Service,
ph10@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
Re: Status report: documentation conversion [ In reply to ]
Philip Hazel wrote:
>
> Thanks for making me away of that feature. I'll look closer at it in
> due course. At first glance, it seems that it might be useful for,
> for example, whole paragraphs. It may not be possible just to flag up
> a single line, as sometimes happens at present. (Of course, I'm
> assuming that something can process it too...)

A quick googling discovered this, which might be helpful:
http://sources.redhat.com/ml/docbook-apps/2004-q3/msg00551.html

- Marc
Re: Status report: documentation conversion [ In reply to ]
On 2005-03-11 Philip Hazel <ph10@cus.cam.ac.uk> wrote:
[...]
> 3. The route for converting DocBook XML to plain text is to turn it into
> HTML, and then use the w3c browser to make text (w3c does a much better
> job than lynx, IMO).
[...]

Hello,
Should that read w3m instead of "w3c"?
cu andreas
--
http://downhill.aus.cc/
Re: Status report: documentation conversion [ In reply to ]
On Fri, 11 Mar 2005, Andreas Metzler wrote:

> > 3. The route for converting DocBook XML to plain text is to turn it into
> > HTML, and then use the w3c browser to make text (w3c does a much better
> > job than lynx, IMO).
> [...]
>
> Hello,
> Should that read w3m instead of "w3c"?

Yes! Oops. Thinko. Sorry.


--
Philip Hazel University of Cambridge Computing Service,
ph10@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book