Mailing List Archive

Static html dump
Hello,

I just subscribed (I'm the wikipedia user At18) to ask about the automatic
html dump function. I see from the database page that it's "in
development".

If anyone is interested, I have a rudimental Perl script that is capable
of reading the downloadable SQL dump and output all the articles as
separate files in a number of alphabetical directories. It's not very
fast, but it works.

What's missing from the script: wikimarkup -> HTML conversion, some
intelligence to autodetect redirects, dealing with images, and so on. I
don't know if someone is in charge of this fuction. If so, I can post the
script. Otherwise, I can further develop it myself, given some directions.

Alfio
Re: Static html dump [ In reply to ]
Je Mardo 20 Majo 2003 03:53, Alfio Puglisi skribis:
> I just subscribed (I'm the wikipedia user At18) to ask about the
> automatic html dump function. I see from the database page that it's
> "in development".

Welcome!

> If anyone is interested, I have a rudimental Perl script that is
> capable of reading the downloadable SQL dump and output all the
> articles as separate files in a number of alphabetical directories.
> It's not very fast, but it works.
>
> What's missing from the script: wikimarkup -> HTML conversion, some
> intelligence to autodetect redirects, dealing with images, and so on.
> I don't know if someone is in charge of this fuction. If so, I can
> post the script. Otherwise, I can further develop it myself, given
> some directions.

Cool! I don't think anyone's really actively working on this at the
moment, so if you'd like to, that would be great.

A few things to consider:

Last year someone started on a static HTML dump system with a hacked-up
version of the wiki code and some post-processing, but never quite
finished it up. I don't think he posted the code, but if you can get
ahold of him he may still have it available:
http://mail.wikipedia.org/pipermail/wikitech-l/2002-November/001292.html

There's also a partial, very experimental offline reader program which
sucks the data out of the dump files. This includes a simplified wiki
parser which, I believe, outputs HTML to use in the wxWindows HTML
viewer widget: http://meta.wikipedia.org/wiki/WINOR
This may be useful to you.

The latest revisions of the wikipedia code can cache the HTML output
pages, but it's not clear whether this would be easy to adapt for
purposes of generating static output.

A couple of the big questions that have come up before are:

* filenames -- making sure they can stay within reasonable limits on
common filesystems, keeping in mind that non-ascii characters and
case-sensitivity may be handled differently on different OSs, and there
may be stronger limits on filename lengths.

* search -- an offline search would be very useful for an offline
reader. JavaScript, Java, local programs are various possibilties.

* size! with markup, header and footer text tacked onto every page, a
static html dump can be very large. The English wiki could at this
point approach or exceed the size of a CD-ROM without compression. Is
there a way to get the data compressed and still let it be accessible
to common web browsers accessing the filesystem directly? Less
important for a mirror site than a CD, perhaps.

* interlanguage links - it would be nice to be able to include all
languages in a single browsable tree, with appropriate cross-links.

-- brion vibber (brion @ pobox.com)
Re: Static html dump [ In reply to ]
On Tue, 20 May 2003, Brion Vibber wrote:

>
>Welcome!

Thanks!

>Cool! I don't think anyone's really actively working on this at the
>moment, so if you'd like to, that would be great.
>
>A few things to consider:
>
> [...]

Thanks, I'll take a look at the links

>A couple of the big questions that have come up before are:
>
>* filenames -- making sure they can stay within reasonable limits on
>common filesystems, keeping in mind that non-ascii characters and
>case-sensitivity may be handled differently on different OSs, and there
>may be stronger limits on filename lengths.

I'll have to find some minimum common denominator. I have already run into
the upper/lower case problem, that works in URL and on Unix machines, but
not in Windows. I expect the problem of truncated filenames to be similar.

>* search -- an offline search would be very useful for an offline
>reader. JavaScript, Java, local programs are various possibilties.

This would be hard to do without some sort of index file, at least for
article titles. We don't want the search app to scan an entire CD-ROM! :-)
I suspect that fulltext search would be impossible (or deadly slow) from
CD, but quite possible from an "installed" version. Article titles search
may be workable from CD.

>* size! with markup, header and footer text tacked onto every page, a
>static html dump can be very large. The English wiki could at this
>point approach or exceed the size of a CD-ROM without compression. Is
>there a way to get the data compressed and still let it be accessible
>to common web browsers accessing the filesystem directly? Less
>important for a mirror site than a CD, perhaps.

header/footer overhead may be avoided using frames, but it's a less
portable solution. I will investigate the compression options.

>* interlanguage links - it would be nice to be able to include all
>languages in a single browsable tree, with appropriate cross-links.


I think i'll leave this for a future improvement plan.... :-))

Ciao,
Alfio