After some delays and bug-hunting my script for the HTML static versions
is in acceptable shape.
Here you can see an example, built from a SQL file of some weeks ago:
(Don't try the Search box!!! I explain below)
http://www.arcetri.astro.it/~puglisi/wiki/dump/ma/main_page.html
Please don't DOS the connection, it's not a very fast line.
Interested parties can find the script here:
http://www.arcetri.astro.it/~puglisi/wiki/wiki2static.txt
(renamed to .txt due to some server misconfig)
use a wide terminal for this one. Everything (html code included) is in
one single file. The whitespace may appear weird because I use 4-space
tabs. There's no need to tell me you don't like the coding style, I
alread know :-)))
Some issues:
- the topbar links do not work (known bug :-). The Edit link goes to the
online wikipedia site.
- interlanguage links are ignored
- some wiki markup is not recognized yet.
- no images are present (of course!)
- filenames should be OK for most filesystems not "8.3" limited
(max 63 chars, only a-z, 0-9 and underscore)
- despite the two-letter subdirectories, some of them have over 4,000
files in them!
- Time: the script takes more than 2 hours on my 1.3 Ghz Athlon...
- Size: this dump is about 800MB. (tar.gz is just 110MB). I think
that I can bring it down to 600-650MB with a bit of trimming and
eliminating unnecessary redirects. BUT, without some form of compression,
the English wikipedia will soon overflow a single CD. Maybe we should
target DVDs? :-)
- Images: no images are present here. AFAIK, each of them has a SQL record
(that my script skips), but the actual image data is not included. How
many megabytes of images we have? I think it will be impossible to store
the full images on a CD. Certainly it's possible on a DVD. Maybe a low-res
version could be included in a CD.
- Search: I tried a javascript search that worked well for small sized
databases: it's basically a big array of strings (article titles and
filenames) with some lines code that do a regexp match against them.
For full-sized databases like this one, the search page becomes an 8
megabytes monster that takes forever to process (IE grabs 100 MB of memory
and stops there, Opera is even worse). I'll see if I can find a different
solution.
Enough for now. While I carry on development, any input is welcome.
Ciao,
Alfio
is in acceptable shape.
Here you can see an example, built from a SQL file of some weeks ago:
(Don't try the Search box!!! I explain below)
http://www.arcetri.astro.it/~puglisi/wiki/dump/ma/main_page.html
Please don't DOS the connection, it's not a very fast line.
Interested parties can find the script here:
http://www.arcetri.astro.it/~puglisi/wiki/wiki2static.txt
(renamed to .txt due to some server misconfig)
use a wide terminal for this one. Everything (html code included) is in
one single file. The whitespace may appear weird because I use 4-space
tabs. There's no need to tell me you don't like the coding style, I
alread know :-)))
Some issues:
- the topbar links do not work (known bug :-). The Edit link goes to the
online wikipedia site.
- interlanguage links are ignored
- some wiki markup is not recognized yet.
- no images are present (of course!)
- filenames should be OK for most filesystems not "8.3" limited
(max 63 chars, only a-z, 0-9 and underscore)
- despite the two-letter subdirectories, some of them have over 4,000
files in them!
- Time: the script takes more than 2 hours on my 1.3 Ghz Athlon...
- Size: this dump is about 800MB. (tar.gz is just 110MB). I think
that I can bring it down to 600-650MB with a bit of trimming and
eliminating unnecessary redirects. BUT, without some form of compression,
the English wikipedia will soon overflow a single CD. Maybe we should
target DVDs? :-)
- Images: no images are present here. AFAIK, each of them has a SQL record
(that my script skips), but the actual image data is not included. How
many megabytes of images we have? I think it will be impossible to store
the full images on a CD. Certainly it's possible on a DVD. Maybe a low-res
version could be included in a CD.
- Search: I tried a javascript search that worked well for small sized
databases: it's basically a big array of strings (article titles and
filenames) with some lines code that do a regexp match against them.
For full-sized databases like this one, the search page becomes an 8
megabytes monster that takes forever to process (IE grabs 100 MB of memory
and stops there, Opera is even worse). I'll see if I can find a different
solution.
Enough for now. While I carry on development, any input is welcome.
Ciao,
Alfio