Mailing List Archive

Static HTML Wikipedia (CD-ROM)
There seems to be some interest in creating a static HTML distribution
(dump) of Wikipedia, most notably it is requested on the
Wikipedia:Database_download page and in Feature Requests #596830 on
Sourceforge. This would allow people to download the Wikipedia for use
offline, for example from a CDROM.

So, I have started work and made my initial version (English only)
available online for anyone on this list to evaluate and test. I am
looking for feedback, suggestions, bug reports and general comments.

http://www.rawlinson.ca:8080/wikipedia/index.html

Please do not attempt to mirror the site as my server and bandwidth won't
be able to handle it. The site is only intended for developers to try and
give feedback. Once everyone is happy with it I will make .tar and .iso
packages available for distribution.

At the moment, the method I use to create the static HTML version is very
lengthy in terms of processing time and requires a number of manual step.
It takes my 1 GHz machine about 5 hours to generate all the pages.
Ideally, I'll have something more automated and efficient as time goes on.
My plan for how things would work is I will produce an updated static HTML
version every few months or significant milestones.

I'm not sure how to distribute this static HTML version when it's ready
for a public release. Currently it's about 500 Meg in size (that includes
everything). As I mentioned above I have limited server resources. For
distribution maybe it could be put on the Sourceforge download page, or on
the Wikipedia.org server somewhere (/tarballs)?

Finally, since I am new to Wikipedia and this list, please excuse me while
I learn how things work around here. I am open to criticism, suggestions
and discussion. I am looking forward to working with everyone on
Wikipedia and contributing where I can.


Some Technical Details (for those interested):

- English only (currently)

- uses "printable" pages, no top or side navigation bars

- added links to home, back, copyright and Wikipedia.org to bottom of all
pages (TODO: if a talk page exists a link should be added)

- pages are stored in directories based on first two characters of MD5
hash, same as image storage scheme

- includes all namespaces (talk, users, users_talk, wikipedia_talk, etc.)

- created a list with links to all the items in each namespace to allow
for basic searching of page titles

- redirects replaced with direct link to article


Regards,
Steve Rawlinson
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
Cool! I like it! Hope to see the CDs at Wal-Mart soon ;-)

A few minor thoughts:
* On every page, there should be a link to the online article, and to
the online edit.
* Images should link to the image text, like wikipedia does, not to the
image itself. One can save the image directly from the page.
* A nice CSS will make it look less "plain", without compromising
compatibility with older browsers. The CD could also have a recent
Mozilla installer on it ;-)
* There could be other index.html files with frames in them, where a
small frame on top, bottom, or side, could carry useful links (Main
Page, Copyrights, the "live" wikipedia, "live" search engine, etc.) and
maybe the logo.
* If you want statistics (date of last edit) at all, maybe add some more
("this was edited 123 times"). [Might need modification of the
"printable" function]
* Should non-existent links be displayed as links? I think so, because
people might feel tempted to write something ;-) [Might need
modification of the "printable" function]

There should also be a "real" offline search function. Shouldn't be too
hard to hack some smallish programs for Windows, Linux, maybe Mac.
Anyone knows if there's something like that out there?

We might also think of adding special "view" functions to the software
(like the "printable version"), to aid projects like this, and maybe
Larry's wikipedia sifter.

Magnus


Steve Rawlinson wrote:

>There seems to be some interest in creating a static HTML distribution
>(dump) of Wikipedia, most notably it is requested on the
>Wikipedia:Database_download page and in Feature Requests #596830 on
>Sourceforge. This would allow people to download the Wikipedia for use
>offline, for example from a CDROM.
>
>So, I have started work and made my initial version (English only)
>available online for anyone on this list to evaluate and test. I am
>looking for feedback, suggestions, bug reports and general comments.
>
>http://www.rawlinson.ca:8080/wikipedia/index.html
>
>Please do not attempt to mirror the site as my server and bandwidth won't
>be able to handle it. The site is only intended for developers to try and
>give feedback. Once everyone is happy with it I will make .tar and .iso
>packages available for distribution.
>
>At the moment, the method I use to create the static HTML version is very
>lengthy in terms of processing time and requires a number of manual step.
>It takes my 1 GHz machine about 5 hours to generate all the pages.
>Ideally, I'll have something more automated and efficient as time goes on.
>My plan for how things would work is I will produce an updated static HTML
>version every few months or significant milestones.
>
>I'm not sure how to distribute this static HTML version when it's ready
>for a public release. Currently it's about 500 Meg in size (that includes
>everything). As I mentioned above I have limited server resources. For
>distribution maybe it could be put on the Sourceforge download page, or on
>the Wikipedia.org server somewhere (/tarballs)?
>
>Finally, since I am new to Wikipedia and this list, please excuse me while
>I learn how things work around here. I am open to criticism, suggestions
>and discussion. I am looking forward to working with everyone on
>Wikipedia and contributing where I can.
>
>
>Some Technical Details (for those interested):
>
>- English only (currently)
>
>- uses "printable" pages, no top or side navigation bars
>
>- added links to home, back, copyright and Wikipedia.org to bottom of all
> pages (TODO: if a talk page exists a link should be added)
>
>- pages are stored in directories based on first two characters of MD5
> hash, same as image storage scheme
>
>- includes all namespaces (talk, users, users_talk, wikipedia_talk, etc.)
>
>- created a list with links to all the items in each namespace to allow
> for basic searching of page titles
>
>- redirects replaced with direct link to article
>
>
>Regards,
>Steve Rawlinson
>
>_______________________________________________
>Wikitech-l mailing list
>Wikitech-l@wikipedia.org
>http://www.wikipedia.org/mailman/listinfo/wikitech-l
>
>
>
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
On Sat, 2002-11-16 at 18:36, Steve Rawlinson wrote:
> There seems to be some interest in creating a static HTML distribution
> (dump) of Wikipedia, most notably it is requested on the
> Wikipedia:Database_download page and in Feature Requests #596830 on
> Sourceforge. This would allow people to download the Wikipedia for use
> offline, for example from a CDROM.
>
> So, I have started work and made my initial version (English only)
> available online for anyone on this list to evaluate and test. I am
> looking for feedback, suggestions, bug reports and general comments.
>
> http://www.rawlinson.ca:8080/wikipedia/index.html

A cool beginning, thanks! :)

> I'm not sure how to distribute this static HTML version when it's ready
> for a public release. Currently it's about 500 Meg in size (that includes
> everything). As I mentioned above I have limited server resources. For
> distribution maybe it could be put on the Sourceforge download page, or on
> the Wikipedia.org server somewhere (/tarballs)?

I expect we could provide both a tarball and a static tree which could
be rsync'ed.

> Finally, since I am new to Wikipedia and this list, please excuse me while
> I learn how things work around here. I am open to criticism, suggestions
> and discussion. I am looking forward to working with everyone on
> Wikipedia and contributing where I can.
>
>
> Some Technical Details (for those interested):
>
> - English only (currently)

That will need to be fixed, of course! :)

> - uses "printable" pages, no top or side navigation bars

Could probably stand to be purtied up at least a little bit.

> - added links to home, back, copyright and Wikipedia.org to bottom of all
> pages (TODO: if a talk page exists a link should be added)

A link to that particular page on the live server would be a *very* good
idea. The regular printable pages include this.

> - pages are stored in directories based on first two characters of MD5
> hash, same as image storage scheme

Some things to think about as far as the actual filenames:
* Length. Wikipedia titles can I think get up to ~255 characters; this
may be too long for some systems.
* Acceptable characters. Colons, slashes, quotes, and various non-ascii
characters may appear in titles that cannot be reliably reproduced on
many filesystems. I notice that colons and commas at least are changed
to underscores, possibly some other characters too; conflicts may occur.
Non-ascii chars appear to be left intact; will this work consistently
across different filesystems which may be configured for different
character encodings?
* Case sensitivity. Many filesystems are not case sensitive; we may have
conflicts.

> - includes all namespaces (talk, users, users_talk, wikipedia_talk, etc.)

User and talk pages are probably not necessary; if you're looking to
discuss the page, you'll be doing it on the live site where you can edit
it (and see the last 6 months' worth of edits which aren't on your
CD-ROM). And, of course, they take up a large chunk of valuable CD real
estate better devoted to future articles.

Thoughts?

> - created a list with links to all the items in each namespace to allow
> for basic searching of page titles

A simple JavaScript-based title search could probably be rigged up out
of that.

> - redirects replaced with direct link to article

Nice.

-- brion vibber (brion @ pobox.com)
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
Magnus and Brion:

[.I've merged both of your comments into one email. The message is a bit
long, but it keeps it all in one place. Note the M> and B> refer to
Magnus and Brion respectively.]

M> Cool! I like it! Hope to see the CDs at Wal-Mart soon ;-)
B> A cool beginning, thanks! :)

Thanks, I'm glad to see the idea receive such a positive response :)

M> * On every page, there should be a link to the online article, and to
M> the online edit.
B> A link to that particular page on the live server would be a *very* good
B> idea. The regular printable pages include this.

I will add links to the current online article, online edit, and offline
talk. The offline talk page will link to the online talk and online talk
edit. I think that provides access to the online version of everything.

B> User and talk pages are probably not necessary; if you're looking to
B> discuss the page, you'll be doing it on the live site where you can edit
B> it (and see the last 6 months' worth of edits which aren't on your
B> CD-ROM). And, of course, they take up a large chunk of valuable CD real
B> estate better devoted to future articles.

I have debated whether to include the User and Talk pages myself. As
space on the CD gets tight they'll probably be the first things to go.
However, I noticed that articles sometimes provide a link to the
corresponding Talk page. If the talk page was somehow relevant to the
article (which it probably shouldn't be) I wanted to make sure people had
access to it offline.

As for the User pages, I thought it was a good way to give credit to the
Wikipedia contributors. Also, many of the Wikipedia pages and Talk pages
link to User pages where people have made comments.

M> * Images should link to the image text, like wikipedia does, not to the
M> image itself. One can save the image directly from the page.

I'm not sure what happened there, I'll fix it so it behaves the same as
Wikipedia.

M> * A nice CSS will make it look less "plain", without compromising
M> compatibility with older browsers.
B> Could probably stand to be purtied up at least a little bit.

I agree, I'll try using a different CSS to improve things. My intention
is to keep a simple look and feel, but also to stay true to the Wikipedia
style.

M> * The CD could also have a recent Mozilla installer on it ;-)

My assumption is that everyone has a HTML browser on their machine or can
easily get one. Maybe I didn't get your joke?

M> * There could be other index.html files with frames in them, where a
M> small frame on top, bottom, or side, could carry useful links (Main
M> Page, Copyrights, the "live" wikipedia, "live" search engine, etc.) and
M> maybe the logo.

Good idea. Thanks for the sample you emailed me personally. I'll
create two interfaces, one frame based and the other with just the pages.

M> * If you want statistics (date of last edit) at all, maybe add some more
M> ("this was edited 123 times").

The number of edits could be useful to a person reading the article to
determine how "mature" the article is. I'll look into how easy it would
be to implement.

M> * Should non-existent links be displayed as links? I think so, because
M> people might feel tempted to write something ;-)

I agree that links to non-existent articles would be a good way to
encourage people to contribute, but it's at the expense of linking online
which might not be available to them. I have tried to keep the number of
links to online sources low so that offline users aren't frustrated by
numerous broken (in their mind) links. Maybe using the stub marker (like
an ! or *) after the topic would be best (I've seen this preference
setting somewhere). Using an "external" CSS tag with a different color,
as is done with the online version, would also help people distinguish.

B> Some things to think about as far as the actual filenames:

Some good observations about potential filename problems. I suppose I
could use something other than the title, like the unique cur_id integer,
but it wouldn't be as meaningful to people. Using eight characters (26
letters and 10 numbers) there are about 2.8 trillion possible filenames,
so each article could be mapped to a unique 8 character filename. I think
this would avoid all the length and conflict issues you've raised. Or
maybe 64 characters would be better since some meaning could be kept in
the name. I'll think about these issues and see what I can come up with.

CD-ROM formats such as Rock Ridge Extension and Joliet support longer
filenames (at least 64 characters), but the ISO 9660 standard only
supports the 8+3 convention. I read that Joliet also supports Unicode
characters for international support. Perhaps it would be enough to
ensure the filenames conform to one of these CD-ROM standards and leave
the rest up to the OS?

The other thing to note about potential filename conflicts with my current
setup is that the MD5 hash directory structure provides some safety.
Even if two filenames conflicted (through removing invalid characters)
with the 256 subdirectories their is only a 0.4% chance they'll be put in
the same directory and actually conflict. It's not perfect, but it's not
very likely to happen.

B> Non-ascii chars appear to be left intact; will this work consistently
B> across different filesystems which may be configured for different
B> character encodings?

I don't know much about different character encodings. I'll have to do
some research on this unless anyone here knows the answer?

M> There should also be a "real" offline search function. Shouldn't be too
M> hard to hack some smallish programs for Windows, Linux, maybe Mac.
B> A simple JavaScript-based title search could probably be rigged up out
B> of that.

I agree, there needs to be a "real" way to search the articles. My focus
at the moment is to get the HTML and layout correct. The JavaScript-based
search is a nice idea to provide some basic functionality.

One of my goals is to try and keep everything OS and browser independent.
This may not be possible for extra features like a search function, but I
think it should be the goal. It's also one of my goals to try and ensure
it works on the most basic setup that libraries, schools and third-world
countries might have.

M> We might also think of adding special "view" functions to the software
M> (like the "printable version"), to aid projects like this, and maybe
M> Larry's wikipedia sifter.

By software I assume you mean the Wikipedia code base. Doesn't the
current Skin object provide something like this already? If you want to
view things differently you just need to implement a new skin.

> > - English only (currently)
B> That will need to be fixed, of course! :)

It's on my todo list :) Once the English version is working, I'll try
some of the other language Wikipedias. I'm looking forward to learning
more about different character sets and internationalization.

B> > - redirects replaced with direct link to article
B>
B> Nice.

I was really fun to implement because I got to use recursion to follow
the redirects to the final page :) Hopefully, Wikipedia never has any
infinite redirection loops.

Thank you both Brion and Magnus for your comments. Anyone else with some
thoughts or ideas? I'll post my next revision when it's ready.

Best Regards,
Steve Rawlinson
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
Steve Rawlinson heeft geschreven:
[cut]
> Thank you both Brion and Magnus for your comments. Anyone else with some
> thoughts or ideas? I'll post my next revision when it's ready.

Why should someone buy or use a encyclopedia on a CD-ROM and not use the
“normal way” online?

I suppose because the do not have internet or only a very slow and/or
expencive connection.

Mayby those CD-ROM readers would also like to write or inprove articles.
It would be nice if there is a way for off-line Wikipedians to work at
wikipedia.

Idea;
someone reads a article on the Wikipedia cd-rom and wants to change
something. He clicks on “edit this page”. He gets a form that sends a
email to Wikipedia by use of his email program.

Something like: to: offlinewiki-request@wikipedia.org
subject: request [[Babel fish]]

He gets back a email whit instructions how to import the attachment in
the off line wikipedia. After the update of the article he can modify it
like on the on line wikipedia. When he is ready whit the article he
select “Send update to on line Wikipedia”. The update is send back to
Wikipedia by use of his email program. It also sends the revision number
of that article. If the on line article is not changed it gets directly
on line. If it has been changed it goes to a que were a on line
Wikipedian must check it.

Or something like that, you get the general idea.
--
Contact: giskart AT wikipedia.be
Ook een artikeltje schrijven? Wikipedia, de vrije GNU/FDL encyclopedie
http://www.wikipedia.be
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
Thanks for doing this!

Two little comments:

* The Back link on every page seems redundant: every browser has a back
button, but not every browser has javascript.
* The link to www.wikipedia.org on every page should be turned into a
link to [[Wikipedia]] where the offline user can find some information
about the project and the online user can find the URL.

Axel

__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
Axel, thanks for the comments.

> * The Back link on every page seems redundant: every browser has a back
> button, but not every browser has javascript.

Now that I think about it some more, you're right. The back button is
redundant and the Javascript has the potential to cause problems (or not
work if Javascript has been disabled). I'll remove it.

> * The link to www.wikipedia.org on every page should be turned into a
> link to [[Wikipedia]] where the offline user can find some information
> about the project and the online user can find the URL.

An excellent idea. My original reason for the link was to give people a
way to find the online Wikipedia. A link to [[Wikipedia]] will accomplish
this and provide more information about the project as you suggest.
Also, Magnus and Brion suggested adding a links to the current online
version of the article and a link to the online edit page, which should
also help people find the online version.

Thanks,
Steve
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
On 16-11-2002, Steve Rawlinson wrote thusly :
> There seems to be some interest in creating a static HTML distribution
> (dump) of Wikipedia, most notably it is requested on the
> Wikipedia:Database_download page and in Feature Requests #596830 on
> Sourceforge. This would allow people to download the Wikipedia for use
> offline, for example from a CDROM.
>
> So, I have started work and made my initial version (English only)
> available online for anyone on this list to evaluate and test. I am
> looking for feedback, suggestions, bug reports and general comments.
>
> http://www.rawlinson.ca:8080/wikipedia/index.html
>
> Please do not attempt to mirror the site as my server and bandwidth won't
> be able to handle it. The site is only intended for developers to try and
> give feedback. Once everyone is happy with it I will make .tar and .iso
> packages available for distribution.

Thanks for your work. I am looking forward to the "final" release.

I think this information should be passed on to the major Linux distros
manufacturers. It might be a plus for them to bundle a free encyclopedia
in their distribution packs. This way Wikipedia reaches very wide audience.

Regards,
Kpjas.
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
On Sun, 17 Nov 2002, Giskart wrote:

> Why should someone buy or use a encyclopedia on a CD-ROM and not use the
> "normal way" online?

As you suggest, I expect the primary use for the static HTML will be for a
CDROM that people without (or expensive) Internet access could use. I
think that schools and libraries might also greatly benefit from it.

There are some other good reasons for doing this. Plain HTML format means
that anyone can easily put Wikipedia on their server or LAN for read-only
use. They won't need to install PHP, MySQL, configure the software,
import the database, etc.

Another possibility: if the updates to the static HTML were frequent
enough, other's could mirror the site which would reduce the load on
Wikipedia. Editing and new articles would still need to be done on the
main Wikipedia site, but read-only access could be distributed.

Finally, having lots of HTML copies of Wikipedia around also servers as a
backup. If for some unfortunate reason the Wikipedia server was ever
destroyed or taken offline, the articles (in HTML format) wouldn't be
lost.

It's also just neat to have your own copy of Wikipedia on CD :)

> Mayby those CD-ROM readers would also like to write or inprove articles.
> It would be nice if there is a way for off-line Wikipedians to work at
> wikipedia.

I agree, the more people working on Wikipedia the better, but I think
having some type of network connection is absolutely necessary for
contributing to Wikipedia.

> Idea;
> someone reads a article on the Wikipedia cd-rom and wants to change
> something. He clicks on “edit this page”. He gets a form that sends a
> email to Wikipedia by use of his email program.

An interesting idea, but I think if they have email then they are also
very likely to have a web access and could submit changes in the "normal"
way online. I know of people who currently work this way because their
Internet connection is expensive. They only connect long enough to get
the most recent changes and upload any updates.

Best Regards,
Steve Rawlinson
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
--- Steve Rawlinson <steve@rawlinson.ca> wrote:

> As you suggest, I expect the primary use for the static HTML
> will be for a CDROM that people without (or expensive)
> Internet access could use.

It doesn't even have to be expensive: as soon as your internet access
is metered, and the clock is constantly ticking, your surfing habits
change drastically. You simply don't read a three page article about
Catherine the Great "just for fun" online in such a setting. And
metered internet access and metered telephone charges are the norm
throughout the world.

So I think the CDROM Wikipedia meets a great need. In fact, I think
many users in this situation would even want to download Wikipedia so
that they can read it afterwards in peace and quiet. For that, a
minimalistic version that cuts out User: and Talk: and is optimally
compressed would be desirable. Maybe even put the images in a separate
package. How small can you make it?

Axel



__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com
Re: Static HTML Wikipedia (CD-ROM) [ In reply to ]
Steve Rawlinson heeft geschreven:
[cut]
>>Idea;
>>someone reads a article on the Wikipedia cd-rom and wants to change
>>something. He clicks on ?edit this page?. He gets a form that sends a
>>email to Wikipedia by use of his email program.
>
>
> An interesting idea, but I think if they have email then they are also
> very likely to have a web access and could submit changes in the "normal"
> way online. I know of people who currently work this way because their
> Internet connection is expensive. They only connect long enough to get
> the most recent changes and upload any updates.

I think there are people who have email but no www. And to connect only
for downloading and submit en disconnect again. Connecting is the most
expencieve thing to do. It depens of your telecom operator of cource,
but i now from the past when I used a modem and not ADSL that making the
conncetion is very expencive.