Mailing List Archive

Read-only mirror
I'd like to setup a read-only mirror of wikipedia on ross.bomis.com,
and point a URL at it. Possibly we can think of a good automatic
redirection scheme or something. Well, I say "I'd like to...", but
actually that means "I'm willing to pay Jason to..."

Jason posted the other day about a zero budget, but that's not
*exactly* true. :-)

We have some loose hardware sitting around that we might as well press
into service.

Ideally, we would have a way to _automatically_ redirect requests to
the read-only site when the full site is dead.

If throwing hardware at the problem is likely to help, I'll do it.

--Jimbo
Re: Read-only mirror [ In reply to ]
Jimmy Wales wrote:

>If throwing hardware at the problem is likely to help, I'll do it.
>
>
I'm trying to picture that in my mind ;-)

Seriously: Something that *might* help not only with that problem, but
would likely reduce server load (and thus, crashes in the first place)
would be to run apache and mysql on different servers. IIRC, this is
suggested by both apache and mysql online manuals. Question is wether to
put mysql on the slower or the faster machine (assuming they're not
identical).

Some (third) machine could just mirror the apache server machine and
jump in if the need arises (=apache machine crashes); it could even have
read-write access. Also, no need for up-to-the-minute backups etc.
We'd only get a problem if the mysql machine dies :-(

Magnus
Re: Read-only mirror [ In reply to ]
Are there figures around of how much bandwidth and how
much traffic wikipedia produces?


Phil

Jimmy Wales sagte:
> I'd like to setup a read-only mirror of wikipedia on ross.bomis.com, and
> point a URL at it. Possibly we can think of a good automatic
> redirection scheme or something. Well, I say "I'd like to...", but
> actually that means "I'm willing to pay Jason to..."
>
> Jason posted the other day about a zero budget, but that's not
> *exactly* true. :-)
>
> We have some loose hardware sitting around that we might as well press
> into service.
>
> Ideally, we would have a way to _automatically_ redirect requests to the
> read-only site when the full site is dead.
>
> If throwing hardware at the problem is likely to help, I'll do it.
>
> --Jimbo
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@wikipedia.org
> http://www.wikipedia.org/mailman/listinfo/wikitech-l
Re: Read-only mirror [ In reply to ]
> Seriously: Something that *might* help not only with that problem, but
> would likely reduce server load (and thus, crashes in the first place)
> would be to run apache and mysql on different servers. IIRC, this is
> suggested by both apache and mysql online manuals. Question is wether
> to
> put mysql on the slower or the faster machine (assuming they're not
> identical).


Apache should never crash... if so, something is wrong! i'd try to
install freeBSD with Apache on the slower computer, and redhat-linux
with mysql4 on the faster one.
Even an old PII 400 is fast enough to serve several tousand users at the
same time. it should only have a lot of memory!

And.. of course.... is your internet connection bandwith limited? if
so... this produces a lot of load if the server-load is near the maximum
possible bw, so... keep that in mind,,,


Phil
Re: Read-only mirror [ In reply to ]
Magnus Manske wrote:
> >If throwing hardware at the problem is likely to help, I'll do it.
> >

> I'm trying to picture that in my mind ;-)

It's fun to contemplate on days like today, when the server's
so unavailable.

> Question is wether to
> put mysql on the slower or the faster machine (assuming they're not
> identical).

Wikipedia is already on my best machine, so the second machine would
be slower. And at least for now, I'm thinking of using a machine
that currently has a few little websites on it, and those would need
to stay there for now.

--Jimbo
Re: Read-only mirror [ In reply to ]
Philipp W. wrote:
> Are there figures around of how much bandwidth and how
> much traffic wikipedia produces?

There are: http://www.wikipedia.org/stats/

But bandwidth, per se, is certainly not the issue. Our pipeline is
not even close to full, running Bomis and Wikipedia and everything
else.

--Jimbo
Re: Read-only mirror [ In reply to ]
On Thu, 6 Feb 2003, Magnus Manske wrote:
> Jimmy Wales wrote:
> >If throwing hardware at the problem is likely to help, I'll do it.
> >
> I'm trying to picture that in my mind ;-)
>
> Seriously: Something that *might* help not only with that problem, but
> would likely reduce server load (and thus, crashes in the first place)
> would be to run apache and mysql on different servers. IIRC, this is
> suggested by both apache and mysql online manuals.

If this is practicable, I fully support it.

> Question is wether to
> put mysql on the slower or the faster machine (assuming they're not
> identical).

Put mysql on the fast one -- it's the database that's our big bad.

> Some (third) machine could just mirror the apache server machine and
> jump in if the need arises (=apache machine crashes); it could even have
> read-write access. Also, no need for up-to-the-minute backups etc.
> We'd only get a problem if the mysql machine dies :-(

The apache machine is less likely to be a problem. My preference would be
something like:

* apache-only: Running apache, php script, image storage, TeX and other
incidentals
* mysql-only: MySQL with the database. That's about it. Maybe in the
future someday it could be postgresql-only. ;)
* backup: another machine running mysql as a slave to the main db
server, and maybe other stuff. Perhaps even apache for a read-
only web-accessible mirror. Takes a few gigs of space, but most
of the time it should be relatively low-traffic.

This backup machine would slurp in updates to the database, but would
still be runnable if no more updates come. New uploads could be copied
over occasionally. I'm uncertain what kind of granularity we can get with
MySQL's replication; can we leave out the user table (in whole or in part)
for instance? Not a big deal if it's just to another of Jimbo's machines,
but I'd be leery of shipping e-mail addresses and password hashes over the
internet to a third-party mirror site.

If we have a backup db and the main one failed (by crashing, by hard drive
failure, by act of god, or simply by getting overfull doing something and
screaming "TOO MANY CONNECTIONS!"), the main apache box could switch over
to use it and clamp into read-only mode. (This is much easier than
failover for the web server, which needs funky IP dealings or possibly a
very ugly DNS hack which is probably a bad idea.)

Notes; I'm not sure how much bandwidth would be required just for database
traffic, or for updates. I'll check into that tonight. If it's within an
internal network, it shouldn't be a problem.

I don't know if Jimbo can spare two machines, though. (Anyone care to make
a donation? Not a tax write-off as the foundation isn't set up yet,
but if you're addicted to Wikipedia and your time is valuable to you...)

-- brion vibber (brion @ pobox.com)
Re: Read-only mirror [ In reply to ]
> over occasionally. I'm uncertain what kind of granularity we can get
> with MySQL's replication; can we leave out the user table (in whole or
> in part) for instance? Not a big deal if it's just to another of Jimbo's
> machines, but I'd be leery of shipping e-mail addresses and password
> hashes over the internet to a third-party mirror site.

mysql supports ssl-encryption, so.... this should't be a problem.

> Notes; I'm not sure how much bandwidth would be required just for
> database traffic, or for updates. I'll check into that tonight. If it's
> within an internal network, it shouldn't be a problem.

it's about twice the real data, but mysql and ssh supports compression!
Re: Read-only mirror [ In reply to ]
On ĵaŭ, 2003-02-06 at 20:38, Philipp W. wrote:
> > over occasionally. I'm uncertain what kind of granularity we can get
> > with MySQL's replication; can we leave out the user table (in whole or
> > in part) for instance? Not a big deal if it's just to another of Jimbo's
> > machines, but I'd be leery of shipping e-mail addresses and password
> > hashes over the internet to a third-party mirror site.
>
> mysql supports ssl-encryption, so.... this should't be a problem.

That's only part of the problem; the other part is trusting the mirror
site to maintain privacy and security at least as well as we do on the
main server. I.e. there is an expectation that we will not give out (or
sell!) a user's e-mail address without their consent and knowledge. And
certainly it seems unsafe to toss password hashes around.

If we fully trust the mirrors with thousands of peoples' addys and
passwords, then no problem. If we let anyone mirror willy-nilly, then
I'm rather more concerned.

Hmm... http://www.mysql.com/doc/en/Replication_Options.html

There's an option "replicate-ignore-table" for the *slave*, but as far
as I can tell the smallest granularity we can get for the replicatable
data from the *master* end is "binlog-ignore-db". Unless we move
sensitive user data into its own database, I don't think we can exclude
it.

> > Notes; I'm not sure how much bandwidth would be required just for
> > database traffic, or for updates. I'll check into that tonight. If it's
> > within an internal network, it shouldn't be a problem.
>
> it's about twice the real data, but mysql and ssh supports compression!

The binary update log from the latest server run (about 22:00 to 05:00)
is currently at 46,222,269 bytes. That's an average of about 6 megs per
hour, or 150 megs per day. Gzip compression should take that down a fair
bit. (This is for all languages combined.)

Hmm, note to self: don't gzip the *current* log file to see how big it
is; now updates to it are going into magical deleted file land.
Fortunately we don't need them yet. :) (Actually, I can free up a huge
load of disk space if I just dump the older binlogs; if we set up
replication it'll be from a clean dump.) In any case, it compresses to
about 7 megs, so we might estimate 24 megs per day as a minimum.


As far as traffic between posited separate db and web servers at
Wikipedia Central:
| Bytes_received | 995768056 |
| Bytes_sent | 1471733779 |
| Uptime | 27260 |

That's about 90kb per second average. Should be fine over local network.

-- brion vibber (brion @ pobox.com)