Mailing List Archive

Software-HA
Tomas,

Phew, I was afraid that the list was completely dead. :) Sorry that this
is so long, but I wanna be as clear as possible as to how cool software
HA is and how important it is, and what I'd like to see done....

I'd love to be a part of this. I'll be happy to share my scripts, but
will be halfway embarasesed to do so as they were written while I was
trying to understand this stuff last year, as opposed to after having a
clue of what was going on. I'm still no expert on the matter, but what
I've got now basically takes care of running a dozen multi-OS servers
and supports a few thousand users with little intervention on my part.
Still, these scripts are a bunch of isolated hacks and features that I
threw together while learning about this, so effectively, they're a a
dead end. *sigh*

So what next? My best idea is a "super cron daemon". Not designed to
replace the userlevel crond, this one would be responsible for just the
HA services of the system. Each service would say how many seconds
should elapse between checks and a series of commands to execute.
Commands can be shell commands or perl statement to evaluate, such as a
subroutine to perform a full check on a remote httpd (check host via
ICMP, check service via socket, check for valid output via socket comms;
depending on state, wait a certain amount of time and take corrective
action). Safety features would include the daemon forking on startup,
with both parts watching and restarting the other as necessary on
failure; timeout checks on everything to prevent hangs during crisis;
intellegent child reaping to prevent the program from clogging the
system during problems; communicates with common database to maintain
state; standardize error logging for easy log filtering and paging; and
single instance mode to control one application (ie. monitor running
netscape, stop it when it hangs and restart it when crashes). The
database and function integration are critical to prevent the pager from
being flooded with irrelevent or redundant data, cus when a server
crashes I often get pages from each pager equiped server notifying me
that the server is toast and any services that only it had are gone --
20 pages a minute are not uncommon when the router crashes because all
services and servers seem to vanish to the watchers. Also maintain the
authenticated interface to turn services on and off, or modify params.
Also detect if the network is hosed, and don't do anything until it's
back -- again, our router crashes periodically and the El-Cheapo(TM)
network cards in lab PCs are known to flood the network badly enough
that hosts can't talk to each other; it's ugly when multiple hosts are
all thinking they're the primary webserver. Also, make the process able
to start or reconf itself -- I've got one host watching critical hosts
on the network, and a second watching the first guy ... if the first guy
dies, I'm blind and deaf until I fix the first one ... I want the backup
to reconfure itself and start watching critical hosts itself.

Most events are triggered by monitoring things in these catagories:
* Local host - resources like RAM, load average, disk space.
* Local files - existance, content, modifications and appends.
* Local services - running processes and state.
* Remote hosts - via ICMP ping.
* Remote services - via socket connection to service, checks for correct
response to query.

Samples of this in action (I've actually done all these)
* Local disk fillage - every minute, df interesting filesystems (ie. not
jaz drive) and report if percent free or min megs free is too small;
dump older files from /tmp and /var; page sysadmin for help.
* Remote NFS server - every minute, see if NFS mounted file can be read
-- retry for 30 seconds. If NFS server is rebooting (via SNMP or secret
port status check) keep checking and advise admin of status; else
attempt to restart remote server (and reboot it if that fails).
* HTTPD server process failover - primary server checks process list for
httpd and connects each minute to see if it's responding like it's
supposed to and acts immediately, also checks it's virtual servers every
few minutes and delays action by 5 minutes (in case the systems
rebooting). Backup system - watches primary host and httpd, if host dies
for more than a minute, grabs its IPs, and fires up httpd. When the old
primary is back up, it notices that there's a new primary server up and
goes into backup watch mode.
* IP sponge - soaks unused IPs onto spare network interfaces. Prevents
assholes from stealing your IPs or putting up systems without approval
on vital subnets.
* Mirroring - backup server mirrors primary, when it dies, the backup
grabs the IPs the primary served, starts its service and becomes the new
primary. I've got this working in read-only mode currently as I can't
guarantee that the backup's disk has the most recent copy of the
primary's data; but as far as nfs, web, ftp, the backup ensures that
read-only copies of files are accessable when the main data repository
is down.
* Executive failure report - haven't been able to do this yet.
Basically, the system determines the priority of failures and only
reports significant ones. Ie. it'll simply report that a host is down,
instead of saying it's down and that it's ftp, mail and whatever servers
aren't responding as seperate pages.

Although the stuff above seems obfuscated, it's all been done and is
possible. There's far greater benefit fom doing this basic software HA
rather than rewriting the kernel, drivers and programs. Downtime due to
software, system, service, and basic admin errors is infinitely more
common than hardware errors. Unmonitored servers with few, busy admins
can be down half the time; well monitored servers manned by sharp crews
(lotsa dough for babysitting) can hit 95% availability during monitoring
hours with plenty of midnight crashes (bad for Internet sites); my
present software HA brings 99.9997% availability* over a 24-hour basis
(~3 min downtime per server, per week), reduces admins to doing "real
work" instead of baby sitting. The most that hardware HA could provide
is an additional .0003% availability (that's 3 minutes a week) -- so is
an additional 3 minutes a week worth all the work? Unless it
significantly reduces dataloss or there's a REALLY good reason the
server can't be down for 3 minutes a week, then yes, hardware HA is
important.

The 99.9997% quoted was achieved over 3 months of intense use by
thousands of users by mail, nfs, ftp, cgi, websites, paging, databases,
and other services, including our fifteen person tech team (none do
Unix) that use the server consoles as X terminals for their daily work.
The above monitoring period includes dying hardware (disks, network
cards, and controller boards), and a kernel bug that caused NFS mounted
directories to become unavailable after a certain number of mounts.
Since I've stopped using AMD, the servers run for a month before needing
downtime (so far only rebooted to physically move them and to
re-registed IPs after a really bad router crash), so downtime is almost
exclusively due to conditions beyond my control.

By turning these scripts into one flexible and extendable program, a
single self-healing process run across multiple hosts to create a RAISE
(Redundant Array of Inexpensive SystEms, coined by me) as in what your
boss and clients will give you. The Sequents, Suns, and HPs that cost
millions to run and monitor are down over 20 to 40 times as much as my
small stack of Linux boxes doing failover that cost a total of $15k to
buy and only require maintance for infrequent reboots and swapping out
dying disks ... total downtime is less than 20 minutes a week for all
half dozen servers. A platform-independant system makes it possible for
other architectures to be thrown into the mix to maximize availability
and speed. I've got new Sun 450s coming, and Sun Ultras and SGI
workstations being decommisioned that are going to be added to this
fray. Programming this in Perl means faster development, and easier to
write extensions -- so more people benefit and contribute. Also, since
this stuff isn't compute intensive or time critical -- 30 seconds is
irrelevent when the network dies for that long at least once a day.

Hardware HA is important, but will only make sense when software HA is
established. No hardware HA can deal with the main reason that systems
are unavailable -- it's not going to restart your webserver and notify
you that it died; nor failover your ftp sites to a working system; nor
tell you that your mail queue is filled. By combining hardware and
software HA together you create an unbeatable combination, a trio of
Linux boxes costing $15 (one for NFS, RAID & vitals, two for service
failover) are up far more, require less maintance, and get a lot more
done than any of the finest clusters that cost many dozens of times as
much.

-igal

> I will start probably start developing something similar as you did in
> the very near future. I guess you've done a lot of work allready, so I
> think it'd be intelligent to either have a look at your system to
> learn from it and write my own, or to take yours and expand/adapt it.
>
> I don't know how far i should go in explaining my own ideas here and
> if i should lead this discussion if it ever evolves in the ha-list.
>
> Well, tell me if you're interested, and then we can discuss further
> how we could get the thing off the ground.
Software-HA [ In reply to ]
Thanks much Louis! Based on the features, this is exactly what I was
looking for! I'm eager to rip this thing to pieces and see what it
does.... :)

-igal

Louis Mandelstam wrote:
>
> On Wed, 8 Oct 1997, Igal Koshevoy wrote:
>
> > Phew, I was afraid that the list was completely dead. :) Sorry that
> this
> > is so long, but I wanna be as clear as possible as to how cool
> software
> > HA is and how important it is, and what I'd like to see done....
>
> I don't have time to rea your message in full right now, put my eyes
> caught some comments in your message which lead me to believe you may
> want
> to look the following URL -
>
> http://consult.ml.org/~trockij/mon
>
> Regards
>
> ---------------------------------------------------------------|-----|--
> Louis Mandelstam Tel +27 83 227-0712 Symphony /|\
> /|\
> Linux systems integration http://sr.co.za Research { } {
> }
> Johannesburg, South Africa mailto:louis@sr.co.za (Pty)Ltd {___}
> {___}
Software-HA [ In reply to ]
Igal Koshevoy (igal@mail.irn.pdx.edu) wrote:
>
> Phew, I was afraid that the list was completely dead. :) Sorry that this
> is so long, but I wanna be as clear as possible as to how cool software
> HA is and how important it is, and what I'd like to see done....

I am sorry for being so quiet but it seems I am buried with job and family.
We are expecting our third child these days although I hoped to be able to
work on the project during my vacation - no. AFAICS I'll not have the time
to merge in the new suggestions into the HOWTO soon.

*sigh* if only the day had 48 hours.
Software-HA [ In reply to ]
Well I'm sorry my Re: is so short, but I'm trying hard not to drown in
work now, so just this:

On Thu, 9 Oct 1997, Linas Vepstas wrote:

> 2) disk mirroring between machines,

http://www.coda.cs.cmu.edu/

*
t

--------------------------------------------------------------------------------
Tomas Pospisek's mailing-lists mailbox
www.SPIN.ch - Internet Services in Graubuenden/Switzerland
--------------------------------------------------------------------------------
Tom: So far, der winter kommt, ich schaff mir glaub ich noch ein paar
Tom: Zyxel traffos an...
Roli: Aber genug davon nehmen. Wenn die Dinger kapput gehen werden sie
Roli: nicht mehr warm.
Software-HA [ In reply to ]
On Fri, 10 Oct 1997, T's Mailing Lists wrote:

> Well I'm sorry my Re: is so short, but I'm trying hard not to drown in
> work now, so just this:
>
> On Thu, 9 Oct 1997, Linas Vepstas wrote:
>
> > 2) disk mirroring between machines,
>
> http://www.coda.cs.cmu.edu/

Have a look at Uniq's very interesting UPFS product. It's currently not
offered on Linux, but provides interesting food for thought.

http://www.uniq.com.au/

--
Martin Pool
Software-HA [ In reply to ]
Linas,

Before I start typing, I emphasis that none of this stuff provides zero
downtime nor ought to be depended on for vital systems (ie. life
support, flight control). This will not provide the security of hardware
HA. The structures I will present will seem pretty lame when compared to
hardware solutions and complex (read: expensive) zero-downtime systems,
still it's importance is underscorded because it's possible with (1)
existing, off-the-shelf hardware, (2) requires relatively little coding,
(3) deals with the primary cause of downtime, (4) will remain useful by
even HA solutions.

Based on my experience, detection and running action scripts is simple
-- I use action scripts to grap IPs & services, and alert me of what's
happening. They're basically nothing more than loops that check to see
if the status has changed (ie. host or service died), and do things upon
change. Preventing the TCP connections from dropping would be most cool
-- one of the structures I'll discuss deals with this, but may be too
ambitious for the moment.

I would discourage the tk/tcl interface because there's not enough
systems that you can use it. I've been using an web-based interface
because I'm can use it anywhere that I can telnet from, that makes me
feel pretty safe. Doing the interface twice would seem redundant.

> infrastructure & mirroring
Here's some software HA setups I've used or considered:

1. Duplication - two identical systems, with the spare constantly
mirroring the disk. When the master dies, the spare grabs IPs, starts
necessary services, no users or services that write to the disk are
allowed -- read the mirroring section below to learn why. I've got this
setup now and don't like. When the master dies, someone's gotta get the
dead system working and transfer services back to it. That's bad
because until the master data server's up, no one can login or write to
their files.

2. Offloading - one system is an exclusive fileserver with a RAID, other
systems mount its disks, have IPs, and run services. A dedicated
fileserver won't fail often, certainly not as frequently as systems full
of logged in users and services; when it dies, it gets powercycled by an
X10 power switch triggered by the machines hanging off it; if it doesn't
come up, a hardware failure is assumed and someone needs to attach it's
RAID connector to box and turns it into an exclusive fileserver -- this
takes however many seconds it takes to boot and fsck. As for the systems
dangling off it, when a system dies, its brethren grab it's services and
IPs in a first come, first serve order. This is what I'm moving towards,
because it require no mirroring -- as the data is always up to date, and
I don't have to drop everything to fix the dead server. If the
fileserver dies, then I'm back to what I'm doing with the duplication
structure above.

3. Proxy offloading - same as above, but a proxy server holds all the
IPs and listens on all ports, it watches the dangling hosts and proxies
requests to them. Based on my understanding, this could prevent dropped
connections completely since it's being supervised by the proxy.
However, I've not seen software that could do this and I'm not qualified
to write it (although I'd love to learn), so I've not touched this. This
is as close to zero-downtime as is possible in a software-only scenario,
since the only downtime will be caused by the proxy server dying.

> mirroring
I do mirroring with rsync and ssh, however, since this doesn't happen in
real time and takes a while, the backup is not current enough to be used
in write-through mode after a failure without losing data that didn't
get transfered over. Solutions to this are either to keep the data on a
shared repository (RAID disk, NFS server, etc) or create a driver
that'll write to both local and remote file systems simultaneously.

> database mirroring
As long as your database writes to disk regularly, mirroring will save
you. In the offloading setups, when a failover happens, the new server
checks the database for errors, autofixes them and starts the engine.
True database mirroring requires rewriting the databases, and the
clients to attempt a reconnect. Oracle's top DB does this, but I'm sure
MySQL, Postgres95 or SOLID could hbe adapted.

> network ARP takeover
If our Cisco is down, no one can talk to anyone else and all our network
connections are gone ... this is the ultimate SPOF. Any ideas on dealing
with this?

-igal

PS: I've contacted the author of Mon to see if we can collaborate to
merge the best components of our systems.
Software-HA [ In reply to ]
> > network ARP takeover
> If our Cisco is down, no one can talk to anyone else and all our network
> connections are gone ... this is the ultimate SPOF. Any ideas on dealing
> with this?

Two routers, two ethernet cards on each. Router 1 advertises a good default
route to the hosts over both lans. The metrics on the lans are weighted so
one lan is favoured. All hosts run routed and pick up a default route dynamically

When a lan fails the routes switch lan in about 2 minutes. If the router
fails then router 2 which is doing the same thing but a worse metric
will kick in and as the other router is down become best route.

In theory you can also set the two routers to exchange BGP4 with each other
and have a line each to different providers via different telco's and balance
traffic normally but switch over in a failure case.

Alan