Tomas,
Phew, I was afraid that the list was completely dead. :) Sorry that this
is so long, but I wanna be as clear as possible as to how cool software
HA is and how important it is, and what I'd like to see done....
I'd love to be a part of this. I'll be happy to share my scripts, but
will be halfway embarasesed to do so as they were written while I was
trying to understand this stuff last year, as opposed to after having a
clue of what was going on. I'm still no expert on the matter, but what
I've got now basically takes care of running a dozen multi-OS servers
and supports a few thousand users with little intervention on my part.
Still, these scripts are a bunch of isolated hacks and features that I
threw together while learning about this, so effectively, they're a a
dead end. *sigh*
So what next? My best idea is a "super cron daemon". Not designed to
replace the userlevel crond, this one would be responsible for just the
HA services of the system. Each service would say how many seconds
should elapse between checks and a series of commands to execute.
Commands can be shell commands or perl statement to evaluate, such as a
subroutine to perform a full check on a remote httpd (check host via
ICMP, check service via socket, check for valid output via socket comms;
depending on state, wait a certain amount of time and take corrective
action). Safety features would include the daemon forking on startup,
with both parts watching and restarting the other as necessary on
failure; timeout checks on everything to prevent hangs during crisis;
intellegent child reaping to prevent the program from clogging the
system during problems; communicates with common database to maintain
state; standardize error logging for easy log filtering and paging; and
single instance mode to control one application (ie. monitor running
netscape, stop it when it hangs and restart it when crashes). The
database and function integration are critical to prevent the pager from
being flooded with irrelevent or redundant data, cus when a server
crashes I often get pages from each pager equiped server notifying me
that the server is toast and any services that only it had are gone --
20 pages a minute are not uncommon when the router crashes because all
services and servers seem to vanish to the watchers. Also maintain the
authenticated interface to turn services on and off, or modify params.
Also detect if the network is hosed, and don't do anything until it's
back -- again, our router crashes periodically and the El-Cheapo(TM)
network cards in lab PCs are known to flood the network badly enough
that hosts can't talk to each other; it's ugly when multiple hosts are
all thinking they're the primary webserver. Also, make the process able
to start or reconf itself -- I've got one host watching critical hosts
on the network, and a second watching the first guy ... if the first guy
dies, I'm blind and deaf until I fix the first one ... I want the backup
to reconfure itself and start watching critical hosts itself.
Most events are triggered by monitoring things in these catagories:
* Local host - resources like RAM, load average, disk space.
* Local files - existance, content, modifications and appends.
* Local services - running processes and state.
* Remote hosts - via ICMP ping.
* Remote services - via socket connection to service, checks for correct
response to query.
Samples of this in action (I've actually done all these)
* Local disk fillage - every minute, df interesting filesystems (ie. not
jaz drive) and report if percent free or min megs free is too small;
dump older files from /tmp and /var; page sysadmin for help.
* Remote NFS server - every minute, see if NFS mounted file can be read
-- retry for 30 seconds. If NFS server is rebooting (via SNMP or secret
port status check) keep checking and advise admin of status; else
attempt to restart remote server (and reboot it if that fails).
* HTTPD server process failover - primary server checks process list for
httpd and connects each minute to see if it's responding like it's
supposed to and acts immediately, also checks it's virtual servers every
few minutes and delays action by 5 minutes (in case the systems
rebooting). Backup system - watches primary host and httpd, if host dies
for more than a minute, grabs its IPs, and fires up httpd. When the old
primary is back up, it notices that there's a new primary server up and
goes into backup watch mode.
* IP sponge - soaks unused IPs onto spare network interfaces. Prevents
assholes from stealing your IPs or putting up systems without approval
on vital subnets.
* Mirroring - backup server mirrors primary, when it dies, the backup
grabs the IPs the primary served, starts its service and becomes the new
primary. I've got this working in read-only mode currently as I can't
guarantee that the backup's disk has the most recent copy of the
primary's data; but as far as nfs, web, ftp, the backup ensures that
read-only copies of files are accessable when the main data repository
is down.
* Executive failure report - haven't been able to do this yet.
Basically, the system determines the priority of failures and only
reports significant ones. Ie. it'll simply report that a host is down,
instead of saying it's down and that it's ftp, mail and whatever servers
aren't responding as seperate pages.
Although the stuff above seems obfuscated, it's all been done and is
possible. There's far greater benefit fom doing this basic software HA
rather than rewriting the kernel, drivers and programs. Downtime due to
software, system, service, and basic admin errors is infinitely more
common than hardware errors. Unmonitored servers with few, busy admins
can be down half the time; well monitored servers manned by sharp crews
(lotsa dough for babysitting) can hit 95% availability during monitoring
hours with plenty of midnight crashes (bad for Internet sites); my
present software HA brings 99.9997% availability* over a 24-hour basis
(~3 min downtime per server, per week), reduces admins to doing "real
work" instead of baby sitting. The most that hardware HA could provide
is an additional .0003% availability (that's 3 minutes a week) -- so is
an additional 3 minutes a week worth all the work? Unless it
significantly reduces dataloss or there's a REALLY good reason the
server can't be down for 3 minutes a week, then yes, hardware HA is
important.
The 99.9997% quoted was achieved over 3 months of intense use by
thousands of users by mail, nfs, ftp, cgi, websites, paging, databases,
and other services, including our fifteen person tech team (none do
Unix) that use the server consoles as X terminals for their daily work.
The above monitoring period includes dying hardware (disks, network
cards, and controller boards), and a kernel bug that caused NFS mounted
directories to become unavailable after a certain number of mounts.
Since I've stopped using AMD, the servers run for a month before needing
downtime (so far only rebooted to physically move them and to
re-registed IPs after a really bad router crash), so downtime is almost
exclusively due to conditions beyond my control.
By turning these scripts into one flexible and extendable program, a
single self-healing process run across multiple hosts to create a RAISE
(Redundant Array of Inexpensive SystEms, coined by me) as in what your
boss and clients will give you. The Sequents, Suns, and HPs that cost
millions to run and monitor are down over 20 to 40 times as much as my
small stack of Linux boxes doing failover that cost a total of $15k to
buy and only require maintance for infrequent reboots and swapping out
dying disks ... total downtime is less than 20 minutes a week for all
half dozen servers. A platform-independant system makes it possible for
other architectures to be thrown into the mix to maximize availability
and speed. I've got new Sun 450s coming, and Sun Ultras and SGI
workstations being decommisioned that are going to be added to this
fray. Programming this in Perl means faster development, and easier to
write extensions -- so more people benefit and contribute. Also, since
this stuff isn't compute intensive or time critical -- 30 seconds is
irrelevent when the network dies for that long at least once a day.
Hardware HA is important, but will only make sense when software HA is
established. No hardware HA can deal with the main reason that systems
are unavailable -- it's not going to restart your webserver and notify
you that it died; nor failover your ftp sites to a working system; nor
tell you that your mail queue is filled. By combining hardware and
software HA together you create an unbeatable combination, a trio of
Linux boxes costing $15 (one for NFS, RAID & vitals, two for service
failover) are up far more, require less maintance, and get a lot more
done than any of the finest clusters that cost many dozens of times as
much.
-igal
> I will start probably start developing something similar as you did in
> the very near future. I guess you've done a lot of work allready, so I
> think it'd be intelligent to either have a look at your system to
> learn from it and write my own, or to take yours and expand/adapt it.
>
> I don't know how far i should go in explaining my own ideas here and
> if i should lead this discussion if it ever evolves in the ha-list.
>
> Well, tell me if you're interested, and then we can discuss further
> how we could get the thing off the ground.
Phew, I was afraid that the list was completely dead. :) Sorry that this
is so long, but I wanna be as clear as possible as to how cool software
HA is and how important it is, and what I'd like to see done....
I'd love to be a part of this. I'll be happy to share my scripts, but
will be halfway embarasesed to do so as they were written while I was
trying to understand this stuff last year, as opposed to after having a
clue of what was going on. I'm still no expert on the matter, but what
I've got now basically takes care of running a dozen multi-OS servers
and supports a few thousand users with little intervention on my part.
Still, these scripts are a bunch of isolated hacks and features that I
threw together while learning about this, so effectively, they're a a
dead end. *sigh*
So what next? My best idea is a "super cron daemon". Not designed to
replace the userlevel crond, this one would be responsible for just the
HA services of the system. Each service would say how many seconds
should elapse between checks and a series of commands to execute.
Commands can be shell commands or perl statement to evaluate, such as a
subroutine to perform a full check on a remote httpd (check host via
ICMP, check service via socket, check for valid output via socket comms;
depending on state, wait a certain amount of time and take corrective
action). Safety features would include the daemon forking on startup,
with both parts watching and restarting the other as necessary on
failure; timeout checks on everything to prevent hangs during crisis;
intellegent child reaping to prevent the program from clogging the
system during problems; communicates with common database to maintain
state; standardize error logging for easy log filtering and paging; and
single instance mode to control one application (ie. monitor running
netscape, stop it when it hangs and restart it when crashes). The
database and function integration are critical to prevent the pager from
being flooded with irrelevent or redundant data, cus when a server
crashes I often get pages from each pager equiped server notifying me
that the server is toast and any services that only it had are gone --
20 pages a minute are not uncommon when the router crashes because all
services and servers seem to vanish to the watchers. Also maintain the
authenticated interface to turn services on and off, or modify params.
Also detect if the network is hosed, and don't do anything until it's
back -- again, our router crashes periodically and the El-Cheapo(TM)
network cards in lab PCs are known to flood the network badly enough
that hosts can't talk to each other; it's ugly when multiple hosts are
all thinking they're the primary webserver. Also, make the process able
to start or reconf itself -- I've got one host watching critical hosts
on the network, and a second watching the first guy ... if the first guy
dies, I'm blind and deaf until I fix the first one ... I want the backup
to reconfure itself and start watching critical hosts itself.
Most events are triggered by monitoring things in these catagories:
* Local host - resources like RAM, load average, disk space.
* Local files - existance, content, modifications and appends.
* Local services - running processes and state.
* Remote hosts - via ICMP ping.
* Remote services - via socket connection to service, checks for correct
response to query.
Samples of this in action (I've actually done all these)
* Local disk fillage - every minute, df interesting filesystems (ie. not
jaz drive) and report if percent free or min megs free is too small;
dump older files from /tmp and /var; page sysadmin for help.
* Remote NFS server - every minute, see if NFS mounted file can be read
-- retry for 30 seconds. If NFS server is rebooting (via SNMP or secret
port status check) keep checking and advise admin of status; else
attempt to restart remote server (and reboot it if that fails).
* HTTPD server process failover - primary server checks process list for
httpd and connects each minute to see if it's responding like it's
supposed to and acts immediately, also checks it's virtual servers every
few minutes and delays action by 5 minutes (in case the systems
rebooting). Backup system - watches primary host and httpd, if host dies
for more than a minute, grabs its IPs, and fires up httpd. When the old
primary is back up, it notices that there's a new primary server up and
goes into backup watch mode.
* IP sponge - soaks unused IPs onto spare network interfaces. Prevents
assholes from stealing your IPs or putting up systems without approval
on vital subnets.
* Mirroring - backup server mirrors primary, when it dies, the backup
grabs the IPs the primary served, starts its service and becomes the new
primary. I've got this working in read-only mode currently as I can't
guarantee that the backup's disk has the most recent copy of the
primary's data; but as far as nfs, web, ftp, the backup ensures that
read-only copies of files are accessable when the main data repository
is down.
* Executive failure report - haven't been able to do this yet.
Basically, the system determines the priority of failures and only
reports significant ones. Ie. it'll simply report that a host is down,
instead of saying it's down and that it's ftp, mail and whatever servers
aren't responding as seperate pages.
Although the stuff above seems obfuscated, it's all been done and is
possible. There's far greater benefit fom doing this basic software HA
rather than rewriting the kernel, drivers and programs. Downtime due to
software, system, service, and basic admin errors is infinitely more
common than hardware errors. Unmonitored servers with few, busy admins
can be down half the time; well monitored servers manned by sharp crews
(lotsa dough for babysitting) can hit 95% availability during monitoring
hours with plenty of midnight crashes (bad for Internet sites); my
present software HA brings 99.9997% availability* over a 24-hour basis
(~3 min downtime per server, per week), reduces admins to doing "real
work" instead of baby sitting. The most that hardware HA could provide
is an additional .0003% availability (that's 3 minutes a week) -- so is
an additional 3 minutes a week worth all the work? Unless it
significantly reduces dataloss or there's a REALLY good reason the
server can't be down for 3 minutes a week, then yes, hardware HA is
important.
The 99.9997% quoted was achieved over 3 months of intense use by
thousands of users by mail, nfs, ftp, cgi, websites, paging, databases,
and other services, including our fifteen person tech team (none do
Unix) that use the server consoles as X terminals for their daily work.
The above monitoring period includes dying hardware (disks, network
cards, and controller boards), and a kernel bug that caused NFS mounted
directories to become unavailable after a certain number of mounts.
Since I've stopped using AMD, the servers run for a month before needing
downtime (so far only rebooted to physically move them and to
re-registed IPs after a really bad router crash), so downtime is almost
exclusively due to conditions beyond my control.
By turning these scripts into one flexible and extendable program, a
single self-healing process run across multiple hosts to create a RAISE
(Redundant Array of Inexpensive SystEms, coined by me) as in what your
boss and clients will give you. The Sequents, Suns, and HPs that cost
millions to run and monitor are down over 20 to 40 times as much as my
small stack of Linux boxes doing failover that cost a total of $15k to
buy and only require maintance for infrequent reboots and swapping out
dying disks ... total downtime is less than 20 minutes a week for all
half dozen servers. A platform-independant system makes it possible for
other architectures to be thrown into the mix to maximize availability
and speed. I've got new Sun 450s coming, and Sun Ultras and SGI
workstations being decommisioned that are going to be added to this
fray. Programming this in Perl means faster development, and easier to
write extensions -- so more people benefit and contribute. Also, since
this stuff isn't compute intensive or time critical -- 30 seconds is
irrelevent when the network dies for that long at least once a day.
Hardware HA is important, but will only make sense when software HA is
established. No hardware HA can deal with the main reason that systems
are unavailable -- it's not going to restart your webserver and notify
you that it died; nor failover your ftp sites to a working system; nor
tell you that your mail queue is filled. By combining hardware and
software HA together you create an unbeatable combination, a trio of
Linux boxes costing $15 (one for NFS, RAID & vitals, two for service
failover) are up far more, require less maintance, and get a lot more
done than any of the finest clusters that cost many dozens of times as
much.
-igal
> I will start probably start developing something similar as you did in
> the very near future. I guess you've done a lot of work allready, so I
> think it'd be intelligent to either have a look at your system to
> learn from it and write my own, or to take yours and expand/adapt it.
>
> I don't know how far i should go in explaining my own ideas here and
> if i should lead this discussion if it ever evolves in the ha-list.
>
> Well, tell me if you're interested, and then we can discuss further
> how we could get the thing off the ground.