Thanks to everyone who gave advice on their two-tier solutions (nice ASCII
art!).
A two-tier solution with a single front-end, or a redundant dedicated load
balancer front-end, would solve one thing for us; when one of our
individual servers goes down, round-robin DNS still directs people to the
downed server. We've been hesitant to adopt a two-tier solution because
the second tier would also require redundancy, and thus we'd simply be
moving the problem up a level. However, we have some graphic content on
our site, so the second tier could potentially be useful for caching
static content as well, and that might outweigh other considerations.
Wackamole in conjunction with the top tier might be nice, but our switches
have this interesting tendency to learn MAC addresses and then not forget
about them--making it difficult to relocate an IP address to another
machine on-the-fly.
Also (and this is unrelated to my original post), wackamole wouldn't help
in a particular case (although a load-balancing switch with some
monitoring capabilities would) I've run into twice lately, where the
Apache server goes belly-up with the following sequence of errors.
The instigator appears to be a segfault in some child Apache process (not
the moderator process itself):
[Tue Aug 27 12:37:42 2002] [notice] child pid 5765 exit signal
Segmentation Fault (11)
[Tue Aug 27 12:37:42 2002] [warn] long lost child came home! (pid 5765)
Subsequently, every Apache process goes into a loop with this repeated
error:
[Tue Aug 27 12:38:00 2002] [error] Child 6281 failed to establish
umbilical to moderator!
Then the server itself remains running, but load climbs to ridiculous
levels and the web application becomes unusable.
It then becomes necessary to reboot the server; restarting the Apache
process works initially, but the same problem recurs instantly if a
"apachectl graceful" is executed in order to get the counting right in the
backhand webpage. This is true even if the UnixSocketDir, all .pid files,
any shared memory segments and all Apache processes are cleaned up.
My hunch is that there's something about the use of shared memory on
Solaris that's causing a problem with mod_backhand; our Apache binary has
a number of other things compiled in including mod_ssl--perhaps something
is in conflict. However, I'm having trouble tracking the problem down. The
only thing I could find in the FAQ related to the use of shared memory was
to move the User and Group directives up in httpd.conf; but that seems not
to be the problem, though these are Solaris machines. An interesting twist
to the problem is that prior to my latest reboots, "ipcs -a" would show
something similar to this on all machines (sorry, lines don't match up
properly):
T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME
Semaphores:
s 65536 0xc0deb00 --ra------- nobody nobody root other 3 14:58:02 14:53:45
Now they all show this:
s 65536 0xc0deb00 --ra------- nobody nobody root root 3 14:58:02 14:53:45
So the Creator Group for the semaphore has changed somehow after my last
reboot--although I'm the only one with a login on these systems, and
haven't applied any patches that would change shared memory behavior. The
UnixSocketDir has these permissions:
drwx------ 2 nobody nobody
which should be correct (the Apache user and group are nobody:nobody as
well). Could it be that different shared memory permissions on machines in
the same backhand cluster were causing the problem? They're all in sync
now--though I can't determine why. I thought I'd float this by the list as
well to see if anyone had run into it. Ideas?
James Ervin
ATN Messaging Systems
UNC-Chapel Hill
On Tue, 27 Aug 2002, Theo Schlossnagle wrote:
> James Blackwell wrote:
>
> >Though this only partially solves your problem (for example, it reduces
> >redudancy to a large extent)... this is what we do. We have a single
> >"dumb" front end machine that doesn't serve itself, which then talks to
> >multiple backend boxes that do the real work and feed it back. Perhaps
> >some ascii art is called for::
> >
> > Internet
> > |
> > **************
> > * Dumb Server*
> > **************
> > |
> > | Local Network
> > |
> > |
> > -----------------------
> > | | |
> > ******** ******** ********
> > *Serv 1* *Serv 2* *Serv 3*
> > ******** ******** ********
> >
> >So forth and so on. The redancy isn't as much of an issue as it would
> >seem; One can use simple high availability scripts to back up if the
> >"dumb" frontend server fails.
> >
> >
> We do the same thing for one of our clients. But we run two front-end
> "dumb servers" each running Apache+mod_ssl+mod_backhand and wackamole
> for failover. It works like a charm. You turn one machine off and
> there is no noticable service interruption.
>
> One of the biggest benefits in this particular client's architecture is
> that the developers can log in restart the back-end servers (which was a
> requirement) and the SSL keys are completely self-contained on the
> front-end machines and only the System Admins have access to those
> machines via shell. The developers can't get a the keys if they wanted to!
>
> This is very useful if you have very "heavy" servers on the back-end.
> If you have "thin" servers throughout, I am not sure of the advantage
> of tiering your architecture. One immediate downside is that you can
> only support the number of concurrent connections that your front tier
> supports. If that is an issue, there are many interesting things you
> can do to have a workable two-tier solution, but I think it is much
> healthier to first ask yourself why a single tier doesn't work.
>
> --
> Theo Schlossnagle
> 1024D/82844984/95FD 30F1 489E 4613 F22E 491A 7E88 364C 8284 4984
> 2047R/33131B65/71 F7 95 64 49 76 5D BA 3D 90 B9 9F BE 27 24 E7
>
>
>
>
>
>
art!).
A two-tier solution with a single front-end, or a redundant dedicated load
balancer front-end, would solve one thing for us; when one of our
individual servers goes down, round-robin DNS still directs people to the
downed server. We've been hesitant to adopt a two-tier solution because
the second tier would also require redundancy, and thus we'd simply be
moving the problem up a level. However, we have some graphic content on
our site, so the second tier could potentially be useful for caching
static content as well, and that might outweigh other considerations.
Wackamole in conjunction with the top tier might be nice, but our switches
have this interesting tendency to learn MAC addresses and then not forget
about them--making it difficult to relocate an IP address to another
machine on-the-fly.
Also (and this is unrelated to my original post), wackamole wouldn't help
in a particular case (although a load-balancing switch with some
monitoring capabilities would) I've run into twice lately, where the
Apache server goes belly-up with the following sequence of errors.
The instigator appears to be a segfault in some child Apache process (not
the moderator process itself):
[Tue Aug 27 12:37:42 2002] [notice] child pid 5765 exit signal
Segmentation Fault (11)
[Tue Aug 27 12:37:42 2002] [warn] long lost child came home! (pid 5765)
Subsequently, every Apache process goes into a loop with this repeated
error:
[Tue Aug 27 12:38:00 2002] [error] Child 6281 failed to establish
umbilical to moderator!
Then the server itself remains running, but load climbs to ridiculous
levels and the web application becomes unusable.
It then becomes necessary to reboot the server; restarting the Apache
process works initially, but the same problem recurs instantly if a
"apachectl graceful" is executed in order to get the counting right in the
backhand webpage. This is true even if the UnixSocketDir, all .pid files,
any shared memory segments and all Apache processes are cleaned up.
My hunch is that there's something about the use of shared memory on
Solaris that's causing a problem with mod_backhand; our Apache binary has
a number of other things compiled in including mod_ssl--perhaps something
is in conflict. However, I'm having trouble tracking the problem down. The
only thing I could find in the FAQ related to the use of shared memory was
to move the User and Group directives up in httpd.conf; but that seems not
to be the problem, though these are Solaris machines. An interesting twist
to the problem is that prior to my latest reboots, "ipcs -a" would show
something similar to this on all machines (sorry, lines don't match up
properly):
T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME
Semaphores:
s 65536 0xc0deb00 --ra------- nobody nobody root other 3 14:58:02 14:53:45
Now they all show this:
s 65536 0xc0deb00 --ra------- nobody nobody root root 3 14:58:02 14:53:45
So the Creator Group for the semaphore has changed somehow after my last
reboot--although I'm the only one with a login on these systems, and
haven't applied any patches that would change shared memory behavior. The
UnixSocketDir has these permissions:
drwx------ 2 nobody nobody
which should be correct (the Apache user and group are nobody:nobody as
well). Could it be that different shared memory permissions on machines in
the same backhand cluster were causing the problem? They're all in sync
now--though I can't determine why. I thought I'd float this by the list as
well to see if anyone had run into it. Ideas?
James Ervin
ATN Messaging Systems
UNC-Chapel Hill
On Tue, 27 Aug 2002, Theo Schlossnagle wrote:
> James Blackwell wrote:
>
> >Though this only partially solves your problem (for example, it reduces
> >redudancy to a large extent)... this is what we do. We have a single
> >"dumb" front end machine that doesn't serve itself, which then talks to
> >multiple backend boxes that do the real work and feed it back. Perhaps
> >some ascii art is called for::
> >
> > Internet
> > |
> > **************
> > * Dumb Server*
> > **************
> > |
> > | Local Network
> > |
> > |
> > -----------------------
> > | | |
> > ******** ******** ********
> > *Serv 1* *Serv 2* *Serv 3*
> > ******** ******** ********
> >
> >So forth and so on. The redancy isn't as much of an issue as it would
> >seem; One can use simple high availability scripts to back up if the
> >"dumb" frontend server fails.
> >
> >
> We do the same thing for one of our clients. But we run two front-end
> "dumb servers" each running Apache+mod_ssl+mod_backhand and wackamole
> for failover. It works like a charm. You turn one machine off and
> there is no noticable service interruption.
>
> One of the biggest benefits in this particular client's architecture is
> that the developers can log in restart the back-end servers (which was a
> requirement) and the SSL keys are completely self-contained on the
> front-end machines and only the System Admins have access to those
> machines via shell. The developers can't get a the keys if they wanted to!
>
> This is very useful if you have very "heavy" servers on the back-end.
> If you have "thin" servers throughout, I am not sure of the advantage
> of tiering your architecture. One immediate downside is that you can
> only support the number of concurrent connections that your front tier
> supports. If that is an issue, there are many interesting things you
> can do to have a workable two-tier solution, but I think it is much
> healthier to first ask yourself why a single tier doesn't work.
>
> --
> Theo Schlossnagle
> 1024D/82844984/95FD 30F1 489E 4613 F22E 491A 7E88 364C 8284 4984
> 2047R/33131B65/71 F7 95 64 49 76 5D BA 3D 90 B9 9F BE 27 24 E7
>
>
>
>
>
>