Mailing List Archive

two-tier solution discussion & a Solaris-related question
Thanks to everyone who gave advice on their two-tier solutions (nice ASCII
art!).

A two-tier solution with a single front-end, or a redundant dedicated load
balancer front-end, would solve one thing for us; when one of our
individual servers goes down, round-robin DNS still directs people to the
downed server. We've been hesitant to adopt a two-tier solution because
the second tier would also require redundancy, and thus we'd simply be
moving the problem up a level. However, we have some graphic content on
our site, so the second tier could potentially be useful for caching
static content as well, and that might outweigh other considerations.
Wackamole in conjunction with the top tier might be nice, but our switches
have this interesting tendency to learn MAC addresses and then not forget
about them--making it difficult to relocate an IP address to another
machine on-the-fly.

Also (and this is unrelated to my original post), wackamole wouldn't help
in a particular case (although a load-balancing switch with some
monitoring capabilities would) I've run into twice lately, where the
Apache server goes belly-up with the following sequence of errors.

The instigator appears to be a segfault in some child Apache process (not
the moderator process itself):

[Tue Aug 27 12:37:42 2002] [notice] child pid 5765 exit signal
Segmentation Fault (11)
[Tue Aug 27 12:37:42 2002] [warn] long lost child came home! (pid 5765)

Subsequently, every Apache process goes into a loop with this repeated
error:

[Tue Aug 27 12:38:00 2002] [error] Child 6281 failed to establish
umbilical to moderator!

Then the server itself remains running, but load climbs to ridiculous
levels and the web application becomes unusable.

It then becomes necessary to reboot the server; restarting the Apache
process works initially, but the same problem recurs instantly if a
"apachectl graceful" is executed in order to get the counting right in the
backhand webpage. This is true even if the UnixSocketDir, all .pid files,
any shared memory segments and all Apache processes are cleaned up.

My hunch is that there's something about the use of shared memory on
Solaris that's causing a problem with mod_backhand; our Apache binary has
a number of other things compiled in including mod_ssl--perhaps something
is in conflict. However, I'm having trouble tracking the problem down. The
only thing I could find in the FAQ related to the use of shared memory was
to move the User and Group directives up in httpd.conf; but that seems not
to be the problem, though these are Solaris machines. An interesting twist
to the problem is that prior to my latest reboots, "ipcs -a" would show
something similar to this on all machines (sorry, lines don't match up
properly):

T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME
Semaphores:
s 65536 0xc0deb00 --ra------- nobody nobody root other 3 14:58:02 14:53:45

Now they all show this:

s 65536 0xc0deb00 --ra------- nobody nobody root root 3 14:58:02 14:53:45

So the Creator Group for the semaphore has changed somehow after my last
reboot--although I'm the only one with a login on these systems, and
haven't applied any patches that would change shared memory behavior. The
UnixSocketDir has these permissions:

drwx------ 2 nobody nobody

which should be correct (the Apache user and group are nobody:nobody as
well). Could it be that different shared memory permissions on machines in
the same backhand cluster were causing the problem? They're all in sync
now--though I can't determine why. I thought I'd float this by the list as
well to see if anyone had run into it. Ideas?

James Ervin
ATN Messaging Systems
UNC-Chapel Hill




On Tue, 27 Aug 2002, Theo Schlossnagle wrote:

> James Blackwell wrote:
>
> >Though this only partially solves your problem (for example, it reduces
> >redudancy to a large extent)... this is what we do. We have a single
> >"dumb" front end machine that doesn't serve itself, which then talks to
> >multiple backend boxes that do the real work and feed it back. Perhaps
> >some ascii art is called for::
> >
> > Internet
> > |
> > **************
> > * Dumb Server*
> > **************
> > |
> > | Local Network
> > |
> > |
> > -----------------------
> > | | |
> > ******** ******** ********
> > *Serv 1* *Serv 2* *Serv 3*
> > ******** ******** ********
> >
> >So forth and so on. The redancy isn't as much of an issue as it would
> >seem; One can use simple high availability scripts to back up if the
> >"dumb" frontend server fails.
> >
> >
> We do the same thing for one of our clients. But we run two front-end
> "dumb servers" each running Apache+mod_ssl+mod_backhand and wackamole
> for failover. It works like a charm. You turn one machine off and
> there is no noticable service interruption.
>
> One of the biggest benefits in this particular client's architecture is
> that the developers can log in restart the back-end servers (which was a
> requirement) and the SSL keys are completely self-contained on the
> front-end machines and only the System Admins have access to those
> machines via shell. The developers can't get a the keys if they wanted to!
>
> This is very useful if you have very "heavy" servers on the back-end.
> If you have "thin" servers throughout, I am not sure of the advantage
> of tiering your architecture. One immediate downside is that you can
> only support the number of concurrent connections that your front tier
> supports. If that is an issue, there are many interesting things you
> can do to have a workable two-tier solution, but I think it is much
> healthier to first ask yourself why a single tier doesn't work.
>
> --
> Theo Schlossnagle
> 1024D/82844984/95FD 30F1 489E 4613 F22E 491A 7E88 364C 8284 4984
> 2047R/33131B65/71 F7 95 64 49 76 5D BA 3D 90 B9 9F BE 27 24 E7
>
>
>
>
>
>
two-tier solution discussion & a Solaris-related question [ In reply to ]
James,

We ran into the ARP caching problem with our Cisco switches and
routers as well. However, I dug around a little bit and found a
small program called SendArp. This lets you send ARP requests
using any source and destination you need.

Grab the source here:
http://www.abiogenesis.com/jrs/sendarp/

We use the following couple lines of shell script to send
an ARP request when we take over a down IP:

---
#setup external virtual IPs
# casesladder.com / igl.net
ifconfig eth0:110 12.129.199.226 broadcast 12.129.199.224 netmask
255.255.255.0

# grab our local MAC address
IGL_MAC=`ifconfig eth0:110| grep HWaddr | awk '{print $5 }'`

# send an ARP packet to the router to let it know where we
# are
`sendarp -t 2 -p 12.129.199.226 -h $IGL_MAC -P 12.129.199.225`

---
Notes:

12.129.199.225 is our router IP - repeat this line if you need
to send the new MAC info to other switches, just change the
destination IP.

Using this method, takeover takes about 1.5 seconds for the ARP
to update and traffic to start flowing.

Hope this helps,
Jeremy

___________________________________________
Jeremy 'Case' Rusnak
President & Founder
Case's Ladder, Inc. http://www.igl.net/
___________________________________________



-----Original Message-----
From: backhand-users-admin@lists.backhand.org
[mailto:backhand-users-admin@lists.backhand.org] On Behalf Of James
Ervin
Sent: Tuesday, August 27, 2002 12:38 PM
To: backhand-users@lists.backhand.org
Subject: [m_b_users] two-tier solution discussion & a Solaris-related
question


Thanks to everyone who gave advice on their two-tier solutions (nice
ASCII art!).

A two-tier solution with a single front-end, or a redundant dedicated
load balancer front-end, would solve one thing for us; when one of our
individual servers goes down, round-robin DNS still directs people to
the downed server. We've been hesitant to adopt a two-tier solution
because the second tier would also require redundancy, and thus we'd
simply be moving the problem up a level. However, we have some graphic
content on our site, so the second tier could potentially be useful for
caching static content as well, and that might outweigh other
considerations. Wackamole in conjunction with the top tier might be
nice, but our switches have this interesting tendency to learn MAC
addresses and then not forget about them--making it difficult to
relocate an IP address to another machine on-the-fly.

Also (and this is unrelated to my original post), wackamole wouldn't
help in a particular case (although a load-balancing switch with some
monitoring capabilities would) I've run into twice lately, where the
Apache server goes belly-up with the following sequence of errors.

The instigator appears to be a segfault in some child Apache process
(not the moderator process itself):

[Tue Aug 27 12:37:42 2002] [notice] child pid 5765 exit signal
Segmentation Fault (11) [Tue Aug 27 12:37:42 2002] [warn] long lost
child came home! (pid 5765)

Subsequently, every Apache process goes into a loop with this repeated
error:

[Tue Aug 27 12:38:00 2002] [error] Child 6281 failed to establish
umbilical to moderator!

Then the server itself remains running, but load climbs to ridiculous
levels and the web application becomes unusable.

It then becomes necessary to reboot the server; restarting the Apache
process works initially, but the same problem recurs instantly if a
"apachectl graceful" is executed in order to get the counting right in
the backhand webpage. This is true even if the UnixSocketDir, all .pid
files, any shared memory segments and all Apache processes are cleaned
up.

My hunch is that there's something about the use of shared memory on
Solaris that's causing a problem with mod_backhand; our Apache binary
has a number of other things compiled in including mod_ssl--perhaps
something is in conflict. However, I'm having trouble tracking the
problem down. The only thing I could find in the FAQ related to the use
of shared memory was to move the User and Group directives up in
httpd.conf; but that seems not to be the problem, though these are
Solaris machines. An interesting twist to the problem is that prior to
my latest reboots, "ipcs -a" would show something similar to this on all
machines (sorry, lines don't match up
properly):

T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME
Semaphores:
s 65536 0xc0deb00 --ra------- nobody nobody root other 3 14:58:02
14:53:45

Now they all show this:

s 65536 0xc0deb00 --ra------- nobody nobody root root 3 14:58:02
14:53:45

So the Creator Group for the semaphore has changed somehow after my last
reboot--although I'm the only one with a login on these systems, and
haven't applied any patches that would change shared memory behavior.
The UnixSocketDir has these permissions:

drwx------ 2 nobody nobody

which should be correct (the Apache user and group are nobody:nobody as
well). Could it be that different shared memory permissions on machines
in the same backhand cluster were causing the problem? They're all in
sync now--though I can't determine why. I thought I'd float this by the
list as well to see if anyone had run into it. Ideas?

James Ervin
ATN Messaging Systems
UNC-Chapel Hill




On Tue, 27 Aug 2002, Theo Schlossnagle wrote:

> James Blackwell wrote:
>
> >Though this only partially solves your problem (for example, it
> >reduces redudancy to a large extent)... this is what we do. We have a

> >single "dumb" front end machine that doesn't serve itself, which then

> >talks to multiple backend boxes that do the real work and feed it
> >back. Perhaps some ascii art is called for::
> >
> > Internet
> > |
> > **************
> > * Dumb Server*
> > **************
> > |
> > | Local Network
> > |
> > |
> > -----------------------
> > | | |
> > ******** ******** ********
> > *Serv 1* *Serv 2* *Serv 3*
> > ******** ******** ********
> >
> >So forth and so on. The redancy isn't as much of an issue as it would

> >seem; One can use simple high availability scripts to back up if the
> >"dumb" frontend server fails.
> >
> >
> We do the same thing for one of our clients. But we run two front-end

> "dumb servers" each running Apache+mod_ssl+mod_backhand and wackamole
> for failover. It works like a charm. You turn one machine off and
> there is no noticable service interruption.
>
> One of the biggest benefits in this particular client's architecture
> is that the developers can log in restart the back-end servers (which
> was a
> requirement) and the SSL keys are completely self-contained on the
> front-end machines and only the System Admins have access to those
> machines via shell. The developers can't get a the keys if they
wanted to!
>
> This is very useful if you have very "heavy" servers on the back-end.

> If you have "thin" servers throughout, I am not sure of the advantage
> of tiering your architecture. One immediate downside is that you can
> only support the number of concurrent connections that your front tier

> supports. If that is an issue, there are many interesting things you
> can do to have a workable two-tier solution, but I think it is much
> healthier to first ask yourself why a single tier doesn't work.
>
> --
> Theo Schlossnagle
> 1024D/82844984/95FD 30F1 489E 4613 F22E 491A 7E88 364C 8284 4984
> 2047R/33131B65/71 F7 95 64 49 76 5D BA 3D 90 B9 9F BE 27 24 E7
>
>
>
>
>
>


_______________________________________________
backhand-users mailing list
backhand-users@lists.backhand.org
http://lists.backhand.org/mailman/listinfo/backhand-users
two-tier solution discussion & a Solaris-related question [ In reply to ]
Jeremy Rusnak wrote:

>James,
>
>We ran into the ARP caching problem with our Cisco switches and
>routers as well. However, I dug around a little bit and found a
>small program called SendArp. This lets you send ARP requests
>using any source and destination you need.
>
>
>
If you are using wackamole, this should all be transparent. Wackamole
has integrated ARP spoofing support that is quite a bit more powerful.
It collects the IPs in the ARP cache in each of the clustered machines'
ARP caches and sends a couple of ARP responses to each IP in the
aggregate pool. Of course, you can additionally specify explicit IP
address to send responses to as well.

--
Theo Schlossnagle
Principal Consultant
OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Phone: +1 301 776 6376 Fax: +1 410 880 4879
1024D/82844984/95FD 30F1 489E 4613 F22E 491A 7E88 364C 8284 4984
2047R/33131B65/71 F7 95 64 49 76 5D BA 3D 90 B9 9F BE 27 24 E7