Mailing List Archive

Database connection pooling... (beyond Apache::DBI)
Howdy. We're all probably pretty familiar with Apache::DBI and
the fact that it opens a database connection per apache process. Sounds
groovy and works well with only one or two servers. Everything is gravy
until you get a cluster of servers, ie 20-30 machines, each with 300+
processes. I've looked into DBD::Proxy and DBI::ProxyServer, which seems
really cool and I'm wondering if anyone has any experience with those two
modules. Yes? No?
Are there any great solutions to this problem aside from tossing a
TON of RAM into the database server and turning up the number of database
connections? I have a few ideas of my own, but I don't want to reinvent
the wheel and want to hear what's out there. Thanks. --SC


PS I'd like to post the results of this discussion and compile it
into something that would be suitable for the FAQ.

--
Sean Chittenden
sean.chittenden@usa.net
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
Depending on your app, you may want to at least consider FastCGI. For
handling a single task that doesn't result in lots of data going back to
the client that needs to get hit MANY times a second, it is nearly ideal,
at least architecture-wise. If you have 100 different scripts that get
called with equal frequency and that all do DB stuff, it's not a good fit.
A mix of mod_perl and FastCGI can solve most problems--they tend to
complement each other, IMHO. If you're not familiar with fastcgi, it's a
protocol that lets an app start up once and then essentially handle
requests in a while() loop, allowing you to use just a single db
connection pretty effectively if you're doing lots of little (i.e. short)
queries or updates. There's an apache module called mod_fastcgi that's
available at www.fastcgi.com. Ok, I'l stop blasphemizing. :)

- Bill

On Wed, 23 Feb 2000, Sean Chittenden wrote:

> Howdy. We're all probably pretty familiar with Apache::DBI and
> the fact that it opens a database connection per apache process. Sounds
> groovy and works well with only one or two servers. Everything is gravy
> until you get a cluster of servers, ie 20-30 machines, each with 300+
> processes. I've looked into DBD::Proxy and DBI::ProxyServer, which seems
> really cool and I'm wondering if anyone has any experience with those two
> modules. Yes? No?
> Are there any great solutions to this problem aside from tossing a
> TON of RAM into the database server and turning up the number of database
> connections? I have a few ideas of my own, but I don't want to reinvent
> the wheel and want to hear what's out there. Thanks. --SC
>
>
> PS I'd like to post the results of this discussion and compile it
> into something that would be suitable for the FAQ.
>
> --
> Sean Chittenden
> sean.chittenden@usa.net
>
>
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
Sean Chittenden <sean@serverninjas.com> writes:

> Howdy. We're all probably pretty familiar with Apache::DBI and
> the fact that it opens a database connection per apache process. Sounds
> groovy and works well with only one or two servers. Everything is gravy
> until you get a cluster of servers, ie 20-30 machines, each with 300+
> processes.

300+ perl processes per machine? No way. The only way that would make _any_
sense is if your perl code is extremely i/o dependent and your perl code is
extremely light. Even then you're way better off having the i/o operations
queued quickly and processed asynchronously.

In all likelihood what you really want is more like 16-32 processes per
machine. That should be more than enough to peg the cpu at 100% even if your
pages are all database driven. Be sure to have separate servers for non-perl
code and a cluster of 20 * 32 processes should be able to handle a few million
hits per day even for loosely coded perl code (assuming well the database can
keep up of course).

32 * 30 is 900 connections. If you're building something of this order of
magnitude there are plenty of databases around that scale to 900 connections.
Oracle with MTS on suitable hardware shouldn't really blink at it, for
example.

> Are there any great solutions to this problem aside from tossing a
> TON of RAM into the database server and turning up the number of database
> connections? I have a few ideas of my own, but I don't want to reinvent
> the wheel and want to hear what's out there. Thanks. --SC

The DBI::Proxy scheme is worth experimenting with. If you try it under load I
would love to see the results. It has the advantage that additional features
could be hacked into DBI::Proxy that aren't supported by the underlying
database, such as asynchronous connections.

The alternatives are:

1) a different architecture that allows for non 1-1 mapping between script
engine and web server processes. Ie, FastCGI. This doesn't completely solve
the problem since you may be doing substantial non-database processing in
these scripts as well.

2) database specific features such as Oracle MTS or Oracle Connection Manager.

> PS I'd like to post the results of this discussion and compile it
> into something that would be suitable for the FAQ.

There have been a few previous similar discussions, you could start by reading
the archives.


--
greg
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
Greg Stark wrote:
>
> Sean Chittenden <sean@serverninjas.com> writes:
>
> > Howdy. We're all probably pretty familiar with Apache::DBI and
> > the fact that it opens a database connection per apache process. Sounds
> > groovy and works well with only one or two servers. Everything is gravy
> > until you get a cluster of servers, ie 20-30 machines, each with 300+
> > processes.
>
> 300+ perl processes per machine? No way. The only way that would make _any_
> sense is if your perl code is extremely i/o dependent and your perl code is
> extremely light. Even then you're way better off having the i/o operations
> queued quickly and processed asynchronously.

This conversation happens on an approximately biweekly schedule, either
on modperl or dbi-users, or some other list I have the misfortune of
frequenting. Please allow me to expand upon this subject a bit.

I have not yet gotten a satisfactory answer from anyone who starts these
threads regarding why they want connection pooling. I suspect that
people think it is needed because everyone else (Netscape, Microsoft,
Bea) is doing it. There is a particular kind of application where
pooled connections are useful, and there are particular situations where
it is a waste. Every project I have ever done falls into the latter
category, and I can only think of a few cases that fall under the
former.

Connection pooling is a system where your application server threads or
processes, which number n on a single machine, share a pool of database
connections which number fewer than n. This is done to minimize the
number of database connections which are open at once, which in turn is
supposed to reduce the load on the database server. This is effective
when database activity is a small fraction of the total load on the
application server. For example, if your application server mostly
performs matrix manipulation, and only occassionally hits the database,
it would make sense for it to relinquish the connection when it is not
in use.

The downside to connection pooling is that it imposes some overhead.
The connections must be managed properly, and the scheme should be
transparent to the programmer whose code is using it. So when a piece
of code requests a database connection, the connection manager needs to
decide which one to return. It may have to wait for one to free up, or
it may have to open one based on some low-water-mark hueristic. It may
also need to decide that a connection consumer has died or gone away,
possibly taking the connection with it. So you can see that opening a
pooled connection is more computationally expensive than opening a
dedicated connection.

This pooling overhead is a total waste of time when the majority of what
your application is doing is database-related. If your program will
issue 100 queries and performa transaction during the course of
fulfilling a request, pooled connections will not make sense. The
reason is that Apache already provides a mechanism for killing off
database connections in this scenario. If a process or thread is
sitting about idle, Apache will come along an terminate it, freeing the
database connection in the process. For database-bound or transactional
programs, the one-to-one mapping of processes to database connections is
ideal.

Pooling is also less attractive because modern databases can handle many
connections. Oracle with MTS will run fine with just as many
connections as you care to open. The application designer should study
how many connections he realistically plans to open. If your
application is bound by database performance, it makes sense to cap the
number of clients, so you would not allow you applications to open too
many connections. If your application is transactional, you don't have
any choice but to give each processes its own dedicated connection. If
your application is compute-bound, then your database is lightly loaded
and you won't mind opening a lot of connections.

The summary is that if your application is database-bound, or is
processing transactions, you don't need or even want connection pooling.

-jwb
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
"Jeffrey W. Baker" wrote:
>
> This conversation happens on an approximately biweekly schedule, either
> on modperl or dbi-users, or some other list I have the misfortune of
> frequenting. Please allow me to expand upon this subject a bit.
>
> I have not yet gotten a satisfactory answer from anyone who starts these
> threads regarding why they want connection pooling. I suspect that
> people think it is needed because everyone else (Netscape, Microsoft,
> Bea) is doing it. There is a particular kind of application where
> pooled connections are useful, and there are particular situations where
> it is a waste. Every project I have ever done falls into the latter
> category, and I can only think of a few cases that fall under the
> former.

Interesting point of view. I'd never stopped to think about how necessary
it REALLY was...

> Connection pooling is a system where your application server threads or
> processes, which number n on a single machine, share a pool of database
> connections which number fewer than n. This is done to minimize the
> number of database connections which are open at once, which in turn is
> supposed to reduce the load on the database server. This is effective
> when database activity is a small fraction of the total load on the
> application server. For example, if your application server mostly
> performs matrix manipulation, and only occassionally hits the database,
> it would make sense for it to relinquish the connection when it is not
> in use.
>
> The downside to connection pooling is that it imposes some overhead.
> The connections must be managed properly, and the scheme should be
> transparent to the programmer whose code is using it. So when a piece
> of code requests a database connection, the connection manager needs to
> decide which one to return. It may have to wait for one to free up, or
> it may have to open one based on some low-water-mark hueristic. It may
> also need to decide that a connection consumer has died or gone away,
> possibly taking the connection with it. So you can see that opening a
> pooled connection is more computationally expensive than opening a
> dedicated connection.

Ah, but just opening a dedicated connection is so slow as to be avoided
as much as possible. Adding a little overhead here wouldn't seem to hurt,
since the overhead to begin with already makes you want to reopen your
db handles as infrequently as you can.

> This pooling overhead is a total waste of time when the majority of what
> your application is doing is database-related. If your program will
> issue 100 queries and performa transaction during the course of
> fulfilling a request, pooled connections will not make sense. The
> reason is that Apache already provides a mechanism for killing off
> database connections in this scenario. If a process or thread is
> sitting about idle, Apache will come along an terminate it, freeing the
> database connection in the process. For database-bound or transactional
> programs, the one-to-one mapping of processes to database connections is
> ideal.

There is certainly something that's cleaner about this approach. I hadn't
really thought of this either.

> Pooling is also less attractive because modern databases can handle many
> connections. Oracle with MTS will run fine with just as many

Ok, stupid question: What is MTS? Multithreaded Server? Part of the
Enterprise package of Standard?

> connections as you care to open. The application designer should study
> how many connections he realistically plans to open. If your
> application is bound by database performance, it makes sense to cap the
> number of clients, so you would not allow you applications to open too
> many connections. If your application is transactional, you don't have
> any choice but to give each processes its own dedicated connection. If
> your application is compute-bound, then your database is lightly loaded
> and you won't mind opening a lot of connections.
>
> The summary is that if your application is database-bound, or is
> processing transactions, you don't need or even want connection pooling.
>
> -jwb

Hmm...are you sure the point about transactions is true? Couldn't you do
database connection pooling and have transactions going, as long as you
ensured any previous transactions had been rolled back? I can see where it
could potentially get ugly to determine the exact state of a db handle, but
I know that uglier problems have been solved. :) I think my desire for
pooled database connections is related to just wanting flexibility. It's
great that we have a decent system for doing database access from mod_perl,
but if there were a clean way to get pooled database transactions, I wouldn't
have to worry about the ratio of db-related handlers vs not. I could just
write code that would run fast, maybe do some db stuff or not, and then
just maybe play with the number of db connections if that became a
contention. Finally, there are licensing concerns for some that would make
pushing the number of db connections down advantageous. It would be ideal
to be able to make the choice between pooled db connection or not, depending
on what you decided was a good fit. One last case where a smaller number of
db connections would make sense is when the db accesses made by handlers
are pretty short. If you hit the db for an insert or something and the amount
of time spent in the db is small compared to the rest of the time needed
to process the request (say for an ad server or something) AND you need to
handle a large number of requests per second, then it would be nice to
just have a few database handles shared. Realistically I don't know that
any overhead imposed by pooled db connections wouldn't get in the way of
things, but hey, it's a theoretical example. :) Anyway, perhaps this
topic gets beaten to death but I still find it interesting to discuss...

- Bill
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
According to Bill:
>
> Ah, but just opening a dedicated connection is so slow as to be avoided
> as much as possible. Adding a little overhead here wouldn't seem to hurt,
> since the overhead to begin with already makes you want to reopen your
> db handles as infrequently as you can.

If you have MySQL on the same machine new connections are very fast.

> > This pooling overhead is a total waste of time when the majority of what
> > your application is doing is database-related. If your program will
> > issue 100 queries and performa transaction during the course of
> > fulfilling a request, pooled connections will not make sense. The
> > reason is that Apache already provides a mechanism for killing off
> > database connections in this scenario. If a process or thread is
> > sitting about idle, Apache will come along an terminate it, freeing the
> > database connection in the process. For database-bound or transactional
> > programs, the one-to-one mapping of processes to database connections is
> > ideal.
>
> There is certainly something that's cleaner about this approach. I hadn't
> really thought of this either.

If you can map the need for a particular database connection into
the URL, you can easily get a pooling effect by running multiple
instances of the back-end mod_perl servers, letting the front end
send the requests to the appropriate copy via ProxyPass or
RewriteRule directives. The normal apache child management
mechanism will keep a reasonable number of each backend running.

> > If your application is transactional, you don't have
> > any choice but to give each processes its own dedicated connection.

Does anyone actually get away with spanning database transactions
across multiple web hits? What if the user just goes away? I've
always tried to accumulate data that needed more than one form into
hidden fields so the transaction can be completed in one shot, even
if it requires checking the initial values at that point to make
sure they haven't been changed by someone else.

Les Mikesell
les@mcs.com
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
Leslie Mikesell wrote:
>
> According to Bill:
> >
> > Ah, but just opening a dedicated connection is so slow as to be avoided
> > as much as possible. Adding a little overhead here wouldn't seem to hurt,
> > since the overhead to begin with already makes you want to reopen your
> > db handles as infrequently as you can.
>
> If you have MySQL on the same machine new connections are very fast.

Doesn't it at least require a fork() somewhere, though? I'd be interested
in getting numbers for, say, cgi scripts that connect to the db every
time using mysql and the oracle on the backend, just for comparision's
sake. Is it that noticable a difference?

> If you can map the need for a particular database connection into
> the URL, you can easily get a pooling effect by running multiple
> instances of the back-end mod_perl servers, letting the front end
> send the requests to the appropriate copy via ProxyPass or
> RewriteRule directives. The normal apache child management
> mechanism will keep a reasonable number of each backend running.

This is true, but doesn't it seem like a lot of setup overhead? It would be
nicer in some ways to not have to suddenly have another set of config
files, log files, etc to deal with. It's not that what's described is
particularly hard, but when you're already trying to solve other problems at
the same time, the little things add up. Ideally I'd like to just edit a
config file or something and have the rest be transparent. Of course this
is completely impossible today, but I said "ideally." : )

> > > If your application is transactional, you don't have
> > > any choice but to give each processes its own dedicated connection.
>
> Does anyone actually get away with spanning database transactions
> across multiple web hits? What if the user just goes away? I've
> always tried to accumulate data that needed more than one form into
> hidden fields so the transaction can be completed in one shot, even
> if it requires checking the initial values at that point to make
> sure they haven't been changed by someone else.
>
> Les Mikesell
> les@mcs.com

Most stuff I've done is all relative to a single script, so it hasn't
been an issue...

- Bill
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
Leslie Mikesell wrote:

> Does anyone actually get away with spanning database transactions
> across multiple web hits? What if the user just goes away? I've
> always tried to accumulate data that needed more than one form into
> hidden fields so the transaction can be completed in one shot, even
> if it requires checking the initial values at that point to make
> sure they haven't been changed by someone else.

I do the same thing here. Trying to make the database transaction live
across requests is a dead end. How would you ensure that all the
requests went to the same Apache child process?

-jwb
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
Agreed... though, I wouldn't use hidden form fields. This sounds
like a job for Apache::Session. :-)

Cheers,
Richard

> > Does anyone actually get away with spanning database transactions
> > across multiple web hits? What if the user just goes away? I've
> > always tried to accumulate data that needed more than one form into
> > hidden fields so the transaction can be completed in one shot, even
> > if it requires checking the initial values at that point to make
> > sure they haven't been changed by someone else.
>
> I do the same thing here. Trying to make the database transaction live
> across requests is a dead end. How would you ensure that all the
> requests went to the same Apache child process?

-------------------------------------------------------------------------
Richard Dice * Personal 514 816 9568 * Fax 514 816 9569
Open Source Evangelist, HBE Software * http://www.hbesoftware.com
ShadNet Creator * http://shadnet.shad.ca/ * rdice@shadnet.shad.ca
Occasional Writer, HotWired * http://www.hotwired.com/webmonkey/
"squeeze the world 'til it's small enough to join us heel to toe"
- jesus jones
Re: Database connection pooling... (beyond Apache::DBI) [ In reply to ]
According to Bill:

> > > Ah, but just opening a dedicated connection is so slow as to be avoided
> > > as much as possible. Adding a little overhead here wouldn't seem to hurt,
> > > since the overhead to begin with already makes you want to reopen your
> > > db handles as infrequently as you can.
> >
> > If you have MySQL on the same machine new connections are very fast.
>
> Doesn't it at least require a fork() somewhere, though? I'd be interested
> in getting numbers for, say, cgi scripts that connect to the db every
> time using mysql and the oracle on the backend, just for comparision's
> sake. Is it that noticable a difference?

MySQL claims to run threaded (I haven't looked at the code) and if
you are running on the same machine it uses unix domain sockets
instead of TCP. It is faster at startup than anything else you are
likely to find.

> > If you can map the need for a particular database connection into
> > the URL, you can easily get a pooling effect by running multiple
> > instances of the back-end mod_perl servers, letting the front end
> > send the requests to the appropriate copy via ProxyPass or
> > RewriteRule directives. The normal apache child management
> > mechanism will keep a reasonable number of each backend running.
>
> This is true, but doesn't it seem like a lot of setup overhead? It would be
> nicer in some ways to not have to suddenly have another set of config
> files, log files, etc to deal with. It's not that what's described is
> particularly hard, but when you're already trying to solve other problems at
> the same time, the little things add up. Ideally I'd like to just edit a
> config file or something and have the rest be transparent. Of course this
> is completely impossible today, but I said "ideally." : )

The first step in dealing with a busy mod_perl site is almost always
to set up a lightweight front-end proxy to handle static content
and allow the mod_perl back ends to complete and move on to the next
job while the proxy dribbles back the replies over the often slow
internet connections. This may be enough by itself. I normally
see a 10 to 1 ratio of front to back end servers. If you still
have too many back end servers with database connections it is
a simple matter to split the back end load with some ProxyPass
or RewriteRule directives. You may want to split the backend
load across machines as well, and the proxy setup handles this
nicely. In fact I don't quite understand what the people who
invented the different protocols for fastcgi and jserve proxying
were thinking of. What protocol can be better than http for
passing through an http request? What is going to be better
than an http server at responding?

Les Mikesell
les@mcs.com