Mailing List Archive

Fork, Threading, and Select madness. Please help!
Ok, I have been working on a website which is now available at
http://www.biddin.com/ that performs a search of multiple auction sites
all at once, and then returns the results. The problem is that it is
very slow, and it appears that the primary bottleneck is in waiting on
the search engines at the various sites to begin reading data.

Well, this makes sense because the code to implement the search is
currently pure sequential with no threading or non-blocking IO
whatsoever. This means that a search request on Amazon can only occur
after a successful completion of a search on EBay (which is notoriously
slow).

What I'm getting around to asking (slowly<g>) is, what is the best way
to implement an algorithm that can retrieve the data from ALL 3 sites
simultaneously, so that the code is never blocked up waiting on one
particular bad site? Note that it should also be possible to do
timeouts if search site is failing, and it should be easily extensible
to handle more than 3 simultaneous requests in the future. Also, keep
in mind that threading is my LAST possible choice as the Python
implementation at my presence provider doesn't support Threading. :)

Suggestions as to where I can read up on implementing this as a series
of forked processes, and whether or not this would be practical are
certainly welcome. Also, if anyone has any brilliant ideas for how it
could easily be done with the Select mechanism, I'd love to hear it.

---------------
Jesse D. Sightler
http://www3.pair.com/jsight/

PS - Not that it matters, but much of this code will probably be
released under GPL eventually. :)
Fork, Threading, and Select madness. Please help! [ In reply to ]
Jesse D. Sightler wrote:

> Ok, I have been working on a website which is now available at
> http://www.biddin.com/ that performs a search of multiple auction
> sites all at once, and then returns the results. The problem is
> that it is very slow, and it appears that the primary bottleneck is
> in waiting on the search engines at the various sites to begin
> reading data.
>
> Well, this makes sense because the code to implement the search is
> currently pure sequential with no threading or non-blocking IO
> whatsoever.
>
> What I'm getting around to asking (slowly<g>) is, what is the best
> way to implement an algorithm that can retrieve the data from ALL 3
> sites simultaneously, so that the code is never blocked up waiting
> on one particular bad site?

In theory, select (multiplexing) is the best solution, especially
since you are just gathering data. But this might well mean trashing
all of your existing code. I'd recommend using asyncore if you're
doing this from scratch.

If you're using one of the higher level protocol (urllib,
httplib) modules, a quick glance (I'm no expert on these guys) says
you'll have to thread or use subprocesses. Multiplexing usually means
turning your logic inside out (you're no longer driven by meaningful
exchanges, but by drips and drabs of data appearing in network
buffers). So the transformation from blocking sockets to multiplexed
non-blocking sockets is major surgery.

- Gordon
Fork, Threading, and Select madness. Please help! [ In reply to ]
Thanks for your thoughtful reply. As it turned out, threading wasn't
nearly as bad a solution as I thought it might be<0.5wink>. I just
downloaded Python 1.52 and recompiled with "./configure --with-threads"
and everything worked out nicely on the FreeBSD box that my webserver
uses.

Of course, doing this on my linux box resulted in something that causes
my kernel to blow up in flames a few hours after running the Python
interpretter with any threaded program. But that's ok, at least it
works with my ISP.

Anyway, threads seem to be helping me make things faster, so right now
that is my solution. The select idea is probably theoretically better,
but would be a big challenge to implement due to the architecture of the
code.<sigh>

Gordon McMillan wrote:
>
> Jesse D. Sightler wrote:
>
> > Ok, I have been working on a website which is now available at
> > http://www.biddin.com/ that performs a search of multiple auction
> > sites all at once, and then returns the results. The problem is
> > that it is very slow, and it appears that the primary bottleneck is
> > in waiting on the search engines at the various sites to begin
> > reading data.
> >
> > Well, this makes sense because the code to implement the search is
> > currently pure sequential with no threading or non-blocking IO
> > whatsoever.
> >
> > What I'm getting around to asking (slowly<g>) is, what is the best
> > way to implement an algorithm that can retrieve the data from ALL 3
> > sites simultaneously, so that the code is never blocked up waiting
> > on one particular bad site?
>
> In theory, select (multiplexing) is the best solution, especially
> since you are just gathering data. But this might well mean trashing
> all of your existing code. I'd recommend using asyncore if you're
> doing this from scratch.
>
> If you're using one of the higher level protocol (urllib,
> httplib) modules, a quick glance (I'm no expert on these guys) says
> you'll have to thread or use subprocesses. Multiplexing usually means
> turning your logic inside out (you're no longer driven by meaningful
> exchanges, but by drips and drabs of data appearing in network
> buffers). So the transformation from blocking sockets to multiplexed
> non-blocking sockets is major surgery.
>
> - Gordon
Fork, Threading, and Select madness. Please help! [ In reply to ]
Jesse D. Sightler wrote:

> Thanks for your thoughtful reply. As it turned out, threading
> wasn't nearly as bad a solution as I thought it might be<0.5wink>.
> I just downloaded Python 1.52 and recompiled with "./configure
> --with-threads" and everything worked out nicely on the FreeBSD box
> that my webserver uses.

You're welcome. Wish all my advice came out that well!

> Of course, doing this on my linux box resulted in something that
> causes my kernel to blow up in flames a few hours after running the
> Python interpretter with any threaded program. But that's ok, at
> least it works with my ISP.

If you compiled with gcc, you might try egcs. A recent thread on
Linux threading problems arrived at that fix, but I'm not sure it's
the same problem.

I'd never encountered the problem, but rebuilt with egcs anyway.


- Gordon