Mailing List Archive

timeout on urllib.urlopen?
I'm trying to use urllib.urlopen() on a big list of urls, some of which are
dead (they don't return a 404, just no response). And the function just waits.
Is there any way to specify a timeout period for this function? thanks,

Kevin

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
timeout on urllib.urlopen? [ In reply to ]
Hello!

On Mon, 26 Apr 1999, Kevin L wrote:
> I'm trying to use urllib.urlopen() on a big list of urls, some of which are
> dead (they don't return a 404, just no response). And the function just waits.
> Is there any way to specify a timeout period for this function? thanks,

this is well-known problem with sockets. No, there is no simple way to
specify timeout. You should rewrite your program to use either fork() or
threads.
One thread (or parent process, if you use fork()) controlls another, and
after timeout kill watching thread/process. Recently I rewrote my URL
checker to use fork() (the very checker is under debuggung and soon to be
published).

> Kevin

Oleg.
----
Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/
Programmers don't die, they just GOSUB without RETURN.
timeout on urllib.urlopen? [ In reply to ]
--wRRV7LY7NUeQGEoC
Content-Type: text/plain; charset=us-ascii

On Mon, Apr 26, 1999 at 04:48:44AM +0000, Kevin L wrote:
>
> I'm trying to use urllib.urlopen() on a big list of urls, some of which are
> dead (they don't return a 404, just no response). And the function just waits.
> Is there any way to specify a timeout period for this function? thanks,
>
> Kevin
>

greetings,

attached, please find a short lightly tested module that might do what you
are looking for.. please let me know if this is what you need. it's a piece
of code I wrote for a larger application, and it seems to get the job done
nicely. suggestions for optimizations, etc, accepted.

regards,
J
--
|| visit gfd <http://quark.newimage.com:8080/>
|| psa member #293 <http://www.python.org/>
|| New Image Systems & Services, Inc. <http://www.newimage.com/>

--wRRV7LY7NUeQGEoC
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="nonblockinghttp.py"

import socket
import string
import select

from urlparse import urlparse, urlunparse
from httplib import HTTP, HTTP_PORT

from errno import EINPROGRESS, ETIMEDOUT

class localHTTP(HTTP):
def __init__(self, host = '', port = 0, timeout = 10.0):
self.connect_timeout = timeout
HTTP.__init__(self, host, port)

def connect(self, host, port = 0):
if not port:
i = string.find(host, ":")
if i >= 0:
host, port = host[:i], host[i+1:]
try:
port = string.atoi(port)
except string.atoi_error:
raise socket.error, "nonnumeric port"
if not port:
port = HTTP_PORT

self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
if self.debuglevel > 0:
print "connect:", (host, port)

self.sock.setblocking(0)
try:
self.sock.connect(host, port)
except socket.error, why:
if why[0] == EINPROGRESS:
pass
else:
raise socket.error, why

(r, w, e) = select.select([], [self.sock], [], self.connect_timeout)
if w == [self.sock]:
self.sock.setblocking(1)
return
else:
raise socket.error, (ETIMEDOUT, "timeout during connect phase")

def checkurl(url):
if url == "" or url == None:
return None

u = urlparse(url)
netloc = u[1]
path = u[2]

h = localHTTP(netloc)
h.set_debuglevel(0)
h.putrequest("HEAD", path)
h.putheader("accept", "text/html")
h.putheader("accept", "text/plain")
h.endheaders()

return h.getreply()

if __name__ == "__main__":
print checkurl("http://quark.newimage.com:8080/")

--wRRV7LY7NUeQGEoC--
timeout on urllib.urlopen? [ In reply to ]
jam <jam@newimage.com> writes:

> On Mon, Apr 26, 1999 at 04:48:44AM +0000, Kevin L wrote:
> >
> > I'm trying to use urllib.urlopen() on a big list of urls, some of which are
> > dead (they don't return a 404, just no response). And the function just waits.
> > Is there any way to specify a timeout period for this function? thanks,

....
> attached, please find a short lightly tested module that might do what you
> are looking for.. please let me know if this is what you need. it's a piece
> of code I wrote for a larger application, and it seems to get the job done
> nicely. suggestions for optimizations, etc, accepted.

I used once SIGALRM to force a timeout. Maybe somebody could comment
on that approach?


/steffen

--8<----8<----8<----8<--
import signal

def alarmHandler(*args):
"""
signal handler for SIGALRM, just raise an exception
"""
raise "TimeOut"

....
signal.signal(signal.SIGALRM, alarmHandler)
try:
# set timeout
signal.alarm(120)

#... urllib.urlretrieve pages

except "TimeOut":
# some error handling
signal.alarm(0)
--8<----8<----8<----8<--
--
steffen@cyberus.ca <> Gravity is a myth -- the Earth sucks!
timeout on urllib.urlopen? [ In reply to ]
On Mon, Apr 26, 1999 at 07:59:50AM -0400, Steffen Ries wrote:
>
> I used once SIGALRM to force a timeout. Maybe somebody could comment
> on that approach?
>
>
> /steffen
>
[..snipped code..]

greetings,

all well and good (the more ideas the better), except that if something goes
wrong, all you get is that a timeout happened within 120 seconds.. with the
'select' approach, you have a chance to record the specific error that the
socket had.. sometimes the server is down ('connection refused'), sometimes
the web server itself is having problems, sometimes the network is down,
etc.. you can import additional 'errno' symbols and trap them if necessary,
and even specify a timeout to the select call, so you can trap that
seperately as well.

hope that helps.

regards,
J
--
|| visit gfd <http://quark.newimage.com:8080/>
|| psa member #293 <http://www.python.org/>
|| New Image Systems & Services, Inc. <http://www.newimage.com/>