Mailing List Archive

Web Crawler
Hi,

I have been writing a web crawler in Java for quite some time now. Since
Lucene doesn't contain one by itself, I wonder if you were interested in a
contribution within the Lucene project.

I would probably call it a 0.4. It has quite a modular design, it's
multithreaded and still pretty simple.

And it's optimized for speed. I spent some time with a profiler to get the
beast FAST and memory consumption low. It contains an optimized HTML parser
that just extracts the necessary information and doesn't waste time nor
objects.

I was able to get a maximum of 3.7 MB/sec on a 100MBit line and a MAN-style
network (a University campus with about 150 web servers).

Its only purpose is to crawl documents and links and store them somewhere.
Nothing is done with the documents (though it would be easy to incorporate
any computation steps, but this would probably shift the balance between IO
and CPU usage until one of them becomes a bottleneck). Any connection to the
Lucene engine has yet to be provided.

I have also made a lot of optimizations on RAM usage, but still some data
structures are kept in main memory (notably the hash of visited URLs),
limiting the number of files that can be crawled.

Since it's not a production release yet, it still has some limitations. Some
work still has to be done, I still have a lot of ideas, and pretty much of
the configuration is still made in the Java source code (well, at least,
most of it is concentrated in the main() method). Since I just used it for
myself, this was fine so far.

Cheers,

Clemens Marschner


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Web Crawler [ In reply to ]
Sounds good!
I think will be very useful.
I am writing a crawler too, but it is not complete and multi threaded yet.
My crawler will run on your own machine due to bandwith usage, high cpu usage, disk I/O , etc.
Also it will work as an NT service and i'll access it by RMI to manage it from a remote machine.

I can tell you in advance that have all the visited links in memory will kill your machine after about 150'000 links, i tested that, i crawled amazon.com and after 200'000 links the cpu was 100%, no response to event,nothing.The best thing? ...it was not even working, because all the time was wasted to search if the array contains already the current url to make the decision enqueue/ignore! Same thing to insert or delete a link from the queue!
Nice.
A database approach i think will be good for that.

Bye

--

On Wed, 24 Apr 2002 21:47:25
Clemens Marschner wrote:
>Hi,
>
>I have been writing a web crawler in Java for quite some time now. Since
>Lucene doesn't contain one by itself, I wonder if you were interested in a
>contribution within the Lucene project.
>
>I would probably call it a 0.4. It has quite a modular design, it's
>multithreaded and still pretty simple.
>
>And it's optimized for speed. I spent some time with a profiler to get the
>beast FAST and memory consumption low. It contains an optimized HTML parser
>that just extracts the necessary information and doesn't waste time nor
>objects.
>
>I was able to get a maximum of 3.7 MB/sec on a 100MBit line and a MAN-style
>network (a University campus with about 150 web servers).
>
>Its only purpose is to crawl documents and links and store them somewhere.
>Nothing is done with the documents (though it would be easy to incorporate
>any computation steps, but this would probably shift the balance between IO
>and CPU usage until one of them becomes a bottleneck). Any connection to the
>Lucene engine has yet to be provided.
>
>I have also made a lot of optimizations on RAM usage, but still some data
>structures are kept in main memory (notably the hash of visited URLs),
>limiting the number of files that can be crawled.
>
>Since it's not a production release yet, it still has some limitations. Some
>work still has to be done, I still have a lot of ideas, and pretty much of
>the configuration is still made in the Java source code (well, at least,
>most of it is concentrated in the main() method). Since I just used it for
>myself, this was fine so far.
>
>Cheers,
>
>Clemens Marschner
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>


See Dave Matthews Band live or win a signed guitar
http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Web Crawler [ In reply to ]
> I can tell you in advance that have all the visited links in memory will
kill your machine after about 150'000 links, i tested that, i crawled
amazon.com and after 200'000 links the cpu was 100%, no response to
event,nothing.The best thing?
<

It's not that bad with my crawler. I crawled 600.000 docs recently and from
what I recall the mem usage was somewhere between 150 and 200 MB. But that's
still too much.

>
...it was not even working, because all the time was wasted to search if the
array contains already the current url to make the decision enqueue/ignore!
Same thing to insert or delete a link from the queue!
<

sounds like you don't use a HashMap?

The queue is easier; I wrote a caching queue that only holds small blocks of
the queue(s) in RAM and keeps most of it in files.

> A database approach i think will be good for that.

No, I fear that's far too slow.
The problem with that is that both inserts and lookup have to be fast and
(since docs can point to any URL on the planet) both take place at random
positions in the whole URL collection. That means you can't put parts of it
on disk and keep the rest in RAM without losing too much performance.
The solution I see is keeping the information of complete hosts on disk and
only keep a certain number of hosts in RAM that can be handled at once. Then
even the decision whether a doc has already be crawled has to be queued
until the host info is loaded back into memory.
Another thing I have in mind is to compress the URLs in memory. First of
all, the URL can be divided in several parts, some of which occur in a lot
of URLs (i.e. the host name). And finally, URLs contain only a limited
number of different characters, so Huffman encoding is probably quite
efficient.


Clemens


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Web Crawler [ In reply to ]
> -----Original Message-----
> From: Clemens Marschner [mailto:cmad@lanlab.de]
> Sent: Wednesday, April 24, 2002 11:14 PM
> To: Lucene Developers List; korfut@lycos.com
> Subject: Re: Web Crawler
>
>> Another thing I have in mind is to compress the URLs in
> memory. First of
> all, the URL can be divided in several parts, some of which
> occur in a lot
> of URLs (i.e. the host name). And finally, URLs contain only a limited
> number of different characters, so Huffman encoding is probably quite
> efficient.
>
see this:http://www.almaden.ibm.com/cs/k53/www9.final/

. "In CS2, each URL is stored in 10 bytes. In CS1, each link requires 8 bytes to store as both an in-link and out-link; in CS2, an average of only 3.4 bytes are used. Second, CS2 provides additional functionality in the form of a host database. For example, in CS2, it is easy to get all the in-links for a given node, or just the in-links from remote hosts.

Like CS1, CS2 is designed to give high-performance access to all this data on a high-end machine with enough RAM to store the database in memory. On a 465 MHz Compaq AlphaServer 4100 with 12GB of RAM, it takes 70-80 ms to convert a URL into an internal id or vice versa, and then only 0.15 ms/link to retrieve each in-link or out-link. On a uniprocessor machine, a BFS that reaches 100M nodes takes about 4 minutes; on a 2-processor machine we were able complete a BFS every two minutes."


peter

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>