Mailing List Archive: LARM Crawler: Repository

Ok I think I got your point.

> You have MySQL to hold your links.
> You have N crawler threads.
> You don't want to hit MySQL a lot, so you get links to crawl in batches
> (e.g. each crawler thread tells MySQL: give me 1000 links to crawl).

[.just to make it clear: this looks like the threads would be "pulling"
tasks. In fact the MessageHandler pushes tasks that are then distributed to
the thread pool.]

That's still one of the things I wanted to change in the next time: at the
moment the message processing pipeline is not working in batch mode. Each
URLMessage is transmitted on its own. I already wanted to change this in
order to reduce the number of synchronization points. But it was not top
priority because the overall process was still pretty much I/O bound.
But from my early experiments with the Repository I see now that this
becomes more and more important.

I read the paper about the WebRace crawler
(http://citeseer.nj.nec.com/zeinalipour-yazti01highperformance.html, pp.
12-15) and thought about the following way the repository should work:
- for each URL put into the pipeline, look if it has already been crawled,
and if, save the lastModified date into the URLMessage
- in the crawling task, look if the lastModified timestamp is set. if so,
send an "If-Modified-Since" header along with the GET command
-> if a 304 (not-modified-since) is the answer, load the outbound links out
of the repository and put them back into the pipeline
-> if a 404 (or similar) statuscode is returned, delete the file and its
links from the repository
-> if it was modified, delete the old stored links, update the timestamp of
the doc in the repository and continue as if the file was new
- if the file is new, load it, parse it, save its links, and put them back
into the pipeline

(Any better ideas?)

I have already implemented a rather naive approach of that today, which (by
no surprise) turns out to be slower than crawling everything from the
start...

What I've learned:
- The repository must load the information about already crawled documents
into main memory after the start (which means the main memory must be large
enough to hold all these URLs + some extra info, which is already done in
URLVisitedFilter at this time) and, more importantly...
- have a more efficient means of accessing the links than in a regular SQL
table with {referer, target} pairs. The Meta-Info store mentioned in the
WebRace crawler may be a solution (it's a plain text file that contains all
document meta-data and whose index is held in main memory), but it prevents
the URLs from being sorted other ways (e.g. all INlinks to a document),
which is what I need for my further studies.

> The crawler fetches all pages, and they go through your component
> pipeline and get processed.
> What happens if after fetching 100 links from this batch of 1000 the
> crawler thread dies? Do you keep track of which links in that batch
> you've crawled, so that in case the thread dies you don't recrawl
> those?
> That's roughly what I meant.

First of all, I have invested a lot of time to prevent any threads from
dying. That's a reason why I took HTTPClient, because it has never hung so
far. A lot of exceptions are caught on the task level. I've had a lot of
problems with hanging threads when I still used the java.net.URLConnection
classes, but no more.
I have also learned that "whatever can go wrong, will go wrong, very soon".
That is why I patched the HTTPClient classes to introduce a maximum file
size to be fetched.
I can imagine some sort of crawler trap when a server process sends
characters very slowly, as it is used in some spam filters. That's where the
ThreadMonitor comes in. Each task publishes its state (i.e. "loading data"),
and the ThreadMonitor restarts it when it remains in a state for too long.
That's the place where the ThreadMonitor could save the rest of the batch.
This way, the ThreadMonitor could become the single point of failure, but
the risk that this thread is hanging is reduced by keeping it simple. Just
like a watchdog hardware that looks that traffic lights work at a street
crossing...

Regards,

Clemens

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Hello,

--- Clemens Marschner <cmad@lanlab.de> wrote:
> Ok I think I got your point.
>
> > You have MySQL to hold your links.
> > You have N crawler threads.
> > You don't want to hit MySQL a lot, so you get links to crawl in
> batches
> > (e.g. each crawler thread tells MySQL: give me 1000 links to
> crawl).
>
>
> [.just to make it clear: this looks like the threads would be
> "pulling"
> tasks. In fact the MessageHandler pushes tasks that are then
> distributed to the thread pool.]

Something must be doing pulling of links from MySQL by doing SELECT...

> That's still one of the things I wanted to change in the next time:
> at the
> moment the message processing pipeline is not working in batch mode.
> Each
> URLMessage is transmitted on its own. I already wanted to change this
> in
> order to reduce the number of synchronization points. But it was not
> top
> priority because the overall process was still pretty much I/O bound.
> But from my early experiments with the Repository I see now that this
> becomes more and more important.

Uh, doing it one by one will make the DB become the bottleneck quickly.
MySQL extends the 'standard SQL' by allowing you to get N rows at the
time, as do MS-SQL Server and PostgreSQL (offset, limit).

> I read the paper about the WebRace crawler
> (http://citeseer.nj.nec.com/zeinalipour-yazti01highperformance.html,
> pp.
> 12-15) and thought about the following way the repository should
> work:
> - for each URL put into the pipeline, look if it has already been
> crawled,
> and if, save the lastModified date into the URLMessage
> - in the crawling task, look if the lastModified timestamp is set. if
> so,
> send an "If-Modified-Since" header along with the GET command
> -> if a 304 (not-modified-since) is the answer, load the outbound
> links out
> of the repository and put them back into the pipeline

Do you need to do that? If you've seen this page before, and you have
already extracted and stored links found in it, then those links will
get their turn, they will be pulled out of the link repository when
their time comes. In other words, you don't need to put them in the
pipeline at this point.

> -> if a 404 (or similar) statuscode is returned, delete the file and
> its links from the repository

Delete the file, sure, although you may want to consider giving it a
few strikes before throwing it out for good (e.g. 404 once - make a
note of it, 404 again - make a note of it, 404 for the third time -
you're out!), thus also giving it a chance to 'recover' (e.g. 404 once,
404 twice, but 200 next time - page stays in repository).
But links, why remove links found in a page that you are about to
delete? They will likely still be valid, and what's more, perhaps this
now 404 page might have been the only path to them, so if you remove
them now, you may never find them again, if no other page points to
them.

> -> if it was modified, delete the old stored links, update the
> timestamp of
> the doc in the repository and continue as if the file was new

I wouldn't delete old links, I'd live them, and just try to add links
anyway, ensuring that you don't end up with duplicate links in the
repository.

> - if the file is new, load it, parse it, save its links, and put them
> back into the pipeline

Why not pipeline, why not repository? (I may be missing something
about how you handle link storage).
When you say pipeline, is this something that is persistent
(CachingQueue?) or in memory?
When I say link repository I'm referring to your MySQL database.

> (Any better ideas?)
>
> I have already implemented a rather naive approach of that today,
> which (by
> no surprise) turns out to be slower than crawling everything from the
> start...
>
> What I've learned:
> - The repository must load the information about already crawled
> documents
> into main memory after the start (which means the main memory must be
> large
> enough to hold all these URLs + some extra info, which is already
> done in
> URLVisitedFilter at this time) and, more importantly...

It has to load all this information in order to be able to check if a
link extracted from a fetched page has already been visited?
If so, this approach will, obviously, not scale.
AltaVista folks use a smaller set of popular URLs and spatially(?)
close URLs in memory, and keep everything else on disk. That way they
don't require a lot of RAM for storing that, and disk accesses for 'has
this link already been seen?' checks are infrequent.

> - have a more efficient means of accessing the links than in a
> regular SQL
> table with {referer, target} pairs. The Meta-Info store mentioned in
> the
> WebRace crawler may be a solution (it's a plain text file that
> contains all
> document meta-data and whose index is held in main memory), but it
> prevents
> the URLs from being sorted other ways (e.g. all INlinks to a
> document), which is what I need for my further studies.

I think pulling links out of the DB is not such a huge problem if you
do it in batches, but updating each individual link's row in the DB
will be. I don't know a way around that.

> > The crawler fetches all pages, and they go through your component
> > pipeline and get processed.
> > What happens if after fetching 100 links from this batch of 1000
> the
> > crawler thread dies? Do you keep track of which links in that
> batch
> > you've crawled, so that in case the thread dies you don't recrawl
> > those?
> > That's roughly what I meant.
>
> First of all, I have invested a lot of time to prevent any threads
> from
> dying. That's a reason why I took HTTPClient, because it has never
> hung so
> far. A lot of exceptions are caught on the task level. I've had a lot
> of
> problems with hanging threads when I still used the
> java.net.URLConnection
> classes, but no more.
> I have also learned that "whatever can go wrong, will go wrong, very
> soon".
> That is why I patched the HTTPClient classes to introduce a maximum
> file
> size to be fetched.
> I can imagine some sort of crawler trap when a server process sends
> characters very slowly, as it is used in some spam filters. That's
> where the
> ThreadMonitor comes in. Each task publishes its state (i.e. "loading
> data"),
> and the ThreadMonitor restarts it when it remains in a state for too
> long.
> That's the place where the ThreadMonitor could save the rest of the
> batch.
> This way, the ThreadMonitor could become the single point of failure,
> but
> the risk that this thread is hanging is reduced by keeping it simple.
> Just
> like a watchdog hardware that looks that traffic lights work at a
> street crossing...
>
> Regards,
>
> Clemens

Otis

__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>