Mailing List Archive

LARM Crawler: Status
Hi,

I just want to keep you informed on how we plan to integrate the LARM
crawler with Lucene. I'm working with Mehran Mehr on two major topics:

1. Lucene storage: We want to see a web document as a bunch of name-value
pairs, one of which is the URL and the other could be the document itself.
From within the storage pipeline, these web documents can be enhanced or
changed. In the end there is the Lucene storage which takes a web document
and stores its contents as fields within a Lucene index. So the storage
itself is stupid. We can think of a lot of preprocessing steps that can
occur before the store process itself takes place: document conversion, HTML
removal, Header extraction, lemmatization and other linguistic features, and
so forth. The storage itself can also be only an intermediary step: web
documents could also be saved in plain files or a JMS topic, allowing for
the division of the processing steps in a temporal or spacial manner.

2. Configuration. The crawler is very modular and mainly consists of several
producer/consumer pipelines that define a way where documents come from and
how they are processed. We want this whole pipeline to be configurable
(remember, most of it is still done from within the source code). That way,
we want to be able to provide different configurations for different
purposes: One could mimic the behavior of "wget", for example, another could
build a fast one-machine crawler for a medium-size intranet, while a third
configuration could be distributed and crawls a major part of the web.

As soon as we have done these two things, I think we can move the crawler
and Lucene a bit closer together.

We are still looking for people to help us. If you have resources left for
the further developments (design, code, test), please read the technical
overview document and the TODO.txt files in the lucene-sandbox repository,
and contact me.

--Clemens




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: LARM Crawler: Status [ In reply to ]
Clemens,

> I just want to keep you informed on how we plan to integrate the LARM
> crawler with Lucene. I'm working with Mehran Mehr on two major
> topics:

Just curious - is that the Iranian Olympic silver medalist from 1996?

> 1. Lucene storage: We want to see a web document as a bunch of
> name-value
> pairs, one of which is the URL and the other could be the document
> itself.

If you are interested, I can send you a class that is written as a
NekoHTML Filter, which I use for extracting title, body, meta keywords
and description.

> From within the storage pipeline, these web documents can be enhanced
> or
> changed. In the end there is the Lucene storage which takes a web
> document
> and stores its contents as fields within a Lucene index. So the
> storage
> itself is stupid. We can think of a lot of preprocessing steps that
> can
> occur before the store process itself takes place: document
> conversion, HTML
> removal, Header extraction, lemmatization and other linguistic
> features, and
> so forth. The storage itself can also be only an intermediary step:
> web
> documents could also be saved in plain files or a JMS topic, allowing
> for
> the division of the processing steps in a temporal or spacial manner.

Have I mentioned k2d2.org framework here before?
I read about it in JavaPro a few months ago and chose it for an
application that I was/am writing. It allows for a very elegant and
simple (in terms of use) producer/consumer pipeline.
I've actually added a bit of functionality to the version that's at
k2d2.org and sent it to the author who will, I believe, include it in
the new version.
Also, the framework allows for distributed consumer pipeline with
different communication protocols (JMS, RMI, BEEP...). That is
something that is not available yet, but the author told me about it
over a month ago.

> 2. Configuration. The crawler is very modular and mainly consists of
> several
> producer/consumer pipelines that define a way where documents come
> from and
> how they are processed. We want this whole pipeline to be
> configurable
> (remember, most of it is still done from within the source code).
> That way,
> we want to be able to provide different configurations for different
> purposes: One could mimic the behavior of "wget", for example,
> another could
> build a fast one-machine crawler for a medium-size intranet, while a
> third
> configuration could be distributed and crawls a major part of the
> web.

k2d2.org stuff doesn't have anything that allows for dynamic
configurations, but it may be good to use because then you don't have
to worry about developing, maintaining, fixing yet another component,
which should really be just another piece of your infrastructure on top
of which you can construct your specific application logic.

> As soon as we have done these two things, I think we can move the
> crawler and Lucene a bit closer together.
>
> We are still looking for people to help us. If you have resources
> left for
> the further developments (design, code, test), please read the
> technical
> overview document and the TODO.txt files in the lucene-sandbox
> repository, and contact me.

I will try to test it.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: LARM Crawler: Status // Avalon? [ In reply to ]
> If you are interested, I can send you a class that is written as a
> NekoHTML Filter, which I use for extracting title, body, meta keywords
> and description.

Sure, send it over. But isn't the example packaged with Lucene doing the
same?

> Have I mentioned k2d2.org framework here before?
> I read about it in JavaPro a few months ago and chose it for an
> application that I was/am writing. It allows for a very elegant and
> simple (in terms of use) producer/consumer pipeline.
> I've actually added a bit of functionality to the version that's at
> k2d2.org and sent it to the author who will, I believe, include it in
> the new version.
> Also, the framework allows for distributed consumer pipeline with
> different communication protocols (JMS, RMI, BEEP...). That is
> something that is not available yet, but the author told me about it
> over a month ago.

Hmm.. I'll have a look at it. But keep in mind that the current solution is
working already, and we probably only need one very simple way to transfer
the data.

> We want this whole pipeline to be
> > configurable
> > (remember, most of it is still done from within the source code).
>...
> k2d2.org stuff doesn't have anything that allows for dynamic
> configurations, but it may be good to use because then you don't have
> to worry about developing, maintaining, fixing yet another component,
> which should really be just another piece of your infrastructure on top
> of which you can construct your specific application logic.

yep, right. that's what i hate about c++ programs (also called
'yet-another-linked-list-implementation's :-)) i'll have a look at it; I
just think the patterns used in LARM are probably too simple to be worth the
exchange. But I'll see.

By the way, I thought about the "putting all together in config files"
thing: It's probably sufficient to have a couple of applications (main
classes) that put the basic stuff together, and whose parts are then
configurable through property files. At least now.
I just have this feeling, but I fear some things could become very nasty if
we have to invent a declarative configuration language that describes the
configuration of the pipelines, or at least whose components tell the
configuring class which other components they need to know of... (oh, that
looks like we need component based development...)...

>> Lots of open questions:
>> - LARM doesn't have the notion of closing everything down. What
>> happens if IndexWriter is interrupted?

I must add that in general I don't have experience with using Lucene
incrementally, that is, updating the index while others are using it. Is
that working smoothly?

>As in what if it encounters an exception (e.g. somebody removes the
>index directory)? I guess one of the items that should them maybe get
>added to the to-do list is checkpointing for starters.

Hm... what do you mean...?
From what I understand you mean that then the doc is stored in a repository
until the index is available again...? [confused]


One last thought:
- the crawler should be be started as a daemon process (at least optionally)
- it should wake up from time to time to crawl changed pages
- it should provide a management and status interface to the outside.
- it internally needs the ability to run service jobs while crawling
(keeping memory tidy, collecting stats, etc.)

from what I know, these matters could be addressed by the Apache
Avalon/Phoenix
project. Does anyone know anything about it?

Clemens





--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: LARM Crawler: Status // Avalon? [ In reply to ]
Hello,

--- Clemens Marschner <cmad@lanlab.de> wrote:
> > If you are interested, I can send you a class that is written as a
> > NekoHTML Filter, which I use for extracting title, body, meta
> keywords
> > and description.
>
> Sure, send it over. But isn't the example packaged with Lucene doing
> the same?

It's attached. I'm sending it to the list, in case anyone searches the
list archives and needs code like this.

> > Have I mentioned k2d2.org framework here before?
> > I read about it in JavaPro a few months ago and chose it for an
> > application that I was/am writing. It allows for a very elegant
> and
> > simple (in terms of use) producer/consumer pipeline.
> > I've actually added a bit of functionality to the version that's at
> > k2d2.org and sent it to the author who will, I believe, include it
> in
> > the new version.
> > Also, the framework allows for distributed consumer pipeline with
> > different communication protocols (JMS, RMI, BEEP...). That is
> > something that is not available yet, but the author told me about
> it
> > over a month ago.
>
> Hmm.. I'll have a look at it. But keep in mind that the current
> solution is
> working already, and we probably only need one very simple way to
> transfer the data.

I know, if it works it may not need fixing, but I thought you may want
to get rid of the infrastructure part of your code if there is
something that does it nicely already.

> > We want this whole pipeline to be
> > > configurable
> > > (remember, most of it is still done from within the source code).
> >...
> > k2d2.org stuff doesn't have anything that allows for dynamic
> > configurations, but it may be good to use because then you don't
> have
> > to worry about developing, maintaining, fixing yet another
> component,
> > which should really be just another piece of your infrastructure on
> top
> > of which you can construct your specific application logic.
>
> yep, right. that's what i hate about c++ programs (also called
> 'yet-another-linked-list-implementation's :-)) i'll have a look at
> it; I
> just think the patterns used in LARM are probably too simple to be
> worth the exchange. But I'll see.

This k2d2 framework is super simple to use. Register consumers, put
something in the front queue, extend a base class and override a single
method that takes an object and returns an object (or null if it
consumes it). Pipeline done.

> By the way, I thought about the "putting all together in config
> files"
> thing: It's probably sufficient to have a couple of applications
> (main
> classes) that put the basic stuff together, and whose parts are then
> configurable through property files. At least now.
> I just have this feeling, but I fear some things could become very
> nasty if
> we have to invent a declarative configuration language that describes
> the
> configuration of the pipelines, or at least whose components tell the
> configuring class which other components they need to know of... (oh,
> that looks like we need component based development...)...

I don't have a better suggestion right now.

> >> Lots of open questions:
> >> - LARM doesn't have the notion of closing everything down. What
> >> happens if IndexWriter is interrupted?
>
> I must add that in general I don't have experience with using Lucene
> incrementally, that is, updating the index while others are using it.
> Is that working smoothly?

Yes, in my experience it works without problems.

> >As in what if it encounters an exception (e.g. somebody removes the
> >index directory)? I guess one of the items that should them maybe
> get
> >added to the to-do list is checkpointing for starters.
>
> Hm... what do you mean...?
> From what I understand you mean that then the doc is stored in a
> repository
> until the index is available again...? [confused]

What I meant was this.
You have MySQL to hold your links.
You have N crawler threads.
You don't want to hit MySQL a lot, so you get links to crawl in batches
(e.g. each crawler thread tells MySQL: give me 1000 links to crawl).
The crawler fetches all pages, and they go through your component
pipeline and get processed.
What happens if after fetching 100 links from this batch of 1000 the
crawler thread dies? Do you keep track of which links in that batch
you've crawled, so that in case the thread dies you don't recrawl
those?
That's roughly what I meant.

> One last thought:
> - the crawler should be be started as a daemon process (at least
> optionally)
> - it should wake up from time to time to crawl changed pages
> - it should provide a management and status interface to the outside.
> - it internally needs the ability to run service jobs while crawling
> (keeping memory tidy, collecting stats, etc.)
>
> from what I know, these matters could be addressed by the Apache
> Avalon/Phoenix project. Does anyone know anything about it?

To me Avalon looks relatively complex, but from what I've read it is a
piece of software designed to allow applications like your crawler to
run on top of it. I'm stating the obvious, for some.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com