Hi,
I just want to keep you informed on how we plan to integrate the LARM
crawler with Lucene. I'm working with Mehran Mehr on two major topics:
1. Lucene storage: We want to see a web document as a bunch of name-value
pairs, one of which is the URL and the other could be the document itself.
From within the storage pipeline, these web documents can be enhanced or
changed. In the end there is the Lucene storage which takes a web document
and stores its contents as fields within a Lucene index. So the storage
itself is stupid. We can think of a lot of preprocessing steps that can
occur before the store process itself takes place: document conversion, HTML
removal, Header extraction, lemmatization and other linguistic features, and
so forth. The storage itself can also be only an intermediary step: web
documents could also be saved in plain files or a JMS topic, allowing for
the division of the processing steps in a temporal or spacial manner.
2. Configuration. The crawler is very modular and mainly consists of several
producer/consumer pipelines that define a way where documents come from and
how they are processed. We want this whole pipeline to be configurable
(remember, most of it is still done from within the source code). That way,
we want to be able to provide different configurations for different
purposes: One could mimic the behavior of "wget", for example, another could
build a fast one-machine crawler for a medium-size intranet, while a third
configuration could be distributed and crawls a major part of the web.
As soon as we have done these two things, I think we can move the crawler
and Lucene a bit closer together.
We are still looking for people to help us. If you have resources left for
the further developments (design, code, test), please read the technical
overview document and the TODO.txt files in the lucene-sandbox repository,
and contact me.
--Clemens
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
I just want to keep you informed on how we plan to integrate the LARM
crawler with Lucene. I'm working with Mehran Mehr on two major topics:
1. Lucene storage: We want to see a web document as a bunch of name-value
pairs, one of which is the URL and the other could be the document itself.
From within the storage pipeline, these web documents can be enhanced or
changed. In the end there is the Lucene storage which takes a web document
and stores its contents as fields within a Lucene index. So the storage
itself is stupid. We can think of a lot of preprocessing steps that can
occur before the store process itself takes place: document conversion, HTML
removal, Header extraction, lemmatization and other linguistic features, and
so forth. The storage itself can also be only an intermediary step: web
documents could also be saved in plain files or a JMS topic, allowing for
the division of the processing steps in a temporal or spacial manner.
2. Configuration. The crawler is very modular and mainly consists of several
producer/consumer pipelines that define a way where documents come from and
how they are processed. We want this whole pipeline to be configurable
(remember, most of it is still done from within the source code). That way,
we want to be able to provide different configurations for different
purposes: One could mimic the behavior of "wget", for example, another could
build a fast one-machine crawler for a medium-size intranet, while a third
configuration could be distributed and crawls a major part of the web.
As soon as we have done these two things, I think we can move the crawler
and Lucene a bit closer together.
We are still looking for people to help us. If you have resources left for
the further developments (design, code, test), please read the technical
overview document and the TODO.txt files in the lucene-sandbox repository,
and contact me.
--Clemens
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>