Mailing List Archive: LARM Web Crawler: LuceneStorage [experimental]

LARM Web Crawler: LuceneStorage [experimental]

Jun 17, 2002, 6:11 PM

Post #1 of 8 (600 views)

Hi,

I have added an experimental version of a LuceneStorage to the LARM crawler,
available from CVS in lucene-sandbox. That means crawled documents can now
directly be indexed into a lucene index.

Sorry, no configuration files yet. Config is done in
...larm/FetcherMain.java
The main class FetcherMain is now configured to store the contents in a
lucene index called "luceneIndex".

Lots of open questions:
- LARM doesn't have the notion of closing everything down. What happens if
IndexWriter is interrupted?
- I haven't tried to read from the index yet...
- How to configure the stuff from a config file
... (it's late)

Please try it:

To build and run it,
- put ANT in your path
- provide a build.properties with the location of the lucene Jar file
(lucene.jar=)
(just like javacc in lucene/build.xml)
- put HTTPClient.jar from http://innovation.ch/java and jakarta-oro library
into libs
- type:

ant
run -Dstart=<starturl> -Drestrictto=<restricttourl> -Dthreads=<numThreads>

ex.:
ant
run -Dstart=http://localhost/ -Drestrictto=http://localhost.* -Dthreads=5

note: restrictto is a regular expression; the URLs tested against it are
normalized beforehand, which means
they are made lower case, index.* are removed, and some other corrections
(see URLNormalizer.java for
details)

note: LuceneStorage is dumb; it just takes the WebDocument and stores it.
That means with the current config it also stores tags, and only one
"content" field that contains everything. I plan to write another storage
that uses the HTMLDocument from the demo package to store HTML documents.

Please note that when adding this storage to the storage pipeline, the whole
crawling process becomes
CPU- instead of I/O bound. We already have plans how to do the distribution.

Feel free to contact me if there are questions.
Still Looking For Contributors!

Clemens

--------------------------------------
http://www.cmarschner.net

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: LARM Web Crawler: LuceneStorage [experimental] [ In reply to ]

otis_gospodnetic at yahoo

Jun 18, 2002, 2:55 PM

Post #2 of 8 (582 views)

Permalink

I see nice progress here.
I will try it in the near future (time!).

> I have added an experimental version of a LuceneStorage to the LARM
> crawler,
> available from CVS in lucene-sandbox. That means crawled documents
> can now directly be indexed into a lucene index.
>
>
>
> Sorry, no configuration files yet. Config is done in
> ...larm/FetcherMain.java
> The main class FetcherMain is now configured to store the contents in
> a lucene index called "luceneIndex".
>
>
> Lots of open questions:
> - LARM doesn't have the notion of closing everything down. What
> happens if IndexWriter is interrupted?

As in what if it encounters an exception (e.g. somebody removes the
index directory)? I guess one of the items that should them maybe get
added to the to-do list is checkpointing for starters.

> - I haven't tried to read from the index yet...

Heh, I'm familiar with that situation.

> - How to configure the stuff from a config file
> ... (it's late)

Property file with name=value pairs and some init() method that is
called at the beginning may be sufficient.

> Please try it:
>
> To build and run it,
> - put ANT in your path
> - provide a build.properties with the location of the lucene Jar file
> (lucene.jar=)
> (just like javacc in lucene/build.xml)
> - put HTTPClient.jar from http://innovation.ch/java and jakarta-oro
> library
> into libs
> - type:
>
> ant
> run -Dstart=<starturl> -Drestrictto=<restricttourl>
> -Dthreads=<numThreads>
>
> ex.:
> ant
> run -Dstart=http://localhost/ -Drestrictto=http://localhost.*
> -Dthreads=5
>
> note: restrictto is a regular expression; the URLs tested against it
> are
> normalized beforehand, which means
> they are made lower case, index.* are removed, and some other
> corrections
> (see URLNormalizer.java for details)

Removing index.* may be too bold and incorrect in some situations.

> note: LuceneStorage is dumb; it just takes the WebDocument and stores
> it.
> That means with the current config it also stores tags, and only one
> "content" field that contains everything. I plan to write another
> storage
> that uses the HTMLDocument from the demo package to store HTML
> documents.

Nice.
I found NekoHTML to do a nice job of 'dehtmlization'.

> Please note that when adding this storage to the storage pipeline,
> the whole
> crawling process becomes
> CPU- instead of I/O bound. We already have plans how to do the
> distribution.
>
> Feel free to contact me if there are questions.
> Still Looking For Contributors!
>
> Clemens

Ausgezeichnet!

Otis

__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>