Hi,
I have added an experimental version of a LuceneStorage to the LARM crawler,
available from CVS in lucene-sandbox. That means crawled documents can now
directly be indexed into a lucene index.
Sorry, no configuration files yet. Config is done in
...larm/FetcherMain.java
The main class FetcherMain is now configured to store the contents in a
lucene index called "luceneIndex".
Lots of open questions:
- LARM doesn't have the notion of closing everything down. What happens if
IndexWriter is interrupted?
- I haven't tried to read from the index yet...
- How to configure the stuff from a config file
... (it's late)
Please try it:
To build and run it,
- put ANT in your path
- provide a build.properties with the location of the lucene Jar file
(lucene.jar=)
(just like javacc in lucene/build.xml)
- put HTTPClient.jar from http://innovation.ch/java and jakarta-oro library
into libs
- type:
ant
run -Dstart=<starturl> -Drestrictto=<restricttourl> -Dthreads=<numThreads>
ex.:
ant
run -Dstart=http://localhost/ -Drestrictto=http://localhost.* -Dthreads=5
note: restrictto is a regular expression; the URLs tested against it are
normalized beforehand, which means
they are made lower case, index.* are removed, and some other corrections
(see URLNormalizer.java for
details)
note: LuceneStorage is dumb; it just takes the WebDocument and stores it.
That means with the current config it also stores tags, and only one
"content" field that contains everything. I plan to write another storage
that uses the HTMLDocument from the demo package to store HTML documents.
Please note that when adding this storage to the storage pipeline, the whole
crawling process becomes
CPU- instead of I/O bound. We already have plans how to do the distribution.
Feel free to contact me if there are questions.
Still Looking For Contributors!
Clemens
--------------------------------------
http://www.cmarschner.net
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
I have added an experimental version of a LuceneStorage to the LARM crawler,
available from CVS in lucene-sandbox. That means crawled documents can now
directly be indexed into a lucene index.
Sorry, no configuration files yet. Config is done in
...larm/FetcherMain.java
The main class FetcherMain is now configured to store the contents in a
lucene index called "luceneIndex".
Lots of open questions:
- LARM doesn't have the notion of closing everything down. What happens if
IndexWriter is interrupted?
- I haven't tried to read from the index yet...
- How to configure the stuff from a config file
... (it's late)
Please try it:
To build and run it,
- put ANT in your path
- provide a build.properties with the location of the lucene Jar file
(lucene.jar=)
(just like javacc in lucene/build.xml)
- put HTTPClient.jar from http://innovation.ch/java and jakarta-oro library
into libs
- type:
ant
run -Dstart=<starturl> -Drestrictto=<restricttourl> -Dthreads=<numThreads>
ex.:
ant
run -Dstart=http://localhost/ -Drestrictto=http://localhost.* -Dthreads=5
note: restrictto is a regular expression; the URLs tested against it are
normalized beforehand, which means
they are made lower case, index.* are removed, and some other corrections
(see URLNormalizer.java for
details)
note: LuceneStorage is dumb; it just takes the WebDocument and stores it.
That means with the current config it also stores tags, and only one
"content" field that contains everything. I plan to write another storage
that uses the HTMLDocument from the demo package to store HTML documents.
Please note that when adding this storage to the storage pipeline, the whole
crawling process becomes
CPU- instead of I/O bound. We already have plans how to do the distribution.
Feel free to contact me if there are questions.
Still Looking For Contributors!
Clemens
--------------------------------------
http://www.cmarschner.net
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>