Mailing List Archive

Notes about webcrawler-LARM contribution
Hello,

A few notes about webcrawler-LARM contribution, which I just imported
in Lucene Sandbox. I will put these notes in the contribution's
README.txt later as well.

- This contribution requires:
a) HTTPClient (not Jakarta's, but this one:
http://www.innovation.ch/java/HTTPClient/
b) Jakarta ORO package for regular expressions

- The original archive file that I got from Clemens had ORO and
HTTPClient in lib directory. I don't think we should include those
there, so I took them out.

- This contribution also uses 3rd party (X?)HTML parser, which is
included.
I am not sure if Clemens' modified this parser in any way. If not,
maybe we don't have to include it and can instead just add it to the
list of required packages.

- There is no Ant build file yet, just build.sh script.
build.xml for this contribution should be really simple to write.

- The key classes are documented fairly well, less central ones are
not, but Clemens actually told me yesterday that he wants to document
them more. I got a feel that he wants to do it soon/now.

- Clemens would be happy to use Lucene Sandbox repository for further
development. I would like to give him access to this repository. That
will eliminate dealing with diffs, patching, conflicts, etc., and one
of the reasons for having the sandbox is a separate repository was to
allow access to a broader group of developers. I will send a separate
email asking for +1s.

- Uh, it just occurred to me that I only looked at about a dozen
classes, compiled it, etc., but I have not actually tried running it.
Ooops. I do get a feeling, from looking at the code, that it will run
as documented.

- This code requires(?) JDK 1.4, as it uses assert keyword.


That's all I can think of for now.
Clemens is subscribed to this list as well, so if you have questions
you can post them here.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>