I'd like to add my +1 to the proposal and my +1 to keeping the Lucene as
a library that can exist separately from the applications. Perhaps the
applications should be separate targets in the Lucene project (and build
process) or perhaps they can be separate projects. I think keeping them
together would be good because Lucene's APIs may need to evolve to
support these applications better and because this will help ensure that
changes to Lucene API are reflected in the applications as soon as they
are made and not with a lag that can come about if the applications are
treated as separate, dependent projects.
See below for some additional ideas for the crawler.
Mark Tucker wrote:
>I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:
>
>Indexers/Crawlers
>
> General Settings
> SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
> IndexerTimeout - kill this crawler thread after long period of inactivity
> IncludeFilter - include only items matching filter
> ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
>
I'm working on a crawler right now actually, but it is a derivative of
WebSPHINX. The original WebSPHINX has not changed since a very long time
ago, but it is licensed under LGPL at the moment. Perhaps we can get
permission from the copyright holders to transfer it to APL (or do we
even need to?). I made a number of bug fixes to it, added support for
cookies (rudimentary) and support for HTTP redirects. One thing that I
like in WebSPHINX is that it has a forgiving HTML parser that can deal
with many kinds of broken HTML. Also, it has a very interesting
framework for analyzing parsed content, but this goes beyound the
requirements for use with Lucene.
I use the crawler with Lucene, but there is a layer of application
classes between the two, so the kind of integration that has been
proposed here has not yet been done. Anyway, I found that in addition to
the Include and Exclude filters, it is helpful to be able to say that
you want some page "expanded" (i.e. parsed and links followed), but not
"indexed" (i.e. added to Lucene's index). And vice versa, it seems
useful to index a page but not expand it, somethimes. Also, filters can
be evaluated on links before they are followed, and then the second time
on final URLs of pages retrieved. Normally the two are the same, but
HTTP redirects can force the final URL to be something very different
from the original link.
Perhaps one way to represent these conditions is to have the following
"language" instead of include and exclude filters:
"include:" regex
"exclude:" regex
"noindex": regex
"noexpand": regex
The first two work as the include/exclude, but for things that pass
these two, the others add handling properties that are used in
processing the link and the page. Disclaimer: I'm experimenting with
this now and these ideas are only about two days old, so please take
them as such. Since we got into the discussion, I figured I'd put them
on the table.
>
> MaxItems - stops indexing after x items
> MaxMegs - stops indexing after x MB of data
>
> File System Indexer
> URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
>
Question: does this information really belong in the index? Perhaps the
root path should be specified, and the documents tagged with a relative
path to that path, but I think that, maybe, the URL to prefix the
document paths with should be given once per entire index and be easy to
change.
>
>
> Web Indexer
> HTTPUser
> HTTPPassword
> HTTPUserAgent
> ProxyServer
> ProxyUser
> ProxyPassword
> HTTPSCertificate
> HTTPSPrivateKey
>
Apache Commons has HTTPClient package that has some similar concepts and
even implements them to some degree. I found it a bit rough still and
dependent on JDK 1.3, but it can be fixed easier than a new one written
I believe. It uses a notion of an HttpState, which is a state container
for an HTTP user agent, containing things like authentication
credentials and cookies. HTTPS support is easy to add with JSSE (which
is the approach taken by the HttpClient from the Commons).
>
>
> Other Possible Indexers
> Microsoft Exchange 5.5/2000
> Lotus Notes
> Newsgroup (NNTP)
> Documentum
> ODBC/OLEDB
> XML - index single XML that represents multiple documents
>
One idea that might prove useful is to add a "DocumentFetcher" in
addition to the DocumentIndexer. The two would go hand in hand, and
document entries created in Lucene by a particular Indexer can be
understood by a corresponding Fetcher. The Fetcher would then
encapsulate retrieval of source documents or creating useful pointers to
them (like URLs).
Another idea is to split the document storage and "envelope" from its
content. The content is subject to a MIME type and can be handed to a
parser, passed to a document factory, mapped to fields, etc. However,
the logic of retrieving a PDF file from a Lotus Notes database (and
creating a URL to point back to it), is different than getting the same
PDF file from the file system. The same parser and a document factory
can still be used though.
>
>
>Document Factory
> General
> The minimum properties for each document should be:
> URL
> Title
> Abstract
> Full Text
> Score
>
> HTML
> Support for META tags including Dublic Core syntax
>
> Other Possible Document Factories
> Office Docs - DOC, XLS, PPT
> PDF
>
>
>Thanks for the great proposal.
>
Yes! Absolutely! Great proposal!
--Dmitry
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>