Mailing List Archive

>This is just a few thoughts about Lucene. Please send me your feedback,
>critiques and thought.
>http://www.trilug.org/~acoliver/luceneplan.html

Interesting and well written! If I read this proposal correctly, what
you're saying is "make Lucene more into an application, rather than
just an indexing library".

I *like* that Lucene doesn't have a spider, or a file tree walker, etc
etc. It's conceptual simplicity. I agree it'd be useful to have easy
applications built with Lucene, but should it be done as part of
Lucene itself or as a separate project?

In either event I think it's important to preserve Lucene's current
library interfaces. If the primary interface into Lucene were via the
proposed Indexer classes, I think Lucene would lose something.

nelson@monkey.org
. . . . . . . . http://www.media.mit.edu/~nelson/

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Proposal for Lucene [ In reply to ]

Ogren.Philip at mayo

Feb 7, 2002, 8:11 AM

Post #3 of 49 (2123 views)

otis_gospodnetic at yahoo

Nelson's comments are very similar to my initial reaction. Our attraction
to Lucene was that it was compact, easy to understand and easy to integrate
into our development effort. I think it is important to maintain a way to
approach Lucene as an indexing library. I don't think your proposal goes
against this idea - but it doesn't mention it either.

Having said that, I think the proposal is great. I don't think this
'application level' effort should have to be in complete isolation from the
'indexing library level'. They should live in 'symbioses' (as the jedi
would say.)

Regards,
Philip

-----Original Message-----
From: Nelson Minar [mailto:nelson@monkey.org]
Sent: Thursday, February 07, 2002 8:52 AM
To: Lucene Developers List
Subject: Re: Proposal for Lucene

>This is just a few thoughts about Lucene. Please send me your feedback,
>critiques and thought.
>http://www.trilug.org/~acoliver/luceneplan.html

Interesting and well written! If I read this proposal correctly, what
you're saying is "make Lucene more into an application, rather than
just an indexing library".

I *like* that Lucene doesn't have a spider, or a file tree walker, etc
etc. It's conceptual simplicity. I agree it'd be useful to have easy
applications built with Lucene, but should it be done as part of
Lucene itself or as a separate project?

In either event I think it's important to preserve Lucene's current
library interfaces. If the primary interface into Lucene were via the
proposed Indexer classes, I think Lucene would lose something.

nelson@monkey.org
. . . . . . . . http://www.media.mit.edu/~nelson/

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 7, 2002, 9:08 AM

Post #4 of 49 (2120 views)

Hello,

I like constructive proposals :)
I think this is a fine idea and the part about different document
handlers has already been mentioned on this list. I think it would be
nice to have easy to use specific applications of Lucene, such as a
htDig-like crawler/indexer/searcher components, but I believe it would
be much better to keep those applications separate. For instance, they
could be a part of the distribution, but kept separately (separate jar,
separate documentation, separate java packages, separate CVS
module...).

Also, while we can work on creating such specific applications, they
will be limited. It will be hard to make them work for everyone (is
everyone saying 'of course' now?). For example, the current
HTTPIndexer section is pretty slim: what about handling dynamically
generating contents, limiting crawling to a specific host, or domain,
or subdirectories, delay between requests, etc. There are lots of
options there. Would the aim of that particular piece be to be as
generic as possible and cover as many options as the author can think
of and can implement?

Those are my initial thoughts.

Otis

--- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> Hi All,
>
> This is just a few thoughts about Lucene. Please send me your
> feedback,
> critiques and thought.
>
> If you folks would take a look:
>
> http://www.trilug.org/~acoliver/luceneplan.html
>
> if you'd like to submit patches:
>
> http://www.trilug.org/~acoliver/luceneplan.xml
>
> Once I've gotten feedback from the developer community I'll send this
> to
> the user community as well.
>
> Thanks,
>
> Andy
> --
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
>
>
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Send FREE Valentine eCards with Yahoo! Greetings!
http://greetings.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: RE: Proposal for Lucene [ In reply to ]

acoliver at nc

Feb 7, 2002, 10:01 AM

Post #5 of 49 (2117 views)

>On Thu, 7 Feb 2002 09:11:48 -0600 "Ogren, Philip V."
<Ogren.Philip@mayo.edu> wrote.
>Nelson's comments are very similar to my initial reaction. Our attraction
>to Lucene was that it was compact, easy to understand and easy to integrate
>into our development effort. I think it is important to maintain a way to
>approach Lucene as an indexing library. I don't think your proposal goes
>against this idea - but it doesn't mention it either.
>

Sometimes I speak more conceisely then is required. The idea is to provide
the features *in addition too*. I do note in the document this is for the
masses not those with specific needs. Hard stuff still has to be coded.
Easy stuff can be configured.

>Having said that, I think the proposal is great. I don't think this
>'application level' effort should have to be in complete isolation from the
>'indexing library level'. They should live in 'symbioses' (as the jedi
>would say.)
>

+1

>Regards,
>Philip
>
>-----Original Message-----
>From: Nelson Minar [mailto:nelson@monkey.org]
>Sent: Thursday, February 07, 2002 8:52 AM
>To: Lucene Developers List
>Subject: Re: Proposal for Lucene
>
>
>>This is just a few thoughts about Lucene. Please send me your feedback,
>>critiques and thought.
>>http://www.trilug.org/~acoliver/luceneplan.html
>
>Interesting and well written! If I read this proposal correctly, what
>you're saying is "make Lucene more into an application, rather than
>just an indexing library".
>
>I *like* that Lucene doesn't have a spider, or a file tree walker, etc
>etc. It's conceptual simplicity. I agree it'd be useful to have easy
>applications built with Lucene, but should it be done as part of
>Lucene itself or as a separate project?
>
>In either event I think it's important to preserve Lucene's current
>library interfaces. If the primary interface into Lucene were via the
>proposed Indexer classes, I think Lucene would lose something.
>
> nelson@monkey.org
>.. . . . . . . . http://www.media.mit.edu/~nelson/
>
>--
>To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>
>--
>To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Proposal for Lucene [ In reply to ]

DCutting at grandcentral

Feb 7, 2002, 10:09 AM

Post #6 of 49 (2125 views)

I think this is a great idea. Lucene badly needs this sort of high-level
interface.

As far as other folks' concern about keeping Lucene a library and not making
it an application, I agree, but I also assumed that's what you meant to do.
All of this can be layered on top of the existing API.

The only change to the existing API that you propose is possibly making
Document an extensible class, which might be feasable, but I'd like to see a
more detailed proposal before I signed off on that.

Doug

> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Thursday, February 07, 2002 4:35 AM
> To: Lucene Developers List
> Subject: Proposal for Lucene
>
>
> Hi All,
>
> This is just a few thoughts about Lucene. Please send me
> your feedback,
> critiques and thought.
>
> If you folks would take a look:
>
> http://www.trilug.org/~acoliver/luceneplan.html
>
> if you'd like to submit patches:
>
> http://www.trilug.org/~acoliver/luceneplan.xml
>
> Once I've gotten feedback from the developer community I'll
> send this to
> the user community as well.
>
> Thanks,
>
> Andy
> --
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
>
>
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Re: Proposal for Lucene [ In reply to ]

acoliver at nc

Feb 7, 2002, 10:10 AM

Post #7 of 49 (2123 views)

otis_gospodnetic at yahoo

>On Thu, 7 Feb 2002 06:52:26 -0800 Nelson Minar <nelson@monkey.org> wrote.
>>This is just a few thoughts about Lucene. Please send me your feedback,
>>critiques and thought.
>>http://www.trilug.org/~acoliver/luceneplan.html
>
>Interesting and well written! If I read this proposal correctly, what
>you're saying is "make Lucene more into an application, rather than
>just an indexing library".
>

No add those features.

>I *like* that Lucene doesn't have a spider, or a file tree walker, etc
>etc. It's conceptual simplicity. I agree it'd be useful to have easy
>applications built with Lucene, but should it be done as part of
>Lucene itself or as a separate project?
>

If these are encapsulated and in a different package, why does this
interfere with that?

I think its best if this were in a Lucene top level package...
example org.apache.lucene.application.indexers.HTTPIndexer
(which I think should be called Crawler).

>In either event I think it's important to preserve Lucene's current
>library interfaces. If the primary interface into Lucene were via the
>proposed Indexer classes, I think Lucene would lose something.
>

This proposal is additive only. Its not intended to replace anything. The
only thing I might suggest (which I've tested locally) is providing an
AbstractDocument and making Document a default subclass. I tested this
locally (and converting the referencing classes to use the abstract document
to reference it) and it worked great. This might simplify adding document
filters. Thats the only *change* I've suggested to the existing classes.
But I tacked that on to the end as another option because I figured someone
had a real good reason to make Document *final*.

To reiterate, this does not seek to become the only interface to lucene. To
draw a parallel: htDig has a htdig dev library. This seeks to add the
features in another packages properly encapulated from the library except
where new/useful features might be added that are more appropriate to the
library. (stating the obvious)

> nelson@monkey.org
>.. . . . . . . . http://www.media.mit.edu/~nelson/
>
>--
>To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 7, 2002, 11:39 AM

Post #8 of 49 (2118 views)

If you can share this code you can post the link to it and we could at
least add it to the Contributions page, until somebody adds it to
Lucene distribution, if that is an option.

Otis

--- Manfred Schäfer <mschaefer@bouncy.com> wrote:
> Hi,
>
> sorry for my mail, i hitted unintentionally the enter-Key. Again:
>
> I've already written a crawler for HTTP and Filesystem (with
> different include-
> and exclude-Options) (based on OROMatcher, thanks god there is open
> source
> software!). We needed that for importing Web-Sites into our product,
> a
> content-managment system.
>
> I suggest to develop that into a relatively autonomous library, wich
> could be
> used by lucene and other packages for retrieving masses of
> html-pages.
>
> regards,
>
> Manfred
>
>
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Send FREE Valentine eCards with Yahoo! Greetings!
http://greetings.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 7, 2002, 11:54 AM

Post #9 of 49 (2119 views)

Hi,

i've already written a crawler for HTTP and Filesystem. We needed that vor
importing Web-Sites into our product, a content-managment system. I

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 7, 2002, 11:59 AM

Post #10 of 49 (2120 views)

Hi,

sorry for my mail, i hitted unintentionally the enter-Key. Again:

I've already written a crawler for HTTP and Filesystem (with different include-
and exclude-Options) (based on OROMatcher, thanks god there is open source
software!). We needed that for importing Web-Sites into our product, a
content-managment system.

I suggest to develop that into a relatively autonomous library, wich could be
used by lucene and other packages for retrieving masses of html-pages.

regards,

Manfred

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Proposal for Lucene [ In reply to ]

MTucker at infoimage

Feb 7, 2002, 1:03 PM

Post #11 of 49 (2119 views)

I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:

Indexers/Crawlers

General Settings
SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
IndexerTimeout - kill this crawler thread after long period of inactivity
IncludeFilter - include only items matching filter
ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
MaxItems - stops indexing after x items
MaxMegs - stops indexing after x MB of data

File System Indexer
URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/

Web Indexer
HTTPUser
HTTPPassword
HTTPUserAgent
ProxyServer
ProxyUser
ProxyPassword
HTTPSCertificate
HTTPSPrivateKey

Other Possible Indexers
Microsoft Exchange 5.5/2000
Lotus Notes
Newsgroup (NNTP)
Documentum
ODBC/OLEDB
XML - index single XML that represents multiple documents

Document Factory
General
The minimum properties for each document should be:
URL
Title
Abstract
Full Text
Score

HTML
Support for META tags including Dublic Core syntax

Other Possible Document Factories
Office Docs - DOC, XLS, PPT
PDF

Thanks for the great proposal.

Mark Tucker

-----Original Message-----
From: Andrew C. Oliver [mailto:acoliver@apache.org]
Sent: Thursday, February 07, 2002 5:35 AM
To: Lucene Developers List
Subject: Proposal for Lucene

Hi All,

This is just a few thoughts about Lucene. Please send me your feedback,
critiques and thought.

If you folks would take a look:

http://www.trilug.org/~acoliver/luceneplan.html

if you'd like to submit patches:

http://www.trilug.org/~acoliver/luceneplan.xml

Once I've gotten feedback from the developer community I'll send this to
the user community as well.

Thanks,

Andy
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

dmitrys at earthlink

Feb 7, 2002, 2:39 PM

Post #12 of 49 (2110 views)

I'd like to add my +1 to the proposal and my +1 to keeping the Lucene as
a library that can exist separately from the applications. Perhaps the
applications should be separate targets in the Lucene project (and build
process) or perhaps they can be separate projects. I think keeping them
together would be good because Lucene's APIs may need to evolve to
support these applications better and because this will help ensure that
changes to Lucene API are reflected in the applications as soon as they
are made and not with a lag that can come about if the applications are
treated as separate, dependent projects.

See below for some additional ideas for the crawler.

Mark Tucker wrote:

>I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:
>
>Indexers/Crawlers
>
> General Settings
> SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
> IndexerTimeout - kill this crawler thread after long period of inactivity
> IncludeFilter - include only items matching filter
> ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
>
I'm working on a crawler right now actually, but it is a derivative of
WebSPHINX. The original WebSPHINX has not changed since a very long time
ago, but it is licensed under LGPL at the moment. Perhaps we can get
permission from the copyright holders to transfer it to APL (or do we
even need to?). I made a number of bug fixes to it, added support for
cookies (rudimentary) and support for HTTP redirects. One thing that I
like in WebSPHINX is that it has a forgiving HTML parser that can deal
with many kinds of broken HTML. Also, it has a very interesting
framework for analyzing parsed content, but this goes beyound the
requirements for use with Lucene.

I use the crawler with Lucene, but there is a layer of application
classes between the two, so the kind of integration that has been
proposed here has not yet been done. Anyway, I found that in addition to
the Include and Exclude filters, it is helpful to be able to say that
you want some page "expanded" (i.e. parsed and links followed), but not
"indexed" (i.e. added to Lucene's index). And vice versa, it seems
useful to index a page but not expand it, somethimes. Also, filters can
be evaluated on links before they are followed, and then the second time
on final URLs of pages retrieved. Normally the two are the same, but
HTTP redirects can force the final URL to be something very different
from the original link.

Perhaps one way to represent these conditions is to have the following
"language" instead of include and exclude filters:

"include:" regex
"exclude:" regex
"noindex": regex
"noexpand": regex

The first two work as the include/exclude, but for things that pass
these two, the others add handling properties that are used in
processing the link and the page. Disclaimer: I'm experimenting with
this now and these ideas are only about two days old, so please take
them as such. Since we got into the discussion, I figured I'd put them
on the table.

>
> MaxItems - stops indexing after x items
> MaxMegs - stops indexing after x MB of data
>
> File System Indexer
> URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
>
Question: does this information really belong in the index? Perhaps the
root path should be specified, and the documents tagged with a relative
path to that path, but I think that, maybe, the URL to prefix the
document paths with should be given once per entire index and be easy to
change.

>
>
> Web Indexer
> HTTPUser
> HTTPPassword
> HTTPUserAgent
> ProxyServer
> ProxyUser
> ProxyPassword
> HTTPSCertificate
> HTTPSPrivateKey
>
Apache Commons has HTTPClient package that has some similar concepts and
even implements them to some degree. I found it a bit rough still and
dependent on JDK 1.3, but it can be fixed easier than a new one written
I believe. It uses a notion of an HttpState, which is a state container
for an HTTP user agent, containing things like authentication
credentials and cookies. HTTPS support is easy to add with JSSE (which
is the approach taken by the HttpClient from the Commons).

>
>
> Other Possible Indexers
> Microsoft Exchange 5.5/2000
> Lotus Notes
> Newsgroup (NNTP)
> Documentum
> ODBC/OLEDB
> XML - index single XML that represents multiple documents
>
One idea that might prove useful is to add a "DocumentFetcher" in
addition to the DocumentIndexer. The two would go hand in hand, and
document entries created in Lucene by a particular Indexer can be
understood by a corresponding Fetcher. The Fetcher would then
encapsulate retrieval of source documents or creating useful pointers to
them (like URLs).

Another idea is to split the document storage and "envelope" from its
content. The content is subject to a MIME type and can be handed to a
parser, passed to a document factory, mapped to fields, etc. However,
the logic of retrieving a PDF file from a Lotus Notes database (and
creating a URL to point back to it), is different than getting the same
PDF file from the file system. The same parser and a document factory
can still be used though.

>
>
>Document Factory
> General
> The minimum properties for each document should be:
> URL
> Title
> Abstract
> Full Text
> Score
>
> HTML
> Support for META tags including Dublic Core syntax
>
> Other Possible Document Factories
> Office Docs - DOC, XLS, PPT
> PDF
>
>
>Thanks for the great proposal.
>
Yes! Absolutely! Great proposal!

--Dmitry

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: RE: Proposal for Lucene [ In reply to ]

acoliver at nc

Feb 7, 2002, 3:39 PM

Post #13 of 49 (2126 views)

>On Thu, 7 Feb 2002 09:09:44 -0800 Doug Cutting <DCutting@grandcentral.com>
wrote.
>I think this is a great idea. Lucene badly needs this sort of high-level
>interface.
>
>As far as other folks' concern about keeping Lucene a library and not
making
>it an application, I agree, but I also assumed that's what you meant to do.
>All of this can be layered on top of the existing API.
>

You assumed correctly. I apparently poorly worded that.

>The only change to the existing API that you propose is possibly making
>Document an extensible class, which might be feasable, but I'd like to see
a
>more detailed proposal before I signed off on that.
>

I'll one up you on that. I'll send you the proposed patches. I did this on
my local machine before mentioning that.. I tested it and performance
tested it. I'll send you the proposed patches (unforunately, I'll have to
be a tease and send them next week when I get back from boston unless I get
to it tomorrow).. I'll detail how you would use the abstraction instead of
this "document factory" deal.

What I actually did was make Lucene library files call it "AbstractDocument"
and moved much of what is in Document to AbstractDocument, then made
Document final and reference AbstractDocument.
The samples ran fine, there seemed to be no impact on performance.

>Doug
>
>> -----Original Message-----
>> From: Andrew C. Oliver [mailto:acoliver@apache.org]
>> Sent: Thursday, February 07, 2002 4:35 AM
>> To: Lucene Developers List
>> Subject: Proposal for Lucene
>>
>>
>> Hi All,
>>
>> This is just a few thoughts about Lucene. Please send me
>> your feedback,
>> critiques and thought.
>>
>> If you folks would take a look:
>>
>> http://www.trilug.org/~acoliver/luceneplan.html
>>
>> if you'd like to submit patches:
>>
>> http://www.trilug.org/~acoliver/luceneplan.xml
>>
>> Once I've gotten feedback from the developer community I'll
>> send this to
>> the user community as well.
>>
>> Thanks,
>>
>> Andy
>> --
>> www.superlinksoftware.com
>> www.sourceforge.net/projects/poi - port of Excel format to java
>> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
>> - fix java generics!
>>
>>
>> The avalanche has already started. It is too late for the pebbles to
>> vote.
>> -Ambassador Kosh
>>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-dev-help@jakarta.apache.org>
>>
>
>--
>To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

kelvin at relevanz

Feb 7, 2002, 6:27 PM

Post #14 of 49 (2120 views)

lists at ehatchersolutions

Great suggestions all around, and I'm pretty much in agreement with what's been said.

In my app, I've built a mini-framework around the searching such that I'm able to map ContentHandlers (which index file contents) to file extensions. I've been wanting to clean it up and contribute it for awhile, but haven't overcome the intertia to do so. Also introduced a DataSource (which can pretty much be anything, like a filesystem, a database, a URL, etc) from which to obtain the data to index, so I think it _could_ be inline with what some of you have in mind.

I could also use alot of feedback with what's been done too...

So what's the plan to move forward?

K
----- Original Message -----
From: Mark Tucker
To: Lucene Developers List
Sent: Friday, February 08, 2002 4:03 AM
Subject: RE: Proposal for Lucene

I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:

Indexers/Crawlers

General Settings
SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
IndexerTimeout - kill this crawler thread after long period of inactivity
IncludeFilter - include only items matching filter
ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
MaxItems - stops indexing after x items
MaxMegs - stops indexing after x MB of data

File System Indexer
URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/

Web Indexer
HTTPUser
HTTPPassword
HTTPUserAgent
ProxyServer
ProxyUser
ProxyPassword
HTTPSCertificate
HTTPSPrivateKey

Other Possible Indexers
Microsoft Exchange 5.5/2000
Lotus Notes
Newsgroup (NNTP)
Documentum
ODBC/OLEDB
XML - index single XML that represents multiple documents

Document Factory
General
The minimum properties for each document should be:
URL
Title
Abstract
Full Text
Score

HTML
Support for META tags including Dublic Core syntax

Other Possible Document Factories
Office Docs - DOC, XLS, PPT
PDF

Thanks for the great proposal.

Mark Tucker

-----Original Message-----
From: Andrew C. Oliver [mailto:acoliver@apache.org]
Sent: Thursday, February 07, 2002 5:35 AM
To: Lucene Developers List
Subject: Proposal for Lucene

Hi All,

This is just a few thoughts about Lucene. Please send me your feedback,
critiques and thought.

If you folks would take a look:

http://www.trilug.org/~acoliver/luceneplan.html

if you'd like to submit patches:

http://www.trilug.org/~acoliver/luceneplan.xml

Once I've gotten feedback from the developer community I'll send this to
the user community as well.

Thanks,

Andy
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 7, 2002, 8:00 PM

Post #15 of 49 (2113 views)

I've developed something similar myself. I've created an Ant task <index>
that uses DocumentHandler interface implementing classes - one that can be
used (<index class="...">) is a FileExtensionDocumentHandler. At build-time
I generate a Lucene index of static documents, and roll that into a web
application.

Its got some kinks, like how to deal with the documents because they contain
relative hyperlinks... so these documents either should be copied into the
WAR too (or somehow made accessible to the web app) or incorporated directly
into a Lucene field ("rawcontents" is what I'm using now). These issues are
not tough to solve and having some additional parameters to my IndexTask
could allow such things to be customized by the user.

My task is still evolving, but my plan all along has been to donate it to
lucene-dev for incorporation in some form or another.

Let me know if you'd like it, and what package name you'd like to use.

Erik

----- Original Message -----
From: "Kelvin Tan" <kelvin@relevanz.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Thursday, February 07, 2002 8:27 PM
Subject: Re: Proposal for Lucene

Great suggestions all around, and I'm pretty much in agreement with what's
been said.

In my app, I've built a mini-framework around the searching such that I'm
able to map ContentHandlers (which index file contents) to file extensions.
I've been wanting to clean it up and contribute it for awhile, but haven't
overcome the intertia to do so. Also introduced a DataSource (which can
pretty much be anything, like a filesystem, a database, a URL, etc) from
which to obtain the data to index, so I think it _could_ be inline with what
some of you have in mind.

I could also use alot of feedback with what's been done too...

So what's the plan to move forward?

K
----- Original Message -----
From: Mark Tucker
To: Lucene Developers List
Sent: Friday, February 08, 2002 4:03 AM
Subject: RE: Proposal for Lucene

I like what you included in your proposal and suggest doing all that (over
time) and taking the following into consideration:

Indexers/Crawlers

General Settings
SleeptimeBetweenCalls - can be used to avoid flooding a machine with too
many requests
IndexerTimeout - kill this crawler thread after long period of inactivity
IncludeFilter - include only items matching filter
ExcludeFilter - exclude items matching filter (can be used with
IncludeFilter)
MaxItems - stops indexing after x items
MaxMegs - stops indexing after x MB of data

File System Indexer
URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/

Web Indexer
HTTPUser
HTTPPassword
HTTPUserAgent
ProxyServer
ProxyUser
ProxyPassword
HTTPSCertificate
HTTPSPrivateKey

Other Possible Indexers
Microsoft Exchange 5.5/2000
Lotus Notes
Newsgroup (NNTP)
Documentum
ODBC/OLEDB
XML - index single XML that represents multiple documents

Document Factory
General
The minimum properties for each document should be:
URL
Title
Abstract
Full Text
Score

HTML
Support for META tags including Dublic Core syntax

Other Possible Document Factories
Office Docs - DOC, XLS, PPT
PDF

Thanks for the great proposal.

Mark Tucker

-----Original Message-----
From: Andrew C. Oliver [mailto:acoliver@apache.org]
Sent: Thursday, February 07, 2002 5:35 AM
To: Lucene Developers List
Subject: Proposal for Lucene

Hi All,

This is just a few thoughts about Lucene. Please send me your feedback,
critiques and thought.

If you folks would take a look:

http://www.trilug.org/~acoliver/luceneplan.html

if you'd like to submit patches:

http://www.trilug.org/~acoliver/luceneplan.xml

Once I've gotten feedback from the developer community I'll send this to
the user community as well.

Thanks,

Andy
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 3:26 AM

Post #16 of 49 (2115 views)

Hi,

i would suggest two sub-projects:

1.Crawler - retrieving docs, wherever they are.....

2. DocumentHandler extract Text, create apropriate fields etc..

The second is a layer on top of lucene. First is a autonomous package, wich
should be nicely integrated with lucene/Document-Handler, but should also be
usable for other projects.

I've included my code, to show you, what i've done. It isn't too useful yet,
because it is integrated in our product, but you can get the idea. Actually i've
written two things:

1: A robot for crawling a remote server via http and writing all the data to
local filesystem, then importing it into our db and
(at the same time) replacing all links with internal links. So we could emulate
a web-Site from this crawled Data!
[com.synformation.script.utilities.importtool]

2: (I've rewritten some of the code from 1 for that, so this is much cleaner) A
customer needs a tool for importing local mini-Websites on the file-system via
an applet, send it to the Web-Server and import it as described in point 1. I've
tried to write it in a way, that it could include the functionality of point 1
(retrieving vie http), but that is mostly untested.
[com.synformation.script.utilities.fileimport]

I don't say, that you(we) should use this. But i think it's time to come to a
more concrete plans. I'm interested to help on that for the crawler.

mfg,

manfred

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 3:35 AM

Post #17 of 49 (2119 views)

Hi again,

> 2. DocumentHandler extract Text, create apropriate fields etc..

> The second is a layer on top of lucene.

I agree with you(me). but that should be more: It should be a like a command line
interface for a programming library.
In me vision i see a xml-config-file, which tells me what to do and when and how.
Because there could be many ways, how somebody wants his documents be indexed, the
configuration should be plugable like ant or jmeter.

regards,

manfred

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 6:13 AM

Post #18 of 49 (2120 views)

Is it opensource? APL?

On Thu, 2002-02-07 at 13:59, Manfred Schäfer wrote:
> Hi,
>
> sorry for my mail, i hitted unintentionally the enter-Key. Again:
>
> I've already written a crawler for HTTP and Filesystem (with different include-
> and exclude-Options) (based on OROMatcher, thanks god there is open source
> software!). We needed that for importing Web-Sites into our product, a
> content-managment system.
>
> I suggest to develop that into a relatively autonomous library, wich could be
> used by lucene and other packages for retrieving masses of html-pages.
>
> regards,
>
> Manfred
>
>
>
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 6:16 AM

Post #19 of 49 (2118 views)

Great, I'll add some of this to the proposal, and keep some of this just
saved off.. (also great ideas but things like *additional
filters/crawlers* I want to enable but not necessarily include every
possible combination).

Be assured DOC and XLS filters are foremost on my mind ;-) (I started
this while trying to figure out how to hook POI to Lucene).

On Thu, 2002-02-07 at 15:03, Mark Tucker wrote:
> I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:
>
> Indexers/Crawlers
>
> General Settings
> SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
> IndexerTimeout - kill this crawler thread after long period of inactivity
> IncludeFilter - include only items matching filter
> ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
> MaxItems - stops indexing after x items
> MaxMegs - stops indexing after x MB of data
>
> File System Indexer
> URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
>
> Web Indexer
> HTTPUser
> HTTPPassword
> HTTPUserAgent
> ProxyServer
> ProxyUser
> ProxyPassword
> HTTPSCertificate
> HTTPSPrivateKey
>
> Other Possible Indexers
> Microsoft Exchange 5.5/2000
> Lotus Notes
> Newsgroup (NNTP)
> Documentum
> ODBC/OLEDB
> XML - index single XML that represents multiple documents
>
>
> Document Factory
> General
> The minimum properties for each document should be:
> URL
> Title
> Abstract
> Full Text
> Score
>
> HTML
> Support for META tags including Dublic Core syntax
>
> Other Possible Document Factories
> Office Docs - DOC, XLS, PPT
> PDF
>
>
> Thanks for the great proposal.
>
> Mark Tucker
>
>
> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Thursday, February 07, 2002 5:35 AM
> To: Lucene Developers List
> Subject: Proposal for Lucene
>
>
> Hi All,
>
> This is just a few thoughts about Lucene. Please send me your feedback,
> critiques and thought.
>
> If you folks would take a look:
>
> http://www.trilug.org/~acoliver/luceneplan.html
>
> if you'd like to submit patches:
>
> http://www.trilug.org/~acoliver/luceneplan.xml
>
> Once I've gotten feedback from the developer community I'll send this to
> the user community as well.
>
> Thanks,
>
> Andy
> --
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
>
>
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 6:18 AM

Post #20 of 49 (2120 views)

Is this open source? APL'd? Where can I look at it?

-Andy

On Thu, 2002-02-07 at 20:27, Kelvin Tan wrote:
> Great suggestions all around, and I'm pretty much in agreement with what's been said.
>
> In my app, I've built a mini-framework around the searching such that I'm able to map ContentHandlers (which index file contents) to file extensions. I've been wanting to clean it up and contribute it for awhile, but haven't overcome the intertia to do so. Also introduced a DataSource (which can pretty much be anything, like a filesystem, a database, a URL, etc) from which to obtain the data to index, so I think it _could_ be inline with what some of you have in mind.
>
> I could also use alot of feedback with what's been done too...
>
> So what's the plan to move forward?
>
> K
> ----- Original Message -----
> From: Mark Tucker
> To: Lucene Developers List
> Sent: Friday, February 08, 2002 4:03 AM
> Subject: RE: Proposal for Lucene
>
>
> I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:
>
> Indexers/Crawlers
>
> General Settings
> SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
> IndexerTimeout - kill this crawler thread after long period of inactivity
> IncludeFilter - include only items matching filter
> ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
> MaxItems - stops indexing after x items
> MaxMegs - stops indexing after x MB of data
>
> File System Indexer
> URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
>
> Web Indexer
> HTTPUser
> HTTPPassword
> HTTPUserAgent
> ProxyServer
> ProxyUser
> ProxyPassword
> HTTPSCertificate
> HTTPSPrivateKey
>
> Other Possible Indexers
> Microsoft Exchange 5.5/2000
> Lotus Notes
> Newsgroup (NNTP)
> Documentum
> ODBC/OLEDB
> XML - index single XML that represents multiple documents
>
>
> Document Factory
> General
> The minimum properties for each document should be:
> URL
> Title
> Abstract
> Full Text
> Score
>
> HTML
> Support for META tags including Dublic Core syntax
>
> Other Possible Document Factories
> Office Docs - DOC, XLS, PPT
> PDF
>
>
> Thanks for the great proposal.
>
> Mark Tucker
>
>
> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Thursday, February 07, 2002 5:35 AM
> To: Lucene Developers List
> Subject: Proposal for Lucene
>
>
> Hi All,
>
> This is just a few thoughts about Lucene. Please send me your feedback,
> critiques and thought.
>
> If you folks would take a look:
>
> http://www.trilug.org/~acoliver/luceneplan.html
>
> if you'd like to submit patches:
>
> http://www.trilug.org/~acoliver/luceneplan.xml
>
> Once I've gotten feedback from the developer community I'll send this to
> the user community as well.
>
> Thanks,
>
> Andy
> --
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
>
>
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 6:19 AM

Post #21 of 49 (2115 views)

Is this open source? APL'd? Where can I look at it?

On Thu, 2002-02-07 at 22:00, Erik Hatcher wrote:
> I've developed something similar myself. I've created an Ant task <index>
> that uses DocumentHandler interface implementing classes - one that can be
> used (<index class="...">) is a FileExtensionDocumentHandler. At build-time
> I generate a Lucene index of static documents, and roll that into a web
> application.
>
> Its got some kinks, like how to deal with the documents because they contain
> relative hyperlinks... so these documents either should be copied into the
> WAR too (or somehow made accessible to the web app) or incorporated directly
> into a Lucene field ("rawcontents" is what I'm using now). These issues are
> not tough to solve and having some additional parameters to my IndexTask
> could allow such things to be customized by the user.
>
> My task is still evolving, but my plan all along has been to donate it to
> lucene-dev for incorporation in some form or another.
>
> Let me know if you'd like it, and what package name you'd like to use.
>
> Erik
>
>
> ----- Original Message -----
> From: "Kelvin Tan" <kelvin@relevanz.com>
> To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> Sent: Thursday, February 07, 2002 8:27 PM
> Subject: Re: Proposal for Lucene
>
>
> Great suggestions all around, and I'm pretty much in agreement with what's
> been said.
>
> In my app, I've built a mini-framework around the searching such that I'm
> able to map ContentHandlers (which index file contents) to file extensions.
> I've been wanting to clean it up and contribute it for awhile, but haven't
> overcome the intertia to do so. Also introduced a DataSource (which can
> pretty much be anything, like a filesystem, a database, a URL, etc) from
> which to obtain the data to index, so I think it _could_ be inline with what
> some of you have in mind.
>
> I could also use alot of feedback with what's been done too...
>
> So what's the plan to move forward?
>
> K
> ----- Original Message -----
> From: Mark Tucker
> To: Lucene Developers List
> Sent: Friday, February 08, 2002 4:03 AM
> Subject: RE: Proposal for Lucene
>
>
> I like what you included in your proposal and suggest doing all that (over
> time) and taking the following into consideration:
>
> Indexers/Crawlers
>
> General Settings
> SleeptimeBetweenCalls - can be used to avoid flooding a machine with too
> many requests
> IndexerTimeout - kill this crawler thread after long period of inactivity
> IncludeFilter - include only items matching filter
> ExcludeFilter - exclude items matching filter (can be used with
> IncludeFilter)
> MaxItems - stops indexing after x items
> MaxMegs - stops indexing after x MB of data
>
> File System Indexer
> URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
>
> Web Indexer
> HTTPUser
> HTTPPassword
> HTTPUserAgent
> ProxyServer
> ProxyUser
> ProxyPassword
> HTTPSCertificate
> HTTPSPrivateKey
>
> Other Possible Indexers
> Microsoft Exchange 5.5/2000
> Lotus Notes
> Newsgroup (NNTP)
> Documentum
> ODBC/OLEDB
> XML - index single XML that represents multiple documents
>
>
> Document Factory
> General
> The minimum properties for each document should be:
> URL
> Title
> Abstract
> Full Text
> Score
>
> HTML
> Support for META tags including Dublic Core syntax
>
> Other Possible Document Factories
> Office Docs - DOC, XLS, PPT
> PDF
>
>
> Thanks for the great proposal.
>
> Mark Tucker
>
>
> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Thursday, February 07, 2002 5:35 AM
> To: Lucene Developers List
> Subject: Proposal for Lucene
>
>
> Hi All,
>
> This is just a few thoughts about Lucene. Please send me your feedback,
> critiques and thought.
>
> If you folks would take a look:
>
> http://www.trilug.org/~acoliver/luceneplan.html
>
> if you'd like to submit patches:
>
> http://www.trilug.org/~acoliver/luceneplan.xml
>
> Once I've gotten feedback from the developer community I'll send this to
> the user community as well.
>
> Thanks,
>
> Andy
> --
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
>
>
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 6:21 AM

Post #22 of 49 (2126 views)

lists at ehatchersolutions

On Fri, 2002-02-08 at 05:35, Manfred Schäfer wrote:
> Hi again,
>
> > 2. DocumentHandler extract Text, create apropriate fields etc..
>
> > The second is a layer on top of lucene.
>
> I agree with you(me). but that should be more: It should be a like a command line
> interface for a programming library.
> In me vision i see a xml-config-file, which tells me what to do and when and how.
> Because there could be many ways, how somebody wants his documents be indexed, the
> configuration should be plugable like ant or jmeter.
>

Awesome idea. Lets grow that (unless you have a concrete design that I
can look at).. Over time and a few iterations we'll develop it into
that. We'll need a base to start with of course though.

> regards,
>
> manfred
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!

The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 6:55 AM

Post #23 of 49 (2122 views)

I'm developing it for a book I'm writing on Ant, and I've posted one piece
of it here already - my HtmlDocument class that uses JTidy to DOM'ify the
HTML and rip out the title and body contents as two separate fields (without
HTML tags, of course).

I have every intention of giving all the code developed to Lucene or other
Jakarta projects where appropriate. I only haven't yet because its still
under development - its not top secret or anything. :) The Ant task
definitely deserves some additional Lucene expertise to make sure its doing
the right thing, but I have it checking dependencies by embedding a
non-indexed "last modified" field into the Lucene index too which it checks
before actually indexing a document again - so a second incremental run of
indexing is *much* faster since it skips files unless they are newer.

Erik

----- Original Message -----
From: "Andrew C. Oliver" <acoliver@apache.org>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Friday, February 08, 2002 8:19 AM
Subject: Re: Proposal for Lucene

> Is this open source? APL'd? Where can I look at it?
>
> On Thu, 2002-02-07 at 22:00, Erik Hatcher wrote:
> > I've developed something similar myself. I've created an Ant task
<index>
> > that uses DocumentHandler interface implementing classes - one that can
be
> > used (<index class="...">) is a FileExtensionDocumentHandler. At
build-time
> > I generate a Lucene index of static documents, and roll that into a web
> > application.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 8, 2002, 8:20 AM

Post #24 of 49 (2121 views)

Hi,

> Lets grow that (unless you have a concrete design that I
> can look at).. Over time and a few iterations we'll develop it into
> that. We'll need a base to start with of course though.

as others have suggested: why not using Ant as the Configuration Framwork with Plugins
for DocumentHandler, Lucene, Crawler.

regards,

Manfred

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Proposal for Lucene [ In reply to ]

Feb 9, 2002, 5:51 AM

Post #25 of 49 (2122 views)