Mailing List Archive

1 2  View All
Re: Proposal for Lucene [ In reply to ]
On Sat, 2002-02-09 at 07:58, Kelvin Tan wrote:
> Here it is. Released under APL (I kinda copied and pasted the license from
> some Fulcrum code). Some (current) limitations:
>
> 1. Only a single datasource is supported at this point in time (support for
> multiple datasources can be easily added through the configuration file and
> improving SearchConfiguration)
> 2. Documentation isn't really complete. (Is it ever?)
> 3. It's a filesystem-based indexer. It's not too difficult to decouple the
> filesystem bit and make it more generic, but I don't have a need for it
> presently.
> 4. A temp folder is needed for extracting Zip, GZip and Tar files. I tried
> using outputstreams but they turned out to be quite a nightmare...

great I'll take a look at all of this when I get back next week (going
to Boston for a week, will be out of touch.)

> 5. There's a JDBCDatasource for indexing a table from databases (the table
> stores metadata of the file to index. There should still be some way to
> obtain the file to index. This ties back to 3.). I really ought to provide
> an example on how to use it...
>

What's that good for...? Wouldn't one just create an index on the
database?

> Questions and feedback are really welcome.
>
> I've attached the source-only version, but there's a full version (with
> libs) at http://www.relevanz.com/search_full.zip.
>
> ----- Original Message -----
> From: Andrew C. Oliver <acoliver@apache.org>
> To: Lucene Developers List <lucene-dev@jakarta.apache.org>
> Sent: Friday, February 08, 2002 9:18 PM
> Subject: Re: Proposal for Lucene
>
>
> > Is this open source? APL'd? Where can I look at it?
> >
> > -Andy
> >
> > On Thu, 2002-02-07 at 20:27, Kelvin Tan wrote:
> > > Great suggestions all around, and I'm pretty much in agreement with
> what's been said.
> > >
> > > In my app, I've built a mini-framework around the searching such that
> I'm able to map ContentHandlers (which index file contents) to file
> extensions. I've been wanting to clean it up and contribute it for awhile,
> but haven't overcome the intertia to do so. Also introduced a DataSource
> (which can pretty much be anything, like a filesystem, a database, a URL,
> etc) from which to obtain the data to index, so I think it _could_ be inline
> with what some of you have in mind.
> > >
> > > I could also use alot of feedback with what's been done too...
> > >
> > > So what's the plan to move forward?
> > >
> > > K
> > > ----- Original Message -----
> > > From: Mark Tucker
> > > To: Lucene Developers List
> > > Sent: Friday, February 08, 2002 4:03 AM
> > > Subject: RE: Proposal for Lucene
> > >
> > >
> > > I like what you included in your proposal and suggest doing all that
> (over time) and taking the following into consideration:
> > >
> > > Indexers/Crawlers
> > >
> > > General Settings
> > > SleeptimeBetweenCalls - can be used to avoid flooding a machine with
> too many requests
> > > IndexerTimeout - kill this crawler thread after long period of
> inactivity
> > > IncludeFilter - include only items matching filter
> > > ExcludeFilter - exclude items matching filter (can be used with
> IncludeFilter)
> > > MaxItems - stops indexing after x items
> > > MaxMegs - stops indexing after x MB of data
> > >
> > > File System Indexer
> > > URLReplacePrefix - can crawl c:\ but expose URL as
> http://mysever/docs/
> > >
> > > Web Indexer
> > > HTTPUser
> > > HTTPPassword
> > > HTTPUserAgent
> > > ProxyServer
> > > ProxyUser
> > > ProxyPassword
> > > HTTPSCertificate
> > > HTTPSPrivateKey
> > >
> > > Other Possible Indexers
> > > Microsoft Exchange 5.5/2000
> > > Lotus Notes
> > > Newsgroup (NNTP)
> > > Documentum
> > > ODBC/OLEDB
> > > XML - index single XML that represents multiple documents
> > >
> > >
> > > Document Factory
> > > General
> > > The minimum properties for each document should be:
> > > URL
> > > Title
> > > Abstract
> > > Full Text
> > > Score
> > >
> > > HTML
> > > Support for META tags including Dublic Core syntax
> > >
> > > Other Possible Document Factories
> > > Office Docs - DOC, XLS, PPT
> > > PDF
> > >
> > >
> > > Thanks for the great proposal.
> > >
> > > Mark Tucker
> > >
> > >
> > > -----Original Message-----
> > > From: Andrew C. Oliver [mailto:acoliver@apache.org]
> > > Sent: Thursday, February 07, 2002 5:35 AM
> > > To: Lucene Developers List
> > > Subject: Proposal for Lucene
> > >
> > >
> > > Hi All,
> > >
> > > This is just a few thoughts about Lucene. Please send me your
> feedback,
> > > critiques and thought.
> > >
> > > If you folks would take a look:
> > >
> > > http://www.trilug.org/~acoliver/luceneplan.html
> > >
> > > if you'd like to submit patches:
> > >
> > > http://www.trilug.org/~acoliver/luceneplan.xml
> > >
> > > Once I've gotten feedback from the developer community I'll send this
> to
> > > the user community as well.
> > >
> > > Thanks,
> > >
> > > Andy
> > > --
> > > www.superlinksoftware.com
> > > www.sourceforge.net/projects/poi - port of Excel format to java
> > > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > > - fix java generics!
> > >
> > >
> > > The avalanche has already started. It is too late for the pebbles to
> > > vote.
> > > -Ambassador Kosh
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> > >
> > >
> > --
> > www.superlinksoftware.com
> > www.sourceforge.net/projects/poi - port of Excel format to java
> > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > - fix java generics!
> >
> >
> > The avalanche has already started. It is too late for the pebbles to
> > vote.
> > -Ambassador Kosh
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
> ----
>

> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!


The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Here it is. Released under APL (I kinda copied and pasted the license from
some Fulcrum code). Some (current) limitations:

1. Only a single datasource is supported at this point in time (support for
multiple datasources can be easily added through the configuration file and
improving SearchConfiguration)
2. Documentation isn't really complete. (Is it ever?)
3. It's a filesystem-based indexer. It's not too difficult to decouple the
filesystem bit and make it more generic, but I don't have a need for it
presently.
4. A temp folder is needed for extracting Zip, GZip and Tar files. I tried
using outputstreams but they turned out to be quite a nightmare...
5. There's a JDBCDatasource for indexing a table from databases (the table
stores metadata of the file to index. There should still be some way to
obtain the file to index. This ties back to 3.). I really ought to provide
an example on how to use it...

Questions and feedback are really welcome.

I've attached the source-only version, but there's a full version (with
libs) at http://www.relevanz.com/search_full.zip.

----- Original Message -----
From: Andrew C. Oliver <acoliver@apache.org>
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Sent: Friday, February 08, 2002 9:18 PM
Subject: Re: Proposal for Lucene


> Is this open source? APL'd? Where can I look at it?
>
> -Andy
>
> On Thu, 2002-02-07 at 20:27, Kelvin Tan wrote:
> > Great suggestions all around, and I'm pretty much in agreement with
what's been said.
> >
> > In my app, I've built a mini-framework around the searching such that
I'm able to map ContentHandlers (which index file contents) to file
extensions. I've been wanting to clean it up and contribute it for awhile,
but haven't overcome the intertia to do so. Also introduced a DataSource
(which can pretty much be anything, like a filesystem, a database, a URL,
etc) from which to obtain the data to index, so I think it _could_ be inline
with what some of you have in mind.
> >
> > I could also use alot of feedback with what's been done too...
> >
> > So what's the plan to move forward?
> >
> > K
> > ----- Original Message -----
> > From: Mark Tucker
> > To: Lucene Developers List
> > Sent: Friday, February 08, 2002 4:03 AM
> > Subject: RE: Proposal for Lucene
> >
> >
> > I like what you included in your proposal and suggest doing all that
(over time) and taking the following into consideration:
> >
> > Indexers/Crawlers
> >
> > General Settings
> > SleeptimeBetweenCalls - can be used to avoid flooding a machine with
too many requests
> > IndexerTimeout - kill this crawler thread after long period of
inactivity
> > IncludeFilter - include only items matching filter
> > ExcludeFilter - exclude items matching filter (can be used with
IncludeFilter)
> > MaxItems - stops indexing after x items
> > MaxMegs - stops indexing after x MB of data
> >
> > File System Indexer
> > URLReplacePrefix - can crawl c:\ but expose URL as
http://mysever/docs/
> >
> > Web Indexer
> > HTTPUser
> > HTTPPassword
> > HTTPUserAgent
> > ProxyServer
> > ProxyUser
> > ProxyPassword
> > HTTPSCertificate
> > HTTPSPrivateKey
> >
> > Other Possible Indexers
> > Microsoft Exchange 5.5/2000
> > Lotus Notes
> > Newsgroup (NNTP)
> > Documentum
> > ODBC/OLEDB
> > XML - index single XML that represents multiple documents
> >
> >
> > Document Factory
> > General
> > The minimum properties for each document should be:
> > URL
> > Title
> > Abstract
> > Full Text
> > Score
> >
> > HTML
> > Support for META tags including Dublic Core syntax
> >
> > Other Possible Document Factories
> > Office Docs - DOC, XLS, PPT
> > PDF
> >
> >
> > Thanks for the great proposal.
> >
> > Mark Tucker
> >
> >
> > -----Original Message-----
> > From: Andrew C. Oliver [mailto:acoliver@apache.org]
> > Sent: Thursday, February 07, 2002 5:35 AM
> > To: Lucene Developers List
> > Subject: Proposal for Lucene
> >
> >
> > Hi All,
> >
> > This is just a few thoughts about Lucene. Please send me your
feedback,
> > critiques and thought.
> >
> > If you folks would take a look:
> >
> > http://www.trilug.org/~acoliver/luceneplan.html
> >
> > if you'd like to submit patches:
> >
> > http://www.trilug.org/~acoliver/luceneplan.xml
> >
> > Once I've gotten feedback from the developer community I'll send this
to
> > the user community as well.
> >
> > Thanks,
> >
> > Andy
> > --
> > www.superlinksoftware.com
> > www.sourceforge.net/projects/poi - port of Excel format to java
> > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > - fix java generics!
> >
> >
> > The avalanche has already started. It is too late for the pebbles to
> > vote.
> > -Ambassador Kosh
> >
> >
> > --
> > To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
> > --
> > To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
> >
> --
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
>
>
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>
>
Re: Proposal for Lucene [ In reply to ]
----- Original Message -----
From: Andrew C. Oliver <acoliver@apache.org>
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Sent: Saturday, February 09, 2002 8:57 PM
Subject: Re: Proposal for Lucene


[snip]
>
> > 5. There's a JDBCDatasource for indexing a table from databases (the
table
> > stores metadata of the file to index. There should still be some way to
> > obtain the file to index. This ties back to 3.). I really ought to
provide
> > an example on how to use it...
> >
>
> What's that good for...? Wouldn't one just create an index on the
> database?

you can do that, but IMHO, that's not a very unified way of searching, since
there are now 2 indexes on which to search. What if the data resides on more
than one database? On different databases? What I tried to do was basically
have Lucene handle _all_ indexing and searching...

I can definitely see the pros of having the database index the data though,
just didn't think it was very clean.

Regards,
K


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Hi All,

I just wanted to apologize for not responding to all of your feedback.
I have read it. I rushed to get this proposal done before I leave this
week, despite unexpectedly working late a few nights at my new job. I'm
leaving for Boston tomorrow and will return in a week. I'll go over all
the messages, test them against the proposal and update it with all of
the great ideas. (I may have limited access to the Internet, hence the
delay)

When I get back I'll submit two things:

* The revised proposal and
* A first iteration implementation plan

I'm a firm believer in iterative development, and the feature set as
I've outlined it is probably more then I'd want to try and do in one
swoop. Moreover, I don't want to do any more of this myself then I have
to ;-) so its great if I can put stubs out there and have others submit
the rest.

As far as keeping Lucene an API. That is my every intention. This is
new stuff to be added to Lucene as part of the project but in seperate
packages. As for creating another project for this..that sounds
counter-productive...secondly, the bar on jakarta for creating new
projects is pretty dern high and I don't think splitting the Lucene
community into really smart people like Doug and people like me who just
want to add high level stuff to add to usability is productive.

What I'm going to propose in the plan for our first iteration will be a
limited set of classes and interfaces for starting this. These will be
limited in scope and will keep the full feature set in mind, but not
implement it. The whole idea behind this is that it lets us have
several crawlers, etc not just one so everyone can contribute to the
effort.

In the first iteration implementation plan (some of this may be just
in-lined patches), I'll propose an ant target or two, some packages,
etc.

In revising the existing proposal, there is no need to wait for me. I
put the XML sources for the proposal at the same url as the html
generated version. Just submit patches to it (diff -u original.xml
new.xml) to this list. (http://www.trilug.org/~acoliver/luceneplan.xml)
I realize the xml is slightly incompatible to the Lucene doc build (it
uses Cocoon instead of anakia). I'll be happy to convert it (mostly
make the <link> tags into <a> tags) when I get back (or if someone else
wants to thats fine) and we can add it to the Lucene docs provided
that's agreeable to everyone. I've also included a tarball version of
my personal website which is nothing more than an ant-based docbuild
that we use for POI and the three webpages (including the lucene plan)
with some supporting images. Its about 5.6mb but for those inclined you
can download it at (http://www.trilug.org/~acoliver/personal.tar.bz2).
The targets are cleandocs and docs. (you can use clean but it will make
generation take longer)

Once again, thanks for all of the feedback. I'll submit the revised
proposal when I get back next Saturday and we'll round us out a nice
set of new features. I highly recommend everyone tear at htDig.
They've done a great job of providing general functionality. They also
DO in fact have an API as part of the project in addition to the
"application". It's a nice piece of software...albeit a pain to install
and I think Lucene will exceed it all around once we add these features.

I look forward to working with you all on this functionality.

Thanks,

Andy
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!


The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
done.

On Sat, 2002-02-23 at 12:58, Otis Gospodnetic wrote:
> Yes, please.
>
> Otis
>
> --- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> > Hi All,
> >
> > I'm about to start revising the Lucene proposal with the ideas I got
> > back. Is it okay with everyone if I convert this to work in anakia
> > and
> > commit it into the cvs repository so that we all can work on it and
> > patch it etc? I'll shortly work on the implementation plan as well
> > (basic proposed interfaces etc).
> >
> > Thanks,
> >
> > Andy
> > --
> > www.superlinksoftware.com
> > www.sourceforge.net/projects/poi - port of Excel format to java
> > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > - fix java generics!
> >
> >
> > The avalanche has already started. It is too late for the pebbles to
> > vote.
> > -Ambassador Kosh
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Sports - Coverage of the 2002 Olympic Games
> http://sports.yahoo.com
--
www.superlinksoftware.com
www.sourceforge.net/projects/poi - port of Excel format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!


The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Proposal for Lucene [ In reply to ]
Marc,

I implemented your suggestions. I didn't add URLReplaceIndex as that
was already in the AbstractIndexer (maybe I should clarify it--
suggestions?) (base context etc). I also didn't put the *standard
fields* as I'm not sure that is appropriate, I think we should have that
configurable. I'm open to more discussion on that.

I also did not put *other document factories* -- I don't want to list
every possible one. The proposal only meant to give a few examples for
illustration. (Perhaps that should be stated more clearly?

The "WEB Indexer" -- Perhaps this shows the need for more settings and
maybe we need a further extraction that pulls all of the sources
together aside from the crawling datasource handler? Anyone have any
suggestions on that?

This is now in CVS... take a look when you get the chance and make sure
I didn't leave anything out that I might not should have.

On Thu, 2002-02-07 at 15:03, Mark Tucker wrote:
> I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:
>
> Indexers/Crawlers
>
> General Settings
> SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
> IndexerTimeout - kill this crawler thread after long period of inactivity
> IncludeFilter - include only items matching filter
> ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
> MaxItems - stops indexing after x items
> MaxMegs - stops indexing after x MB of data
>
> File System Indexer
> URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
>
> Web Indexer
> HTTPUser
> HTTPPassword
> HTTPUserAgent
> ProxyServer
> ProxyUser
> ProxyPassword
> HTTPSCertificate
> HTTPSPrivateKey
>
> Other Possible Indexers
> Microsoft Exchange 5.5/2000
> Lotus Notes
> Newsgroup (NNTP)
> Documentum
> ODBC/OLEDB
> XML - index single XML that represents multiple documents
>
>
> Document Factory
> General
> The minimum properties for each document should be:
> URL
> Title
> Abstract
> Full Text
> Score
>
> HTML
> Support for META tags including Dublic Core syntax
>
> Other Possible Document Factories
> Office Docs - DOC, XLS, PPT
> PDF
>
>
> Thanks for the great proposal.
>
> Mark Tucker
>
>
> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org]
> Sent: Thursday, February 07, 2002 5:35 AM
> To: Lucene Developers List
> Subject: Proposal for Lucene
>
>
> Hi All,
>
> This is just a few thoughts about Lucene. Please send me your feedback,
> critiques and thought.
>
> If you folks would take a look:
>
> http://www.trilug.org/~acoliver/luceneplan.html
>
> if you'd like to submit patches:
>
> http://www.trilug.org/~acoliver/luceneplan.xml
>
> Once I've gotten feedback from the developer community I'll send this to
> the user community as well.
>
> Thanks,
>
> Andy
> --
> www.superlinksoftware.com
> www.sourceforge.net/projects/poi - port of Excel format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
>
>
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
On Thu, 2002-02-07 at 16:39, Dmitry Serebrennikov wrote:
> I'd like to add my +1 to the proposal and my +1 to keeping the Lucene as
> a library that can exist separately from the applications. Perhaps the
> applications should be separate targets in the Lucene project (and build
> process) or perhaps they can be separate projects. I think keeping them
> together would be good because Lucene's APIs may need to evolve to
> support these applications better and because this will help ensure that
> changes to Lucene API are reflected in the applications as soon as they
> are made and not with a lag that can come about if the applications are
> treated as separate, dependent projects.
>
> See below for some additional ideas for the crawler.
>
> Mark Tucker wrote:
>
> >I like what you included in your proposal and suggest doing all that (over time) and taking the following into consideration:
> >
> >Indexers/Crawlers
> >
> > General Settings
> > SleeptimeBetweenCalls - can be used to avoid flooding a machine with too many requests
> > IndexerTimeout - kill this crawler thread after long period of inactivity
> > IncludeFilter - include only items matching filter
> > ExcludeFilter - exclude items matching filter (can be used with IncludeFilter)
> >
> I'm working on a crawler right now actually, but it is a derivative of
> WebSPHINX. The original WebSPHINX has not changed since a very long time
> ago, but it is licensed under LGPL at the moment. Perhaps we can get
> permission from the copyright holders to transfer it to APL (or do we
> even need to?). I made a number of bug fixes to it, added support for
> cookies (rudimentary) and support for HTTP redirects. One thing that I
> like in WebSPHINX is that it has a forgiving HTML parser that can deal
> with many kinds of broken HTML. Also, it has a very interesting
> framework for analyzing parsed content, but this goes beyound the
> requirements for use with Lucene.
>

I'm pretty sure they'd have to make it APL for us to collaborate
significantly.

> I use the crawler with Lucene, but there is a layer of application
> classes between the two, so the kind of integration that has been
> proposed here has not yet been done. Anyway, I found that in addition to
> the Include and Exclude filters, it is helpful to be able to say that
> you want some page "expanded" (i.e. parsed and links followed), but not
> "indexed" (i.e. added to Lucene's index). And vice versa, it seems
> useful to index a page but not expand it, somethimes. Also, filters can
> be evaluated on links before they are followed, and then the second time
> on final URLs of pages retrieved. Normally the two are the same, but
> HTTP redirects can force the final URL to be something very different
> from the original link.
>

Ahh... that does make sense to me... I've added this. I had to read
this like 3 or 4 times.. Please look over the changes I made and make
sure I explained it properly.. (It could be my little brain just took a
few times to grasp it ;-) ).

> Perhaps one way to represent these conditions is to have the following
> "language" instead of include and exclude filters:
>
> "include:" regex
> "exclude:" regex
> "noindex": regex
> "noexpand": regex
>
> The first two work as the include/exclude, but for things that pass
> these two, the others add handling properties that are used in
> processing the link and the page. Disclaimer: I'm experimenting with
> this now and these ideas are only about two days old, so please take
> them as such. Since we got into the discussion, I figured I'd put them
> on the table.
>
> >
> > MaxItems - stops indexing after x items
> > MaxMegs - stops indexing after x MB of data
> >
> > File System Indexer
> > URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/
> >
> Question: does this information really belong in the index? Perhaps the
> root path should be specified, and the documents tagged with a relative
> path to that path, but I think that, maybe, the URL to prefix the
> document paths with should be given once per entire index and be easy to
> change.
>

Yes it must be in the index. This replace context is already in the
abstractcrawler.

> >
> >
> > Web Indexer
> > HTTPUser
> > HTTPPassword
> > HTTPUserAgent
> > ProxyServer
> > ProxyUser
> > ProxyPassword
> > HTTPSCertificate
> > HTTPSPrivateKey
> >
> Apache Commons has HTTPClient package that has some similar concepts and
> even implements them to some degree. I found it a bit rough still and
> dependent on JDK 1.3, but it can be fixed easier than a new one written
> I believe. It uses a notion of an HttpState, which is a state container
> for an HTTP user agent, containing things like authentication
> credentials and cookies. HTTPS support is easy to add with JSSE (which
> is the approach taken by the HttpClient from the Commons).
>

I actually had HttpClient in mind (have only looked at the description)
the whole time I typed this.. We can use whatever, but it makes sense
to use this if its available. Such specific details don't belong in
this particular proposal (we're answering "What" not "How") but once we
get a proposal we like we can look at that in the implementation plan.

> >
> >
> > Other Possible Indexers
> > Microsoft Exchange 5.5/2000
> > Lotus Notes
> > Newsgroup (NNTP)
> > Documentum
> > ODBC/OLEDB
> > XML - index single XML that represents multiple documents
> >
> One idea that might prove useful is to add a "DocumentFetcher" in
> addition to the DocumentIndexer. The two would go hand in hand, and
> document entries created in Lucene by a particular Indexer can be
> understood by a corresponding Fetcher. The Fetcher would then
> encapsulate retrieval of source documents or creating useful pointers to
> them (like URLs).
>

I like that... I'm just trying to figure out "How" to do that
(design-wise).. How do we seperate the concerns of the retrieval from
the link crawling etc? Could you perhaps patch the proposal with a
design.?

> Another idea is to split the document storage and "envelope" from its
> content. The content is subject to a MIME type and can be handed to a
> parser, passed to a document factory, mapped to fields, etc. However,
> the logic of retrieving a PDF file from a Lotus Notes database (and
> creating a URL to point back to it), is different than getting the same
> PDF file from the file system. The same parser and a document factory
> can still be used though.
>

Right.. I'm not sure we should do this at first...maybe for a later
iteration. Thats a lot to bite off in one chew. I want to match and
slightly exceed htDig at first (not a competitive thing, its just what I
use currently).. Nail the 80% first and worry about the 20% later so
that we minimize up front complexity (iterative programming, etc etc)

-Andy

> >
> >
> >Document Factory
> > General
> > The minimum properties for each document should be:
> > URL
> > Title
> > Abstract
> > Full Text
> > Score
> >
> > HTML
> > Support for META tags including Dublic Core syntax
> >
> > Other Possible Document Factories
> > Office Docs - DOC, XLS, PPT
> > PDF
> >
> >
> >Thanks for the great proposal.
> >
> Yes! Absolutely! Great proposal!
>
> --Dmitry
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> Hi,
>
> i would suggest two sub-projects:
>

I think "packages" would be more appropriate of a description, I
wouldn't call them "subprojects" so to speak.

> 1.Crawler - retrieving docs, wherever they are.....
>
> 2. DocumentHandler extract Text, create apropriate fields etc..
>

+1 thats what I was getting at in the proposal about DocumentFactory
etc.

> The second is a layer on top of lucene. First is a autonomous package, wich
> should be nicely integrated with lucene/Document-Handler, but should also be
> usable for other projects.
>

hummm...I'm not entirely sure I'd go that far. Well encapsulated for
sure but How usable by other projects is up to them not us...

> I've included my code, to show you, what i've done. It isn't too useful yet,
> because it is integrated in our product, but you can get the idea. Actually i've
> written two things:
>
> 1: A robot for crawling a remote server via http and writing all the data to
> local filesystem, then importing it into our db and
> (at the same time) replacing all links with internal links. So we could emulate
> a web-Site from this crawled Data!
> [com.synformation.script.utilities.importtool]
>

I looked through this! Great stuff! Do you own this code? Are you
able to donate it to Lucene (APL and all)? It looks like a great
starting point. We'd have to do some refactoring but it looks pretty
dern good to me. I haven't tried running it, just skimmed through.

> 2: (I've rewritten some of the code from 1 for that, so this is much cleaner) A
> customer needs a tool for importing local mini-Websites on the file-system via
> an applet, send it to the Web-Server and import it as described in point 1. I've
> tried to write it in a way, that it could include the functionality of point 1
> (retrieving vie http), but that is mostly untested.
> [com.synformation.script.utilities.fileimport]
>
My brain didn't parse that..

> I don't say, that you(we) should use this. But i think it's time to come to a
> more concrete plans. I'm interested to help on that for the crawler.
>

If you're able to donate it (legally) I kinda think there is a lot
here. It of course needs to be refactored to meet some of the
objectives we've outlined, but a darn good starting point IMHO!

>
> mfg,
>
> manfred
>
>
>
>
> ----
>

> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
--
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Wow this is an awesome starting point! I'm awed! The object model is
nice and abstracted and yet clean and simple.. I only scanned it but I
already feel like I understand it. Are you okay with us putting this in
a scratchpad area in lucene repository (I gather "yes") and refactoring
it as a starting point?

Has anyone else looked at this? Any objections?

-Andy


On Sat, 2002-02-09 at 07:58, Kelvin Tan wrote:
> Here it is. Released under APL (I kinda copied and pasted the license from
> some Fulcrum code). Some (current) limitations:
>
> 1. Only a single datasource is supported at this point in time (support for
> multiple datasources can be easily added through the configuration file and
> improving SearchConfiguration)
> 2. Documentation isn't really complete. (Is it ever?)
> 3. It's a filesystem-based indexer. It's not too difficult to decouple the
> filesystem bit and make it more generic, but I don't have a need for it
> presently.
> 4. A temp folder is needed for extracting Zip, GZip and Tar files. I tried
> using outputstreams but they turned out to be quite a nightmare...
> 5. There's a JDBCDatasource for indexing a table from databases (the table
> stores metadata of the file to index. There should still be some way to
> obtain the file to index. This ties back to 3.). I really ought to provide
> an example on how to use it...
>
> Questions and feedback are really welcome.
>
> I've attached the source-only version, but there's a full version (with
> libs) at http://www.relevanz.com/search_full.zip.
>
> ----- Original Message -----
> From: Andrew C. Oliver <acoliver@apache.org>
> To: Lucene Developers List <lucene-dev@jakarta.apache.org>
> Sent: Friday, February 08, 2002 9:18 PM
> Subject: Re: Proposal for Lucene
>
>
> > Is this open source? APL'd? Where can I look at it?
> >
> > -Andy
> >
> > On Thu, 2002-02-07 at 20:27, Kelvin Tan wrote:
> > > Great suggestions all around, and I'm pretty much in agreement with
> what's been said.
> > >
> > > In my app, I've built a mini-framework around the searching such that
> I'm able to map ContentHandlers (which index file contents) to file
> extensions. I've been wanting to clean it up and contribute it for awhile,
> but haven't overcome the intertia to do so. Also introduced a DataSource
> (which can pretty much be anything, like a filesystem, a database, a URL,
> etc) from which to obtain the data to index, so I think it _could_ be inline
> with what some of you have in mind.
> > >
> > > I could also use alot of feedback with what's been done too...
> > >
> > > So what's the plan to move forward?
> > >
> > > K
> > > ----- Original Message -----
> > > From: Mark Tucker
> > > To: Lucene Developers List
> > > Sent: Friday, February 08, 2002 4:03 AM
> > > Subject: RE: Proposal for Lucene
> > >
> > >
> > > I like what you included in your proposal and suggest doing all that
> (over time) and taking the following into consideration:
> > >
> > > Indexers/Crawlers
> > >
> > > General Settings
> > > SleeptimeBetweenCalls - can be used to avoid flooding a machine with
> too many requests
> > > IndexerTimeout - kill this crawler thread after long period of
> inactivity
> > > IncludeFilter - include only items matching filter
> > > ExcludeFilter - exclude items matching filter (can be used with
> IncludeFilter)
> > > MaxItems - stops indexing after x items
> > > MaxMegs - stops indexing after x MB of data
> > >
> > > File System Indexer
> > > URLReplacePrefix - can crawl c:\ but expose URL as
> http://mysever/docs/
> > >
> > > Web Indexer
> > > HTTPUser
> > > HTTPPassword
> > > HTTPUserAgent
> > > ProxyServer
> > > ProxyUser
> > > ProxyPassword
> > > HTTPSCertificate
> > > HTTPSPrivateKey
> > >
> > > Other Possible Indexers
> > > Microsoft Exchange 5.5/2000
> > > Lotus Notes
> > > Newsgroup (NNTP)
> > > Documentum
> > > ODBC/OLEDB
> > > XML - index single XML that represents multiple documents
> > >
> > >
> > > Document Factory
> > > General
> > > The minimum properties for each document should be:
> > > URL
> > > Title
> > > Abstract
> > > Full Text
> > > Score
> > >
> > > HTML
> > > Support for META tags including Dublic Core syntax
> > >
> > > Other Possible Document Factories
> > > Office Docs - DOC, XLS, PPT
> > > PDF
> > >
> > >
> > > Thanks for the great proposal.
> > >
> > > Mark Tucker
> > >
> > >
> > > -----Original Message-----
> > > From: Andrew C. Oliver [mailto:acoliver@apache.org]
> > > Sent: Thursday, February 07, 2002 5:35 AM
> > > To: Lucene Developers List
> > > Subject: Proposal for Lucene
> > >
> > >
> > > Hi All,
> > >
> > > This is just a few thoughts about Lucene. Please send me your
> feedback,
> > > critiques and thought.
> > >
> > > If you folks would take a look:
> > >
> > > http://www.trilug.org/~acoliver/luceneplan.html
> > >
> > > if you'd like to submit patches:
> > >
> > > http://www.trilug.org/~acoliver/luceneplan.xml
> > >
> > > Once I've gotten feedback from the developer community I'll send this
> to
> > > the user community as well.
> > >
> > > Thanks,
> > >
> > > Andy
> > > --
> > > www.superlinksoftware.com
> > > www.sourceforge.net/projects/poi - port of Excel format to java
> > > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > > - fix java generics!
> > >
> > >
> > > The avalanche has already started. It is too late for the pebbles to
> > > vote.
> > > -Ambassador Kosh
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> > >
> > >
> > --
> > www.superlinksoftware.com
> > www.sourceforge.net/projects/poi - port of Excel format to java
> > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > - fix java generics!
> >
> >
> > The avalanche has already started. It is too late for the pebbles to
> > vote.
> > -Ambassador Kosh
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
> ----
>

> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
--
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Proposal for Lucene [ In reply to ]
When I try to unzip the file with WinZip, I get the following error:

Cannot open file: it does not appear to be a valid archive.

Can someone send or post a new zip file?

Thanks,

Mark Tucker

> I've attached the source-only version, but there's a full version (with
> libs) at http://www.relevanz.com/search_full.zip.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Mark,

My web server is acting all weird -- somehow this zip file refuses to
download completely via HTTP (both in IE and Netscape, but downloading via
FTP is fine).

The workaround is that I've renamed it to
http://www.relevanz.com/search_full.z. If your friendly zip program doesn't
recognize it (though WinZip does), just rename it to search_full.zip. :)

Regards,
Kelvin

----- Original Message -----
From: "Mark Tucker" <MTucker@infoimage.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Tuesday, February 26, 2002 4:49 AM
Subject: RE: Proposal for Lucene


When I try to unzip the file with WinZip, I get the following error:

Cannot open file: it does not appear to be a valid archive.

Can someone send or post a new zip file?

Thanks,

Mark Tucker

> I've attached the source-only version, but there's a full version (with
> libs) at http://www.relevanz.com/search_full.zip.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
----- Original Message -----
From: "Andrew C. Oliver" <acoliver@apache.org>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Monday, February 25, 2002 12:48 AM
Subject: Re: Proposal for Lucene


> Wow this is an awesome starting point! I'm awed!
> The object model is
> nice and abstracted and yet clean and simple.. I only scanned it but I
> already feel like I understand it. Are you okay with us putting this in
> a scratchpad area in lucene repository (I gather "yes") and refactoring
> it as a starting point?

I'd be more than happy if you could do that. It would be nice if Lucene had
the equivalent of the commons-sandbox or turbine-stratum, a workplace
kind-of.

Regards,
Kelvin

>
> Has anyone else looked at this? Any objections?
>
> -Andy
>
>
> On Sat, 2002-02-09 at 07:58, Kelvin Tan wrote:
> > Here it is. Released under APL (I kinda copied and pasted the license
from
> > some Fulcrum code). Some (current) limitations:
> >
> > 1. Only a single datasource is supported at this point in time (support
for
> > multiple datasources can be easily added through the configuration file
and
> > improving SearchConfiguration)
> > 2. Documentation isn't really complete. (Is it ever?)
> > 3. It's a filesystem-based indexer. It's not too difficult to decouple
the
> > filesystem bit and make it more generic, but I don't have a need for it
> > presently.
> > 4. A temp folder is needed for extracting Zip, GZip and Tar files. I
tried
> > using outputstreams but they turned out to be quite a nightmare...
> > 5. There's a JDBCDatasource for indexing a table from databases (the
table
> > stores metadata of the file to index. There should still be some way to
> > obtain the file to index. This ties back to 3.). I really ought to
provide
> > an example on how to use it...
> >
> > Questions and feedback are really welcome.
> >
> > I've attached the source-only version, but there's a full version (with
> > libs) at http://www.relevanz.com/search_full.zip.
> >
> > ----- Original Message -----
> > From: Andrew C. Oliver <acoliver@apache.org>
> > To: Lucene Developers List <lucene-dev@jakarta.apache.org>
> > Sent: Friday, February 08, 2002 9:18 PM
> > Subject: Re: Proposal for Lucene
> >
> >
> > > Is this open source? APL'd? Where can I look at it?
> > >
> > > -Andy
> > >
> > > On Thu, 2002-02-07 at 20:27, Kelvin Tan wrote:
> > > > Great suggestions all around, and I'm pretty much in agreement with
> > what's been said.
> > > >
> > > > In my app, I've built a mini-framework around the searching such
that
> > I'm able to map ContentHandlers (which index file contents) to file
> > extensions. I've been wanting to clean it up and contribute it for
awhile,
> > but haven't overcome the intertia to do so. Also introduced a DataSource
> > (which can pretty much be anything, like a filesystem, a database, a
URL,
> > etc) from which to obtain the data to index, so I think it _could_ be
inline
> > with what some of you have in mind.
> > > >
> > > > I could also use alot of feedback with what's been done too...
> > > >
> > > > So what's the plan to move forward?
> > > >
> > > > K
> > > > ----- Original Message -----
> > > > From: Mark Tucker
> > > > To: Lucene Developers List
> > > > Sent: Friday, February 08, 2002 4:03 AM
> > > > Subject: RE: Proposal for Lucene
> > > >
> > > >
> > > > I like what you included in your proposal and suggest doing all
that
> > (over time) and taking the following into consideration:
> > > >
> > > > Indexers/Crawlers
> > > >
> > > > General Settings
> > > > SleeptimeBetweenCalls - can be used to avoid flooding a machine
with
> > too many requests
> > > > IndexerTimeout - kill this crawler thread after long period of
> > inactivity
> > > > IncludeFilter - include only items matching filter
> > > > ExcludeFilter - exclude items matching filter (can be used with
> > IncludeFilter)
> > > > MaxItems - stops indexing after x items
> > > > MaxMegs - stops indexing after x MB of data
> > > >
> > > > File System Indexer
> > > > URLReplacePrefix - can crawl c:\ but expose URL as
> > http://mysever/docs/
> > > >
> > > > Web Indexer
> > > > HTTPUser
> > > > HTTPPassword
> > > > HTTPUserAgent
> > > > ProxyServer
> > > > ProxyUser
> > > > ProxyPassword
> > > > HTTPSCertificate
> > > > HTTPSPrivateKey
> > > >
> > > > Other Possible Indexers
> > > > Microsoft Exchange 5.5/2000
> > > > Lotus Notes
> > > > Newsgroup (NNTP)
> > > > Documentum
> > > > ODBC/OLEDB
> > > > XML - index single XML that represents multiple documents
> > > >
> > > >
> > > > Document Factory
> > > > General
> > > > The minimum properties for each document should be:
> > > > URL
> > > > Title
> > > > Abstract
> > > > Full Text
> > > > Score
> > > >
> > > > HTML
> > > > Support for META tags including Dublic Core syntax
> > > >
> > > > Other Possible Document Factories
> > > > Office Docs - DOC, XLS, PPT
> > > > PDF
> > > >
> > > >
> > > > Thanks for the great proposal.
> > > >
> > > > Mark Tucker
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Andrew C. Oliver [mailto:acoliver@apache.org]
> > > > Sent: Thursday, February 07, 2002 5:35 AM
> > > > To: Lucene Developers List
> > > > Subject: Proposal for Lucene
> > > >
> > > >
> > > > Hi All,
> > > >
> > > > This is just a few thoughts about Lucene. Please send me your
> > feedback,
> > > > critiques and thought.
> > > >
> > > > If you folks would take a look:
> > > >
> > > > http://www.trilug.org/~acoliver/luceneplan.html
> > > >
> > > > if you'd like to submit patches:
> > > >
> > > > http://www.trilug.org/~acoliver/luceneplan.xml
> > > >
> > > > Once I've gotten feedback from the developer community I'll send
this
> > to
> > > > the user community as well.
> > > >
> > > > Thanks,
> > > >
> > > > Andy
> > > > --
> > > > www.superlinksoftware.com
> > > > www.sourceforge.net/projects/poi - port of Excel format to java
> > > >
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > > > - fix java generics!
> > > >
> > > >
> > > > The avalanche has already started. It is too late for the pebbles
to
> > > > vote.
> > > > -Ambassador Kosh
> > > >
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> > > >
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> > > >
> > > >
> > > >
> > > --
> > > www.superlinksoftware.com
> > > www.sourceforge.net/projects/poi - port of Excel format to java
> > > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > > - fix java generics!
> > >
> > >
> > > The avalanche has already started. It is too late for the pebbles to
> > > vote.
> > > -Ambassador Kosh
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> > >
> > ----
> >
>
> > --
> > To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
> --
> http://www.superlinksoftware.com
> http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document
> format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Hi,


> > 2: (I've rewritten some of the code from 1 for that, so this is much cleaner) A
> > customer needs a tool for importing local mini-Websites on the file-system via
> > an applet, send it to the Web-Server and import it as described in point 1. I've
> > tried to write it in a way, that it could include the functionality of point 1
> > (retrieving vie http), but that is mostly untested.
> > [com.synformation.script.utilities.fileimport]
> >
> My brain didn't parse that..

com.synformation.script.utilities.fileimport is about crawling html pages on a remote
server (the process is on the remote server), packing all crawled files up and sending
it via http to the server, which is processing it further. Maybe thats also not clear
enough, but that is not necessary: All you have to know is, that it is a kind of
refactored version for com.synformation.script.utilities.httpimport.


>
> If you're able to donate it (legally) I kinda think there is a lot
> here. It of course needs to be refactored to meet some of the
> objectives we've outlined, but a darn good starting point IMHO!
>

Tell me what i've to do, to donate it. Is copying the apache license into every file
enough ? It has little dependencies from some other classes, but that are always very
little util methods, so i can remove it and add a little sample how to use it.
btw: I've seen, that avalon/excalibut/phoenix has also a crawler component (just the
case you want to compare).

regards,

manfred





--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
On Tue, 2002-02-26 at 05:41, Manfred Schäfer wrote:
> Hi,
>
>
> > > 2: (I've rewritten some of the code from 1 for that, so this is much cleaner) A
> > > customer needs a tool for importing local mini-Websites on the file-system via
> > > an applet, send it to the Web-Server and import it as described in point 1. I've
> > > tried to write it in a way, that it could include the functionality of point 1
> > > (retrieving vie http), but that is mostly untested.
> > > [com.synformation.script.utilities.fileimport]
> > >
> > My brain didn't parse that..
>
> com.synformation.script.utilities.fileimport is about crawling html pages on a remote
> server (the process is on the remote server), packing all crawled files up and sending
> it via http to the server, which is processing it further. Maybe thats also not clear
> enough, but that is not necessary: All you have to know is, that it is a kind of
> refactored version for com.synformation.script.utilities.httpimport.
>
>

now I get it.

> >
> > If you're able to donate it (legally) I kinda think there is a lot
> > here. It of course needs to be refactored to meet some of the
> > objectives we've outlined, but a darn good starting point IMHO!
> >
>
> Tell me what i've to do, to donate it. Is copying the apache license into every file
> enough ? It has little dependencies from some other classes, but that are always very
> little util methods, so i can remove it and add a little sample how to use it.
> btw: I've seen, that avalon/excalibut/phoenix has also a crawler component (just the
> case you want to compare).
>

Naw, you just have to own the code, I can use a little PERL script to
put the header (did I say that...duck...I'm so embarrassed). Basically
I'm just saying "Can you legally donate this, copywright wise, etc.?
Are you willing to donate this?" If your boss, for instance, owns the
code the the answer would be no, yes. And that wouldn't work ;-)

-Andy

> regards,
>
> manfred
>
>
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Re: Proposal for Lucene [ In reply to ]
>On Tue, 26 Feb 2002 19:11:07 0000 Manfred =?iso-8859-1?Q?Sch=E4fer?=
<mschaefer@bouncy.com> wrote.
>Hi,
>
>
>>
>> Naw, you just have to own the code, I can use a little PERL script to
>> put the header (did I say that...duck...I'm so embarrassed). Basically
>> I'm just saying "Can you legally donate this, copywright wise, etc.?
>> Are you willing to donate this?"
>
>My boss has granted to donate the source.
>

great!

>
>> If your boss, for instance, owns the
>> code the the answer would be no, yes. And that wouldn't work ;-)
>
>???
>

nevermind.

>regards,
>
>Manfred
>
>
>
>--
>To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Hi,


>
> Naw, you just have to own the code, I can use a little PERL script to
> put the header (did I say that...duck...I'm so embarrassed). Basically
> I'm just saying "Can you legally donate this, copywright wise, etc.?
> Are you willing to donate this?"

My boss has granted to donate the source.


> If your boss, for instance, owns the
> code the the answer would be no, yes. And that wouldn't work ;-)

???

regards,

Manfred



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Hi Manfred/Kelvin (whose name I saw on a lot of this),

I'm back on the on cycle and I was about to commit this stuff so we
could start refactoring, I've got it building and all set up and ready.
But I wanted to make sure that you're still okay with it.

Once I get it in lucene-sandbox we can start refactoring it and adding
the new features.

Are we good to go? Let me know and then we can watch the CVS commit
messages fly into lucene-sandbox...

Thanks,

-Andy

On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> Hi,
>
> i would suggest two sub-projects:
>
> 1.Crawler - retrieving docs, wherever they are.....
>
> 2. DocumentHandler extract Text, create apropriate fields etc..
>
> The second is a layer on top of lucene. First is a autonomous package, wich
> should be nicely integrated with lucene/Document-Handler, but should also be
> usable for other projects.
>
> I've included my code, to show you, what i've done. It isn't too useful yet,
> because it is integrated in our product, but you can get the idea. Actually i've
> written two things:
>
> 1: A robot for crawling a remote server via http and writing all the data to
> local filesystem, then importing it into our db and
> (at the same time) replacing all links with internal links. So we could emulate
> a web-Site from this crawled Data!
> [com.synformation.script.utilities.importtool]
>
> 2: (I've rewritten some of the code from 1 for that, so this is much cleaner) A
> customer needs a tool for importing local mini-Websites on the file-system via
> an applet, send it to the Web-Server and import it as described in point 1. I've
> tried to write it in a way, that it could include the functionality of point 1
> (retrieving vie http), but that is mostly untested.
> [com.synformation.script.utilities.fileimport]
>
> I don't say, that you(we) should use this. But i think it's time to come to a
> more concrete plans. I'm interested to help on that for the crawler.
>
>
> mfg,
>
> manfred
>
>
>
>
> ----
>

> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
--
http://www.superlinksoftware.com
http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Note that I will also be putting some web crawler code in the sandbox
soon. The code is from Clemens, who posted a few messages recently.

Good, lets see some refactoring!

Otis


--- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> Hi Manfred/Kelvin (whose name I saw on a lot of this),
>
> I'm back on the on cycle and I was about to commit this stuff so we
> could start refactoring, I've got it building and all set up and
> ready.
> But I wanted to make sure that you're still okay with it.
>
> Once I get it in lucene-sandbox we can start refactoring it and
> adding
> the new features.
>
> Are we good to go? Let me know and then we can watch the CVS commit
> messages fly into lucene-sandbox...
>
> Thanks,
>
> -Andy
>
> On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> > Hi,
> >
> > i would suggest two sub-projects:
> >
> > 1.Crawler - retrieving docs, wherever they are.....
> >
> > 2. DocumentHandler extract Text, create apropriate fields etc..
> >
> > The second is a layer on top of lucene. First is a autonomous
> package, wich
> > should be nicely integrated with lucene/Document-Handler, but
> should also be
> > usable for other projects.
> >
> > I've included my code, to show you, what i've done. It isn't too
> useful yet,
> > because it is integrated in our product, but you can get the idea.
> Actually i've
> > written two things:
> >
> > 1: A robot for crawling a remote server via http and writing all
> the data to
> > local filesystem, then importing it into our db and
> > (at the same time) replacing all links with internal links. So we
> could emulate
> > a web-Site from this crawled Data!
> > [com.synformation.script.utilities.importtool]
> >
> > 2: (I've rewritten some of the code from 1 for that, so this is
> much cleaner) A
> > customer needs a tool for importing local mini-Websites on the
> file-system via
> > an applet, send it to the Web-Server and import it as described in
> point 1. I've
> > tried to write it in a way, that it could include the functionality
> of point 1
> > (retrieving vie http), but that is mostly untested.
> > [com.synformation.script.utilities.fileimport]
> >
> > I don't say, that you(we) should use this. But i think it's time to
> come to a
> > more concrete plans. I'm interested to help on that for the
> crawler.
> >
> >
> > mfg,
> >
> > manfred
> >
> >
> >
> >
> > ----
> >
>
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> --
> http://www.superlinksoftware.com
> http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
> Document
> format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
cool dude lets put it all in the same place and refactor it together. I
didn't even repackage this yet I figured we'd put it in, get it building
and then pick, choose and enhance.

(like i want to rip jdom out and use commons-logging for this bad boy
because log4j has bitten me cleanly in the rump too many times with its
forever changing interfaces)

I'm not wanting to do this alone. Lets work together.. My interest is
in getting some interfaces and pluggable architecture in here and you
had some great ideas on that IIRC.

I've even got the build and all set up. When these guys say go I'll hit
the button and we can go to town.

-Andy

On Fri, 2002-05-03 at 22:48, Otis Gospodnetic wrote:
> Note that I will also be putting some web crawler code in the sandbox
> soon. The code is from Clemens, who posted a few messages recently.
>
> Good, lets see some refactoring!
>
> Otis
>
>
> --- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> > Hi Manfred/Kelvin (whose name I saw on a lot of this),
> >
> > I'm back on the on cycle and I was about to commit this stuff so we
> > could start refactoring, I've got it building and all set up and
> > ready.
> > But I wanted to make sure that you're still okay with it.
> >
> > Once I get it in lucene-sandbox we can start refactoring it and
> > adding
> > the new features.
> >
> > Are we good to go? Let me know and then we can watch the CVS commit
> > messages fly into lucene-sandbox...
> >
> > Thanks,
> >
> > -Andy
> >
> > On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> > > Hi,
> > >
> > > i would suggest two sub-projects:
> > >
> > > 1.Crawler - retrieving docs, wherever they are.....
> > >
> > > 2. DocumentHandler extract Text, create apropriate fields etc..
> > >
> > > The second is a layer on top of lucene. First is a autonomous
> > package, wich
> > > should be nicely integrated with lucene/Document-Handler, but
> > should also be
> > > usable for other projects.
> > >
> > > I've included my code, to show you, what i've done. It isn't too
> > useful yet,
> > > because it is integrated in our product, but you can get the idea.
> > Actually i've
> > > written two things:
> > >
> > > 1: A robot for crawling a remote server via http and writing all
> > the data to
> > > local filesystem, then importing it into our db and
> > > (at the same time) replacing all links with internal links. So we
> > could emulate
> > > a web-Site from this crawled Data!
> > > [com.synformation.script.utilities.importtool]
> > >
> > > 2: (I've rewritten some of the code from 1 for that, so this is
> > much cleaner) A
> > > customer needs a tool for importing local mini-Websites on the
> > file-system via
> > > an applet, send it to the Web-Server and import it as described in
> > point 1. I've
> > > tried to write it in a way, that it could include the functionality
> > of point 1
> > > (retrieving vie http), but that is mostly untested.
> > > [com.synformation.script.utilities.fileimport]
> > >
> > > I don't say, that you(we) should use this. But i think it's time to
> > come to a
> > > more concrete plans. I'm interested to help on that for the
> > crawler.
> > >
> > >
> > > mfg,
> > >
> > > manfred
> > >
> > >
> > >
> > >
> > > ----
> > >
> >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> > --
> > http://www.superlinksoftware.com
> > http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
> > Document
> > format to java
> > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > - fix java generics!
> > The avalanche has already started. It is too late for the pebbles to
> > vote.
> > -Ambassador Kosh
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Health - your guide to health and wellness
> http://health.yahoo.com
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
http://www.superlinksoftware.com
http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Andy,

I'm up for it. I've made further changes to what I previously posted and am
keen on getting it into sandbox.

K
----- Original Message -----
From: "Andrew C. Oliver" <acoliver@apache.org>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Saturday, May 04, 2002 10:23 AM
Subject: Re: Proposal for Lucene


Hi Manfred/Kelvin (whose name I saw on a lot of this),

I'm back on the on cycle and I was about to commit this stuff so we
could start refactoring, I've got it building and all set up and ready.
But I wanted to make sure that you're still okay with it.

Once I get it in lucene-sandbox we can start refactoring it and adding
the new features.

Are we good to go? Let me know and then we can watch the CVS commit
messages fly into lucene-sandbox...

Thanks,

-Andy

On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> Hi,
>
> i would suggest two sub-projects:
>
> 1.Crawler - retrieving docs, wherever they are.....
>
> 2. DocumentHandler extract Text, create apropriate fields etc..
>
> The second is a layer on top of lucene. First is a autonomous package,
wich
> should be nicely integrated with lucene/Document-Handler, but should also
be
> usable for other projects.
>
> I've included my code, to show you, what i've done. It isn't too useful
yet,
> because it is integrated in our product, but you can get the idea.
Actually i've
> written two things:
>
> 1: A robot for crawling a remote server via http and writing all the data
to
> local filesystem, then importing it into our db and
> (at the same time) replacing all links with internal links. So we could
emulate
> a web-Site from this crawled Data!
> [com.synformation.script.utilities.importtool]
>
> 2: (I've rewritten some of the code from 1 for that, so this is much
cleaner) A
> customer needs a tool for importing local mini-Websites on the file-system
via
> an applet, send it to the Web-Server and import it as described in point
1. I've
> tried to write it in a way, that it could include the functionality of
point 1
> (retrieving vie http), but that is mostly untested.
> [com.synformation.script.utilities.fileimport]
>
> I don't say, that you(we) should use this. But i think it's time to come
to a
> more concrete plans. I'm interested to help on that for the crawler.
>
>
> mfg,
>
> manfred
>
>
>
>
> ----
>

> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
--
http://www.superlinksoftware.com
http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Woohoo, I can feel the energy all the way here in NYC! :)

Clemens' contribution is in jakarta-lucene-sandbox now, so go ahead,
look and play.


I will send a separate note about this contribution now.

Otis


--- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> cool dude lets put it all in the same place and refactor it together.
> I
> didn't even repackage this yet I figured we'd put it in, get it
> building
> and then pick, choose and enhance.
>
> (like i want to rip jdom out and use commons-logging for this bad boy
> because log4j has bitten me cleanly in the rump too many times with
> its
> forever changing interfaces)
>
> I'm not wanting to do this alone. Lets work together.. My interest
> is
> in getting some interfaces and pluggable architecture in here and you
> had some great ideas on that IIRC.
>
> I've even got the build and all set up. When these guys say go I'll
> hit
> the button and we can go to town.
>
> -Andy
>
> On Fri, 2002-05-03 at 22:48, Otis Gospodnetic wrote:
> > Note that I will also be putting some web crawler code in the
> sandbox
> > soon. The code is from Clemens, who posted a few messages
> recently.
> >
> > Good, lets see some refactoring!
> >
> > Otis
> >
> >
> > --- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> > > Hi Manfred/Kelvin (whose name I saw on a lot of this),
> > >
> > > I'm back on the on cycle and I was about to commit this stuff so
> we
> > > could start refactoring, I've got it building and all set up and
> > > ready.
> > > But I wanted to make sure that you're still okay with it.
> > >
> > > Once I get it in lucene-sandbox we can start refactoring it and
> > > adding
> > > the new features.
> > >
> > > Are we good to go? Let me know and then we can watch the CVS
> commit
> > > messages fly into lucene-sandbox...
> > >
> > > Thanks,
> > >
> > > -Andy
> > >
> > > On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> > > > Hi,
> > > >
> > > > i would suggest two sub-projects:
> > > >
> > > > 1.Crawler - retrieving docs, wherever they are.....
> > > >
> > > > 2. DocumentHandler extract Text, create apropriate fields etc..
> > > >
> > > > The second is a layer on top of lucene. First is a autonomous
> > > package, wich
> > > > should be nicely integrated with lucene/Document-Handler, but
> > > should also be
> > > > usable for other projects.
> > > >
> > > > I've included my code, to show you, what i've done. It isn't
> too
> > > useful yet,
> > > > because it is integrated in our product, but you can get the
> idea.
> > > Actually i've
> > > > written two things:
> > > >
> > > > 1: A robot for crawling a remote server via http and writing
> all
> > > the data to
> > > > local filesystem, then importing it into our db and
> > > > (at the same time) replacing all links with internal links. So
> we
> > > could emulate
> > > > a web-Site from this crawled Data!
> > > > [com.synformation.script.utilities.importtool]
> > > >
> > > > 2: (I've rewritten some of the code from 1 for that, so this is
> > > much cleaner) A
> > > > customer needs a tool for importing local mini-Websites on the
> > > file-system via
> > > > an applet, send it to the Web-Server and import it as described
> in
> > > point 1. I've
> > > > tried to write it in a way, that it could include the
> functionality
> > > of point 1
> > > > (retrieving vie http), but that is mostly untested.
> > > > [com.synformation.script.utilities.fileimport]
> > > >
> > > > I don't say, that you(we) should use this. But i think it's
> time to
> > > come to a
> > > > more concrete plans. I'm interested to help on that for the
> > > crawler.
> > > >
> > > >
> > > > mfg,
> > > >
> > > > manfred
> > > >
> > > >
> > > >
> > > >
> > > > ----
> > > >
> > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > <mailto:lucene-dev-help@jakarta.apache.org>
> > > --
> > > http://www.superlinksoftware.com
> > > http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
> > > Document
> > > format to java
> > >
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > > - fix java generics!
> > > The avalanche has already started. It is too late for the pebbles
> to
> > > vote.
> > > -Ambassador Kosh
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > > <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Yahoo! Health - your guide to health and wellness
> > http://health.yahoo.com
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
> --
> http://www.superlinksoftware.com
> http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
> Document
> format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Okay I'll go ahead and commit what I have and then you can post the new
version

On Sat, 2002-05-04 at 04:33, Kelvin Tan wrote:
> Andy,
>
> I'm up for it. I've made further changes to what I previously posted and am
> keen on getting it into sandbox.
>
> K
> ----- Original Message -----
> From: "Andrew C. Oliver" <acoliver@apache.org>
> To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> Sent: Saturday, May 04, 2002 10:23 AM
> Subject: Re: Proposal for Lucene
>
>
> Hi Manfred/Kelvin (whose name I saw on a lot of this),
>
> I'm back on the on cycle and I was about to commit this stuff so we
> could start refactoring, I've got it building and all set up and ready.
> But I wanted to make sure that you're still okay with it.
>
> Once I get it in lucene-sandbox we can start refactoring it and adding
> the new features.
>
> Are we good to go? Let me know and then we can watch the CVS commit
> messages fly into lucene-sandbox...
>
> Thanks,
>
> -Andy
>
> On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> > Hi,
> >
> > i would suggest two sub-projects:
> >
> > 1.Crawler - retrieving docs, wherever they are.....
> >
> > 2. DocumentHandler extract Text, create apropriate fields etc..
> >
> > The second is a layer on top of lucene. First is a autonomous package,
> wich
> > should be nicely integrated with lucene/Document-Handler, but should also
> be
> > usable for other projects.
> >
> > I've included my code, to show you, what i've done. It isn't too useful
> yet,
> > because it is integrated in our product, but you can get the idea.
> Actually i've
> > written two things:
> >
> > 1: A robot for crawling a remote server via http and writing all the data
> to
> > local filesystem, then importing it into our db and
> > (at the same time) replacing all links with internal links. So we could
> emulate
> > a web-Site from this crawled Data!
> > [com.synformation.script.utilities.importtool]
> >
> > 2: (I've rewritten some of the code from 1 for that, so this is much
> cleaner) A
> > customer needs a tool for importing local mini-Websites on the file-system
> via
> > an applet, send it to the Web-Server and import it as described in point
> 1. I've
> > tried to write it in a way, that it could include the functionality of
> point 1
> > (retrieving vie http), but that is mostly untested.
> > [com.synformation.script.utilities.fileimport]
> >
> > I don't say, that you(we) should use this. But i think it's time to come
> to a
> > more concrete plans. I'm interested to help on that for the crawler.
> >
> >
> > mfg,
> >
> > manfred
> >
> >
> >
> >
> > ----
> >
>
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> --
> http://www.superlinksoftware.com
> http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
> Document
> format to java
> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> - fix java generics!
> The avalanche has already started. It is too late for the pebbles to
> vote.
> -Ambassador Kosh
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
http://www.superlinksoftware.com
http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Yeah dude, When a process gets moved onto the Andy-CPU it gets some
juicebehind it......when it moves back of and some other process swaps
onto it then well...;-)

-Andy


On Sat, 2002-05-04 at 10:09, Otis Gospodnetic wrote:
> Woohoo, I can feel the energy all the way here in NYC! :)
>
> Clemens' contribution is in jakarta-lucene-sandbox now, so go ahead,
> look and play.
>
>
> I will send a separate note about this contribution now.
>
> Otis
>
>
> --- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> > cool dude lets put it all in the same place and refactor it together.
> > I
> > didn't even repackage this yet I figured we'd put it in, get it
> > building
> > and then pick, choose and enhance.
> >
> > (like i want to rip jdom out and use commons-logging for this bad boy
> > because log4j has bitten me cleanly in the rump too many times with
> > its
> > forever changing interfaces)
> >
> > I'm not wanting to do this alone. Lets work together.. My interest
> > is
> > in getting some interfaces and pluggable architecture in here and you
> > had some great ideas on that IIRC.
> >
> > I've even got the build and all set up. When these guys say go I'll
> > hit
> > the button and we can go to town.
> >
> > -Andy
> >
> > On Fri, 2002-05-03 at 22:48, Otis Gospodnetic wrote:
> > > Note that I will also be putting some web crawler code in the
> > sandbox
> > > soon. The code is from Clemens, who posted a few messages
> > recently.
> > >
> > > Good, lets see some refactoring!
> > >
> > > Otis
> > >
> > >
> > > --- "Andrew C. Oliver" <acoliver@apache.org> wrote:
> > > > Hi Manfred/Kelvin (whose name I saw on a lot of this),
> > > >
> > > > I'm back on the on cycle and I was about to commit this stuff so
> > we
> > > > could start refactoring, I've got it building and all set up and
> > > > ready.
> > > > But I wanted to make sure that you're still okay with it.
> > > >
> > > > Once I get it in lucene-sandbox we can start refactoring it and
> > > > adding
> > > > the new features.
> > > >
> > > > Are we good to go? Let me know and then we can watch the CVS
> > commit
> > > > messages fly into lucene-sandbox...
> > > >
> > > > Thanks,
> > > >
> > > > -Andy
> > > >
> > > > On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote:
> > > > > Hi,
> > > > >
> > > > > i would suggest two sub-projects:
> > > > >
> > > > > 1.Crawler - retrieving docs, wherever they are.....
> > > > >
> > > > > 2. DocumentHandler extract Text, create apropriate fields etc..
> > > > >
> > > > > The second is a layer on top of lucene. First is a autonomous
> > > > package, wich
> > > > > should be nicely integrated with lucene/Document-Handler, but
> > > > should also be
> > > > > usable for other projects.
> > > > >
> > > > > I've included my code, to show you, what i've done. It isn't
> > too
> > > > useful yet,
> > > > > because it is integrated in our product, but you can get the
> > idea.
> > > > Actually i've
> > > > > written two things:
> > > > >
> > > > > 1: A robot for crawling a remote server via http and writing
> > all
> > > > the data to
> > > > > local filesystem, then importing it into our db and
> > > > > (at the same time) replacing all links with internal links. So
> > we
> > > > could emulate
> > > > > a web-Site from this crawled Data!
> > > > > [com.synformation.script.utilities.importtool]
> > > > >
> > > > > 2: (I've rewritten some of the code from 1 for that, so this is
> > > > much cleaner) A
> > > > > customer needs a tool for importing local mini-Websites on the
> > > > file-system via
> > > > > an applet, send it to the Web-Server and import it as described
> > in
> > > > point 1. I've
> > > > > tried to write it in a way, that it could include the
> > functionality
> > > > of point 1
> > > > > (retrieving vie http), but that is mostly untested.
> > > > > [com.synformation.script.utilities.fileimport]
> > > > >
> > > > > I don't say, that you(we) should use this. But i think it's
> > time to
> > > > come to a
> > > > > more concrete plans. I'm interested to help on that for the
> > > > crawler.
> > > > >
> > > > >
> > > > > mfg,
> > > > >
> > > > > manfred
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ----
> > > > >
> > > >
> > > > > --
> > > > > To unsubscribe, e-mail:
> > > > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > > > For additional commands, e-mail:
> > > > <mailto:lucene-dev-help@jakarta.apache.org>
> > > > --
> > > > http://www.superlinksoftware.com
> > > > http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
> > > > Document
> > > > format to java
> > > >
> > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > > > - fix java generics!
> > > > The avalanche has already started. It is too late for the pebbles
> > to
> > > > vote.
> > > > -Ambassador Kosh
> > > >
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > > <mailto:lucene-dev-help@jakarta.apache.org>
> > > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Yahoo! Health - your guide to health and wellness
> > > http://health.yahoo.com
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> > >
> > --
> > http://www.superlinksoftware.com
> > http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
> > Document
> > format to java
> > http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
> > - fix java generics!
> > The avalanche has already started. It is too late for the pebbles to
> > vote.
> > -Ambassador Kosh
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Health - your guide to health and wellness
> http://health.yahoo.com
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
--
http://www.superlinksoftware.com
http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Proposal for Lucene [ In reply to ]
Otis,
Thanks for cranking and getting all this stuff rolling.

--Peter


On 5/4/02 7:09 AM, "Otis Gospodnetic" <otis_gospodnetic@yahoo.com> wrote:

> Woohoo, I can feel the energy all the way here in NYC! :)
>
> Clemens' contribution is in jakarta-lucene-sandbox now, so go ahead,
> look and play.
>
>
> I will send a separate note about this contribution now.
>
> Otis
>
>
> --- "Andrew C. Oliver" <acoliver@apache.org> wrote:
>> cool dude lets put it all in the same place and refactor it together.
>> I
>> didn't even repackage this yet I figured we'd put it in, get it
>> building
>> and then pick, choose and enhance.
>>
>> (like i want to rip jdom out and use commons-logging for this bad boy
>> because log4j has bitten me cleanly in the rump too many times with
>> its
>> forever changing interfaces)
>>
>> I'm not wanting to do this alone. Lets work together.. My interest
>> is
>> in getting some interfaces and pluggable architecture in here and you
>> had some great ideas on that IIRC.
>>
>> I've even got the build and all set up. When these guys say go I'll
>> hit
>> the button and we can go to town.
>>
>> -Andy
>>
>> On Fri, 2002-05-03 at 22:48, Otis Gospodnetic wrote:
>>> Note that I will also be putting some web crawler code in the
>> sandbox
>>> soon. The code is from Clemens, who posted a few messages
>> recently.
>>>
>>> Good, lets see some refactoring!
>>>
>>> Otis
>>>
>>>
>>> --- "Andrew C. Oliver" <acoliver@apache.org> wrote:
>>>> Hi Manfred/Kelvin (whose name I saw on a lot of this),
>>>>
>>>> I'm back on the on cycle and I was about to commit this stuff so
>> we
>>>> could start refactoring, I've got it building and all set up and
>>>> ready.
>>>> But I wanted to make sure that you're still okay with it.
>>>>
>>>> Once I get it in lucene-sandbox we can start refactoring it and
>>>> adding
>>>> the new features.
>>>>
>>>> Are we good to go? Let me know and then we can watch the CVS
>> commit
>>>> messages fly into lucene-sandbox...
>>>>
>>>> Thanks,
>>>>
>>>> -Andy
>>>>
>>>> On Fri, 2002-02-08 at 05:26, Manfred Sch?fer wrote:
>>>>> Hi,
>>>>>
>>>>> i would suggest two sub-projects:
>>>>>
>>>>> 1.Crawler - retrieving docs, wherever they are.....
>>>>>
>>>>> 2. DocumentHandler extract Text, create apropriate fields etc..
>>>>>
>>>>> The second is a layer on top of lucene. First is a autonomous
>>>> package, wich
>>>>> should be nicely integrated with lucene/Document-Handler, but
>>>> should also be
>>>>> usable for other projects.
>>>>>
>>>>> I've included my code, to show you, what i've done. It isn't
>> too
>>>> useful yet,
>>>>> because it is integrated in our product, but you can get the
>> idea.
>>>> Actually i've
>>>>> written two things:
>>>>>
>>>>> 1: A robot for crawling a remote server via http and writing
>> all
>>>> the data to
>>>>> local filesystem, then importing it into our db and
>>>>> (at the same time) replacing all links with internal links. So
>> we
>>>> could emulate
>>>>> a web-Site from this crawled Data!
>>>>> [com.synformation.script.utilities.importtool]
>>>>>
>>>>> 2: (I've rewritten some of the code from 1 for that, so this is
>>>> much cleaner) A
>>>>> customer needs a tool for importing local mini-Websites on the
>>>> file-system via
>>>>> an applet, send it to the Web-Server and import it as described
>> in
>>>> point 1. I've
>>>>> tried to write it in a way, that it could include the
>> functionality
>>>> of point 1
>>>>> (retrieving vie http), but that is mostly untested.
>>>>> [com.synformation.script.utilities.fileimport]
>>>>>
>>>>> I don't say, that you(we) should use this. But i think it's
>> time to
>>>> come to a
>>>>> more concrete plans. I'm interested to help on that for the
>>>> crawler.
>>>>>
>>>>>
>>>>> mfg,
>>>>>
>>>>> manfred
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ----
>>>>>
>>>>
>>>>> --
>>>>> To unsubscribe, e-mail:
>>>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>>>> For additional commands, e-mail:
>>>> <mailto:lucene-dev-help@jakarta.apache.org>
>>>> --
>>>> http://www.superlinksoftware.com
>>>> http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
>>>> Document
>>>> format to java
>>>>
>> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
>>>> - fix java generics!
>>>> The avalanche has already started. It is too late for the pebbles
>> to
>>>> vote.
>>>> -Ambassador Kosh
>>>>
>>>>
>>>> --
>>>> To unsubscribe, e-mail:
>>>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>>> For additional commands, e-mail:
>>>> <mailto:lucene-dev-help@jakarta.apache.org>
>>>>
>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Yahoo! Health - your guide to health and wellness
>>> http://health.yahoo.com
>>>
>>> --
>>> To unsubscribe, e-mail:
>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>> For additional commands, e-mail:
>> <mailto:lucene-dev-help@jakarta.apache.org>
>>>
>> --
>> http://www.superlinksoftware.com
>> http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
>> Document
>> format to java
>> http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
>> - fix java generics!
>> The avalanche has already started. It is too late for the pebbles to
>> vote.
>> -Ambassador Kosh
>>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-dev-help@jakarta.apache.org>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Health - your guide to health and wellness
> http://health.yahoo.com
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

1 2  View All