Mailing List Archive: indexing and searching different file formats

indexing and searching different file formats

pradeepk at robosoftin

Feb 13, 2002, 8:54 AM

Post #1 of 7 (1251 views)

Hi Lucene friends!

How the files of different format can be indexed and searched? ( As I
know lucene is having HTML indexer and searcher, which comes along with
it and also XML indexer, but is there any way to index files
irrespective of the file type)
Any suggestions will be greatly appreciated..

Thanks in advance.
Pradeep

--------------------------------------------------------------
Robosoft Technologies, Mangalore, India

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing and searching different file formats [ In reply to ]

alibby at commnav

Feb 13, 2002, 9:20 AM

Post #2 of 7 (1241 views)

Pradeep,
Currently Lucene does not provide the ability to convert documents
to text for indexing. There is talk of adding this kind of thing to the
goal of the project, along with providing crawlers to traverse web,
local disk, ftp, and RDBMS sources of data.

The problem with indexining irrespective of file type is that each document
format contains embedded information that must be stripped out (or ignored)
and the text needs to be retrieved for indexing. An extreeme example is
a PDF which has a considerably complicated document format.

On the contributions page there are some pointers that may provide information
about processing the types of documents you're interested in.

http://jakarta.apache.org/lucene/docs/contributions.html

If you've not taken the time to do so, look at the FAQs, they are very
informative:

http://www.lucene.com/cgi-bin/faq/faqmanager.cgi
http://jakarta.apache.org/lucene/docs/gettingstarted.html
http://www.jguru.com/faq/Lucene

Good luck!

Andy

On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote:
> Hi Lucene friends!
>
> How the files of different format can be indexed and searched? ( As I
> know lucene is having HTML indexer and searcher, which comes along with
> it and also XML indexer, but is there any way to index files
> irrespective of the file type)
> Any suggestions will be greatly appreciated..
>
> Thanks in advance.
> Pradeep
>
>
> --------------------------------------------------------------
> Robosoft Technologies, Mangalore, India
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>

--
--------------------------------------------------
Andrew Libby
CommNav, Inc
alibby@commnav.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing and searching different file formats [ In reply to ]

carlson at bookandhammer

Feb 13, 2002, 4:46 PM

Post #3 of 7 (1248 views)

Hi pradeep,

The Lucene Document is not document type specific. It is a Lucene class
which is made up of fields (which have different options).
Data in a document is parsed and put into a one for more of these fields.

So Lucene can really handle any kind of document, their just needs to be a
document parser that puts the document into the Lucene Document format.

I hope this helps.

--Peter

On 2/13/02 7:54 AM, "Pradeep Kumar K" <pradeepk@robosoftin.com> wrote:

> Hi Lucene friends!
>
> How the files of different format can be indexed and searched? ( As I
> know lucene is having HTML indexer and searcher, which comes along with
> it and also XML indexer, but is there any way to index files
> irrespective of the file type)
> Any suggestions will be greatly appreciated..
>
> Thanks in advance.
> Pradeep
>
>
> --------------------------------------------------------------
> Robosoft Technologies, Mangalore, India
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing and searching different file formats [ In reply to ]

pradeepk at robosoftin

Feb 14, 2002, 3:58 AM

Post #4 of 7 (1251 views)

Thanks a lot Andy.
-Pradeep

On Wednesday, February 13, 2002, at 09:50 PM, Andrew Libby wrote:

>
> Pradeep,
> Currently Lucene does not provide the ability to convert documents
> to text for indexing. There is talk of adding this kind of thing to the
> goal of the project, along with providing crawlers to traverse web,
> local disk, ftp, and RDBMS sources of data.
>
> The problem with indexining irrespective of file type is that each
> document
> format contains embedded information that must be stripped out (or
> ignored)
> and the text needs to be retrieved for indexing. An extreeme example is
> a PDF which has a considerably complicated document format.
>
> On the contributions page there are some pointers that may provide
> information
> about processing the types of documents you're interested in.
>
> http://jakarta.apache.org/lucene/docs/contributions.html
>
> If you've not taken the time to do so, look at the FAQs, they are very
> informative:
>
> http://www.lucene.com/cgi-bin/faq/faqmanager.cgi
> http://jakarta.apache.org/lucene/docs/gettingstarted.html
> http://www.jguru.com/faq/Lucene
>
> Good luck!
>
> Andy
>
>
> On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote:
>> Hi Lucene friends!
>>
>> How the files of different format can be indexed and searched?
>> ( As I
>> know lucene is having HTML indexer and searcher, which comes along with
>> it and also XML indexer, but is there any way to index files
>> irrespective of the file type)
>> Any suggestions will be greatly appreciated..
>>
>> Thanks in advance.
>> Pradeep
>>
>>
>> --------------------------------------------------------------
>> Robosoft Technologies, Mangalore, India
>>
>>
>>
>> --
>> To unsubscribe, e-mail: <mailto:lucene-user-
>> unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: <mailto:lucene-user-
>> help@jakarta.apache.org>
>>
>
> --
> --------------------------------------------------
> Andrew Libby
> CommNav, Inc
> alibby@commnav.com
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>

--------------------------------------------------------------
Robosoft Technologies, Mangalore, India

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing and searching different file formats [ In reply to ]

eliot at isogen

Feb 14, 2002, 10:10 AM

Post #5 of 7 (1236 views)

Andrew Libby wrote:

> and the text needs to be retrieved for indexing. An extreeme example is
> a PDF which has a considerably complicated document format.

The PJ library from www.etymon.com provides a pretty complete and
easy-to-use API for getting info from PDF docs. It wouldn't be too hard
to write a PDF indexer for Lucene using this library. The main challenge
would be guessing word boundaries in strings where spaces have been
replaced with explicit shift values by the formatter.

Cheers,

Eliot
--
W. Eliot Kimber, eliot@isogen.com
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX 78752 Phone: 512.656.4139

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing and searching different file formats [ In reply to ]

kelvin at relevanz

Feb 14, 2002, 6:09 PM

Post #6 of 7 (1243 views)

Uhmmm, I can contribute something which does a pretty decent job if anyone's
interested...

Just have to clean it up a little...

Regards,
Kelvin
----- Original Message -----
From: "W. Eliot Kimber" <eliot@isogen.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, February 15, 2002 1:10 AM
Subject: Re: indexing and searching different file formats

> Andrew Libby wrote:
>
> > and the text needs to be retrieved for indexing. An extreeme example is
> > a PDF which has a considerably complicated document format.
>
> The PJ library from www.etymon.com provides a pretty complete and
> easy-to-use API for getting info from PDF docs. It wouldn't be too hard
> to write a PDF indexer for Lucene using this library. The main challenge
> would be guessing word boundaries in strings where spaces have been
> replaced with explicit shift values by the formatter.
>
> Cheers,
>
> Eliot
> --
> W. Eliot Kimber, eliot@isogen.com
> Consultant, ISOGEN International
>
> 1016 La Posada Dr., Suite 240
> Austin, TX 78752 Phone: 512.656.4139
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing and searching different file formats [ In reply to ]

kelvin at relevanz

Feb 14, 2002, 8:52 PM

Post #7 of 7 (1240 views)

Known limitations here:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html

HTH.

Regards,
Kelvin

PS: Pj library is GPL'ed. Commercial licenses go for $5,000 per 100 copies
(1 CPU per copy).

----- Original Message -----
From: "Kelvin Tan" <kelvin@relevanz.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>; <eliot@isogen.com>
Sent: Friday, February 15, 2002 9:09 AM
Subject: Re: indexing and searching different file formats

> Uhmmm, I can contribute something which does a pretty decent job if
anyone's
> interested...
>
> Just have to clean it up a little...
>
> Regards,
> Kelvin
> ----- Original Message -----
> From: "W. Eliot Kimber" <eliot@isogen.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Friday, February 15, 2002 1:10 AM
> Subject: Re: indexing and searching different file formats
>
>
> > Andrew Libby wrote:
> >
> > > and the text needs to be retrieved for indexing. An extreeme example
is
> > > a PDF which has a considerably complicated document format.
> >
> > The PJ library from www.etymon.com provides a pretty complete and
> > easy-to-use API for getting info from PDF docs. It wouldn't be too hard
> > to write a PDF indexer for Lucene using this library. The main challenge
> > would be guessing word boundaries in strings where spaces have been
> > replaced with explicit shift values by the formatter.
> >
> > Cheers,
> >
> > Eliot
> > --
> > W. Eliot Kimber, eliot@isogen.com
> > Consultant, ISOGEN International
> >
> > 1016 La Posada Dr., Suite 240
> > Austin, TX 78752 Phone: 512.656.4139
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>
>