Mailing List Archive

Re: lucene-user Digest 24 May 2002 14:31:22 -0000 Issue 105
>
> ------------------------------------------------------------------------
>
> Subject:
>
> powerpoint: sometimes it works...sometimes it doesn't
> From:
>
> Bruce Altner <baltner@hq.nasa.gov>
> Date:
>
> Wed, 22 May 2002 20:13:46 -0400
> To:
>
> lucene-user@jakarta.apache.org
>
>
> Greetings:
>
> I am brand new to Lucene so please forgive me if the following is too
> naive for polite replies...
>
> I have built a web-based application to schedule and archive brown bag
> talks. The system uses a database
> for the scheduling and searching by title, author, topic, abstract,
> etc. but I want to add full text searching of the powerpoint
> files actually presented during the seminars.
>
> So I ran a quick index (using the demo API) on the ppt file of a past
> talk I'd given and Lucene handled it very well, finding hits 95% of
> the time. I was quite impressed and excited about the possibilities
> but then I indexed a more recent talk and Lucene failed completely,
> never once finding a term.
>
> Any idea why it would work on one ppt file but not on another? The
> first was created using powerpoint from Office 97 and the latter
> (failed) example from Office 2000 so that's a strong possibility but I
> wanted to run this by folks on the list for opinions.
>
> Thanks!
>
> Bruce
>
> PS My brown bag app is my second go-round with the jakarta Turbine
> framework. Ai'nt open source great!
>

If I understand correctly what you did, it's really pure chance that it
has worked in the first place. As far as I know, there are no built-in
parsers for any document type (including PPT) in Lucene. Rather it
attempts to handle every stream or String you give it as regular English
text (you can change the language by selecting a different stemmer, but
it will still assume plain text). There are various efforts underway
that are building a framework for plugging different document parsers.
There is also a project (on Apache?) that has an OLE parser written in
Java that might be appropriate for working with PPT files. I think the
first file you tried has had the text listed in plain string form where
as the second file had them in some encoding (or compressed /
encrypted). That's all I can say without knowing more about the PPT file
format. If you can find a program or library that will extract text from
a PPT file, you then should be able to easily use Lucene to index this
text. This might not be as elegant as the final solution of the project
I mentioned above, but this is the way everyone is using Lucene with
various document types today.

Good luck.
Dmitry.



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>