Mailing List Archive: Text Extractor

Text Extractor

Jul 10, 2007, 6:39 AM

Post #1 of 4 (3993 views)

Hello,

I am looking for a text extractor (tool set) which could be used, to get text data out of several file formats like office documents and so on. The text data (extract) could then be used to index with lucene. Best would be a java api, but not required. Does any one have knowledge of such a tool set or project?

Best Regards

Stefan

Stefan Schuh
Senior SW-Engineer
--------------------------------------------------------------------------------------------
COI GmbH
Erlanger Straße 62 Phone +49 9132 73 83 4775
91074 Herzogenaurach Fax +49 9132 73 83 4959
http://www.coi.de mailto:Stefan.Schuh@coi.de
--------------------------------------------------------------------------------------------
C O I - S o l u t i o n s f o r D o c u m e n t s
--------------------------------------------------------------------------------------------

COI Consulting für Office und Information Management GmbH
Sitz Herzogenaurach
Registergericht: AG Fürth HRB 3692, USt-IdNr: DE 811159097
Geschäftsführer: Giovanni Santamaria, Andreas Schwarze

Diese Information ist für den Gebrauch durch die Person oder die Firma/Organisation bestimmt,
die in der Empfängeradresse benannt ist und unterliegt u. U. dem Betriebsgeheimnis, dem Schutz
von Arbeitsergebnissen oder anderweitigem rechtlichen Schutz. Wenn Sie nicht der angegebene
Empfänger sind, nehmen Sie bitte zur Kenntnis, dass Weitergabe, Kopieren, Verteilung oder
Nutzung des Inhalts dieser E-Mail-Übertragung unzulässig ist.
Falls Sie diese E-Mail irrtümlich erhalten haben, benachrichtigen Sie den Absender bitte unverzüglich
telefonisch oder durch E-Mail und löschen Sie diese Information aus Ihrem EDV-System.

This e-mail message is intended only for the use of the named recipient(s) and contains information
which may be confidential or privileged. If you are not the intended recipient, be aware that any
distribution, or use of the contents of this information is prohibited.
If you have received this electronic transmission in error, please notify the sender
and delete the material from the computer.

Re: Text Extractor [ In reply to ]

wtaeger at epo

Jul 10, 2007, 7:03 AM

Post #2 of 4 (3821 views)

Permalink

You may use the script language Perl which is available for free for many
platforms.
You then have to install several modules which allow to access Word,
Excel, PDF, SQL,
Zip, Tar, ...

Google for Perl, CPAN, ActiveState ...

If you do so, please tell me of your experience.

Best regards

Wolfgang Täger
Examiner Telecommunications | Dir. 2.4.1.4
European Patent Office
Landsberger Str. 30 | 80339 Munich | Germany
Tel. +49 (0)89 2399 6957
wtaeger@epo.org
http://www.epo.org

"Schuh, Stefan" <Stefan.Schuh@coi.de>
10-07-2007 15:39
Please respond to
general@lucene.apache.org

To
<general@lucene.apache.org>
cc

Subject
Text Extractor

Hello,

I am looking for a text extractor (tool set) which could be used, to get
text data out of several file formats like office documents and so on. The
text data (extract) could then be used to index with lucene. Best would
be a java api, but not required. Does any one have knowledge of such a
tool set or project?

Best Regards

Stefan

Stefan Schuh
Senior SW-Engineer
--------------------------------------------------------------------------------------------

COI GmbH
Erlanger Straße 62 Phone +49 9132 73 83 4775
91074 Herzogenaurach Fax +49 9132 73 83 4959
http://www.coi.de mailto:Stefan.Schuh@coi.de
--------------------------------------------------------------------------------------------

C O I - S o l u t i o n s f o r D o c u m e n t s
--------------------------------------------------------------------------------------------

COI Consulting für Office und Information Management GmbH
Sitz Herzogenaurach
Registergericht: AG Fürth HRB 3692, USt-IdNr: DE 811159097
Geschäftsführer: Giovanni Santamaria, Andreas Schwarze

Diese Information ist für den Gebrauch durch die Person oder die
Firma/Organisation bestimmt,
die in der Empfängeradresse benannt ist und unterliegt u. U. dem
Betriebsgeheimnis, dem Schutz
von Arbeitsergebnissen oder anderweitigem rechtlichen Schutz. Wenn Sie
nicht der angegebene
Empfänger sind, nehmen Sie bitte zur Kenntnis, dass Weitergabe, Kopieren,
Verteilung oder
Nutzung des Inhalts dieser E-Mail-Übertragung unzulässig ist.
Falls Sie diese E-Mail irrtümlich erhalten haben, benachrichtigen Sie den
Absender bitte unverzüglich
telefonisch oder durch E-Mail und löschen Sie diese Information aus Ihrem
EDV-System.

This e-mail message is intended only for the use of the named recipient(s)
and contains information
which may be confidential or privileged. If you are not the intended
recipient, be aware that any
distribution, or use of the contents of this information is prohibited.
If you have received this electronic transmission in error, please notify
the sender
and delete the material from the computer.

Re: Text Extractor [ In reply to ]

jukka.zitting at gmail

Jul 10, 2007, 7:15 AM

Post #3 of 4 (3854 views)

Permalink

Hi,

On 7/10/07, Schuh, Stefan <Stefan.Schuh@coi.de> wrote:
> I am looking for a text extractor (tool set) which could be used, to get
> text data out of several file formats like office documents and so on.
> The text data (extract) could then be used to index with lucene. Best
> would be a java api, but not required. Does any one have knowledge
> of such a tool set or project?

The Tika project [1] in the Apache Incubator is currently getting
started at implementing such a generic toolkit. Unfortunately we
haven't yet released anything.

You may also want to check out the Lius project [2] that is one of the
source codebases to be used in Tika. Another potential match is the
Aperture project [3].

[1] http://incubator.apache.org/tika/
[2] http://sourceforge.net/projects/lius/
[3] http://aperture.sourceforge.net/

BR,

Jukka Zitting

Re: Text Extractor [ In reply to ]

michaelrlevy at gmail

Jul 23, 2007, 5:57 AM

Post #4 of 4 (3780 views)

Permalink

It might be worthwhile for you to review Nutch, a web search application
based on Lucene that can also search local filesystems. It includes parsers
for several common office type documents.

http://lucene.apache.org/nutch/

On 7/10/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:
>
> Hi,
>
> On 7/10/07, Schuh, Stefan <Stefan.Schuh@coi.de> wrote:
> > I am looking for a text extractor (tool set) which could be used, to get
> > text data out of several file formats like office documents and so on.
> > The text data (extract) could then be used to index with lucene. Best
> > would be a java api, but not required. Does any one have knowledge
> > of such a tool set or project?
>
> The Tika project [1] in the Apache Incubator is currently getting
> started at implementing such a generic toolkit. Unfortunately we
> haven't yet released anything.
>
> You may also want to check out the Lius project [2] that is one of the
> source codebases to be used in Tika. Another potential match is the
> Aperture project [3].
>
> [1] http://incubator.apache.org/tika/
> [2] http://sourceforge.net/projects/lius/
> [3] http://aperture.sourceforge.net/
>
> BR,
>
> Jukka Zitting
>