Mailing List Archive

R: PDF parser for Lucene
I also used this technique to index PDF files. Here's the class I used to
obtain a "text" input stream by a PDF file, using then it to build a Reader
for the indexing engine:


import java.io.*;
import java.util.Properties;

public class PDFToTextConverter extends FileToTextConverter
{
public PDFToTextConverter(String as_file_path,
Properties ax_props)
{
super(as_file_path,ax_props); //superclass just save args in
instance variables
//is_file_path and
ix_props...
}

public InputStream getInputStream()
{
String ls_command;
Process lx_child = null;

ls_command = ix_props.get("pdftotext-path") + " " + is_file_path + " -";

try
{
lx_child = Runtime.getRuntime().exec(ls_command);

return lx_child.getInputStream();
}
catch(Exception e)
{
return null;
}
}
}


The application I used to convert pdf to text (called by
Runtime.getRuntime().exec(...)) was xpdf 0.93, that you can find at
http://www.foolabs.com/xpdf in both unix and windows versions.


-----Messaggio originale-----
Da: Cecil, Paula New [mailto:cnew@fuse.net]
Inviato: venerdì 23 novembre 2001 17.37
A: Lucene Users List; Kelvin Tan
Oggetto: Re: PDF parser for Lucene


Inspired by the Unix "strings" command, I have written a subclass of
FilterReader; which I have called BinaryReader. The idea is simply to index
any proprietary file format by filtering out all non-printable characters.
The assumption is that text is text. It will end up with more than the
"visible" text, but not less. After I have tested and made some examples I
will post it here.



----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 2:48 AM
Subject: Re: PDF parser for Lucene


> I'm not too familiar with websearch's PDF parsing.
>
> I use a nice API Etymon Pj http://www.etymon.com/pj/
>
> It doesn't come with the ability to extract text, but it can be coded.
I'll
> leave you to do it because it's kinda fun, but I could provide it if
anyone
> wants it.
>
> I've also implemented it so that the searches can be performed on a
> page-by-page basis. That's pretty cool, i think.
>
> ----- Original Message -----
> From: <sampreet@interactive1.com>
> To: <lucene-user@jakarta.apache.org>
> Cc: <bkopic@interactive1.hr>
> Sent: Friday, November 23, 2001 4:39 PM
> Subject: RE: PDF parser for Lucene
>
>
> > Hello,
> >
> > We have been using PDFHandler - a pdf parser provided by websearch, to
> > search in pdf files. We are trying to get the contents using
> > pdfHandler.getContents() to arrive at a context-sensitive summary.
> However,
> > it gives some yen signs and other special symbols in the title, summary
> and
> > contents. If anyone is using the websearch component to parse pdf files
> and
> > have encountered this problem, kindly give your suggestions.
> >
> > Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> > encoding as Win-12xx doesn't help.
> >
> > Thanks in advance,
> >
> > Sampreet
> > Programmer
> >
> >
> > You could try this one:
> > http://www.i2a.com/websearch/
> >
> > ...and then tell me how it works for you.
> > =:o)
> >
> >
> > Anyway, it is simple and Open Source.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>



--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>



- Disclaimer -
This email and any attachments thereto may contain information which is
confidential and/or protected by intellectual property rights and are
intended for the sole use of the recipient(s) named above. Any use of the
information contained herein (including, but not limited to, total or
partial reproduction, communication or distribution in any form) or the
taking of any action in reliance on the contents, by persons other than the
designated recipient(s) is strictly prohibited.

If you have received this email in error, please notify the sender either by
telephone or by email and delete the material from any computer.

Thank you for your cooperation.



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>