Mailing List Archive: RE: PDF parser for Lucene

RE: PDF parser for Lucene

sampreet at interactive1

Nov 23, 2001, 1:39 AM

Post #1 of 8 (2405 views)

Hello,

We have been using PDFHandler - a pdf parser provided by websearch, to
search in pdf files. We are trying to get the contents using
pdfHandler.getContents() to arrive at a context-sensitive summary. However,
it gives some yen signs and other special symbols in the title, summary and
contents. If anyone is using the websearch component to parse pdf files and
have encountered this problem, kindly give your suggestions.

Note - Most of the pdf files are using WinAnsiEncoding, and setting the
encoding as Win-12xx doesn't help.

Thanks in advance,

Sampreet
Programmer

You could try this one:
http://www.i2a.com/websearch/

...and then tell me how it works for you.
=:o)

Anyway, it is simple and Open Source.

Have fun,
Paulo Gaspar

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: PDF parser for Lucene [ In reply to ]

kelvin at relevanz

Nov 23, 2001, 3:48 AM

Post #2 of 8 (2368 views)

I'm not too familiar with websearch's PDF parsing.

I use a nice API Etymon Pj http://www.etymon.com/pj/

It doesn't come with the ability to extract text, but it can be coded. I'll
leave you to do it because it's kinda fun, but I could provide it if anyone
wants it.

I've also implemented it so that the searches can be performed on a
page-by-page basis. That's pretty cool, i think.

----- Original Message -----
From: <sampreet@interactive1.com>
To: <lucene-user@jakarta.apache.org>
Cc: <bkopic@interactive1.hr>
Sent: Friday, November 23, 2001 4:39 PM
Subject: RE: PDF parser for Lucene

> Hello,
>
> We have been using PDFHandler - a pdf parser provided by websearch, to
> search in pdf files. We are trying to get the contents using
> pdfHandler.getContents() to arrive at a context-sensitive summary.
However,
> it gives some yen signs and other special symbols in the title, summary
and
> contents. If anyone is using the websearch component to parse pdf files
and
> have encountered this problem, kindly give your suggestions.
>
> Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> encoding as Win-12xx doesn't help.
>
> Thanks in advance,
>
> Sampreet
> Programmer
>
>
> You could try this one:
> http://www.i2a.com/websearch/
>
> ...and then tell me how it works for you.
> =:o)
>
>
> Anyway, it is simple and Open Source.
>
>
> Have fun,
> Paulo Gaspar
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: PDF parser for Lucene [ In reply to ]

Nov 23, 2001, 9:36 AM

Post #3 of 8 (2396 views)

Inspired by the Unix "strings" command, I have written a subclass of
FilterReader; which I have called BinaryReader. The idea is simply to index
any proprietary file format by filtering out all non-printable characters.
The assumption is that text is text. It will end up with more than the
"visible" text, but not less. After I have tested and made some examples I
will post it here.

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 2:48 AM
Subject: Re: PDF parser for Lucene

> I'm not too familiar with websearch's PDF parsing.
>
> I use a nice API Etymon Pj http://www.etymon.com/pj/
>
> It doesn't come with the ability to extract text, but it can be coded.
I'll
> leave you to do it because it's kinda fun, but I could provide it if
anyone
> wants it.
>
> I've also implemented it so that the searches can be performed on a
> page-by-page basis. That's pretty cool, i think.
>
> ----- Original Message -----
> From: <sampreet@interactive1.com>
> To: <lucene-user@jakarta.apache.org>
> Cc: <bkopic@interactive1.hr>
> Sent: Friday, November 23, 2001 4:39 PM
> Subject: RE: PDF parser for Lucene
>
>
> > Hello,
> >
> > We have been using PDFHandler - a pdf parser provided by websearch, to
> > search in pdf files. We are trying to get the contents using
> > pdfHandler.getContents() to arrive at a context-sensitive summary.
> However,
> > it gives some yen signs and other special symbols in the title, summary
> and
> > contents. If anyone is using the websearch component to parse pdf files
> and
> > have encountered this problem, kindly give your suggestions.
> >
> > Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> > encoding as Win-12xx doesn't help.
> >
> > Thanks in advance,
> >
> > Sampreet
> > Programmer
> >
> >
> > You could try this one:
> > http://www.i2a.com/websearch/
> >
> > ...and then tell me how it works for you.
> > =:o)
> >
> >
> > Anyway, it is simple and Open Source.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: PDF parser for Lucene [ In reply to ]

kelvin at relevanz

Nov 23, 2001, 5:24 PM

Post #4 of 8 (2367 views)

That's pretty interesting because I've done something similar, but I
(vainly) try to build some intelligence into it.

First I run the text through a regex which filters out most of the words
which contain nonsense characters (^A-Za-z0-9_@).

Then I run it through a couple of tests, which try to guess if this is an
English word, like testing for 3 consonants in a row, and if there are 2
numbers which are not next to each other (can you think of any?!?!). Based
on these, I rank the documents. Finally, if the word is less than 5
characters, I run it through a wordlist.

The results are <emphasis>acceptable</emphasis>. The difficulty is that when
faced with large documents (>5MB), you end up with alot of nonsense terms,
actually exceeding Lucene's inbuilt limit of 10000 terms (to limit memory
usage). The result is that these documents are not indexed completely.

I'd be interested to see how you filter your documents...:)

----- Original Message -----
From: Cecil, Paula New <cnew@fuse.net>
To: Lucene Users List <lucene-user@jakarta.apache.org>; Kelvin Tan
<kelvin@relevanz.com>
Sent: Saturday, November 24, 2001 12:36 AM
Subject: Re: PDF parser for Lucene

> Inspired by the Unix "strings" command, I have written a subclass of
> FilterReader; which I have called BinaryReader. The idea is simply to
index
> any proprietary file format by filtering out all non-printable characters.
> The assumption is that text is text. It will end up with more than the
> "visible" text, but not less. After I have tested and made some examples
I
> will post it here.
>
>
>
> ----- Original Message -----
> From: Kelvin Tan <kelvin@relevanz.com>
> To: Lucene Users List <lucene-user@jakarta.apache.org>
> Sent: Friday, November 23, 2001 2:48 AM
> Subject: Re: PDF parser for Lucene
>
>
> > I'm not too familiar with websearch's PDF parsing.
> >
> > I use a nice API Etymon Pj http://www.etymon.com/pj/
> >
> > It doesn't come with the ability to extract text, but it can be coded.
> I'll
> > leave you to do it because it's kinda fun, but I could provide it if
> anyone
> > wants it.
> >
> > I've also implemented it so that the searches can be performed on a
> > page-by-page basis. That's pretty cool, i think.
> >
> > ----- Original Message -----
> > From: <sampreet@interactive1.com>
> > To: <lucene-user@jakarta.apache.org>
> > Cc: <bkopic@interactive1.hr>
> > Sent: Friday, November 23, 2001 4:39 PM
> > Subject: RE: PDF parser for Lucene
> >
> >
> > > Hello,
> > >
> > > We have been using PDFHandler - a pdf parser provided by websearch, to
> > > search in pdf files. We are trying to get the contents using
> > > pdfHandler.getContents() to arrive at a context-sensitive summary.
> > However,
> > > it gives some yen signs and other special symbols in the title,
summary
> > and
> > > contents. If anyone is using the websearch component to parse pdf
files
> > and
> > > have encountered this problem, kindly give your suggestions.
> > >
> > > Note - Most of the pdf files are using WinAnsiEncoding, and setting
the
> > > encoding as Win-12xx doesn't help.
> > >
> > > Thanks in advance,
> > >
> > > Sampreet
> > > Programmer
> > >
> > >
> > > You could try this one:
> > > http://www.i2a.com/websearch/
> > >
> > > ...and then tell me how it works for you.
> > > =:o)
> > >
> > >
> > > Anyway, it is simple and Open Source.
> > >
> > >
> > > Have fun,
> > > Paulo Gaspar
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: PDF parser for Lucene [ In reply to ]

kelvin at relevanz

Nov 23, 2001, 8:48 PM

Post #5 of 8 (2376 views)

Here's part of my email to Otis...with some additions at the bottom

I was rather intrigued by Websearch's abilities and wanted to compare it
with Pj's, so I ran both on a couple more PDFs and of a greater variety than
I had prior to this. The results were pretty disappointing.

Generally any PDF file that can be processed by Websearch can be done by Pj.
Text is extracted and except for special characters (which are replaced by a
\{code}). Whilst I had previously enjoyed relative success with Pj for
extracting text from PDFs, there were many PDF files in which it just fell
flat on its face.

Probing further, this is what I found. If the PDF is encrypted, generally
the text can't be extracted (pj has a method where you call
getEncryptedDictionary() which apparently returns an encrypted dictionary if
it is encrypted). If an encoding method other than ascii85 or flate is used,
pj can't handle it (I've seen a LZWdecode used. I suppose this is Zip). And
then there are other instances of which I haven't a clue...:)

As a rule of thumb, if the PDF is all text (unpractical of course, and
defeating the entire purpose of PDF files), pj can handle it without a
glitch.

The method of going through the PDF file and extracting all text from it
through some kind of Reader (brought up by Paula New Cecil) probably
wouldn't be effective either. Most PDFs are FlateDecoded, which means
compressed using the Flate algorithm. You can actually read it in using
java.util.zip.InflateInputStream and decompress it then though.

<newly-added>
I was bored and decided to try out the files that pj failed to handle, using
xpdf v0.92 instead( specifically pdftotext, under windows).
http://www.foolabs.com/xpdf

Same results as with pj. Encrypted files are not extracted (Error: Copying
of text from this document is not allowed.)
Other files fail with some error or other.

Does anyone have a solution for this?? :)
</newly-added>

Kelvin

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 6:48 PM
Subject: Re: PDF parser for Lucene

> I'm not too familiar with websearch's PDF parsing.
>
> I use a nice API Etymon Pj http://www.etymon.com/pj/
>
> It doesn't come with the ability to extract text, but it can be coded.
I'll
> leave you to do it because it's kinda fun, but I could provide it if
anyone
> wants it.
>
> I've also implemented it so that the searches can be performed on a
> page-by-page basis. That's pretty cool, i think.
>
> ----- Original Message -----
> From: <sampreet@interactive1.com>
> To: <lucene-user@jakarta.apache.org>
> Cc: <bkopic@interactive1.hr>
> Sent: Friday, November 23, 2001 4:39 PM
> Subject: RE: PDF parser for Lucene
>
>
> > Hello,
> >
> > We have been using PDFHandler - a pdf parser provided by websearch, to
> > search in pdf files. We are trying to get the contents using
> > pdfHandler.getContents() to arrive at a context-sensitive summary.
> However,
> > it gives some yen signs and other special symbols in the title, summary
> and
> > contents. If anyone is using the websearch component to parse pdf files
> and
> > have encountered this problem, kindly give your suggestions.
> >
> > Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> > encoding as Win-12xx doesn't help.
> >
> > Thanks in advance,
> >
> > Sampreet
> > Programmer
> >
> >
> > You could try this one:
> > http://www.i2a.com/websearch/
> >
> > ...and then tell me how it works for you.
> > =:o)
> >
> >
> > Anyway, it is simple and Open Source.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: PDF parser for Lucene [ In reply to ]

Nov 24, 2001, 3:34 PM

Post #6 of 8 (2377 views)

Relative to PDF's, Kevin is correct. My reader class completely failed. I
brought up the PDF in Textpad to take a look... nothing there in "readable"
form.

My tests with typical office documents seemed to work ok as did some other
selections of non-text files. Turbo-tax failed - I'm sure they encrypt,
which makes perfect sense. A search also failed on PowerPoint's "word art"
text.

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 7:48 PM
Subject: Re: PDF parser for Lucene

> Here's part of my email to Otis...with some additions at the bottom
>
> I was rather intrigued by Websearch's abilities and wanted to compare it
> with Pj's, so I ran both on a couple more PDFs and of a greater variety
than
> I had prior to this. The results were pretty disappointing.
>
> Generally any PDF file that can be processed by Websearch can be done by
Pj.
> Text is extracted and except for special characters (which are replaced by
a
> \{code}). Whilst I had previously enjoyed relative success with Pj for
> extracting text from PDFs, there were many PDF files in which it just fell
> flat on its face.
>
> Probing further, this is what I found. If the PDF is encrypted, generally
> the text can't be extracted (pj has a method where you call
> getEncryptedDictionary() which apparently returns an encrypted dictionary
if
> it is encrypted). If an encoding method other than ascii85 or flate is
used,
> pj can't handle it (I've seen a LZWdecode used. I suppose this is Zip).
And
> then there are other instances of which I haven't a clue...:)
>
> As a rule of thumb, if the PDF is all text (unpractical of course, and
> defeating the entire purpose of PDF files), pj can handle it without a
> glitch.
>
> The method of going through the PDF file and extracting all text from it
> through some kind of Reader (brought up by Paula New Cecil) probably
> wouldn't be effective either. Most PDFs are FlateDecoded, which means
> compressed using the Flate algorithm. You can actually read it in using
> java.util.zip.InflateInputStream and decompress it then though.
>
> <newly-added>
> I was bored and decided to try out the files that pj failed to handle,
using
> xpdf v0.92 instead( specifically pdftotext, under windows).
> http://www.foolabs.com/xpdf
>
> Same results as with pj. Encrypted files are not extracted (Error: Copying
> of text from this document is not allowed.)
> Other files fail with some error or other.
>
> Does anyone have a solution for this?? :)
> </newly-added>
>
> Kelvin
>
> ----- Original Message -----
> From: Kelvin Tan <kelvin@relevanz.com>
> To: Lucene Users List <lucene-user@jakarta.apache.org>
> Sent: Friday, November 23, 2001 6:48 PM
> Subject: Re: PDF parser for Lucene
>
>
> > I'm not too familiar with websearch's PDF parsing.
> >
> > I use a nice API Etymon Pj http://www.etymon.com/pj/
> >
> > It doesn't come with the ability to extract text, but it can be coded.
> I'll
> > leave you to do it because it's kinda fun, but I could provide it if
> anyone
> > wants it.
> >
> > I've also implemented it so that the searches can be performed on a
> > page-by-page basis. That's pretty cool, i think.
> >
> > ----- Original Message -----
> > From: <sampreet@interactive1.com>
> > To: <lucene-user@jakarta.apache.org>
> > Cc: <bkopic@interactive1.hr>
> > Sent: Friday, November 23, 2001 4:39 PM
> > Subject: RE: PDF parser for Lucene
> >
> >
> > > Hello,
> > >
> > > We have been using PDFHandler - a pdf parser provided by websearch, to
> > > search in pdf files. We are trying to get the contents using
> > > pdfHandler.getContents() to arrive at a context-sensitive summary.
> > However,
> > > it gives some yen signs and other special symbols in the title,
summary
> > and
> > > contents. If anyone is using the websearch component to parse pdf
files
> > and
> > > have encountered this problem, kindly give your suggestions.
> > >
> > > Note - Most of the pdf files are using WinAnsiEncoding, and setting
the
> > > encoding as Win-12xx doesn't help.
> > >
> > > Thanks in advance,
> > >
> > > Sampreet
> > > Programmer
> > >
> > >
> > > You could try this one:
> > > http://www.i2a.com/websearch/
> > >
> > > ...and then tell me how it works for you.
> > > =:o)
> > >
> > >
> > > Anyway, it is simple and Open Source.
> > >
> > >
> > > Have fun,
> > > Paulo Gaspar
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: PDF parser for Lucene [ In reply to ]

Nov 24, 2001, 3:46 PM

Post #7 of 8 (2374 views)

It had occurred to me also that a useful enhancement might be to filter out
nonsense tokens.

Certainly you are way ahead of me on this. You're welcome to put your logic
in the class (or send it to me and I will take a stab). My "BinaryReader"
also squeezes out sequences of whitespace, but will replace a binary
character with a whitespace under certain conditions.

I found a lot of single letters in the MS Office files. Which I think the
analyzer will get rid of (??? I'm still pretty new to Lucene).

At any rate here is my BinaryReader. Improvements welcomed!
/* usage example
FileReader fr = new FileReader(args[0]);
BufferedReader br = new BufferedReader(fr);
BinaryReader binr = new BinaryReader(br);
org.apache.lucene.document.Document doc =
new org.apache.lucene.document.Document();
doc.add(Field.UnIndexed("filename", args[0]));
doc.add(Field.Text("body",binr));
writer.addDocument(doc);
*/

import java.util.*;
import java.io.*;

public class BinaryReader
extends java.io.FilterReader
{
// private vars
private char prevchar = '\r';
private char thischar;

public BinaryReader(Reader in) {
super(in);
}

public int read()
throws IOException
{

int c = in.read();
if ( c != -1 ) {
if ( bintest((char)c) ) {
return thischar;
}
}
return c;
}

private boolean bintest(char c) {
if ( (c >= '!') && (c <= 'z') ) {
thischar = c;
prevchar = c;
return true;
} else if ( (c == '\n') ||
(c == '\t') ||
(c == '\r') )
{
thischar = c;
prevchar = ' ';
return true;
} else if ( prevchar != ' ' ) {
thischar = ' ';
prevchar = ' ';
return true;
}
return false;
}

public int read(char[] cbuf)
throws IOException
{
return read(cbuf, 0, cbuf.length);
}

public int read(char[] cbuf, int off, int len)
throws IOException
{
char[] cb = new char[len];
int cnt = in.read(cb);

if ( cnt == -1 ) return cnt; // done
int num = 0;
int loc = off;
for ( int i=0; i < cnt; i++ ) {
if ( bintest(cb[i]) ) {
cbuf[loc++] = thischar;
num++;
}
}
return num;
}
}

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Friday, November 23, 2001 4:24 PM
Subject: Re: PDF parser for Lucene

> That's pretty interesting because I've done something similar, but I
> (vainly) try to build some intelligence into it.
>
> First I run the text through a regex which filters out most of the words
> which contain nonsense characters (^A-Za-z0-9_@).
>
> Then I run it through a couple of tests, which try to guess if this is an
> English word, like testing for 3 consonants in a row, and if there are 2
> numbers which are not next to each other (can you think of any?!?!). Based
> on these, I rank the documents. Finally, if the word is less than 5
> characters, I run it through a wordlist.
>
> The results are <emphasis>acceptable</emphasis>. The difficulty is that
when
> faced with large documents (>5MB), you end up with alot of nonsense terms,
> actually exceeding Lucene's inbuilt limit of 10000 terms (to limit memory
> usage). The result is that these documents are not indexed completely.
>
> I'd be interested to see how you filter your documents...:)
>
> ----- Original Message -----
> From: Cecil, Paula New <cnew@fuse.net>
> To: Lucene Users List <lucene-user@jakarta.apache.org>; Kelvin Tan
> <kelvin@relevanz.com>
> Sent: Saturday, November 24, 2001 12:36 AM
> Subject: Re: PDF parser for Lucene
>
>
> > Inspired by the Unix "strings" command, I have written a subclass of
> > FilterReader; which I have called BinaryReader. The idea is simply to
> index
> > any proprietary file format by filtering out all non-printable
characters.
> > The assumption is that text is text. It will end up with more than the
> > "visible" text, but not less. After I have tested and made some
examples
> I
> > will post it here.
> >
> >
> >
> > ----- Original Message -----
> > From: Kelvin Tan <kelvin@relevanz.com>
> > To: Lucene Users List <lucene-user@jakarta.apache.org>
> > Sent: Friday, November 23, 2001 2:48 AM
> > Subject: Re: PDF parser for Lucene
> >
> >
> > > I'm not too familiar with websearch's PDF parsing.
> > >
> > > I use a nice API Etymon Pj http://www.etymon.com/pj/
> > >
> > > It doesn't come with the ability to extract text, but it can be coded.
> > I'll
> > > leave you to do it because it's kinda fun, but I could provide it if
> > anyone
> > > wants it.
> > >
> > > I've also implemented it so that the searches can be performed on a
> > > page-by-page basis. That's pretty cool, i think.
> > >
> > > ----- Original Message -----
> > > From: <sampreet@interactive1.com>
> > > To: <lucene-user@jakarta.apache.org>
> > > Cc: <bkopic@interactive1.hr>
> > > Sent: Friday, November 23, 2001 4:39 PM
> > > Subject: RE: PDF parser for Lucene
> > >
> > >
> > > > Hello,
> > > >
> > > > We have been using PDFHandler - a pdf parser provided by websearch,
to
> > > > search in pdf files. We are trying to get the contents using
> > > > pdfHandler.getContents() to arrive at a context-sensitive summary.
> > > However,
> > > > it gives some yen signs and other special symbols in the title,
> summary
> > > and
> > > > contents. If anyone is using the websearch component to parse pdf
> files
> > > and
> > > > have encountered this problem, kindly give your suggestions.
> > > >
> > > > Note - Most of the pdf files are using WinAnsiEncoding, and setting
> the
> > > > encoding as Win-12xx doesn't help.
> > > >
> > > > Thanks in advance,
> > > >
> > > > Sampreet
> > > > Programmer
> > > >
> > > >
> > > > You could try this one:
> > > > http://www.i2a.com/websearch/
> > > >
> > > > ...and then tell me how it works for you.
> > > > =:o)
> > > >
> > > >
> > > > Anyway, it is simple and Open Source.
> > > >
> > > >
> > > > Have fun,
> > > > Paulo Gaspar
> > > >
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > <mailto:lucene-user-help@jakarta.apache.org>
> > > >
> > > >
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> > >
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: PDF parser for Lucene [ In reply to ]

jtrent at structsoft

Nov 25, 2001, 1:12 PM

Post #8 of 8 (2372 views)

A Strings-like interpretation will bound to fail on PDFs since often times
PDF files have positioning information between every darn letter of a word.
The developers of PDF were apparently striving for ultimate flexibility for
formatting and positioning but performance (and larger files sizes) are the
obvious down side.

----- Original Message -----
From: "Cecil, Paula New" <cnew@fuse.net>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Saturday, November 24, 2001 5:46 PM
Subject: Re: PDF parser for Lucene

> It had occurred to me also that a useful enhancement might be to filter
out
> nonsense tokens.
>
> Certainly you are way ahead of me on this. You're welcome to put your
logic
> in the class (or send it to me and I will take a stab). My "BinaryReader"
> also squeezes out sequences of whitespace, but will replace a binary
> character with a whitespace under certain conditions.
>
> I found a lot of single letters in the MS Office files. Which I think the
> analyzer will get rid of (??? I'm still pretty new to Lucene).
>
> At any rate here is my BinaryReader. Improvements welcomed!
> /* usage example
> FileReader fr = new FileReader(args[0]);
> BufferedReader br = new BufferedReader(fr);
> BinaryReader binr = new BinaryReader(br);
> org.apache.lucene.document.Document doc =
> new org.apache.lucene.document.Document();
> doc.add(Field.UnIndexed("filename", args[0]));
> doc.add(Field.Text("body",binr));
> writer.addDocument(doc);
> */
>
> import java.util.*;
> import java.io.*;
>
> public class BinaryReader
> extends java.io.FilterReader
> {
> // private vars
> private char prevchar = '\r';
> private char thischar;
>
> public BinaryReader(Reader in) {
> super(in);
> }
>
> public int read()
> throws IOException
> {
>
> int c = in.read();
> if ( c != -1 ) {
> if ( bintest((char)c) ) {
> return thischar;
> }
> }
> return c;
> }
>
> private boolean bintest(char c) {
> if ( (c >= '!') && (c <= 'z') ) {
> thischar = c;
> prevchar = c;
> return true;
> } else if ( (c == '\n') ||
> (c == '\t') ||
> (c == '\r') )
> {
> thischar = c;
> prevchar = ' ';
> return true;
> } else if ( prevchar != ' ' ) {
> thischar = ' ';
> prevchar = ' ';
> return true;
> }
> return false;
> }
>
> public int read(char[] cbuf)
> throws IOException
> {
> return read(cbuf, 0, cbuf.length);
> }
>
> public int read(char[] cbuf, int off, int len)
> throws IOException
> {
> char[] cb = new char[len];
> int cnt = in.read(cb);
>
> if ( cnt == -1 ) return cnt; // done
> int num = 0;
> int loc = off;
> for ( int i=0; i < cnt; i++ ) {
> if ( bintest(cb[i]) ) {
> cbuf[loc++] = thischar;
> num++;
> }
> }
> return num;
> }
> }
>
> ----- Original Message -----
> From: Kelvin Tan <kelvin@relevanz.com>
> To: Lucene Users List <lucene-user@jakarta.apache.org>
> Sent: Friday, November 23, 2001 4:24 PM
> Subject: Re: PDF parser for Lucene
>
>
> > That's pretty interesting because I've done something similar, but I
> > (vainly) try to build some intelligence into it.
> >
> > First I run the text through a regex which filters out most of the words
> > which contain nonsense characters (^A-Za-z0-9_@).
> >
> > Then I run it through a couple of tests, which try to guess if this is
an
> > English word, like testing for 3 consonants in a row, and if there are 2
> > numbers which are not next to each other (can you think of any?!?!).
Based
> > on these, I rank the documents. Finally, if the word is less than 5
> > characters, I run it through a wordlist.
> >
> > The results are <emphasis>acceptable</emphasis>. The difficulty is that
> when
> > faced with large documents (>5MB), you end up with alot of nonsense term
s,
> > actually exceeding Lucene's inbuilt limit of 10000 terms (to limit
memory
> > usage). The result is that these documents are not indexed completely.
> >
> > I'd be interested to see how you filter your documents...:)
> >
> > ----- Original Message -----
> > From: Cecil, Paula New <cnew@fuse.net>
> > To: Lucene Users List <lucene-user@jakarta.apache.org>; Kelvin Tan
> > <kelvin@relevanz.com>
> > Sent: Saturday, November 24, 2001 12:36 AM
> > Subject: Re: PDF parser for Lucene
> >
> >
> > > Inspired by the Unix "strings" command, I have written a subclass of
> > > FilterReader; which I have called BinaryReader. The idea is simply to
> > index
> > > any proprietary file format by filtering out all non-printable
> characters.
> > > The assumption is that text is text. It will end up with more than
the
> > > "visible" text, but not less. After I have tested and made some
> examples
> > I
> > > will post it here.
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Kelvin Tan <kelvin@relevanz.com>
> > > To: Lucene Users List <lucene-user@jakarta.apache.org>
> > > Sent: Friday, November 23, 2001 2:48 AM
> > > Subject: Re: PDF parser for Lucene
> > >
> > >
> > > > I'm not too familiar with websearch's PDF parsing.
> > > >
> > > > I use a nice API Etymon Pj http://www.etymon.com/pj/
> > > >
> > > > It doesn't come with the ability to extract text, but it can be
coded.
> > > I'll
> > > > leave you to do it because it's kinda fun, but I could provide it if
> > > anyone
> > > > wants it.
> > > >
> > > > I've also implemented it so that the searches can be performed on a
> > > > page-by-page basis. That's pretty cool, i think.
> > > >
> > > > ----- Original Message -----
> > > > From: <sampreet@interactive1.com>
> > > > To: <lucene-user@jakarta.apache.org>
> > > > Cc: <bkopic@interactive1.hr>
> > > > Sent: Friday, November 23, 2001 4:39 PM
> > > > Subject: RE: PDF parser for Lucene
> > > >
> > > >
> > > > > Hello,
> > > > >
> > > > > We have been using PDFHandler - a pdf parser provided by
websearch,
> to
> > > > > search in pdf files. We are trying to get the contents using
> > > > > pdfHandler.getContents() to arrive at a context-sensitive summary.
> > > > However,
> > > > > it gives some yen signs and other special symbols in the title,
> > summary
> > > > and
> > > > > contents. If anyone is using the websearch component to parse pdf
> > files
> > > > and
> > > > > have encountered this problem, kindly give your suggestions.
> > > > >
> > > > > Note - Most of the pdf files are using WinAnsiEncoding, and
setting
> > the
> > > > > encoding as Win-12xx doesn't help.
> > > > >
> > > > > Thanks in advance,
> > > > >
> > > > > Sampreet
> > > > > Programmer
> > > > >
> > > > >
> > > > > You could try this one:
> > > > > http://www.i2a.com/websearch/
> > > > >
> > > > > ...and then tell me how it works for you.
> > > > > =:o)
> > > > >
> > > > >
> > > > > Anyway, it is simple and Open Source.
> > > > >
> > > > >
> > > > > Have fun,
> > > > > Paulo Gaspar
> > > > >
> > > > >
> > > > > --
> > > > > To unsubscribe, e-mail:
> > > > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > > > For additional commands, e-mail:
> > > > <mailto:lucene-user-help@jakarta.apache.org>
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > To unsubscribe, e-mail:
> > > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > > For additional commands, e-mail:
> > > <mailto:lucene-user-help@jakarta.apache.org>
> > > >
> > >
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>