If you want to parse PDF documents, the best approach would be to use
the Adobe IFilter for PDF, which is a COM component. You will need to
write a java client, which interacts with that COM component.
I believe it is easilly doable, but I have never done anything like
this.
It's a very interesting project, though.
Also, you will have to perform the pdf-text conversion on a windows
machine.
http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr
v/ixrefint_9sfm.asp
Regards,
Ivaylo Zlatev
-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Saturday, February 16, 2002 7:15 AM
To: Lucene Developers List
Subject: RE: HTMLParser
Hm, I thought this place would have a PDF parser, but it does not.
It does seem to have a RTF parser:
http://cobase-www.cs.ucla.edu/pub/javacc/
Perhaps some of these things can be adopted by Lucene, people could
contribute Java classes for interacting with specific parsers, and all
that could then be included in Lucene to work together with those
DocumentHandlers mentioned a few days ago.
Otis
--- Daniel Calvo <dcalvo@task.com.br> wrote:
> Maybe...I'll have to give it a try first
>
> Anyway, I was playing with Lucene's HTMParser in order to understand
> a little better how JavaCC works. My real interest is in PDF
> and RTF parsers. I've tried Websearch PDF parser but it only worked
> well with the examples provided. I wasn't able to parse
> correctly even PDF files distributed by Adobe. I've also had a lot of
> trouble with files converted to PDF (probably via dvi2pdf or
> something like that). Recently I read on this list (or maybe it was
> on the users list) that someone else was having trouble with
> both Websearch and PJ library parsers.
>
> I've just downloaded Adobe's PDF Specification and later I'll try to
> see if there's any room for improvement in Websearch code. I
> know PDF has various features (compression, cryptography, etc.) that
> complicate the parsing and I'm not willing to spend much time
> doing this but I'll probably try something.
>
> --Daniel
>
> > -----Original Message-----
> > From: Paulo Gaspar [mailto:paulo.gaspar@krankikom.de]
> > Sent: sexta-feira, 15 de fevereiro de 2002 23:14
> > To: Lucene Developers List
> > Subject: RE: HTMLParser
> >
> >
> > Can the following Xerces based HTML parser be interesting for
> > your work?
> >
> > This is just the initial ANNOUNCE but there are further
> > developments.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> > > -----Original Message-----
> > > From: Andy Clark [mailto:andyc@apache.org]
> > > Sent: Saturday, February 09, 2002 4:16 AM
> > > To: general@xml.apache.org
> > > Cc: xerces-j-dev@xml.apache.org
> > > Subject: [ANNOUNCE] Xerces HTML Parser
> > >
> > >
> > > For a long time users have asked if Xerces can parse HTML files.
> > > But since most HTML documents are not well-formed XML documents,
> > > it is generally not possible to use a conforming XML parser to
> > > read HTML documents.
> > >
> > > However, the Xerces Native Interface (XNI) that is the foundation
> > > of the Xerces2 implementation defines a framework that allows
> > > different kinds of parsers to be constructed by connecting a
> > > pipeline of parser components. Therefore, as long as a component
> > > can be written that generates the appropriate XNI "events", then
> > > it can be used to emit SAX events, build DOM trees, or anything
> > > else that you can think of.
> > >
> > > So, as a fun little exercise, I have written a basic HTML parser
> > > using XNI. It consists of an HTML scanner component that can scan
> > > HTML files and generate XNI events and a tag balancing component.
> > > The tag balancer cleans up the events produced by the scanner,
> > > balancing mismatched tags and adding tags where necessary. And
> > > it does all of this in a streaming manner to minimize the amount
> > > of memory required.
> > >
> > > Since I wrote the HTML parser as an example of using XNI and
> > > because the code is considered alpha quality (but it seems to
> > > work quite well, actually!), I am posting the code with a very
> > > limited license. Even though it contains the complete source
> > > code for the HTML parser, the license only allows the user to
> > > experiment but gives no right to actually use the code in a
> > > product.
> > >
> > > If the source isn't "free" or "open", why release it at all?
> > > I want to get an idea of what people think of the code first.
> > > Then, if there's enough interest, I would like to either donate
> > > the code to the Xerces-J project or make it available elsewhere
> > > under a true open source license.
> > >
> > > So, if you've been looking for a way to parse HTML documents
> > > please try out the HTML parser and let me know what you think.
> > > There should be enough information in the documentation to get
> > > you started. Check out the "NekoHTML" project listed on my
> > > Apache web site: http://www.apache.org/~andyc/
> > >
> > > Have fun!
> > >
> > > --
> > > Andy Clark * andyc@apache.org
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> > > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> > >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com
--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
the Adobe IFilter for PDF, which is a COM component. You will need to
write a java client, which interacts with that COM component.
I believe it is easilly doable, but I have never done anything like
this.
It's a very interesting project, though.
Also, you will have to perform the pdf-text conversion on a windows
machine.
http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr
v/ixrefint_9sfm.asp
Regards,
Ivaylo Zlatev
-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Saturday, February 16, 2002 7:15 AM
To: Lucene Developers List
Subject: RE: HTMLParser
Hm, I thought this place would have a PDF parser, but it does not.
It does seem to have a RTF parser:
http://cobase-www.cs.ucla.edu/pub/javacc/
Perhaps some of these things can be adopted by Lucene, people could
contribute Java classes for interacting with specific parsers, and all
that could then be included in Lucene to work together with those
DocumentHandlers mentioned a few days ago.
Otis
--- Daniel Calvo <dcalvo@task.com.br> wrote:
> Maybe...I'll have to give it a try first
>
> Anyway, I was playing with Lucene's HTMParser in order to understand
> a little better how JavaCC works. My real interest is in PDF
> and RTF parsers. I've tried Websearch PDF parser but it only worked
> well with the examples provided. I wasn't able to parse
> correctly even PDF files distributed by Adobe. I've also had a lot of
> trouble with files converted to PDF (probably via dvi2pdf or
> something like that). Recently I read on this list (or maybe it was
> on the users list) that someone else was having trouble with
> both Websearch and PJ library parsers.
>
> I've just downloaded Adobe's PDF Specification and later I'll try to
> see if there's any room for improvement in Websearch code. I
> know PDF has various features (compression, cryptography, etc.) that
> complicate the parsing and I'm not willing to spend much time
> doing this but I'll probably try something.
>
> --Daniel
>
> > -----Original Message-----
> > From: Paulo Gaspar [mailto:paulo.gaspar@krankikom.de]
> > Sent: sexta-feira, 15 de fevereiro de 2002 23:14
> > To: Lucene Developers List
> > Subject: RE: HTMLParser
> >
> >
> > Can the following Xerces based HTML parser be interesting for
> > your work?
> >
> > This is just the initial ANNOUNCE but there are further
> > developments.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> > > -----Original Message-----
> > > From: Andy Clark [mailto:andyc@apache.org]
> > > Sent: Saturday, February 09, 2002 4:16 AM
> > > To: general@xml.apache.org
> > > Cc: xerces-j-dev@xml.apache.org
> > > Subject: [ANNOUNCE] Xerces HTML Parser
> > >
> > >
> > > For a long time users have asked if Xerces can parse HTML files.
> > > But since most HTML documents are not well-formed XML documents,
> > > it is generally not possible to use a conforming XML parser to
> > > read HTML documents.
> > >
> > > However, the Xerces Native Interface (XNI) that is the foundation
> > > of the Xerces2 implementation defines a framework that allows
> > > different kinds of parsers to be constructed by connecting a
> > > pipeline of parser components. Therefore, as long as a component
> > > can be written that generates the appropriate XNI "events", then
> > > it can be used to emit SAX events, build DOM trees, or anything
> > > else that you can think of.
> > >
> > > So, as a fun little exercise, I have written a basic HTML parser
> > > using XNI. It consists of an HTML scanner component that can scan
> > > HTML files and generate XNI events and a tag balancing component.
> > > The tag balancer cleans up the events produced by the scanner,
> > > balancing mismatched tags and adding tags where necessary. And
> > > it does all of this in a streaming manner to minimize the amount
> > > of memory required.
> > >
> > > Since I wrote the HTML parser as an example of using XNI and
> > > because the code is considered alpha quality (but it seems to
> > > work quite well, actually!), I am posting the code with a very
> > > limited license. Even though it contains the complete source
> > > code for the HTML parser, the license only allows the user to
> > > experiment but gives no right to actually use the code in a
> > > product.
> > >
> > > If the source isn't "free" or "open", why release it at all?
> > > I want to get an idea of what people think of the code first.
> > > Then, if there's enough interest, I would like to either donate
> > > the code to the Xerces-J project or make it available elsewhere
> > > under a true open source license.
> > >
> > > So, if you've been looking for a way to parse HTML documents
> > > please try out the HTML parser and let me know what you think.
> > > There should be enough information in the documentation to get
> > > you started. Check out the "NekoHTML" project listed on my
> > > Apache web site: http://www.apache.org/~andyc/
> > >
> > > Have fun!
> > >
> > > --
> > > Andy Clark * andyc@apache.org
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> > > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> > >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com
--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>