Mailing List Archive: HTML Parser

HTML Parser

Apr 9, 2002, 7:21 AM

Post #1 of 12 (3028 views)

Hi,

I am working with the lucene demo and would like to compile the demo so that
I may eventually modify it for my own use. I am using the source from
lucene-demos-1.2-rc4.jar.zip.

However, the HTMLParser class had the filename HTMLParser.jj and won't
compile.
I changed the name to HTMLParser.java, still the same problem.

Any help would be greatly appreciated.

Thanks,
Neal

Neal Weinstein
Manager Software Development
blue*spark
neal@bluespark.com
T (416) 971-6612 x205
F (416) 971-6549
489 King Street West, Suite 200
Toronto, Ontario M5V 1K4 Canada
www.bluespark.com

RE: HTML Parser [ In reply to ]

mark at javamark

Apr 9, 2002, 7:26 AM

Post #2 of 12 (3025 views)

Permalink

Neal

Thats because the HTMLParser.jj is NOT a java file it contains the grammar
for the JavaCC, have a look at

http://www.quiotix.com/downloads/html-parser/

Regards

Mark

-----Original Message-----
From: Neal Weinstein [mailto:neal@bluespark.com]
Sent: 09 April 2002 16:21
To: 'lucene-user@jakarta.apache.org'
Subject: HTML Parser

Hi,

I am working with the lucene demo and would like to compile the demo so that
I may eventually modify it for my own use. I am using the source from
lucene-demos-1.2-rc4.jar.zip.

However, the HTMLParser class had the filename HTMLParser.jj and won't
compile.
I changed the name to HTMLParser.java, still the same problem.

Any help would be greatly appreciated.

Thanks,
Neal

Neal Weinstein
Manager Software Development
blue*spark
neal@bluespark.com
T (416) 971-6612 x205
F (416) 971-6549
489 King Street West, Suite 200
Toronto, Ontario M5V 1K4 Canada
www.bluespark.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

HTML parser [ In reply to ]

otis_gospodnetic at yahoo

Apr 18, 2002, 10:28 PM

Post #3 of 12 (3039 views)

Permalink

Hello,

I need to select an HTML parser for the application that I'm writing
and I'm not sure what to choose.
The HTML parser included with Lucene looks flimsy, JTidy looks like a
hack and an overkill, using classes written for Swing
(javax.swing.text.html.parser) seems wrong, and I haven't tried David
McNicol's parser (included with Spindle).

Somebody on this list must have done some research on this subject.
Can anyone share some experiences?
Have you found a better HTML parser than any of those I listed above?
If your application deals with HTML, what do you use for parsing it?

Thanks,
Otis

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

parrt at jguru

Apr 18, 2002, 10:38 PM

Post #4 of 12 (3021 views)

Permalink

On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote:

> Hello,
>
> I need to select an HTML parser for the application that I'm writing
> and I'm not sure what to choose.
> The HTML parser included with Lucene looks flimsy, JTidy looks like a
> hack and an overkill, using classes written for Swing
> (javax.swing.text.html.parser) seems wrong, and I haven't tried David
> McNicol's parser (included with Spindle).
>
> Somebody on this list must have done some research on this subject.
> Can anyone share some experiences?
> Have you found a better HTML parser than any of those I listed above?
> If your application deals with HTML, what do you use for parsing it?

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it
accepts. Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all
sorts of HTML from various websites to such them into the jGuru search
engine. I use a simple stripHTML() method I wrote to handle it. Works
great. Kills everything but the text. is that the kind of thing you
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

otis_gospodnetic at yahoo

Apr 18, 2002, 11:05 PM

Post #5 of 12 (3017 views)

Permalink

Hello Terrence,

Ah, you got me.
I guess I need a bit of both.
I need to just strip HTML and get raw body text so that I can stick it
in Lucene's index.
I would also like something that can extract at least the
<title>...</title> stuff, so that I can stick that in a separate field
in Lucene index.
While doing that I, like you, need to be able to handle poorly
formatted web pages.

In a future I may need something that has the ability to extract HREFs,
but I'll stick to one of the XP principles and just look for something
that meets current needs :)

I looked for ANTLR-based HTML parser a few days ago, but must have
missed the one you pointed out. I'll take a look at it now.
Can you share or describe your stripHTML method? Simple java that
looks for <s and >s or something smarter?

Thanks,
Otis
P.S.
This type of thing makes me wish I can use Perl or Python :)

--- Terence Parr <parrt@jguru.com> wrote:
>
> On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote:
>
> > Hello,
> >
> > I need to select an HTML parser for the application that I'm
> writing
> > and I'm not sure what to choose.
> > The HTML parser included with Lucene looks flimsy, JTidy looks like
> a
> > hack and an overkill, using classes written for Swing
> > (javax.swing.text.html.parser) seems wrong, and I haven't tried
> David
> > McNicol's parser (included with Spindle).
> >
> > Somebody on this list must have done some research on this subject.
> > Can anyone share some experiences?
> > Have you found a better HTML parser than any of those I listed
> above?
> > If your application deals with HTML, what do you use for parsing
> it?
>
> Hi Otis,
>
> I have an HTML parser built for ANTLR, but it's pretty strict in what
> it
> accepts. Not sure how useful it will be for you, but here it is:
>
> http://www.antlr.org/grammars/HTML
>
> I am not sure what your goal is, but I personally have to scarf all
> sorts of HTML from various websites to such them into the jGuru
> search
> engine. I use a simple stripHTML() method I wrote to handle it.
> Works
> great. Kills everything but the text. is that the kind of thing you
>
> are looking for or do you really want to parse not filter?
>
> Terence
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: HTML parser [ In reply to ]

mark at javamark

Apr 19, 2002, 8:40 AM

Post #6 of 12 (3021 views)

Permalink

You can use the swing html parser to do this but it's only a 3.2 DTD based
parser.
I have written (attached) a totall hack job for braking up an html page into
its
component parts, the code gives you an idea ... If anyone wants to know how
to use
the swing based parser I add some code ?

Mark

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: 19 April 2002 07:29
To: lucene-user@jakarta.apache.org
Subject: HTML parser

Hello,

I need to select an HTML parser for the application that I'm writing
and I'm not sure what to choose.
The HTML parser included with Lucene looks flimsy, JTidy looks like a
hack and an overkill, using classes written for Swing
(javax.swing.text.html.parser) seems wrong, and I haven't tried David
McNicol's parser (included with Spindle).

Somebody on this list must have done some research on this subject.
Can anyone share some experiences?
Have you found a better HTML parser than any of those I listed above?
If your application deals with HTML, what do you use for parsing it?

Thanks,
Otis

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

parrt at jguru

Apr 19, 2002, 10:21 AM

Post #7 of 12 (3020 views)

Permalink

Hi Otis,

The idea behind stripHTML is pretty simple. It's just a hand-built
lexer that looks like this:

while more char
if comment start, scarf til end comment
if char is < then
if SCRIPT tag scarf til end SCRIPT;
[same with A, STYLE, HEAD, FORM etc...]
endif
if &blort; and not stuff lile LT, GT, AMP, QUOT, NBSP, scarf
if whitespace scarf leaving just one bit of whitespace
otherwise just add char to stripped text.
endwhile

Nothing tricky in the code, but I should have used ANTLR itself to build
this lexer. it got to be a big method with lots of book keeping code as
in all lexers.

Ter

On Thursday, April 18, 2002, at 11:05 PM, Otis Gospodnetic wrote:

> Hello Terrence,
>
> Ah, you got me.
> I guess I need a bit of both.
> I need to just strip HTML and get raw body text so that I can stick it
> in Lucene's index.
> I would also like something that can extract at least the
> <title>...</title> stuff, so that I can stick that in a separate field
> in Lucene index.
> While doing that I, like you, need to be able to handle poorly
> formatted web pages.
>
> In a future I may need something that has the ability to extract HREFs,
> but I'll stick to one of the XP principles and just look for something
> that meets current needs :)
>
> I looked for ANTLR-based HTML parser a few days ago, but must have
> missed the one you pointed out. I'll take a look at it now.
> Can you share or describe your stripHTML method? Simple java that
> looks for <s and >s or something smarter?
>
> Thanks,
> Otis
> P.S.
> This type of thing makes me wish I can use Perl or Python :)
>
>
> --- Terence Parr <parrt@jguru.com> wrote:
>>
>> On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote:
>>
>>> Hello,
>>>
>>> I need to select an HTML parser for the application that I'm
>> writing
>>> and I'm not sure what to choose.
>>> The HTML parser included with Lucene looks flimsy, JTidy looks like
>> a
>>> hack and an overkill, using classes written for Swing
>>> (javax.swing.text.html.parser) seems wrong, and I haven't tried
>> David
>>> McNicol's parser (included with Spindle).
>>>
>>> Somebody on this list must have done some research on this subject.
>>> Can anyone share some experiences?
>>> Have you found a better HTML parser than any of those I listed
>> above?
>>> If your application deals with HTML, what do you use for parsing
>> it?
>>
>> Hi Otis,
>>
>> I have an HTML parser built for ANTLR, but it's pretty strict in what
>> it
>> accepts. Not sure how useful it will be for you, but here it is:
>>
>> http://www.antlr.org/grammars/HTML
>>
>> I am not sure what your goal is, but I personally have to scarf all
>> sorts of HTML from various websites to such them into the jGuru
>> search
>> engine. I use a simple stripHTML() method I wrote to handle it.
>> Works
>> great. Kills everything but the text. is that the kind of thing you
>>
>> are looking for or do you really want to parse not filter?
>>
>> Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
>>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-user-help@jakarta.apache.org>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Tax Center - online filing with TurboTax
> http://taxes.yahoo.com/
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: HTML parser [ In reply to ]

ian at plusfour

Apr 19, 2002, 10:40 AM

Post #8 of 12 (3018 views)

Permalink

Are there core classes part of lucene that allow one to feed lucene links,
and 'it' will capture the contents of those urls into the index..

or does one write a file capture class to seek out the url store the file in
a directory, then index the local directory..

Ian

-----Original Message-----
From: Terence Parr [mailto:parrt@jguru.com]
Sent: Friday, April 19, 2002 1:38 AM
To: Lucene Users List
Subject: Re: HTML parser

On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote:

:snip

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it
accepts. Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all
sorts of HTML from various websites to such them into the jGuru search
engine. I use a simple stripHTML() method I wrote to handle it. Works
great. Kills everything but the text. is that the kind of thing you
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: HTML parser [ In reply to ]

otis_gospodnetic at yahoo

Apr 19, 2002, 10:50 AM

Post #9 of 12 (3016 views)

Permalink

Such classes are not included with Lucene.
This was _just_ mentioned on this list earlier today.
Look at the archives and search for crawler, URL, lucene sandbox, etc.

Otis

--- Ian Forsyth <ian@plusfour.org> wrote:
>
> Are there core classes part of lucene that allow one to feed lucene
> links,
> and 'it' will capture the contents of those urls into the index..
>
> or does one write a file capture class to seek out the url store the
> file in
> a directory, then index the local directory..
>
> Ian
>
>
> -----Original Message-----
> From: Terence Parr [mailto:parrt@jguru.com]
> Sent: Friday, April 19, 2002 1:38 AM
> To: Lucene Users List
> Subject: Re: HTML parser
>
>
>
> On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote:
>
> :snip
>
> Hi Otis,
>
> I have an HTML parser built for ANTLR, but it's pretty strict in what
> it
> accepts. Not sure how useful it will be for you, but here it is:
>
> http://www.antlr.org/grammars/HTML
>
> I am not sure what your goal is, but I personally have to scarf all
> sorts of HTML from various websites to such them into the jGuru
> search
> engine. I use a simple stripHTML() method I wrote to handle it.
> Works
> great. Kills everything but the text. is that the kind of thing you
> are looking for or do you really want to parse not filter?
>
> Terence
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

black at apple

Apr 19, 2002, 2:26 PM

Post #10 of 12 (3017 views)

Permalink

While trying to research the same thing, I found the following...here's
a good example of link extraction.....

http://developer.java.sun.com/developer/TechTips/1999/tt0923.html

It seems like I could use this to also get the text out from between the
tags but haven't been able to do it yet. It seems like it should be
simple but geez...my head hurts.

On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:

>
> Are there core classes part of lucene that allow one to feed lucene
> links,
> and 'it' will capture the contents of those urls into the index..
>
> or does one write a file capture class to seek out the url store the
> file in
> a directory, then index the local directory..
>
> Ian
>
>
> -----Original Message-----
> From: Terence Parr [mailto:parrt@jguru.com]
> Sent: Friday, April 19, 2002 1:38 AM
> To: Lucene Users List
> Subject: Re: HTML parser
>
>
>
> On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote:
>
> :snip
>
> Hi Otis,
>
> I have an HTML parser built for ANTLR, but it's pretty strict in what it
> accepts. Not sure how useful it will be for you, but here it is:
>
> http://www.antlr.org/grammars/HTML
>
> I am not sure what your goal is, but I personally have to scarf all
> sorts of HTML from various websites to such them into the jGuru search
> engine. I use a simple stripHTML() method I wrote to handle it. Works
> great. Kills everything but the text. is that the kind of thing you
> are looking for or do you really want to parse not filter?
>
> Terence
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

lists at ehatchersolutions

Apr 19, 2002, 5:32 PM

Post #11 of 12 (3022 views)

Permalink

HttpUnit (which uses JTidy under the covers) makes childs play out of
pulling out links and navigating to them.

The only caveat (and this would be true for practically all tools, I
suspect) is that the HTML has to be relatively well-formed for it to work
well. JTidy can be somewhat forgiving though.

Erik

----- Original Message -----
From: "David Black" <black@apple.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, April 19, 2002 5:26 PM
Subject: Re: HTML parser

> While trying to research the same thing, I found the following...here's
> a good example of link extraction.....
>
> http://developer.java.sun.com/developer/TechTips/1999/tt0923.html
>
> It seems like I could use this to also get the text out from between the
> tags but haven't been able to do it yet. It seems like it should be
> simple but geez...my head hurts.
>
>
>
>
>
>
> On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:
>
> >
> > Are there core classes part of lucene that allow one to feed lucene
> > links,
> > and 'it' will capture the contents of those urls into the index..
> >
> > or does one write a file capture class to seek out the url store the
> > file in
> > a directory, then index the local directory..
> >
> > Ian
> >
> >
> > -----Original Message-----
> > From: Terence Parr [mailto:parrt@jguru.com]
> > Sent: Friday, April 19, 2002 1:38 AM
> > To: Lucene Users List
> > Subject: Re: HTML parser
> >
> >
> >
> > On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote:
> >
> > :snip
> >
> > Hi Otis,
> >
> > I have an HTML parser built for ANTLR, but it's pretty strict in what it
> > accepts. Not sure how useful it will be for you, but here it is:
> >
> > http://www.antlr.org/grammars/HTML
> >
> > I am not sure what your goal is, but I personally have to scarf all
> > sorts of HTML from various websites to such them into the jGuru search
> > engine. I use a simple stripHTML() method I wrote to handle it. Works
> > great. Kills everything but the text. is that the kind of thing you
> > are looking for or do you really want to parse not filter?
> >
> > Terence
> > --
> > Co-founder, http://www.jguru.com
> > Creator, ANTLR Parser Generator: http://www.antlr.org
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
> >
> > --
> > To unsubscribe, e-mail: <mailto:lucene-user-
> > unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail: <mailto:lucene-user-
> > help@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

brian at quiotix

Apr 19, 2002, 5:46 PM

Post #12 of 12 (3021 views)

Permalink

>While trying to research the same thing, I found the following...here's a
>good example of link extraction.....

Try http://www.quiotix.com/opensource/html-parser

Its easy to write a Visitor which extracts the links; should take about ten
lines of code.

--
Brian Goetz
Quiotix Corporation
brian@quiotix.com Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Mailing List Archive

Mailing List Archive

Attached Files: