Mailing List Archive: HTML Parser

HTML Parser

Dec 17, 2001, 9:31 AM

Post #1 of 12 (2471 views)

Hi,

How should I integrate the HTML Parser (which is in the demo directory) in a
new project ?
In particular with the HTMLParser.jj file.
Do a need to compile it before trying to use it in my code.
Any help would be apreciated !
Thank.

-----
Christophe

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: HTML Parser [ In reply to ]

karl at gan

Dec 18, 2001, 4:05 AM

Post #2 of 12 (2453 views)

Permalink

*.jj files are compiled with javacc, there is a javacc.zip file in your lib
directory, but you should download the compilerset.

mvh karl øie

-----Original Message-----
From: Christophe GOGUYER DESSAGNES [mailto:cgd@arcadsoftware.com]
Sent: 17. desember 2001 17:32
To: lucene-user@jakarta.apache.org
Subject: HTML Parser

Hi,

How should I integrate the HTML Parser (which is in the demo directory) in a
new project ?
In particular with the HTMLParser.jj file.
Do a need to compile it before trying to use it in my code.
Any help would be apreciated !
Thank.

-----
Christophe

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

lucene at libero

Apr 20, 2002, 6:29 AM

Post #3 of 12 (2444 views)

Permalink

Hi all,

I'm very interested about this thread. I also have to solve the problem
of spidering web sites, creating index (weel about this there is the
BIG problem that lucene can't be integrated easily with a DB),
extracting links from the page repeating all the process.

For extracting links from a page I'm thinking to use JTidy. I think
that with this library you can also parse a non well formed page (that
you can take from the web with URLConnection) setting the property to
clean the page. The class Tidy() returns a org.w3c.dom.Document that
you can use for analizing all the document: for example you can use
doc.getElementsByTagName(a) for taking all the a elements. You can
parse as xml.

Did someone solve the problem to spider recursively a web pages?

Laura

>
> >While trying to research the same thing, I found the following...here
's a
> >good example of link extraction.....
>
> Try http://www.quiotix.com/opensource/html-parser
>
> Its easy to write a Visitor which extracts the links; should take abou
t ten
> lines of code.
>
>
>
> --
> Brian Goetz
> Quiotix Corporation
> brian@quiotix.com Tel: 650-843-1300 Fax: 650-324-
8032
>
> http://www.quiotix.com
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>
>
>

Re:_HTML_parser [ In reply to ]

otis_gospodnetic at yahoo

Apr 20, 2002, 7:10 AM

Post #4 of 12 (2446 views)

Permalink

Laura,

Search the lucene-user and lucene-dev archives for things like:
crawler
spider
spindle
lucene sandbox

Spindle is something you may want to look at, as is MoJo (not mentioned
on lucene lists, use Google).

Otis

> Did someone solve the problem to spider recursively a web pages?

> > >While trying to research the same thing, I found the
> following...here
> 's a
> > >good example of link extraction.....
> >
> > Try http://www.quiotix.com/opensource/html-parser
> >
> > Its easy to write a Visitor which extracts the links; should take
> abou
> t ten
> > lines of code.

__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

lucene at libero

Apr 21, 2002, 5:47 AM

Post #5 of 12 (2443 views)

Permalink

Hi Otis,

thanks for your reply. I have been looking for Spindle and Mojo for 2
hours but I don't found anything.

Can you help me? Wher can I find something?

Thanks for your help and time

Laura

> Laura,
>
> Search the lucene-user and lucene-dev archives for things like:
> crawler
> spider
> spindle
> lucene sandbox
>
> Spindle is something you may want to look at, as is MoJo (not mentione
d
> on lucene lists, use Google).
>
> Otis
>
> > Did someone solve the problem to spider recursively a web pages?
>
> > > >While trying to research the same thing, I found the
> > following...here
> > 's a
> > > >good example of link extraction.....
> > >
> > > Try http://www.quiotix.com/opensource/html-parser
> > >
> > > Its easy to write a Visitor which extracts the links; should take
> > abou
> > t ten
> > > lines of code.
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Games - play chess, backgammon, pool and more
> http://games.yahoo.com/
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>
>
>

Re:_HTML_parser [ In reply to ]

otis_gospodnetic at yahoo

Apr 21, 2002, 9:27 AM

Post #6 of 12 (2446 views)

Permalink

Laura,

http://marc.theaimsgroup.com/?l=lucene-user&w=2&r=1&s=Spindle&q=b

Oops, it's JoBo, not MoJo :)
http://www.matuschek.net/software/jobo/

Otis

--- "lucene@libero.it" <lucene@libero.it> wrote:
> Hi Otis,
>
> thanks for your reply. I have been looking for Spindle and Mojo for 2
>
> hours but I don't found anything.
>
> Can you help me? Wher can I find something?
>
> Thanks for your help and time
>
>
> Laura
>
>
>
>
> > Laura,
> >
> > Search the lucene-user and lucene-dev archives for things like:
> > crawler
> > spider
> > spindle
> > lucene sandbox
> >
> > Spindle is something you may want to look at, as is MoJo (not
> mentione
> d
> > on lucene lists, use Google).
> >
> > Otis
> >
> > > Did someone solve the problem to spider recursively a web pages?
> >
> > > > >While trying to research the same thing, I found the
> > > following...here
> > > 's a
> > > > >good example of link extraction.....
> > > >
> > > > Try http://www.quiotix.com/opensource/html-parser
> > > >
> > > > Its easy to write a Visitor which extracts the links; should
> take
> > > abou
> > > t ten
> > > > lines of code.
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Yahoo! Games - play chess, backgammon, pool and more
> > http://games.yahoo.com/
> >
> > --
> > To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
> >
> >

__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

black at apple

Apr 21, 2002, 2:49 PM

Post #7 of 12 (2442 views)

Permalink

I think I have found the secret recipe for doing this.......

1. The example at Sun for link extraction..this was very easy to convert
over to my application.
http://developer.java.sun.com/developer/TechTips/1999/tt0923.html

2. Brian Goetz's (great) Library at
http://www.quiotix.com/opensource/html-parser

While the "Visitor Design Pattern" might make your eyes cross at first,
it's actually pretty cool. Here's a simple Vistor class that I wrote
to extract the HTML. I also make reference to a piece of code for
searching and replacing strings in a class called StripperUtils.java.
If I understood the Visitor thing better, I could probably produce
something more elegant...like the conversion of "&..;" with it's
appropriate unencoded text.

---------------begin HTMLTextVisitor.java --------

import com.quiotix.html.parser.*;
import java.io.*;

public class HTMLTextVisitor extends HtmlVisitor{
protected PrintWriter out;

public HTMLTextVisitor(OutputStream os) {
out = new PrintWriter(os);
}

public HTMLTextVisitor(OutputStream os, String encoding) throws
UnsupportedEncodingException {
out = new PrintWriter( new OutputStreamWriter(os,
encoding) );
}

public void finish() {
out.flush();
}

public void visit(HtmlDocument.Text t) {
String txt = t.toString();
txt = StripperUtils.replace(txt," "," ");
txt = StripperUtils.replace(txt," "," "); // for some
wierd reason, the first pass doesn't get all of them
out.print(txt);
}

---------- end HTMLTextVisitor-------

--------- begin StripperUtils.java ----------

public static String replace(String originalText,
String subStringToFind, String
subStringToReplaceWith) {
int s = 0;
int e = 0;
StringBuffer newText = new StringBuffer();
while ((e = originalText.indexOf(subStringToFind, s)) >= 0) {
newText.append(originalText.substring(s, e));
newText.append(subStringToReplaceWith);
s = e + subStringToFind.length();
}
newText.append(originalText.substring(s));
return newText.toString();
}

--------- End StripperUtils.java --------------

On Saturday, April 20, 2002, at 09:29 AM, lucene@libero.it wrote:

> Hi all,
>
> I'm very interested about this thread. I also have to solve the problem
> of spidering web sites, creating index (weel about this there is the
> BIG problem that lucene can't be integrated easily with a DB),
> extracting links from the page repeating all the process.
>
> For extracting links from a page I'm thinking to use JTidy. I think
> that with this library you can also parse a non well formed page (that
> you can take from the web with URLConnection) setting the property to
> clean the page. The class Tidy() returns a org.w3c.dom.Document that
> you can use for analizing all the document: for example you can use
> doc.getElementsByTagName(a) for taking all the a elements. You can
> parse as xml.
>
> Did someone solve the problem to spider recursively a web pages?
>
> Laura
>
>
>
>
>>
>>> While trying to research the same thing, I found the following...here
> 's a
>>> good example of link extraction.....
>>
>> Try http://www.quiotix.com/opensource/html-parser
>>
>> Its easy to write a Visitor which extracts the links; should take abou
> t ten
>> lines of code.
>>
>>
>>
>> --
>> Brian Goetz
>> Quiotix Corporation
>> brian@quiotix.com Tel: 650-843-1300 Fax: 650-324-
> 8032
>>
>> http://www.quiotix.com
>>
>>
>> --
>> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>>
>>

Re: Re: HTML parser [ In reply to ]

pixel at bitmechanic

Apr 21, 2002, 10:16 PM

Post #8 of 12 (2450 views)

Permalink

On Sun, 21 Apr 2002, [iso-8859-1] lucene@libero.it wrote:

> thanks for your reply. I have been looking for Spindle and Mojo for 2
> hours but I don't found anything.

spindle is at:

http://www.bitmechanic.com/projects/spindle/

cheers

-- James

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: HTML parser [ In reply to ]

lucene at libero

Apr 22, 2002, 2:43 AM

Post #9 of 12 (2440 views)

Permalink

Hi all,

did someone try jobo?

It seems a good software which can be extended.

Has someone some experiences about it?

Laura

> Laura,
>
> http://marc.theaimsgroup.com/?l=lucene-user&w=2&r=1&s=Spindle&q=b
>
> Oops, it's JoBo, not MoJo :)
> http://www.matuschek.net/software/jobo/
>
> Otis
>
> --- "lucene@libero.it" <lucene@libero.it> wrote:
> > Hi Otis,
> >
> > thanks for your reply. I have been looking for Spindle and Mojo for
2
> >
> > hours but I don't found anything.
> >
> > Can you help me? Wher can I find something?
> >
> > Thanks for your help and time
> >
> >
> > Laura
> >
> >
> >
> >
> > > Laura,
> > >
> > > Search the lucene-user and lucene-dev archives for things like:
> > > crawler
> > > spider
> > > spindle
> > > lucene sandbox
> > >
> > > Spindle is something you may want to look at, as is MoJo (not
> > mentione
> > d
> > > on lucene lists, use Google).
> > >
> > > Otis
> > >
> > > > Did someone solve the problem to spider recursively a web pages?
> > >
> > > > > >While trying to research the same thing, I found the
> > > > following...here
> > > > 's a
> > > > > >good example of link extraction.....
> > > > >
> > > > > Try http://www.quiotix.com/opensource/html-parser
> > > > >
> > > > > Its easy to write a Visitor which extracts the links; should
> > take
> > > > abou
> > > > t ten
> > > > > lines of code.
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Yahoo! Games - play chess, backgammon, pool and more
> > > http://games.yahoo.com/
> > >
> > > --
> > > To unsubscribe, e-mail: <mailto:lucene-user-
> > unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail: <mailto:lucene-user-
> > help@jakarta.apache.org>
> > >
> > >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Games - play chess, backgammon, pool and more
> http://games.yahoo.com/
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>
>
>

Re: Re:_HTML_parser [ In reply to ]

kelvin at relevanz

Apr 24, 2002, 3:26 AM

Post #10 of 12 (2443 views)

Permalink

Otis, what's the final conclusion you've arrived at regarding the HTML
filter/parsing?

I have pretty much the same requirements as you do right now (extract text,
and obtain the title).

Kelvin

----- Original Message -----
From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, April 22, 2002 12:27 AM
Subject: Re:_HTML_parser

> Laura,
>
> http://marc.theaimsgroup.com/?l=lucene-user&w=2&r=1&s=Spindle&q=b
>
> Oops, it's JoBo, not MoJo :)
> http://www.matuschek.net/software/jobo/
>
> Otis
>
> --- "lucene@libero.it" <lucene@libero.it> wrote:
> > Hi Otis,
> >
> > thanks for your reply. I have been looking for Spindle and Mojo for 2
> >
> > hours but I don't found anything.
> >
> > Can you help me? Wher can I find something?
> >
> > Thanks for your help and time
> >
> >
> > Laura
> >
> >
> >
> >
> > > Laura,
> > >
> > > Search the lucene-user and lucene-dev archives for things like:
> > > crawler
> > > spider
> > > spindle
> > > lucene sandbox
> > >
> > > Spindle is something you may want to look at, as is MoJo (not
> > mentione
> > d
> > > on lucene lists, use Google).
> > >
> > > Otis
> > >
> > > > Did someone solve the problem to spider recursively a web pages?
> > >
> > > > > >While trying to research the same thing, I found the
> > > > following...here
> > > > 's a
> > > > > >good example of link extraction.....
> > > > >
> > > > > Try http://www.quiotix.com/opensource/html-parser
> > > > >
> > > > > Its easy to write a Visitor which extracts the links; should
> > take
> > > > abou
> > > > t ten
> > > > > lines of code.
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Yahoo! Games - play chess, backgammon, pool and more
> > > http://games.yahoo.com/
> > >
> > > --
> > > To unsubscribe, e-mail: <mailto:lucene-user-
> > unsubscribe@jakarta.apache.org>
> > > For additional commands, e-mail: <mailto:lucene-user-
> > help@jakarta.apache.org>
> > >
> > >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Games - play chess, backgammon, pool and more
> http://games.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: _HTML_parser [ In reply to ]

paulo.gaspar at krankikom

Apr 24, 2002, 1:29 PM

Post #11 of 12 (2451 views)

Permalink

Did anyone take a look at NekoHTML?
http://www.apache.org/~andyc/

Con: needs Xerces.

Have fun,
Paulo Gaspar

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: html parser [ In reply to ]

cmad at lanlab

Jun 17, 2002, 2:56 PM

Post #12 of 12 (2447 views)

Permalink

yes. download JavaCC, point to it in built.properties as said by Ant, and
build it.

----- Original Message -----
From: <Srinivas.Kotamraju@bmsus.com>
To: <lucene-user@jakarta.apache.org>
Sent: Monday, June 17, 2002 11:00 PM
Subject: html parser

> hi all,
> does any one how to use the htmlparser.jj class in the demo package? does
> this class needs to be build before using it.
> thanks
> srini
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>