I think I have found the secret recipe for doing this.......
1. The example at Sun for link extraction..this was very easy to convert
over to my application.
http://developer.java.sun.com/developer/TechTips/1999/tt0923.html 2. Brian Goetz's (great) Library at
http://www.quiotix.com/opensource/html-parser While the "Visitor Design Pattern" might make your eyes cross at first,
it's actually pretty cool. Here's a simple Vistor class that I wrote
to extract the HTML. I also make reference to a piece of code for
searching and replacing strings in a class called StripperUtils.java.
If I understood the Visitor thing better, I could probably produce
something more elegant...like the conversion of "&..;" with it's
appropriate unencoded text.
---------------begin HTMLTextVisitor.java --------
import com.quiotix.html.parser.*;
import java.io.*;
public class HTMLTextVisitor extends HtmlVisitor{
protected PrintWriter out;
public HTMLTextVisitor(OutputStream os) {
out = new PrintWriter(os);
}
public HTMLTextVisitor(OutputStream os, String encoding) throws
UnsupportedEncodingException {
out = new PrintWriter( new OutputStreamWriter(os,
encoding) );
}
public void finish() {
out.flush();
}
public void visit(HtmlDocument.Text t) {
String txt = t.toString();
txt = StripperUtils.replace(txt," "," ");
txt = StripperUtils.replace(txt," "," "); // for some
wierd reason, the first pass doesn't get all of them
out.print(txt);
}
---------- end HTMLTextVisitor-------
--------- begin StripperUtils.java ----------
public static String replace(String originalText,
String subStringToFind, String
subStringToReplaceWith) {
int s = 0;
int e = 0;
StringBuffer newText = new StringBuffer();
while ((e = originalText.indexOf(subStringToFind, s)) >= 0) {
newText.append(originalText.substring(s, e));
newText.append(subStringToReplaceWith);
s = e + subStringToFind.length();
}
newText.append(originalText.substring(s));
return newText.toString();
}
--------- End StripperUtils.java --------------
On Saturday, April 20, 2002, at 09:29 AM, lucene@libero.it wrote:
> Hi all,
>
> I'm very interested about this thread. I also have to solve the problem
> of spidering web sites, creating index (weel about this there is the
> BIG problem that lucene can't be integrated easily with a DB),
> extracting links from the page repeating all the process.
>
> For extracting links from a page I'm thinking to use JTidy. I think
> that with this library you can also parse a non well formed page (that
> you can take from the web with URLConnection) setting the property to
> clean the page. The class Tidy() returns a org.w3c.dom.Document that
> you can use for analizing all the document: for example you can use
> doc.getElementsByTagName(a) for taking all the a elements. You can
> parse as xml.
>
> Did someone solve the problem to spider recursively a web pages?
>
> Laura
>
>
>
>
>>
>>> While trying to research the same thing, I found the following...here
> 's a
>>> good example of link extraction.....
>>
>> Try http://www.quiotix.com/opensource/html-parser
>>
>> Its easy to write a Visitor which extracts the links; should take abou
> t ten
>> lines of code.
>>
>>
>>
>> --
>> Brian Goetz
>> Quiotix Corporation
>> brian@quiotix.com Tel: 650-843-1300 Fax: 650-324-
> 8032
>>
>> http://www.quiotix.com
>>
>>
>> --
>> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>>
>>