Mailing List Archive

Html documents parsing
I am confused about how Lucene performs the parsing of an Html document. It
doesn't do any tag striping (or does it?) consequently does that mean it
also indexes all html tags? If so then a request for searching "body" will
return any and all html documents previously indexed.
I'd appreciate anyone would could shed some light on the FAQ.10 about
indexing?


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Html documents parsing [ In reply to ]
This is a question similar to the Meta-Tags question posted to the list
earlier today (or was it yesterday?). Lucene distribution includes a
few simple applications that demonstrate what Lucene is capable of and
how it can be used. But those demos are not relaly a part of Lucene
code.
It is up to you to write the application around Lucene, which in your
case would include HTML parsing.

Perhaps JTidy (http://jtidy.sf.net/) could come handy here...

Otis


--- Emmanuel Bridonneau <EBridonneau@epicentric.com> wrote:
> I am confused about how Lucene performs the parsing of an Html
> document. It
> doesn't do any tag striping (or does it?) consequently does that mean
> it
> also indexes all html tags? If so then a request for searching "body"
> will
> return any and all html documents previously indexed.
> I'd appreciate anyone would could shed some light on the FAQ.10 about
> indexing?
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Find the one for you at Yahoo! Personals
http://personals.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>