Mailing List Archive

how to parse XHTML
Hi,

I'm new to Lucene, and I was wondering how I should parse XHTML files.
Should I name them with the .HTML file extention and use
org.apache.lucene.demo.IndexHTML or name them with the .XML file extention
and use an XML parser?

Also, I would like to keep my XHTML files with a .XHTML file extention, if
possible, but that's not so important.

Thanks,
Terry.

_________________________________________________________________
Join the world’s largest e-mail service with MSN Hotmail.
http://www.hotmail.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: how to parse XHTML [ In reply to ]
Terry,

These are really not Lucene questions. Lucene will let you index text,
but you need to figure out how to parse your XHTML files.
Take a look at Jtidy on sf.net, I think Jtidy can help you with parsing
XHTML, or perhaps Xerces from xml.apache.org can.

Otis

--- Terry McGregor <trmcgregor@hotmail.com> wrote:
>
> Hi,
>
> I'm new to Lucene, and I was wondering how I should parse XHTML
> files.
> Should I name them with the .HTML file extention and use
> org.apache.lucene.demo.IndexHTML or name them with the .XML file
> extention
> and use an XML parser?
>
> Also, I would like to keep my XHTML files with a .XHTML file
> extention, if
> possible, but that's not so important.
>
> Thanks,
> Terry.
>
> _________________________________________________________________
> Join the world’s largest e-mail service with MSN Hotmail.
> http://www.hotmail.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Try FREE Yahoo! Mail - the world's greatest free email!
http://mail.yahoo.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: how to parse XHTML [ In reply to ]
Terry,

Check out the contribution sections of the lucene site. It has a few xml
document parsers.

--Peter

On 3/5/02 9:08 PM, "Otis Gospodnetic" <otis_gospodnetic@yahoo.com> wrote:

> Terry,
>
> These are really not Lucene questions. Lucene will let you index text,
> but you need to figure out how to parse your XHTML files.
> Take a look at Jtidy on sf.net, I think Jtidy can help you with parsing
> XHTML, or perhaps Xerces from xml.apache.org can.
>
> Otis
>
> --- Terry McGregor <trmcgregor@hotmail.com> wrote:
>>
>> Hi,
>>
>> I'm new to Lucene, and I was wondering how I should parse XHTML
>> files.
>> Should I name them with the .HTML file extention and use
>> org.apache.lucene.demo.IndexHTML or name them with the .XML file
>> extention
>> and use an XML parser?
>>
>> Also, I would like to keep my XHTML files with a .XHTML file
>> extention, if
>> possible, but that's not so important.
>>
>> Thanks,
>> Terry.
>>
>> _________________________________________________________________
>> Join the world?s largest e-mail service with MSN Hotmail.
>> http://www.hotmail.com
>>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-user-help@jakarta.apache.org>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Try FREE Yahoo! Mail - the world's greatest free email!
> http://mail.yahoo.com/
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>