Mailing List Archive

RE: character filter issue/tokenizing host names
> -----Original Message-----
> From: Oshima, Scott [mailto:soshima@business.com]
> Sent: Tuesday, January 08, 2002 11:53 PM
> To: 'Lucene Users List'
> Subject: character filter issue
>
>
> Suppose we have one field with one string abc-xxx.com
>
> When I query for abc-xxx.com it returns 0 hits.
>
> BUT when i query for something like xxx.com it returns results fine.
>
> not sure what lucene is doing with the dashes. i am using the default
> standardfilter, lowercasefilter, stopfilter and porterstemfilter.
>
> Does anyone know how to get around this?

>
> thanks.
>
> -scott
>

I don't but I would like to suggest that you should not analyze
(tokenize) host names. Sorry if you didn't want to do that, only it was
an example.

If you analyze host names with standard stuffes, abc.xxx.com and
abc-xxx.com are indexed with the same terms. I think the best way you
can do is doing three fields:
1. site (host) name: jakarta.apache.org
2 (maybe you don't need it) domain name: apache.org
2. url: jakarta.apache.org/lucene/docs/index.html

host and site name fields are not tokenized (and you can use a hash
function to save space) but url field is tokenized

why? we are running a search engine (not lucene based ;) and we have
only tokenized url field; we can't search for documents from only
index.hu domain (Hungarian portal) because we get back a lot of
documents from such urls: www.something.com/locales/hu/index.html or
index_hu.html or www.hu.com/index

on google you can filter with site name
on northern light you can filter with words in url

peter