Mailing List Archive

Using lucene more effectively
Hi,

I have implemented search on my website using lucene. However, I still have
a few questions -- mostly because of the flexible nature of Lucene. I would
like to learn from users who have some real world experience.

My situation: English documents. Three searchable 'fields' per document --
reference id, short description and content.

* Analyzer
-----------
I am using a combination of Standard, LowerCase, Stop and PorterStem
tokenizers. Is this the preferred combo or is there anything better?

* Query
-----------
Right now I am directly using the query parser. However, I am uncertain as
to whether this is the best approach especially with the myraid of Query
classes (Fuzzy, Wildcard, etc.). I would like to set up 'internal' boost
factors i.e maybe description is twice as important as content, etc. but
don't want users to enter the boost factors in the query itself. Any
experience shared will be greatly appreciated.

BTW, what is a Fuzzy Query?

* Ranking
------------
I've read the FAQ on generating the "stars" but am still a bit confused. For
example, searching a 2 page document that has about 7 or 8 'email' in it the
score is 0.07. Now I would've thought that this is a 4 star at least (if not
a 5) kind of search. In fact, I rarely get a 0.8+ score. I am aware that teh
score depends on the total number of words as well and that makes it even
more confusing on how to design a 'starring' strategy.

* General
------------
* Has anybody implemented aliasing yet? If yes, can you please point me to
it?

* My search is going to be used by businesses and I read some artciles that
said that organizing search results by "topics" is preferred very greatly.
E.g. if the word 'penguin' is searched then it would be a bit useless to
show sites related to arctic life, with linux, with research site on
penguins, with the hockey team site, etc. on the same page. Better would be
to organize these results. This of course is a very stupid example that can
be sorted out by adding more keywords, but in business scenarios this may
not be easy. Has anybody ventured into this? Any pointers will be useful.

Thanks much in advance
-Nikhil
AW: Using lucene more effectively [ In reply to ]
>I've read the FAQ on generating the "stars" but am still a bit
>confused. For example, searching a 2 page document that has
>about 7 or 8 'email' in it the score is 0.07. Now I would've
>thought that this is a 4 star at least (if not a 5) kind of
>search. In fact, I rarely get a 0.8+ score. I am aware that
>teh score depends on the total number of words as well and
>that makes it even more confusing on how to design a
>'starring' strategy.

One way to have more meaningful "stars" would be to use a scale factor
that will be applied to the scores before you decide how many stars to
display.

As an example, all scores could be scaled such that the document with
the highest score will get 100% (and thus gets 5 stars):


float scaleFactor = hits.score(0);
float currentDocumentScore;

while (i = 0; i < hits.length(); i++)
{
currentDocumentScore = hits.score(i) * scaleFactor;

// calculate number of "stars" here

// do something with hits.doc(i)
}

--
Maik Schreiber
IQ Computing - http://www.iq-computing.de
mailto: info@iq-computing.de
AW: Using lucene more effectively [ In reply to ]
> float scaleFactor = hits.score(0);

Sorry. That should of course read:

float scaleFactor = 100f / hits.score(0);

--
Maik Schreiber
IQ Computing - http://www.iq-computing.de
mailto: info@iq-computing.de