Mailing List Archive

How to explain Lucene's ranking algorithm to someone who is not technical?
Hi everyone,

If you are asked to explain how Lucene's algorithm works, to someone who is
not technical and doesn't understand math, how do you go about doing so?

I'm going to list what I see as key points to use but please correct me
where correction is needed and do add where addition is needed. Here are
the talking points I can think of.

Search terms are: "to be or not to be, that is the question" The examples
below or simple term search (no booleans, no phrase, no fields, etc.)

1) Documents that contain all or most of the search terms are ranked
highest.
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... to be, that is the question ... ...
hit #3: ... ... is the question ... ...

2) Documents that contain all or most of the search terms, more often than
other documents are ranked higher.
hit #1: ... ... to be or not to be, that is the question and is still the
question ... ...
hit #2: ... ... to be or not to be, that is the question ... ...
hit #3: ... ... to be, that is the question ... ...

3) Documents that contain the search terms closer to each other are ranked
higher
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... to be or not to be, is what being asked, that is the
question ... ...
hit #3: ... ... is the question ... ...

4) Documents that contain the exact search terms, including number of times
search terms occur, the smaller document is ranked higher
hit #1: to be or not to be, that is the question
hit #2: ... ... to be or not to be, that is the question ... ...

5) Documents that contain more of the complex / longer terms are ranked
higher than those containing more of the lighter terms.
hit #1: ... ... to be or not to be, that is the question and is still the
question to question ... ...
hit #2: ... ... to be or not to be and to be or not to be, and to be or not
to be, that is the question ... ...

6) Documents that contain search terms, match the order, are ranked higher:
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... question the that is be not to be or be ... ...

I think I get all the above right (I'm not sure about #6).

Thanks

Steven
How to explain Lucene's ranking algorithm to someone who is not technical? [ In reply to ]
Hi everyone,

If you are asked to explain how Lucene's algorithm works, to someone who is
not technical and doesn't understand math, how do you go about doing so?

I'm going to list what I see as key points to use but please correct me
where correction is needed and do add where addition is needed. Here are
the talking points I can think of.

Search terms are: "to be or not to be, that is the question" The examples
below or simple term search (no booleans, no phrase, no fields, etc.)

1) Documents that contain all or most of the search terms are ranked
highest.
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... to be, that is the question ... ...
hit #3: ... ... is the question ... ...

2) Documents that contain all or most of the search terms, more often than
other documents are ranked higher.
hit #1: ... ... to be or not to be, that is the question and is still the
question ... ...
hit #2: ... ... to be or not to be, that is the question ... ...
hit #3: ... ... to be, that is the question ... ...

3) Documents that contain the search terms closer to each other are ranked
higher
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... to be or not to be, is what being asked, that is the
question ... ...
hit #3: ... ... is the question ... ...

4) Documents that contain the exact search terms, including number of times
search terms occur, the smaller document is ranked higher
hit #1: to be or not to be, that is the question
hit #2: ... ... to be or not to be, that is the question ... ...

5) Documents that contain more of the complex / longer terms are ranked
higher than those containing more of the lighter terms.
hit #1: ... ... to be or not to be, that is the question and is still the
question to question ... ...
hit #2: ... ... to be or not to be and to be or not to be, and to be or not
to be, that is the question ... ...

6) Documents that contain search terms, match the order, are ranked higher:
hit #1: ... ... to be or not to be, that is the question ... ...
hit #2: ... ... question the that is be not to be or be ... ...

I think I get all the above right (I'm not sure about #6).

Thanks

Steven
Re: How to explain Lucene's ranking algorithm to someone who is not technical? [ In reply to ]
1. This isn't true. Your query has 10 terms. A document that poorly matches
all 10 terms will rank lower than a document that has great matches for 9
of the 10 terms. However it's true that having more matches usually
correlates with better scores since the final score of a boolean query is
the sum of the scores of all wrapped term queries, so having more matches
helps get higher scores.

2. Assuming that all documents have the same length, this is correct.

3. This is incorrect. Proximity is ignored by default. You can boost based
on proximity data by adding optional phrase clauses, but this is not the
default behavior. And I can't find the paper, but I remember reading one
that said that contrary to intuition, leveraging proximity didn't actually
improve relevancy.

4. This is correct.

5. It would be correct if you replaced "complex / longer" with "rarest".

6. This is incorrect, proximity is not taken into account for boolean
queries.

On Mon, Apr 19, 2021 at 1:10 AM Steven White <swhite4141@gmail.com> wrote:

> Hi everyone,
>
> If you are asked to explain how Lucene's algorithm works, to someone who is
> not technical and doesn't understand math, how do you go about doing so?
>
> I'm going to list what I see as key points to use but please correct me
> where correction is needed and do add where addition is needed. Here are
> the talking points I can think of.
>
> Search terms are: "to be or not to be, that is the question" The examples
> below or simple term search (no booleans, no phrase, no fields, etc.)
>
> 1) Documents that contain all or most of the search terms are ranked
> highest.
> hit #1: ... ... to be or not to be, that is the question ... ...
> hit #2: ... ... to be, that is the question ... ...
> hit #3: ... ... is the question ... ...
>
> 2) Documents that contain all or most of the search terms, more often than
> other documents are ranked higher.
> hit #1: ... ... to be or not to be, that is the question and is still the
> question ... ...
> hit #2: ... ... to be or not to be, that is the question ... ...
> hit #3: ... ... to be, that is the question ... ...
>
> 3) Documents that contain the search terms closer to each other are ranked
> higher
> hit #1: ... ... to be or not to be, that is the question ... ...
> hit #2: ... ... to be or not to be, is what being asked, that is the
> question ... ...
> hit #3: ... ... is the question ... ...
>
> 4) Documents that contain the exact search terms, including number of times
> search terms occur, the smaller document is ranked higher
> hit #1: to be or not to be, that is the question
> hit #2: ... ... to be or not to be, that is the question ... ...
>
> 5) Documents that contain more of the complex / longer terms are ranked
> higher than those containing more of the lighter terms.
> hit #1: ... ... to be or not to be, that is the question and is still the
> question to question ... ...
> hit #2: ... ... to be or not to be and to be or not to be, and to be or not
> to be, that is the question ... ...
>
> 6) Documents that contain search terms, match the order, are ranked higher:
> hit #1: ... ... to be or not to be, that is the question ... ...
> hit #2: ... ... question the that is be not to be or be ... ...
>
> I think I get all the above right (I'm not sure about #6).
>
> Thanks
>
> Steven
>


--
Adrien