Mailing List Archive

HNSW search with threshold
Hi!

There are some use cases where we need to find vectors with the distance
(by some metric) to the given vector V less than the given threshold T.
That task is very similar to the knn problem, but in this case we don't
have a quantity of the nearest neighbours *k*.

As I see, the current implementation of knn doesn't provide such
functionality. But at the first glance it is not very difficult to modify
the method *search* of *HnswGraph *to implement that feature (do not limit
*result* size and get rid of candidates which exceed threshold).

But maybe that idea has some not obvious problems which I haven't noticed,
and in reality an implementation of that idea would have fundamental
difficulties?
Re: HNSW search with threshold [ In reply to ]
+1 to adding a scoring threshold. I think it could be another
parameter to KnnVectorQuery. Do you want to have a try at adding this?
If so, please feel free to open a PR and I will be happy to guide you.

On Mon, Nov 7, 2022 at 6:38 AM Alexey Gorlenko <agorlenko1@gmail.com> wrote:
>
> Hi!
>
> There are some use cases where we need to find vectors with the distance (by some metric) to the given vector V less than the given threshold T. That task is very similar to the knn problem, but in this case we don't have a quantity of the nearest neighbours k.
>
> As I see, the current implementation of knn doesn't provide such functionality. But at the first glance it is not very difficult to modify the method search of HnswGraph to implement that feature (do not limit result size and get rid of candidates which exceed threshold).
>
> But maybe that idea has some not obvious problems which I haven't noticed, and in reality an implementation of that idea would have fundamental difficulties?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: HNSW search with threshold [ In reply to ]
Thanks, Michael!
Yes, I will try.

??, 8 ????. 2022 ?. ? 03:31, Michael Sokolov <msokolov@gmail.com>:

> +1 to adding a scoring threshold. I think it could be another
> parameter to KnnVectorQuery. Do you want to have a try at adding this?
> If so, please feel free to open a PR and I will be happy to guide you.
>
> On Mon, Nov 7, 2022 at 6:38 AM Alexey Gorlenko <agorlenko1@gmail.com>
> wrote:
> >
> > Hi!
> >
> > There are some use cases where we need to find vectors with the distance
> (by some metric) to the given vector V less than the given threshold T.
> That task is very similar to the knn problem, but in this case we don't
> have a quantity of the nearest neighbours k.
> >
> > As I see, the current implementation of knn doesn't provide such
> functionality. But at the first glance it is not very difficult to modify
> the method search of HnswGraph to implement that feature (do not limit
> result size and get rid of candidates which exceed threshold).
> >
> > But maybe that idea has some not obvious problems which I haven't
> noticed, and in reality an implementation of that idea would have
> fundamental difficulties?
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: HNSW search with threshold [ In reply to ]
I wonder if it would actually be a good idea to support filtering _only_
based on distance. In the worst case scenario, this may require traversing
the whole HNSW graph and would run in linear time with the number of
vectors, with a high constant factor since we'd need to compute a distance
for every vector? I imagine that this would only make sense for low values
of the radius, so that few vectors would match, but this looks to me like
it would be hard to predict whether a given radius would actually match a
small set of vectors. Should the query still require a `k` value in
addition to the radius to make sure it doesn't go wild?

On Tue, Nov 8, 2022 at 7:26 AM Alexey Gorlenko <agorlenko1@gmail.com> wrote:

> Thanks, Michael!
> Yes, I will try.
>
> ??, 8 ????. 2022 ?. ? 03:31, Michael Sokolov <msokolov@gmail.com>:
>
>> +1 to adding a scoring threshold. I think it could be another
>> parameter to KnnVectorQuery. Do you want to have a try at adding this?
>> If so, please feel free to open a PR and I will be happy to guide you.
>>
>> On Mon, Nov 7, 2022 at 6:38 AM Alexey Gorlenko <agorlenko1@gmail.com>
>> wrote:
>> >
>> > Hi!
>> >
>> > There are some use cases where we need to find vectors with the
>> distance (by some metric) to the given vector V less than the given
>> threshold T. That task is very similar to the knn problem, but in this case
>> we don't have a quantity of the nearest neighbours k.
>> >
>> > As I see, the current implementation of knn doesn't provide such
>> functionality. But at the first glance it is not very difficult to modify
>> the method search of HnswGraph to implement that feature (do not limit
>> result size and get rid of candidates which exceed threshold).
>> >
>> > But maybe that idea has some not obvious problems which I haven't
>> noticed, and in reality an implementation of that idea would have
>> fundamental difficulties?
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
Adrien
Re: HNSW search with threshold [ In reply to ]
I think we can support both parameters: k and threshold. And if we need to
get all docs by the threshold, we just will set k == Integer.MAX_VALUE.

??, 10 ????. 2022 ?. ? 12:43, Adrien Grand <jpountz@gmail.com>:

> I wonder if it would actually be a good idea to support filtering _only_
> based on distance. In the worst case scenario, this may require traversing
> the whole HNSW graph and would run in linear time with the number of
> vectors, with a high constant factor since we'd need to compute a distance
> for every vector? I imagine that this would only make sense for low values
> of the radius, so that few vectors would match, but this looks to me like
> it would be hard to predict whether a given radius would actually match a
> small set of vectors. Should the query still require a `k` value in
> addition to the radius to make sure it doesn't go wild?
>
> On Tue, Nov 8, 2022 at 7:26 AM Alexey Gorlenko <agorlenko1@gmail.com>
> wrote:
>
>> Thanks, Michael!
>> Yes, I will try.
>>
>> ??, 8 ????. 2022 ?. ? 03:31, Michael Sokolov <msokolov@gmail.com>:
>>
>>> +1 to adding a scoring threshold. I think it could be another
>>> parameter to KnnVectorQuery. Do you want to have a try at adding this?
>>> If so, please feel free to open a PR and I will be happy to guide you.
>>>
>>> On Mon, Nov 7, 2022 at 6:38 AM Alexey Gorlenko <agorlenko1@gmail.com>
>>> wrote:
>>> >
>>> > Hi!
>>> >
>>> > There are some use cases where we need to find vectors with the
>>> distance (by some metric) to the given vector V less than the given
>>> threshold T. That task is very similar to the knn problem, but in this case
>>> we don't have a quantity of the nearest neighbours k.
>>> >
>>> > As I see, the current implementation of knn doesn't provide such
>>> functionality. But at the first glance it is not very difficult to modify
>>> the method search of HnswGraph to implement that feature (do not limit
>>> result size and get rid of candidates which exceed threshold).
>>> >
>>> > But maybe that idea has some not obvious problems which I haven't
>>> noticed, and in reality an implementation of that idea would have
>>> fundamental difficulties?
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> Adrien
>
Re: HNSW search with threshold [ In reply to ]
That would work for me, though this is something that I would like to be
documented as not recommended.

On Thu, Nov 10, 2022 at 2:33 PM Alexey Gorlenko <agorlenko1@gmail.com>
wrote:

> I think we can support both parameters: k and threshold. And if we need to
> get all docs by the threshold, we just will set k == Integer.MAX_VALUE.
>
> ??, 10 ????. 2022 ?. ? 12:43, Adrien Grand <jpountz@gmail.com>:
>
>> I wonder if it would actually be a good idea to support filtering _only_
>> based on distance. In the worst case scenario, this may require traversing
>> the whole HNSW graph and would run in linear time with the number of
>> vectors, with a high constant factor since we'd need to compute a distance
>> for every vector? I imagine that this would only make sense for low values
>> of the radius, so that few vectors would match, but this looks to me like
>> it would be hard to predict whether a given radius would actually match a
>> small set of vectors. Should the query still require a `k` value in
>> addition to the radius to make sure it doesn't go wild?
>>
>> On Tue, Nov 8, 2022 at 7:26 AM Alexey Gorlenko <agorlenko1@gmail.com>
>> wrote:
>>
>>> Thanks, Michael!
>>> Yes, I will try.
>>>
>>> ??, 8 ????. 2022 ?. ? 03:31, Michael Sokolov <msokolov@gmail.com>:
>>>
>>>> +1 to adding a scoring threshold. I think it could be another
>>>> parameter to KnnVectorQuery. Do you want to have a try at adding this?
>>>> If so, please feel free to open a PR and I will be happy to guide you.
>>>>
>>>> On Mon, Nov 7, 2022 at 6:38 AM Alexey Gorlenko <agorlenko1@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi!
>>>> >
>>>> > There are some use cases where we need to find vectors with the
>>>> distance (by some metric) to the given vector V less than the given
>>>> threshold T. That task is very similar to the knn problem, but in this case
>>>> we don't have a quantity of the nearest neighbours k.
>>>> >
>>>> > As I see, the current implementation of knn doesn't provide such
>>>> functionality. But at the first glance it is not very difficult to modify
>>>> the method search of HnswGraph to implement that feature (do not limit
>>>> result size and get rid of candidates which exceed threshold).
>>>> >
>>>> > But maybe that idea has some not obvious problems which I haven't
>>>> noticed, and in reality an implementation of that idea would have
>>>> fundamental difficulties?
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>
>> --
>> Adrien
>>
>

--
Adrien
Re: HNSW search with threshold [ In reply to ]
I think it's fine to warn about this, but in general large values of K
will increase cost, with or without thresholding, so this is not a new
thing to warn about

On Thu, Nov 10, 2022 at 5:50 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> That would work for me, though this is something that I would like to be documented as not recommended.
>
> On Thu, Nov 10, 2022 at 2:33 PM Alexey Gorlenko <agorlenko1@gmail.com> wrote:
>>
>> I think we can support both parameters: k and threshold. And if we need to get all docs by the threshold, we just will set k == Integer.MAX_VALUE.
>>
>> ??, 10 ????. 2022 ?. ? 12:43, Adrien Grand <jpountz@gmail.com>:
>>>
>>> I wonder if it would actually be a good idea to support filtering _only_ based on distance. In the worst case scenario, this may require traversing the whole HNSW graph and would run in linear time with the number of vectors, with a high constant factor since we'd need to compute a distance for every vector? I imagine that this would only make sense for low values of the radius, so that few vectors would match, but this looks to me like it would be hard to predict whether a given radius would actually match a small set of vectors. Should the query still require a `k` value in addition to the radius to make sure it doesn't go wild?
>>>
>>> On Tue, Nov 8, 2022 at 7:26 AM Alexey Gorlenko <agorlenko1@gmail.com> wrote:
>>>>
>>>> Thanks, Michael!
>>>> Yes, I will try.
>>>>
>>>> ??, 8 ????. 2022 ?. ? 03:31, Michael Sokolov <msokolov@gmail.com>:
>>>>>
>>>>> +1 to adding a scoring threshold. I think it could be another
>>>>> parameter to KnnVectorQuery. Do you want to have a try at adding this?
>>>>> If so, please feel free to open a PR and I will be happy to guide you.
>>>>>
>>>>> On Mon, Nov 7, 2022 at 6:38 AM Alexey Gorlenko <agorlenko1@gmail.com> wrote:
>>>>> >
>>>>> > Hi!
>>>>> >
>>>>> > There are some use cases where we need to find vectors with the distance (by some metric) to the given vector V less than the given threshold T. That task is very similar to the knn problem, but in this case we don't have a quantity of the nearest neighbours k.
>>>>> >
>>>>> > As I see, the current implementation of knn doesn't provide such functionality. But at the first glance it is not very difficult to modify the method search of HnswGraph to implement that feature (do not limit result size and get rid of candidates which exceed threshold).
>>>>> >
>>>>> > But maybe that idea has some not obvious problems which I haven't noticed, and in reality an implementation of that idea would have fundamental difficulties?
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>
>>>
>>> --
>>> Adrien
>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org