Mailing List Archive

Raising the Value of MAX_DIMENSIONS of Vector Values
Hi Lucene Team,

In general, I have advised very strongly against our team at MongoDB
modifying the Lucene source, except in scenarios where we have strong needs
for a particular customization. Ultimately, people can do what they would
like to do.

That being said, we have a number of customers preparing to use Lucene for
dense vector search. There are many language models that are optimized for
> 1024 dimensions. I remember Michael Wechner's email
<https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html> about
one instance with Open API.

I just tried to test the OpenAI model
> "text-similarity-davinci-001" with 12288 dimension


It seems that customers who attempt to use these models should not be
turned away. It could be sufficient to explain the issues. The only ones I
have identified are two expected ones in very slow indexing throughput,
high CPU usage, and a maybe less defined risk of more numerical errors.

I opened an issue <https://github.com/apache/lucene/issues/1060> and PR
<https://github.com/apache/lucene/pull/1061> for the discussion as well. I
would appreciate guidance on where we think the warning should go. I feel
like burying in a Javadoc is a less than ideal experience. It would be
better to be a warning on startup. In the PR, I increased the max limit by
a factor of twenty. We should let users use the system based on their needs
even if it was designed or optimized for the models they bring because we
need the feedback and the data from the world.

Is there something I'm overlooking from a risk standpoint?

Best,
--
Marcus Eagan
Re: Raising the Value of MAX_DIMENSIONS of Vector Values [ In reply to ]
I agree that Lucene should support vector sizes depending on the model
one is choosing.

For example Weaviate seems to do this

https://weaviate.slack.com/archives/C017EG2SL3H/p1659981294040479

Thanks

Michael


Am 07.08.22 um 22:48 schrieb Marcus Eagan:
> Hi Lucene Team,
>
> In general, I have advised very strongly against our team at MongoDB
> modifying the Lucene source, except in scenarios where we have strong
> needs for a particular customization. Ultimately, people can do what
> they would like to do.
>
> That?being said, we have a number of customers preparing to use Lucene
> for dense vector?search. There are many language models that are
> optimized for > 1024 dimensions. I remember Michael Wechner's email
> <https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html>
> about one instance with Open API.
>
> I just tried to test the OpenAI model
> "text-similarity-davinci-001" with 12288 dimension
>
>
> It seems that customers who attempt to use these models should not be
> turned away. It could be sufficient to explain the issues. The only
> ones I have identified are two expected ones in very slow indexing
> throughput, high CPU usage, and a maybe less defined risk of more
> numerical errors.
>
> I opened an issue <https://github.com/apache/lucene/issues/1060> and
> PR <https://github.com/apache/lucene/pull/1061> for the discussion as
> well. I would appreciate guidance on where we think the warning should
> go. I feel like burying in a Javadoc is a less?than ideal experience.
> It would be better to be a warning on startup. In the PR, I increased
> the max limit by a factor of twenty. We should let users use the
> system based on their needs even if it was designed or optimized for
> the models they bring because we need the feedback and the data from
> the world.
>
> Is there something I'm overlooking from a risk standpoint?
>
> Best,
> --
> Marcus Eagan
>
Re: Raising the Value of MAX_DIMENSIONS of Vector Values [ In reply to ]
Thank you Marcus for raising this, it's an important topic! On the issue
you filed, Mike pointed to the JIRA ticket where we've been discussing this
(https://issues.apache.org/jira/browse/LUCENE-10471) and suggested
commenting with the embedding models you've heard about from users. This
seems like a good idea to me too -- looking forward to discussing more on
that JIRA issue. (Unless we get caught in the middle of the migration --
then we'll discuss once it's been moved to GitHub!)

Julie

On Mon, Aug 8, 2022 at 10:05 PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> I agree that Lucene should support vector sizes depending on the model one
> is choosing.
>
> For example Weaviate seems to do this
>
> https://weaviate.slack.com/archives/C017EG2SL3H/p1659981294040479
>
> Thanks
>
> Michael
>
>
> Am 07.08.22 um 22:48 schrieb Marcus Eagan:
>
> Hi Lucene Team,
>
> In general, I have advised very strongly against our team at MongoDB
> modifying the Lucene source, except in scenarios where we have strong needs
> for a particular customization. Ultimately, people can do what they would
> like to do.
>
> That being said, we have a number of customers preparing to use Lucene for
> dense vector search. There are many language models that are optimized for
> > 1024 dimensions. I remember Michael Wechner's email
> <https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html> about
> one instance with Open API.
>
> I just tried to test the OpenAI model
>> "text-similarity-davinci-001" with 12288 dimension
>
>
> It seems that customers who attempt to use these models should not be
> turned away. It could be sufficient to explain the issues. The only ones I
> have identified are two expected ones in very slow indexing throughput,
> high CPU usage, and a maybe less defined risk of more numerical errors.
>
> I opened an issue <https://github.com/apache/lucene/issues/1060> and PR
> <https://github.com/apache/lucene/pull/1061> for the discussion as well.
> I would appreciate guidance on where we think the warning should go. I feel
> like burying in a Javadoc is a less than ideal experience. It would be
> better to be a warning on startup. In the PR, I increased the max limit by
> a factor of twenty. We should let users use the system based on their needs
> even if it was designed or optimized for the models they bring because we
> need the feedback and the data from the world.
>
> Is there something I'm overlooking from a risk standpoint?
>
> Best,
> --
> Marcus Eagan
>
>
>
Re: Raising the Value of MAX_DIMENSIONS of Vector Values [ In reply to ]
Hi Together

Any news on the MAX_DIMENSIONS discussion?

https://github.com/apache/lucene/issues/11507

I just implemented Cohere.ai embeddings and Cohere is offering

small: 1024
medium: 2048
large: 4096

whereas Cohere has a nice demo described at

https://txt.cohere.ai/building-a-search-based-discord-bot-with-language-models/

whereas I am not sure which model they are using for the demo.

Thanks

Michael


Am 09.08.22 um 21:56 schrieb Julie Tibshirani:
> Thank you Marcus for raising this, it's an important topic! On the
> issue you filed, Mike pointed to the JIRA ticket where we've been
> discussing this (https://issues.apache.org/jira/browse/LUCENE-10471)
> and suggested commenting with the embedding models you've heard about
> from users. This seems like a good idea to me too -- looking forward
> to discussing more on that JIRA issue. (Unless we get caught in the
> middle of the migration -- then we'll discuss once it's been moved to
> GitHub!)
>
> Julie
>
> On Mon, Aug 8, 2022 at 10:05 PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> I agree that Lucene should support vector sizes depending on the
> model one is choosing.
>
> For example Weaviate seems to do this
>
> https://weaviate.slack.com/archives/C017EG2SL3H/p1659981294040479
>
> Thanks
>
> Michael
>
>
> Am 07.08.22 um 22:48 schrieb Marcus Eagan:
>> Hi Lucene Team,
>>
>> In general, I have advised very strongly against our team at
>> MongoDB modifying the Lucene source, except in scenarios where we
>> have strong needs for a particular customization. Ultimately,
>> people can do what they would like to do.
>>
>> That being said, we have a number of customers preparing to use
>> Lucene for dense vector search. There are many language models
>> that are optimized for > 1024 dimensions. I remember Michael
>> Wechner's email
>> <https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html>
>> about one instance with Open API.
>>
>> I just tried to test the OpenAI model
>> "text-similarity-davinci-001" with 12288 dimension
>>
>>
>> It seems that customers who attempt to use these models should
>> not be turned away. It could be sufficient to explain the issues.
>> The only ones I have identified are two expected ones in very
>> slow indexing throughput, high CPU usage, and a maybe less
>> defined risk of more numerical errors.
>>
>> I opened an issue <https://github.com/apache/lucene/issues/1060>
>> and PR <https://github.com/apache/lucene/pull/1061> for the
>> discussion as well. I would appreciate guidance on where we think
>> the warning should go. I feel like burying in a Javadoc is a
>> less than ideal experience. It would be better to be a warning on
>> startup. In the PR, I increased the max limit by a factor of
>> twenty. We should let users use the system based on their needs
>> even if it was designed or optimized for the models they bring
>> because we need the feedback and the data from the world.
>>
>> Is there something I'm overlooking from a risk standpoint?
>>
>> Best,
>> --
>> Marcus Eagan
>>
>