Mailing List Archive

Raising the Value of MAX_DIMENSIONS of Vector Values
Hi Lucene Team,

In general, I have advised very strongly against our team at MongoDB
modifying the Lucene source, except in scenarios where we have strong needs
for a particular customization. Ultimately, people can do what they would
like to do.

That being said, we have a number of customers preparing to use Lucene for
dense vector search. There are many language models that are optimized for
> 1024 dimensions. I remember Michael Wechner's email
<https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html> about
one instance with Open API.

I just tried to test the OpenAI model
> "text-similarity-davinci-001" with 12288 dimension


It seems that customers who attempt to use these models should not be
turned away. It could be sufficient to explain the issues. The only ones I
have identified are two expected ones in very slow indexing throughput,
high CPU usage, and a maybe less defined risk of more numerical errors.

I opened an issue <https://github.com/apache/lucene/issues/1060> and PR
<https://github.com/apache/lucene/pull/1061> for the discussion as well. I
would appreciate guidance on where we think the warning should go. I feel
like burying in a Javadoc is a less than ideal experience. It would be
better to be a warning on startup. In the PR, I increased the max limit by
a factor of twenty. We should let users use the system based on their needs
even if it was designed or optimized for the models they bring because we
need the feedback and the data from the world.

Is there something I'm overlooking from a risk standpoint?

Best,
--
Marcus Eagan