Hi Uwe
Thanks again for your feedback, I got it working now :-)
I am using a simplified version, which I will post below, such that it
might help others, at least as long as this implementation makes sense.
Btw, when a new version of Lucene gets released, how do I best find out
that? "Lucene95Codec" is still the most recent default codec or that
there is a new default codec?
Thanks
Michael
---
@Autowired private LuceneCodecFactoryluceneCodecFactory;
IndexWriterConfig iwc =new IndexWriterConfig();
iwc.setCodec(luceneCodecFactory.getCodec());
----
package com.erkigsnek.webapp.services;
import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.KnnVectorsFormat;
import org.apache.lucene.codecs.KnnVectorsReader;
import org.apache.lucene.codecs.KnnVectorsWriter;
import org.apache.lucene.codecs.lucene95.Lucene95Codec;
import org.apache.lucene.codecs.lucene95.Lucene95HnswVectorsFormat;
import org.apache.lucene.index.SegmentReadState;
import org.apache.lucene.index.SegmentWriteState;
import org.springframework.stereotype.Component;
import lombok.extern.slf4j.Slf4j;
import java.io.IOException;
@Slf4j @Component public class LuceneCodecFactory {
private final int maxDimensions =16384;/** * */ public Codec getCodec() {
//return Lucene95Codec.getDefault(); log.info("Get codec ...");
Codec codec =new Lucene95Codec() {
@Override public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
var delegate =new Lucene95HnswVectorsFormat();
log.info("Maximum Vector Dimension: " +maxDimensions);
return new DelegatingKnnVectorsFormat(delegate,maxDimensions);
}
};
return codec;
}
}
/** * This class exists because Lucene95HnswVectorsFormat's
getMaxDimensions method is final and we * need to workaround that
constraint to allow more than the default number of dimensions */ @Slf4j
class DelegatingKnnVectorsFormatextends KnnVectorsFormat {
private final KnnVectorsFormatdelegate;
private final int maxDimensions;
public DelegatingKnnVectorsFormat(KnnVectorsFormat delegate,int maxDimensions) {
super(delegate.getName());
this.delegate = delegate;
this.maxDimensions = maxDimensions;
}
@Override public KnnVectorsWriter fieldsWriter(SegmentWriteState state)throws IOException {
return delegate.fieldsWriter(state);
}
@Override public KnnVectorsReader fieldsReader(SegmentReadState state)throws IOException {
return delegate.fieldsReader(state);
}
@Override public int getMaxDimensions(String fieldName) {
log.info("Maximum vector dimension: " +maxDimensions);
return maxDimensions;
}
}
Am 19.10.23 um 11:23 schrieb Uwe Schindler:
> Hi Michael,
>
> The max vector dimension limit is no longer checked in the field type
> as it is responsibility of the codec to enforce it.
>
> You need to build your own codec that returns a different setting so
> it can be enforced by IndexWriter. See Apache Solr's code how to wrap
> the existing KnnVectorsFormat so it returns another limit:
> <https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L159-L183>
>
>
> Basically you need to subclass Lucene95Codec like done here:
> <https://github.com/apache/solr/blob/6d50c592fb0b7e0ea2e52ecf1cde7e882e1d0d0a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L99-L146>
> and return a different vectors format like a delegator as descirbed
> before.
>
> The responsibility was shifted to the codec, because there may be
> better alternatives to HNSW that have different limits especially with
> regard to performance during merging and query response times, e.g.
> BKD trees.
>
> Uwe
>
> Am 19.10.2023 um 10:53 schrieb Michael Wechner:
>> I forgot to mention, that when using the custom FieldType and 1536
>> vector dimension does work with Lucene 9.7.0
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 19.10.23 um 10:39 schrieb Michael Wechner:
>>> Hi
>>>
>>> I recently upgraded Lucene to 9.8.0 and was running tests with
>>> OpenAI's embedding model, which has the vector dimension 1536 and
>>> received the following error
>>>
>>> Field[vector]vector's dimensions must be <= [1024]; got 1536
>>>
>>> wheres this worked previously with the hack to override the vector
>>> dimension using a custom
>>>
>>> float[] vector = ...
>>> FieldType vectorFieldType = new CustomVectorFieldType(vector.length,
>>> VectorSimilarityFuncion.COSINE);
>>>
>>> and setting
>>>
>>> KnnFloatVectorField vectorField = new
>>> KnnFloatVectorField("VECTOR_FIELD", vector, vectorFieldType);
>>>
>>> But this does not seem to work anymore with Lucene 9.8.0
>>>
>>> Is this hack now prevented by the Lucene code itself, or any idea
>>> how to make this work again?
>>>
>>> Whatever one thinks of OpenAI, the embedding model
>>> "text-embedding-ada-002" is really good and it is sad, that one
>>> cannot use it with Lucene, because of the 1024 dimension restriction.
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>