Mailing List Archive

Conneting Lucene with ChatGPT Retrieval Plugin
Hi Together

I recently setup ChatGPT retrieval plugin locally

https://github.com/openai/chatgpt-retrieval-plugin

I think it would be nice to consider to submit a Lucene implementation
for this plugin

https://github.com/openai/chatgpt-retrieval-plugin#future-directions

The plugin is using by default OpenAI's model "text-embedding-ada-002"
with 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

but which means one won't be able to use it out-of-the-box with Lucene.

Similar request here

https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions

I understand we just recently had a lenghty discussion about increasing
the max dimension and whatever one thinks of OpenAI, fact is, that it
has a huge impact and I think it would be nice that Lucene could be part
of this "revolution". All we have to do is increase the limit from 1024
to 1536 or even 2048 for example.

Since the performace seems to be linear with the vector dimension and
several members have done performance tests successfully and 1024 seems
to have been chosen as max dimension quite arbitrarily in the first
place, I think it should not be a problem to increase the max dimension
by a factor 1.5 or 2.

WDYT?

Thanks

Michael



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
Hello Michael,

I agree. I think it makes sense to support OpenAI embeddings.

Best,
Christian


On Sat, May 6, 2023 at 7:03?PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Hi Together
>
> I recently setup ChatGPT retrieval plugin locally
>
> https://github.com/openai/chatgpt-retrieval-plugin
>
> I think it would be nice to consider to submit a Lucene implementation
> for this plugin
>
> https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>
> The plugin is using by default OpenAI's model "text-embedding-ada-002"
> with 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> but which means one won't be able to use it out-of-the-box with Lucene.
>
> Similar request here
>
>
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>
> I understand we just recently had a lenghty discussion about increasing
> the max dimension and whatever one thinks of OpenAI, fact is, that it
> has a huge impact and I think it would be nice that Lucene could be part
> of this "revolution". All we have to do is increase the limit from 1024
> to 1536 or even 2048 for example.
>
> Since the performace seems to be linear with the vector dimension and
> several members have done performance tests successfully and 1024 seems
> to have been chosen as max dimension quite arbitrarily in the first
> place, I think it should not be a problem to increase the max dimension
> by a factor 1.5 or 2.
>
> WDYT?
>
> Thanks
>
> Michael
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
there is already a pull request for Elasticsearch which is also
mentioning the max size 1024

https://github.com/openai/chatgpt-retrieval-plugin/pull/83



Am 06.05.23 um 19:00 schrieb Michael Wechner:
> Hi Together
>
> I recently setup ChatGPT retrieval plugin locally
>
> https://github.com/openai/chatgpt-retrieval-plugin
>
> I think it would be nice to consider to submit a Lucene implementation
> for this plugin
>
> https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>
> The plugin is using by default OpenAI's model "text-embedding-ada-002"
> with 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> but which means one won't be able to use it out-of-the-box with Lucene.
>
> Similar request here
>
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>
>
> I understand we just recently had a lenghty discussion about
> increasing the max dimension and whatever one thinks of OpenAI, fact
> is, that it has a huge impact and I think it would be nice that Lucene
> could be part of this "revolution". All we have to do is increase the
> limit from 1024 to 1536 or even 2048 for example.
>
> Since the performace seems to be linear with the vector dimension and
> several members have done performance tests successfully and 1024
> seems to have been chosen as max dimension quite arbitrarily in the
> first place, I think it should not be a problem to increase the max
> dimension by a factor 1.5 or 2.
>
> WDYT?
>
> Thanks
>
> Michael
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
The pr mentioned a Elasticsearch pr
<https://github.com/elastic/elasticsearch/pull/95257> that increased the
dim to 2048 in ElasticSearch.

Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
per document. Usually multiple/many vectors are needed for a document
content. We will have to split the document content into chunks and create
one Lucene document per document chunk.

ChatGPT plugin directly stores the chunk text in the underline vector db.
If there are lots of documents, will it be a concern to store the full
document content in Lucene? In the traditional inverted index use case, is
it common to store the full document content in Lucene?

Another question: if you use Lucene as a vector db, do you still need the
inverted index? Wondering what would be the use case to use inverted index
together with vector index. If we don't need the inverted index, will it be
better to use other vector dbs? For example, PostgreSQL also added vector
support recently.

Thanks,
Jun

On Sat, May 6, 2023 at 1:44?PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> there is already a pull request for Elasticsearch which is also
> mentioning the max size 1024
>
> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>
>
>
> Am 06.05.23 um 19:00 schrieb Michael Wechner:
> > Hi Together
> >
> > I recently setup ChatGPT retrieval plugin locally
> >
> > https://github.com/openai/chatgpt-retrieval-plugin
> >
> > I think it would be nice to consider to submit a Lucene implementation
> > for this plugin
> >
> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
> >
> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
> > with 1536 dimensions
> >
> > https://openai.com/blog/new-and-improved-embedding-model
> >
> > but which means one won't be able to use it out-of-the-box with Lucene.
> >
> > Similar request here
> >
> >
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
> >
> >
> > I understand we just recently had a lenghty discussion about
> > increasing the max dimension and whatever one thinks of OpenAI, fact
> > is, that it has a huge impact and I think it would be nice that Lucene
> > could be part of this "revolution". All we have to do is increase the
> > limit from 1024 to 1536 or even 2048 for example.
> >
> > Since the performace seems to be linear with the vector dimension and
> > several members have done performance tests successfully and 1024
> > seems to have been chosen as max dimension quite arbitrarily in the
> > first place, I think it should not be a problem to increase the max
> > dimension by a factor 1.5 or 2.
> >
> > WDYT?
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
I tried my best in the previous thread to set a plan of action to decide
what should be done with that limit, I tried to summarise the possible next
steps multiple times, but the discussion steered into other directions
(fierce opposition, benchmarking, etc, etc).

I created a new thread:
"Dimensions Limit for KNN vectors - Next Steps"

Just to collect the possible options we have and then vote.

Let's see how it goes.

Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benedetti@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Sat, 6 May 2023 at 21:44, Michael Wechner <michael.wechner@wyona.com>
wrote:

> there is already a pull request for Elasticsearch which is also
> mentioning the max size 1024
>
> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>
>
>
> Am 06.05.23 um 19:00 schrieb Michael Wechner:
> > Hi Together
> >
> > I recently setup ChatGPT retrieval plugin locally
> >
> > https://github.com/openai/chatgpt-retrieval-plugin
> >
> > I think it would be nice to consider to submit a Lucene implementation
> > for this plugin
> >
> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
> >
> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
> > with 1536 dimensions
> >
> > https://openai.com/blog/new-and-improved-embedding-model
> >
> > but which means one won't be able to use it out-of-the-box with Lucene.
> >
> > Similar request here
> >
> >
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
> >
> >
> > I understand we just recently had a lenghty discussion about
> > increasing the max dimension and whatever one thinks of OpenAI, fact
> > is, that it has a huge impact and I think it would be nice that Lucene
> > could be part of this "revolution". All we have to do is increase the
> > limit from 1024 to 1536 or even 2048 for example.
> >
> > Since the performace seems to be linear with the vector dimension and
> > several members have done performance tests successfully and 1024
> > seems to have been chosen as max dimension quite arbitrarily in the
> > first place, I think it should not be a problem to increase the max
> > dimension by a factor 1.5 or 2.
> >
> > WDYT?
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
I agree with Robert Muir that an increase of the 1024 limit as it is
currently in FloatVectorValues or ByteVectorValues would bind the API, we
could not decrease it after, even if we needed to change the vector engine.

Would it be possible to move the limit definition to a HNSW specific
implementation, where it would only bind HNSW?
I don't know this area of code well. It seems to me the FloatVectorValues
implementation is unfortunately not HNSW specific. Is this on purpose? We
should be able to replace the vector engine, no?

Le sam. 6 mai 2023 à 22:44, Michael Wechner <michael.wechner@wyona.com> a
écrit :

> there is already a pull request for Elasticsearch which is also
> mentioning the max size 1024
>
> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>
>
>
> Am 06.05.23 um 19:00 schrieb Michael Wechner:
> > Hi Together
> >
> > I recently setup ChatGPT retrieval plugin locally
> >
> > https://github.com/openai/chatgpt-retrieval-plugin
> >
> > I think it would be nice to consider to submit a Lucene implementation
> > for this plugin
> >
> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
> >
> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
> > with 1536 dimensions
> >
> > https://openai.com/blog/new-and-improved-embedding-model
> >
> > but which means one won't be able to use it out-of-the-box with Lucene.
> >
> > Similar request here
> >
> >
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
> >
> >
> > I understand we just recently had a lenghty discussion about
> > increasing the max dimension and whatever one thinks of OpenAI, fact
> > is, that it has a huge impact and I think it would be nice that Lucene
> > could be part of this "revolution". All we have to do is increase the
> > limit from 1024 to 1536 or even 2048 for example.
> >
> > Since the performace seems to be linear with the vector dimension and
> > several members have done performance tests successfully and 1024
> > seems to have been chosen as max dimension quite arbitrarily in the
> > first place, I think it should not be a problem to increase the max
> > dimension by a factor 1.5 or 2.
> >
> > WDYT?
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
Lucene is a library. I don’t see how it would be exposed in this plugin
which is about services.


On Tue, 9 May 2023 at 18:00, Jun Luo <luo.junius@gmail.com> wrote:

> The pr mentioned a Elasticsearch pr
> <https://github.com/elastic/elasticsearch/pull/95257> that increased the
> dim to 2048 in ElasticSearch.
>
> Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
> per document. Usually multiple/many vectors are needed for a document
> content. We will have to split the document content into chunks and create
> one Lucene document per document chunk.
>
> ChatGPT plugin directly stores the chunk text in the underline vector db.
> If there are lots of documents, will it be a concern to store the full
> document content in Lucene? In the traditional inverted index use case, is
> it common to store the full document content in Lucene?
>
> Another question: if you use Lucene as a vector db, do you still need the
> inverted index? Wondering what would be the use case to use inverted index
> together with vector index. If we don't need the inverted index, will it be
> better to use other vector dbs? For example, PostgreSQL also added vector
> support recently.
>
> Thanks,
> Jun
>
> On Sat, May 6, 2023 at 1:44?PM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> there is already a pull request for Elasticsearch which is also
>> mentioning the max size 1024
>>
>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>
>>
>>
>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>> > Hi Together
>> >
>> > I recently setup ChatGPT retrieval plugin locally
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin
>> >
>> > I think it would be nice to consider to submit a Lucene implementation
>> > for this plugin
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>> >
>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>> > with 1536 dimensions
>> >
>> > https://openai.com/blog/new-and-improved-embedding-model
>> >
>> > but which means one won't be able to use it out-of-the-box with Lucene.
>> >
>> > Similar request here
>> >
>> >
>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>> >
>> >
>> > I understand we just recently had a lenghty discussion about
>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>> > is, that it has a huge impact and I think it would be nice that Lucene
>> > could be part of this "revolution". All we have to do is increase the
>> > limit from 1024 to 1536 or even 2048 for example.
>> >
>> > Since the performace seems to be linear with the vector dimension and
>> > several members have done performance tests successfully and 1024
>> > seems to have been chosen as max dimension quite arbitrarily in the
>> > first place, I think it should not be a problem to increase the max
>> > dimension by a factor 1.5 or 2.
>> >
>> > WDYT?
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
I'm adding Lucene HNSW to Cassandra for vector search. One of my test
harnesses loads 50k openai embeddings. Works as expected; as someone
pointed out, it should be linear wrt vector size and that is what I see. I
would not be afraid of increasing the max size.

In parallel, Cassandra is also adding numerical indexes using Lucene's k-d
tree. We definitely expect people to want to compose the two (topK vector
matches that also satisfy some other predicates).

But I agree that classic term based relevance queries are probably less
useful combined w/ vector search.


On Tue, May 9, 2023 at 11:59?AM Jun Luo <luo.junius@gmail.com> wrote:

> The pr mentioned a Elasticsearch pr
> <https://github.com/elastic/elasticsearch/pull/95257> that increased the
> dim to 2048 in ElasticSearch.
>
> Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
> per document. Usually multiple/many vectors are needed for a document
> content. We will have to split the document content into chunks and create
> one Lucene document per document chunk.
>
> ChatGPT plugin directly stores the chunk text in the underline vector db.
> If there are lots of documents, will it be a concern to store the full
> document content in Lucene? In the traditional inverted index use case, is
> it common to store the full document content in Lucene?
>
> Another question: if you use Lucene as a vector db, do you still need the
> inverted index? Wondering what would be the use case to use inverted index
> together with vector index. If we don't need the inverted index, will it be
> better to use other vector dbs? For example, PostgreSQL also added vector
> support recently.
>
> Thanks,
> Jun
>
> On Sat, May 6, 2023 at 1:44?PM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> there is already a pull request for Elasticsearch which is also
>> mentioning the max size 1024
>>
>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>
>>
>>
>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>> > Hi Together
>> >
>> > I recently setup ChatGPT retrieval plugin locally
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin
>> >
>> > I think it would be nice to consider to submit a Lucene implementation
>> > for this plugin
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>> >
>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>> > with 1536 dimensions
>> >
>> > https://openai.com/blog/new-and-improved-embedding-model
>> >
>> > but which means one won't be able to use it out-of-the-box with Lucene.
>> >
>> > Similar request here
>> >
>> >
>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>> >
>> >
>> > I understand we just recently had a lenghty discussion about
>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>> > is, that it has a huge impact and I think it would be nice that Lucene
>> > could be part of this "revolution". All we have to do is increase the
>> > limit from 1024 to 1536 or even 2048 for example.
>> >
>> > Since the performace seems to be linear with the vector dimension and
>> > several members have done performance tests successfully and 1024
>> > seems to have been chosen as max dimension quite arbitrarily in the
>> > first place, I think it should not be a problem to increase the max
>> > dimension by a factor 1.5 or 2.
>> >
>> > WDYT?
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
It looks like the framework is designed to support self-hosted plugins.

On Tue, May 9, 2023 at 12:13?PM jim ferenczi <jim.ferenczi@gmail.com> wrote:

> Lucene is a library. I don’t see how it would be exposed in this plugin
> which is about services.
>
>
> On Tue, 9 May 2023 at 18:00, Jun Luo <luo.junius@gmail.com> wrote:
>
>> The pr mentioned a Elasticsearch pr
>> <https://github.com/elastic/elasticsearch/pull/95257> that increased the
>> dim to 2048 in ElasticSearch.
>>
>> Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
>> per document. Usually multiple/many vectors are needed for a document
>> content. We will have to split the document content into chunks and create
>> one Lucene document per document chunk.
>>
>> ChatGPT plugin directly stores the chunk text in the underline vector db.
>> If there are lots of documents, will it be a concern to store the full
>> document content in Lucene? In the traditional inverted index use case, is
>> it common to store the full document content in Lucene?
>>
>> Another question: if you use Lucene as a vector db, do you still need the
>> inverted index? Wondering what would be the use case to use inverted index
>> together with vector index. If we don't need the inverted index, will it be
>> better to use other vector dbs? For example, PostgreSQL also added vector
>> support recently.
>>
>> Thanks,
>> Jun
>>
>> On Sat, May 6, 2023 at 1:44?PM Michael Wechner <michael.wechner@wyona.com>
>> wrote:
>>
>>> there is already a pull request for Elasticsearch which is also
>>> mentioning the max size 1024
>>>
>>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>>
>>>
>>>
>>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>>> > Hi Together
>>> >
>>> > I recently setup ChatGPT retrieval plugin locally
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin
>>> >
>>> > I think it would be nice to consider to submit a Lucene implementation
>>> > for this plugin
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>>> >
>>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>>> > with 1536 dimensions
>>> >
>>> > https://openai.com/blog/new-and-improved-embedding-model
>>> >
>>> > but which means one won't be able to use it out-of-the-box with Lucene.
>>> >
>>> > Similar request here
>>> >
>>> >
>>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>>> >
>>> >
>>> > I understand we just recently had a lenghty discussion about
>>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>>> > is, that it has a huge impact and I think it would be nice that Lucene
>>> > could be part of this "revolution". All we have to do is increase the
>>> > limit from 1024 to 1536 or even 2048 for example.
>>> >
>>> > Since the performace seems to be linear with the vector dimension and
>>> > several members have done performance tests successfully and 1024
>>> > seems to have been chosen as max dimension quite arbitrarily in the
>>> > first place, I think it should not be a problem to increase the max
>>> > dimension by a factor 1.5 or 2.
>>> >
>>> > WDYT?
>>> >
>>> > Thanks
>>> >
>>> > Michael
>>> >
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
Yes, you would split the document into multiple chunks, whereas the
ChatGPT retrieval plugin does this by itself, whereas AFAIK the default
chunk size is 200 tokens
(https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py).

Also it creates a unique ID for each document you upload, which is saved
as "document_id" (at least for Weaviate) together with the chunk text.

Re a Lucene implementation, you might want to store the chunk text
outside of the Lucene index using only a chunk id.

HTH

Michael

Am 09.05.23 um 04:14 schrieb Jun Luo:
> The pr mentioned a Elasticsearch pr
> <https://github.com/elastic/elasticsearch/pull/95257> that increased
> the dim to 2048 in ElasticSearch.
>
> Curious how you use Lucene's KNN search. Lucene's KNN supports one
> vector per document. Usually multiple/many vectors are needed for a
> document content. We will have to split the document content into
> chunks and create one Lucene document per document chunk.
>
> ChatGPT plugin directly stores the chunk text in the underline vector
> db. If there are lots of documents, will it be a concern to store the
> full document content in Lucene? In the traditional inverted index use
> case, is it common to store the full document content in Lucene?
>
> Another question: if you use Lucene as a vector db, do you still need
> the inverted index? Wondering what would be the use case to use
> inverted index together with vector index. If we don't need the
> inverted index, will it be better to use other vector dbs? For
> example, PostgreSQL also added vector support recently.
>
> Thanks,
> Jun
>
> On Sat, May 6, 2023 at 1:44?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> there is already a pull request for Elasticsearch which is also
> mentioning the max size 1024
>
> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>
>
>
> Am 06.05.23 um 19:00 schrieb Michael Wechner:
> > Hi Together
> >
> > I recently setup ChatGPT retrieval plugin locally
> >
> > https://github.com/openai/chatgpt-retrieval-plugin
> >
> > I think it would be nice to consider to submit a Lucene
> implementation
> > for this plugin
> >
> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
> >
> > The plugin is using by default OpenAI's model
> "text-embedding-ada-002"
> > with 1536 dimensions
> >
> > https://openai.com/blog/new-and-improved-embedding-model
> >
> > but which means one won't be able to use it out-of-the-box with
> Lucene.
> >
> > Similar request here
> >
> >
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>
> >
> >
> > I understand we just recently had a lenghty discussion about
> > increasing the max dimension and whatever one thinks of OpenAI,
> fact
> > is, that it has a huge impact and I think it would be nice that
> Lucene
> > could be part of this "revolution". All we have to do is
> increase the
> > limit from 1024 to 1536 or even 2048 for example.
> >
> > Since the performace seems to be linear with the vector
> dimension and
> > several members have done performance tests successfully and 1024
> > seems to have been chosen as max dimension quite arbitrarily in the
> > first place, I think it should not be a problem to increase the max
> > dimension by a factor 1.5 or 2.
> >
> > WDYT?
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
I assumed that you would wrap Lucene into a mimimal REST service or use
Solr or Elasticsearch

Am 09.05.23 um 19:07 schrieb jim ferenczi:
> Lucene is a library. I don’t see how it would be exposed in this
> plugin which is about services.
>
>
> On Tue, 9 May 2023 at 18:00, Jun Luo <luo.junius@gmail.com> wrote:
>
> The pr mentioned a Elasticsearch pr
> <https://github.com/elastic/elasticsearch/pull/95257> that
> increased the dim to 2048 in ElasticSearch.
>
> Curious how you use Lucene's KNN search. Lucene's KNN supports one
> vector per document. Usually multiple/many vectors are needed for
> a document content. We will have to split the document content
> into chunks and create one Lucene document per document chunk.
>
> ChatGPT plugin directly stores the chunk text in the underline
> vector db. If there are lots of documents, will it be a concern to
> store the full document content in Lucene? In the traditional
> inverted index use case, is it common to store the full document
> content in Lucene?
>
> Another question: if you use Lucene as a vector db, do you still
> need the inverted index? Wondering what would be the use case to
> use inverted index together with vector index. If we don't need
> the inverted index, will it be better to use other vector dbs? For
> example, PostgreSQL also added vector support recently.
>
> Thanks,
> Jun
>
> On Sat, May 6, 2023 at 1:44?PM Michael Wechner
> <michael.wechner@wyona.com> wrote:
>
> there is already a pull request for Elasticsearch which is also
> mentioning the max size 1024
>
> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>
>
>
> Am 06.05.23 um 19:00 schrieb Michael Wechner:
> > Hi Together
> >
> > I recently setup ChatGPT retrieval plugin locally
> >
> > https://github.com/openai/chatgpt-retrieval-plugin
> >
> > I think it would be nice to consider to submit a Lucene
> implementation
> > for this plugin
> >
> >
> https://github.com/openai/chatgpt-retrieval-plugin#future-directions
> >
> > The plugin is using by default OpenAI's model
> "text-embedding-ada-002"
> > with 1536 dimensions
> >
> > https://openai.com/blog/new-and-improved-embedding-model
> >
> > but which means one won't be able to use it out-of-the-box
> with Lucene.
> >
> > Similar request here
> >
> >
> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>
> >
> >
> > I understand we just recently had a lenghty discussion about
> > increasing the max dimension and whatever one thinks of
> OpenAI, fact
> > is, that it has a huge impact and I think it would be nice
> that Lucene
> > could be part of this "revolution". All we have to do is
> increase the
> > limit from 1024 to 1536 or even 2048 for example.
> >
> > Since the performace seems to be linear with the vector
> dimension and
> > several members have done performance tests successfully and
> 1024
> > seems to have been chosen as max dimension quite arbitrarily
> in the
> > first place, I think it should not be a problem to increase
> the max
> > dimension by a factor 1.5 or 2.
> >
> > WDYT?
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
I did track down a weird bug I was seeing to our cosine similarity
returning NaN with high dimension vectors. Fix is here:
https://github.com/apache/lucene/pull/12281

On Tue, May 9, 2023 at 12:15?PM Jonathan Ellis <jbellis@gmail.com> wrote:

> I'm adding Lucene HNSW to Cassandra for vector search. One of my test
> harnesses loads 50k openai embeddings. Works as expected; as someone
> pointed out, it should be linear wrt vector size and that is what I see. I
> would not be afraid of increasing the max size.
>
> In parallel, Cassandra is also adding numerical indexes using Lucene's k-d
> tree. We definitely expect people to want to compose the two (topK vector
> matches that also satisfy some other predicates).
>
> But I agree that classic term based relevance queries are probably less
> useful combined w/ vector search.
>
>
> On Tue, May 9, 2023 at 11:59?AM Jun Luo <luo.junius@gmail.com> wrote:
>
>> The pr mentioned a Elasticsearch pr
>> <https://github.com/elastic/elasticsearch/pull/95257> that increased the
>> dim to 2048 in ElasticSearch.
>>
>> Curious how you use Lucene's KNN search. Lucene's KNN supports one vector
>> per document. Usually multiple/many vectors are needed for a document
>> content. We will have to split the document content into chunks and create
>> one Lucene document per document chunk.
>>
>> ChatGPT plugin directly stores the chunk text in the underline vector db.
>> If there are lots of documents, will it be a concern to store the full
>> document content in Lucene? In the traditional inverted index use case, is
>> it common to store the full document content in Lucene?
>>
>> Another question: if you use Lucene as a vector db, do you still need the
>> inverted index? Wondering what would be the use case to use inverted index
>> together with vector index. If we don't need the inverted index, will it be
>> better to use other vector dbs? For example, PostgreSQL also added vector
>> support recently.
>>
>> Thanks,
>> Jun
>>
>> On Sat, May 6, 2023 at 1:44?PM Michael Wechner <michael.wechner@wyona.com>
>> wrote:
>>
>>> there is already a pull request for Elasticsearch which is also
>>> mentioning the max size 1024
>>>
>>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>>
>>>
>>>
>>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>>> > Hi Together
>>> >
>>> > I recently setup ChatGPT retrieval plugin locally
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin
>>> >
>>> > I think it would be nice to consider to submit a Lucene implementation
>>> > for this plugin
>>> >
>>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>>> >
>>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>>> > with 1536 dimensions
>>> >
>>> > https://openai.com/blog/new-and-improved-embedding-model
>>> >
>>> > but which means one won't be able to use it out-of-the-box with Lucene.
>>> >
>>> > Similar request here
>>> >
>>> >
>>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>>> >
>>> >
>>> > I understand we just recently had a lenghty discussion about
>>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>>> > is, that it has a huge impact and I think it would be nice that Lucene
>>> > could be part of this "revolution". All we have to do is increase the
>>> > limit from 1024 to 1536 or even 2048 for example.
>>> >
>>> > Since the performace seems to be linear with the vector dimension and
>>> > several members have done performance tests successfully and 1024
>>> > seems to have been chosen as max dimension quite arbitrarily in the
>>> > first place, I think it should not be a problem to increase the max
>>> > dimension by a factor 1.5 or 2.
>>> >
>>> > WDYT?
>>> >
>>> > Thanks
>>> >
>>> > Michael
>>> >
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced
Re: Conneting Lucene with ChatGPT Retrieval Plugin [ In reply to ]
Do you anticipate that the vector engine would be changed in a way that
fundamentally precluded larger vectors (intentionally)? I would think that
the ability to support larger vectors should be a key criteria for any
changes to be made. Certainly if there are optimizations to be had at
specific sizes (due to power of 2 size or some other numerical coincidence)
found in the future we should have ways of picking that up if people use
the beneficial size, but I don't understand the idea that we would support
a change to the engine that would preclude larger vectors in the long run.
It makes great sense to have a default limit because it's important to
communicate that "beyond this point we haven't tested, we don't know what
happens and you are on your own" but forcing a code fork for folks to do
that testing only creates a barrier if they find something useful that they
want to contribute back...

On the proposal's thread I like the configurability option fwiw.

On Tue, May 9, 2023 at 12:49?PM Bruno Roustant <bruno.roustant@gmail.com>
wrote:

> I agree with Robert Muir that an increase of the 1024 limit as it is
> currently in FloatVectorValues or ByteVectorValues would bind the API, we
> could not decrease it after, even if we needed to change the vector engine.
>
> Would it be possible to move the limit definition to a HNSW specific
> implementation, where it would only bind HNSW?
> I don't know this area of code well. It seems to me the FloatVectorValues
> implementation is unfortunately not HNSW specific. Is this on purpose? We
> should be able to replace the vector engine, no?
>
> Le sam. 6 mai 2023 à 22:44, Michael Wechner <michael.wechner@wyona.com> a
> écrit :
>
>> there is already a pull request for Elasticsearch which is also
>> mentioning the max size 1024
>>
>> https://github.com/openai/chatgpt-retrieval-plugin/pull/83
>>
>>
>>
>> Am 06.05.23 um 19:00 schrieb Michael Wechner:
>> > Hi Together
>> >
>> > I recently setup ChatGPT retrieval plugin locally
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin
>> >
>> > I think it would be nice to consider to submit a Lucene implementation
>> > for this plugin
>> >
>> > https://github.com/openai/chatgpt-retrieval-plugin#future-directions
>> >
>> > The plugin is using by default OpenAI's model "text-embedding-ada-002"
>> > with 1536 dimensions
>> >
>> > https://openai.com/blog/new-and-improved-embedding-model
>> >
>> > but which means one won't be able to use it out-of-the-box with Lucene.
>> >
>> > Similar request here
>> >
>> >
>> https://learn.microsoft.com/en-us/answers/questions/1192796/open-ai-text-embedding-dimensions
>> >
>> >
>> > I understand we just recently had a lenghty discussion about
>> > increasing the max dimension and whatever one thinks of OpenAI, fact
>> > is, that it has a huge impact and I think it would be nice that Lucene
>> > could be part of this "revolution". All we have to do is increase the
>> > limit from 1024 to 1536 or even 2048 for example.
>> >
>> > Since the performace seems to be linear with the vector dimension and
>> > several members have done performance tests successfully and 1024
>> > seems to have been chosen as max dimension quite arbitrarily in the
>> > first place, I think it should not be a problem to increase the max
>> > dimension by a factor 1.5 or 2.
>> >
>> > WDYT?
>> >
>> > Thanks
>> >
>> > Michael
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)