Mailing List Archive

Right Way to Read vectors from Index
Hi,
Our project uses Lucene 9_7_0 and we have a requirement of frequent vector read operation from the Index for a set of documents. We tried two approaches
1. Index vector as Stored field and retrieve whenever needed using StoredFields APIs.
2. Using LeafReader’s API to read vector. Here the Random accessing of documents is very slow.
Which one is the right approach and can you suggest me a better approach.Also why isn’t there a straightforward API like the StoredFields API to read vector.

Regards,
Uthra
Re: Right Way to Read vectors from Index [ In reply to ]
Can you describe your use case in more detail (beyond having to read the
vectors)?

Thanks

Michael

Am 09.02.24 um 12:28 schrieb Uthra:
> Hi,
> Our project uses Lucene 9_7_0 and we have a requirement of frequent vector read operation from the Index for a set of documents. We tried two approaches
> 1. Index vector as Stored field and retrieve whenever needed using StoredFields APIs.
> 2. Using LeafReader’s API to read vector. Here the Random accessing of documents is very slow.
> Which one is the right approach and can you suggest me a better approach.Also why isn’t there a straightforward API like the StoredFields API to read vector.
>
> Regards,
> Uthra


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Right Way to Read vectors from Index [ In reply to ]
Hi Michael,
The use case is to handle index updates along with its vector field without resending the vector in change data every time. The change data will consist of only “updated_field(s):value(s)” wherein I will read the vector value from Index to update the document.

Thanks,
Uthra

> On 09-Feb-2024, at 7:13?PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>
> Can you describe your use case in more detail (beyond having to read the vectors)?
>
> Thanks
>
> Michael
>
> Am 09.02.24 um 12:28 schrieb Uthra:
>> Hi,
>> Our project uses Lucene 9_7_0 and we have a requirement of frequent vector read operation from the Index for a set of documents. We tried two approaches
>> 1. Index vector as Stored field and retrieve whenever needed using StoredFields APIs.
>> 2. Using LeafReader’s API to read vector. Here the Random accessing of documents is very slow.
>> Which one is the right approach and can you suggest me a better approach.Also why isn’t there a straightforward API like the StoredFields API to read vector.
>>
>> Regards,
>> Uthra
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Right Way to Read vectors from Index [ In reply to ]
Hi,

> Using LeafReader’s API to read vector. Here the Random accessing of
documents is very slow.

Is it possible that you are creating a new VectorValues instance for every
doc whose value you want to look up?
Ideally, you should sort your docids and then advance to them one by one,
or call nextDoc. Then retrieve the value stored at the position pointed to
by the iterator.

This would look something like this method:
https://github.com/apache/lucene/blob/branch_9_7/lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java#L226

TLDR: I don't think it should be *that* slow.. maybe share the code you are
using to retrieve the vectors?

Regards,
Gautam Worah.


On Sun, Feb 11, 2024 at 4:40?AM Uthra <uthrakumarvs@gmail.com> wrote:

> Hi Michael,
> The use case is to handle index updates along with its vector
> field without resending the vector in change data every time. The change
> data will consist of only “updated_field(s):value(s)” wherein I will read
> the vector value from Index to update the document.
>
> Thanks,
> Uthra
>
> > On 09-Feb-2024, at 7:13?PM, Michael Wechner <michael.wechner@wyona.com>
> wrote:
> >
> > Can you describe your use case in more detail (beyond having to read the
> vectors)?
> >
> > Thanks
> >
> > Michael
> >
> > Am 09.02.24 um 12:28 schrieb Uthra:
> >> Hi,
> >> Our project uses Lucene 9_7_0 and we have a requirement of
> frequent vector read operation from the Index for a set of documents. We
> tried two approaches
> >> 1. Index vector as Stored field and retrieve whenever needed using
> StoredFields APIs.
> >> 2. Using LeafReader’s API to read vector. Here the Random accessing of
> documents is very slow.
> >> Which one is the right approach and can you suggest me a better
> approach.Also why isn’t there a straightforward API like the StoredFields
> API to read vector.
> >>
> >> Regards,
> >> Uthra
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Right Way to Read vectors from Index [ In reply to ]
thanks for explainig, Uthra!

IIUC the text / data for which the vector was originally generated was
not changed, only some other data (e.g. meta data) which is also part of
the Lucene document, right?
So, if you want to update the other data within the Lucene document, you
first retrieve the Lucene document, create a new Lucene document, update
the changed data, but keep the unchanged vector, which means you don't
need to re-generate the vector, right?

Thanks

Michael



Am 11.02.24 um 13:39 schrieb Uthra:
> Hi Michael,
> The use case is to handle index updates along with its vector field without resending the vector in change data every time. The change data will consist of only “updated_field(s):value(s)” wherein I will read the vector value from Index to update the document.
>
> Thanks,
> Uthra
>
>> On 09-Feb-2024, at 7:13?PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>>
>> Can you describe your use case in more detail (beyond having to read the vectors)?
>>
>> Thanks
>>
>> Michael
>>
>> Am 09.02.24 um 12:28 schrieb Uthra:
>>> Hi,
>>> Our project uses Lucene 9_7_0 and we have a requirement of frequent vector read operation from the Index for a set of documents. We tried two approaches
>>> 1. Index vector as Stored field and retrieve whenever needed using StoredFields APIs.
>>> 2. Using LeafReader’s API to read vector. Here the Random accessing of documents is very slow.
>>> Which one is the right approach and can you suggest me a better approach.Also why isn’t there a straightforward API like the StoredFields API to read vector.
>>>
>>> Regards,
>>> Uthra
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Right Way to Read vectors from Index [ In reply to ]
@Gautam - Thanks for your response. Our leaf reader API approach is the one you mentioned. I wanted to make sure the best way to read vectors for our case.

@Michael - Yes Michael that’s the case here.

Regards,
Uthra

> On 12-Feb-2024, at 1:23?PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>
> thanks for explainig, Uthra!
>
> IIUC the text / data for which the vector was originally generated was not changed, only some other data (e.g. meta data) which is also part of the Lucene document, right?
> So, if you want to update the other data within the Lucene document, you first retrieve the Lucene document, create a new Lucene document, update the changed data, but keep the unchanged vector, which means you don't need to re-generate the vector, right?
>
> Thanks
>
> Michael
>
>
>
> Am 11.02.24 um 13:39 schrieb Uthra:
>> Hi Michael,
>> The use case is to handle index updates along with its vector field without resending the vector in change data every time. The change data will consist of only “updated_field(s):value(s)” wherein I will read the vector value from Index to update the document.
>>
>> Thanks,
>> Uthra
>>
>>> On 09-Feb-2024, at 7:13?PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>>>
>>> Can you describe your use case in more detail (beyond having to read the vectors)?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 09.02.24 um 12:28 schrieb Uthra:
>>>> Hi,
>>>> Our project uses Lucene 9_7_0 and we have a requirement of frequent vector read operation from the Index for a set of documents. We tried two approaches
>>>> 1. Index vector as Stored field and retrieve whenever needed using StoredFields APIs.
>>>> 2. Using LeafReader’s API to read vector. Here the Random accessing of documents is very slow.
>>>> Which one is the right approach and can you suggest me a better approach.Also why isn’t there a straightforward API like the StoredFields API to read vector.
>>>>
>>>> Regards,
>>>> Uthra
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Right Way to Read vectors from Index [ In reply to ]
Hi,

reading information from the inverted index (and also vectors) is always
slow, because the data is not stored "as is" for easy reconsumption. To
allow easy reindexing, there input data must be serialized to a "stored"
field in parallel to the indexed value.

Elasticearch is using the approach to have a single/separate "stored
only" binary field in the index that contains the "_source" data of the
whole document as machine readable JSON/CBOR/SMILE format. When a
document is updated in index, the updater reads the original source,
applies updates to it and then reindexes the document. All other fields
in Elasticsearch are not stored (unless you explicitely to opt-in for
that).

In Solr it is very similar, but there are the stored values serialized
to companion fields with same name. But there is currently no separate
Lucene StoredField implementation in to store vectors. But it's easy to
do: You could use a binary (byte[]) stored field to preserve the vector
data (e.g., serialized in little/big endian).

I tend to favour the Elasticsearch approach to have a single stored
field containing the whole document in machine readable from.

Uwe

Am 11.02.2024 um 13:39 schrieb Uthra:
> Hi Michael,
> The use case is to handle index updates along with its vector field without resending the vector in change data every time. The change data will consist of only “updated_field(s):value(s)” wherein I will read the vector value from Index to update the document.
>
> Thanks,
> Uthra
>
>> On 09-Feb-2024, at 7:13?PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>>
>> Can you describe your use case in more detail (beyond having to read the vectors)?
>>
>> Thanks
>>
>> Michael
>>
>> Am 09.02.24 um 12:28 schrieb Uthra:
>>> Hi,
>>> Our project uses Lucene 9_7_0 and we have a requirement of frequent vector read operation from the Index for a set of documents. We tried two approaches
>>> 1. Index vector as Stored field and retrieve whenever needed using StoredFields APIs.
>>> 2. Using LeafReader’s API to read vector. Here the Random accessing of documents is very slow.
>>> Which one is the right approach and can you suggest me a better approach.Also why isn’t there a straightforward API like the StoredFields API to read vector.
>>>
>>> Regards,
>>> Uthra
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Right Way to Read vectors from Index [ In reply to ]
it is good, that you asked this question, because it made me realize,
that there is some room for improvement for our own application :-)

Thanks

Michael

Am 12.02.24 um 11:35 schrieb Uthra:
> @Gautam - Thanks for your response. Our leaf reader API approach is the one you mentioned. I wanted to make sure the best way to read vectors for our case.
>
> @Michael - Yes Michael that’s the case here.
>
> Regards,
> Uthra
>
>> On 12-Feb-2024, at 1:23?PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>>
>> thanks for explainig, Uthra!
>>
>> IIUC the text / data for which the vector was originally generated was not changed, only some other data (e.g. meta data) which is also part of the Lucene document, right?
>> So, if you want to update the other data within the Lucene document, you first retrieve the Lucene document, create a new Lucene document, update the changed data, but keep the unchanged vector, which means you don't need to re-generate the vector, right?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 11.02.24 um 13:39 schrieb Uthra:
>>> Hi Michael,
>>> The use case is to handle index updates along with its vector field without resending the vector in change data every time. The change data will consist of only “updated_field(s):value(s)” wherein I will read the vector value from Index to update the document.
>>>
>>> Thanks,
>>> Uthra
>>>
>>>> On 09-Feb-2024, at 7:13?PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>>>>
>>>> Can you describe your use case in more detail (beyond having to read the vectors)?
>>>>
>>>> Thanks
>>>>
>>>> Michael
>>>>
>>>> Am 09.02.24 um 12:28 schrieb Uthra:
>>>>> Hi,
>>>>> Our project uses Lucene 9_7_0 and we have a requirement of frequent vector read operation from the Index for a set of documents. We tried two approaches
>>>>> 1. Index vector as Stored field and retrieve whenever needed using StoredFields APIs.
>>>>> 2. Using LeafReader’s API to read vector. Here the Random accessing of documents is very slow.
>>>>> Which one is the right approach and can you suggest me a better approach.Also why isn’t there a straightforward API like the StoredFields API to read vector.
>>>>>
>>>>> Regards,
>>>>> Uthra
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org