Mailing List Archive

Store arrays in DocValues and keep the original order
Hi~

We are trying to build an OLAP database based on lucene, and we heavily use lucene's DocValues (as our column store).

We try to use DocValues to store the array type field. For example, if we want to store the field1 and feild2 in this json document into DocValues respectively, SORTED_NUMERIC and SORTED_SET seem to be our only option.

{
"field1": [ 3, 1, 1, 2 ],
"field2": [ "c", "a", "a", "b" ]
}


When we store field1 in SORTED_NUMERIC and field2 in SORTED_SET, we will get this result:

[Community Verified icon]

field1:

* origin: [3, 1, 1, 2]
* in SORTED_NUMERIC: [1, 1, 2, 3]

field2?

* origin: [”c”, “a”, “a”, “b” ]
* in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”]

The original ordering relationship of the elements in the array is lost.

We're guessing that lucene's DocValues are designed primarily for sorting and aggregation, so the original order of elements may not matter.

But in our usage scene, it is important to keep the original order of the elements in the array (we allow user to access the elements in the array using the subscript operator).

We wonder if lucene has plans to add new types of DocValues that can store arrays and keep the original order of elements in the array?

Thanks!
Re: Store arrays in DocValues and keep the original order [ In reply to ]
Depending on what you use the field for, you can use BinaryDocValuesField
which encodes a byte[] and lets you store the data however you want. But
how are you using these fields later at search time?

On Tue, Jun 28, 2022 at 3:46 PM linfeng lu <linfeng.lu@hotmail.com> wrote:

> Hi~
>
> We are trying to build an OLAP database based on lucene, and we heavily
> use lucene's *DocValues* (as our column store).
>
> *We try to use DocValues to store the array type field. *For example, if
> we want to store the *field1* and *feild2* in this json document into
> *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our
> only option.
>
> *{*
> * "field1": [ 3, 1, 1, 2 ], *
> * "field2": [ "c", "a", "a", "b" ] *
> *}*
>
>
> When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we
> will get this result:
>
> *[image: Community Verified icon]*
>
> field1:
>
> - origin: [3, 1, 1, 2]
> - in SORTED_NUMERIC: [1, 1, 2, 3]
>
> field2?
>
> - origin: [”c”, “a”, “a”, “b” ]
> - in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”]
>
>
> The original ordering relationship of the elements in the array is lost.
>
> We're guessing that lucene's DocValues are designed primarily for sorting
> and aggregation, so the original order of elements may not matter.
>
> But in our usage scene, it is important to keep the original order of the
> elements in the array (we allow user to access the elements in the array
> using the subscript operator).
>
> We wonder if lucene has plans to add new types of DocValues that can store
> arrays and keep the original order of elements in the array?
>
> Thanks!
>
Re: Store arrays in DocValues and keep the original order [ In reply to ]
You're correct that these doc value fields are primarily meant for sorting,
as well as some other use-cases like faceting. And what you're discovered
is also correct, that these fields don't maintain the original ordering,
and SORTED_SET dedupes values (
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/index/DocValuesType.html
).

There's no technical reason new doc value types couldn't be added that
maintain original ordering and don't dedupe, but whether-or-not there are
enough use-cases to support that need is a question that would need to be
considered. +1 to Shai's suggestion to build on BinaryDocValues. By
extending BinaryDocValuesField, you can encode the doc values however you
like. An example of this can be seen here:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/IntRangeDocValuesField.java

Hope this helps.

Cheers,
-Greg

On Tue, Jun 28, 2022 at 5:52 AM Shai Erera <serera@gmail.com> wrote:

> Depending on what you use the field for, you can use BinaryDocValuesField
> which encodes a byte[] and lets you store the data however you want. But
> how are you using these fields later at search time?
>
> On Tue, Jun 28, 2022 at 3:46 PM linfeng lu <linfeng.lu@hotmail.com> wrote:
>
>> Hi~
>>
>> We are trying to build an OLAP database based on lucene, and we heavily
>> use lucene's *DocValues* (as our column store).
>>
>> *We try to use DocValues to store the array type field. *For example, if
>> we want to store the *field1* and *feild2* in this json document into
>> *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our
>> only option.
>>
>> *{*
>> * "field1": [ 3, 1, 1, 2 ], *
>> * "field2": [ "c", "a", "a", "b" ] *
>> *}*
>>
>>
>> When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we
>> will get this result:
>>
>> *[image: Community Verified icon]*
>>
>> field1:
>>
>> - origin: [3, 1, 1, 2]
>> - in SORTED_NUMERIC: [1, 1, 2, 3]
>>
>> field2?
>>
>> - origin: [”c”, “a”, “a”, “b” ]
>> - in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”]
>>
>>
>> The original ordering relationship of the elements in the array is lost.
>>
>> We're guessing that lucene's DocValues are designed primarily for sorting
>> and aggregation, so the original order of elements may not matter.
>>
>> But in our usage scene, it is important to keep the original order of
>> the elements in the array (we allow user to access the elements in the
>> array using the subscript operator).
>>
>> We wonder if lucene has plans to add new types of DocValues that can
>> store arrays and keep the original order of elements in the array?
>>
>> Thanks!
>>
>