Mailing List Archive

Using setIndexSort on a binary field
Hi all,

Could someone point me to an example of using the
IndexWriterConfig.setIndexSort for a field containing binary values?

To be specific, the fields are constructed using the Field(String name,
byte[] value, IndexableFieldType type) constructor, and I'd like to try
using the java.util.Arrays.compareUnsigned method to sort the fields.

Thanks,
Alex
Re: Using setIndexSort on a binary field [ In reply to ]
Hi Alex,

You need to use a BinaryDocValuesField so that the field is indexed with
doc values.

`Field` is not going to work because it only indexes the data while index
sorting requires doc values.

On Fri, Oct 15, 2021 at 6:40 PM Alex K <aklibisz@gmail.com> wrote:

> Hi all,
>
> Could someone point me to an example of using the
> IndexWriterConfig.setIndexSort for a field containing binary values?
>
> To be specific, the fields are constructed using the Field(String name,
> byte[] value, IndexableFieldType type) constructor, and I'd like to try
> using the java.util.Arrays.compareUnsigned method to sort the fields.
>
> Thanks,
> Alex
>


--
Adrien
Re: Using setIndexSort on a binary field [ In reply to ]
Thanks Adrien. This makes me think I might not be understanding the use
case for index sorting correctly. I basically want to make it so that my
terms are sorted across segments. For example, let's say I have integer
terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in segment
1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on.
With default indexing settings, I see terms duplicated across segments. I
thought index sorting was the way to achieve this, but the use of doc
values makes me think it might actually be used for something else? Is
something like what I described possible? Any clarification would be great.
Thanks,
Alex


On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand <jpountz@gmail.com> wrote:

> Hi Alex,
>
> You need to use a BinaryDocValuesField so that the field is indexed with
> doc values.
>
> `Field` is not going to work because it only indexes the data while index
> sorting requires doc values.
>
> On Fri, Oct 15, 2021 at 6:40 PM Alex K <aklibisz@gmail.com> wrote:
>
> > Hi all,
> >
> > Could someone point me to an example of using the
> > IndexWriterConfig.setIndexSort for a field containing binary values?
> >
> > To be specific, the fields are constructed using the Field(String name,
> > byte[] value, IndexableFieldType type) constructor, and I'd like to try
> > using the java.util.Arrays.compareUnsigned method to sort the fields.
> >
> > Thanks,
> > Alex
> >
>
>
> --
> Adrien
>
Re: Using setIndexSort on a binary field [ In reply to ]
Yeah, index sorting doesn't do that -- it sorts *within* each segment
so that when documents are iterated (within that segment) by any of
the many DocIdSetIterators that underlie the Lucene search API, they
are retrieved in the order specified (which is then also docid order).

To achieve what you want you would have to tightly control the
indexing process. For example you could configure a NoMergePolicy to
prevent the segments you manually create from being merged, set a very
large RAM buffer size on the index writer so it doesn't unexpectedly
flush a segment while you're indexing, and then index documents in the
sequence you want to group them by, committing after each block of
documents. But this is a very artificial setup; it wouldn't survive
any normal indexing workflow where merges are allowed, documents may
be updated, etc.

For testing purposes we've recently added the ability to rearrange the
index (IndexRearranger) according to a specific assignment of docids
to segments - you could apply this to an existing index. But again,
this is not really intended for use in a production on-line index that
receives updates.

On Fri, Oct 15, 2021 at 1:27 PM Alex K <aklibisz@gmail.com> wrote:
>
> Thanks Adrien. This makes me think I might not be understanding the use
> case for index sorting correctly. I basically want to make it so that my
> terms are sorted across segments. For example, let's say I have integer
> terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in segment
> 1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on.
> With default indexing settings, I see terms duplicated across segments. I
> thought index sorting was the way to achieve this, but the use of doc
> values makes me think it might actually be used for something else? Is
> something like what I described possible? Any clarification would be great.
> Thanks,
> Alex
>
>
> On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand <jpountz@gmail.com> wrote:
>
> > Hi Alex,
> >
> > You need to use a BinaryDocValuesField so that the field is indexed with
> > doc values.
> >
> > `Field` is not going to work because it only indexes the data while index
> > sorting requires doc values.
> >
> > On Fri, Oct 15, 2021 at 6:40 PM Alex K <aklibisz@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Could someone point me to an example of using the
> > > IndexWriterConfig.setIndexSort for a field containing binary values?
> > >
> > > To be specific, the fields are constructed using the Field(String name,
> > > byte[] value, IndexableFieldType type) constructor, and I'd like to try
> > > using the java.util.Arrays.compareUnsigned method to sort the fields.
> > >
> > > Thanks,
> > > Alex
> > >
> >
> >
> > --
> > Adrien
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Using setIndexSort on a binary field [ In reply to ]
Thanks Michael. Totally agree this is a contrived setup. It's mostly for
benchmarking purposes right now. I was actually able to rephrase my problem
in a way that made more sense for the existing setIndexSort API using float
doc values and saw an appreciable speedup in searches. The IndexRearranger
is also good to know about.

Cheers,
Alex

On Sun, Oct 17, 2021 at 9:32 AM Michael Sokolov <msokolov@gmail.com> wrote:

> Yeah, index sorting doesn't do that -- it sorts *within* each segment
> so that when documents are iterated (within that segment) by any of
> the many DocIdSetIterators that underlie the Lucene search API, they
> are retrieved in the order specified (which is then also docid order).
>
> To achieve what you want you would have to tightly control the
> indexing process. For example you could configure a NoMergePolicy to
> prevent the segments you manually create from being merged, set a very
> large RAM buffer size on the index writer so it doesn't unexpectedly
> flush a segment while you're indexing, and then index documents in the
> sequence you want to group them by, committing after each block of
> documents. But this is a very artificial setup; it wouldn't survive
> any normal indexing workflow where merges are allowed, documents may
> be updated, etc.
>
> For testing purposes we've recently added the ability to rearrange the
> index (IndexRearranger) according to a specific assignment of docids
> to segments - you could apply this to an existing index. But again,
> this is not really intended for use in a production on-line index that
> receives updates.
>
> On Fri, Oct 15, 2021 at 1:27 PM Alex K <aklibisz@gmail.com> wrote:
> >
> > Thanks Adrien. This makes me think I might not be understanding the use
> > case for index sorting correctly. I basically want to make it so that my
> > terms are sorted across segments. For example, let's say I have integer
> > terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in
> segment
> > 1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on.
> > With default indexing settings, I see terms duplicated across segments. I
> > thought index sorting was the way to achieve this, but the use of doc
> > values makes me think it might actually be used for something else? Is
> > something like what I described possible? Any clarification would be
> great.
> > Thanks,
> > Alex
> >
> >
> > On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > > Hi Alex,
> > >
> > > You need to use a BinaryDocValuesField so that the field is indexed
> with
> > > doc values.
> > >
> > > `Field` is not going to work because it only indexes the data while
> index
> > > sorting requires doc values.
> > >
> > > On Fri, Oct 15, 2021 at 6:40 PM Alex K <aklibisz@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Could someone point me to an example of using the
> > > > IndexWriterConfig.setIndexSort for a field containing binary values?
> > > >
> > > > To be specific, the fields are constructed using the Field(String
> name,
> > > > byte[] value, IndexableFieldType type) constructor, and I'd like to
> try
> > > > using the java.util.Arrays.compareUnsigned method to sort the fields.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>