Mailing List Archive: get distinct values from indexreader for given field

get distinct values from indexreader for given field

Nov 22, 2023, 11:09 AM

Post #1 of 4 (115 views)

Hello,

In Lucene 6 I was doing this to get all values for a given field
knowing its type:

public List<Object> getDistinctValues(IndexReader reader, String fieldname,
Class<? extends Object> type) throws IOException {

List<Object> values = new ArrayList<Object>();
Fields fields = MultiFields.getFields(reader);
if (fields == null) return values;

Terms terms = fields.terms(fieldname);
if (terms == null) return values;

TermsEnum iterator = terms.iterator();

BytesRef value = iterator.next();

while (value != null) {
if (type == Long.class) {
values.add(LegacyNumericUtils.prefixCodedToLong(value));
} else if (type == Integer.class) {
values.add(LegacyNumericUtils.prefixCodedToInt(value));
} else if (type == Boolean.class) {
values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ?
TRUE : FALSE);
} else if (type == Date.class) {
values.add(new
Date(LegacyNumericUtils.prefixCodedToLong(value)));
} else if (type == String.class) {
values.add(value.utf8ToString());
} else {
// ...
}

value = iterator.next();
}

return values;
}

I am trying to upgrade to lucene 9.
there were 2 changes over time:
- LegacyNumericUtils has been removed in favor of PointBase
- MultiFields.getFields() has been dropped, and I read we were encouraged
to avoid fields in general

what is proper way to implement getting distinct values for a specific
field in a reader?

thanks for your help,

vs

Re: get distinct values from indexreader for given field [ In reply to ]

msfroh at gmail

Nov 28, 2023, 1:42 PM

Post #2 of 4 (109 views)

Permalink

Hello!

Instead of MultiFields.getFields(), you can use MultiTerms.getTerms(reader,
fieldname) to get the Terms instance.

To decode your long / int values, you should be able to use
LongPoint/IntPoint.unpack to write the values into an array:

long[] val = new long[1]; // Assuming 1-D values
LongPoint.unpack(value, 0, val);
values.add(val[0]);

Hope that helps,
Froh

On Wed, Nov 22, 2023 at 11:09?AM <vvsevel@gmail.com> wrote:

> Hello,
>
> In Lucene 6 I was doing this to get all values for a given field
> knowing its type:
>
> public List<Object> getDistinctValues(IndexReader reader, String fieldname,
> Class<? extends Object> type) throws IOException {
>
> List<Object> values = new ArrayList<Object>();
> Fields fields = MultiFields.getFields(reader);
> if (fields == null) return values;
>
> Terms terms = fields.terms(fieldname);
> if (terms == null) return values;
>
> TermsEnum iterator = terms.iterator();
>
> BytesRef value = iterator.next();
>
> while (value != null) {
> if (type == Long.class) {
> values.add(LegacyNumericUtils.prefixCodedToLong(value));
> } else if (type == Integer.class) {
> values.add(LegacyNumericUtils.prefixCodedToInt(value));
> } else if (type == Boolean.class) {
> values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ?
> TRUE : FALSE);
> } else if (type == Date.class) {
> values.add(new
> Date(LegacyNumericUtils.prefixCodedToLong(value)));
> } else if (type == String.class) {
> values.add(value.utf8ToString());
> } else {
> // ...
> }
>
> value = iterator.next();
> }
>
> return values;
> }
>
> I am trying to upgrade to lucene 9.
> there were 2 changes over time:
> - LegacyNumericUtils has been removed in favor of PointBase
> - MultiFields.getFields() has been dropped, and I read we were encouraged
> to avoid fields in general
>
> what is proper way to implement getting distinct values for a specific
> field in a reader?
>
> thanks for your help,
>
> vs
>

Re: get distinct values from indexreader for given field [ In reply to ]

msfroh at gmail

Nov 28, 2023, 2:45 PM

Post #3 of 4 (109 views)

Permalink

Oh -- of course if you're using IntPoint / LongPoint for your numeric
fields, they won't be indexed as terms, so loading terms for them won't
work.

It's not the prettiest solution, but I think the following should let you
collect the set of distinct point values for an IntPoint field:

final Set<Integer> collectedValues = new TreeSet<>();
for (LeafReaderContext lrc : reader.leaves()) {
LeafReader lr = lrc.reader();
PointValues.IntersectVisitor collectingVisitor = new
PointValues.IntersectVisitor() {
@Override
public void visit(int docID) throws IOException {

}

@Override
public void visit(int docID, byte[] packedValue) {

collectedValues.add(IntPoint.decodeDimension(packedValue, 0));
}

@Override
public PointValues.Relation compare(byte[]
minPackedValue, byte[] maxPackedValue) {
return PointValues.Relation.CELL_CROSSES_QUERY;
}
};

lr.getPointValues(fieldname).intersect(collectingVisitor);
}

On Tue, Nov 28, 2023 at 1:42?PM Michael Froh <msfroh@gmail.com> wrote:

> Hello!
>
> Instead of MultiFields.getFields(), you can use
> MultiTerms.getTerms(reader, fieldname) to get the Terms instance.
>
> To decode your long / int values, you should be able to use
> LongPoint/IntPoint.unpack to write the values into an array:
>
> long[] val = new long[1]; // Assuming 1-D values
> LongPoint.unpack(value, 0, val);
> values.add(val[0]);
>
> Hope that helps,
> Froh
>
>
> On Wed, Nov 22, 2023 at 11:09?AM <vvsevel@gmail.com> wrote:
>
>> Hello,
>>
>> In Lucene 6 I was doing this to get all values for a given field
>> knowing its type:
>>
>> public List<Object> getDistinctValues(IndexReader reader, String
>> fieldname,
>> Class<? extends Object> type) throws IOException {
>>
>> List<Object> values = new ArrayList<Object>();
>> Fields fields = MultiFields.getFields(reader);
>> if (fields == null) return values;
>>
>> Terms terms = fields.terms(fieldname);
>> if (terms == null) return values;
>>
>> TermsEnum iterator = terms.iterator();
>>
>> BytesRef value = iterator.next();
>>
>> while (value != null) {
>> if (type == Long.class) {
>> values.add(LegacyNumericUtils.prefixCodedToLong(value));
>> } else if (type == Integer.class) {
>> values.add(LegacyNumericUtils.prefixCodedToInt(value));
>> } else if (type == Boolean.class) {
>> values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ?
>> TRUE : FALSE);
>> } else if (type == Date.class) {
>> values.add(new
>> Date(LegacyNumericUtils.prefixCodedToLong(value)));
>> } else if (type == String.class) {
>> values.add(value.utf8ToString());
>> } else {
>> // ...
>> }
>>
>> value = iterator.next();
>> }
>>
>> return values;
>> }
>>
>> I am trying to upgrade to lucene 9.
>> there were 2 changes over time:
>> - LegacyNumericUtils has been removed in favor of PointBase
>> - MultiFields.getFields() has been dropped, and I read we were encouraged
>> to avoid fields in general
>>
>> what is proper way to implement getting distinct values for a specific
>> field in a reader?
>>
>> thanks for your help,
>>
>> vs
>>
>

Re: get distinct values from indexreader for given field [ In reply to ]

vvsevel at gmail

Dec 5, 2023, 10:54 AM

Post #4 of 4 (104 views)

Permalink

Thanks Michael

On Tue, Nov 28, 2023 at 11:45?PM Michael Froh <msfroh@gmail.com> wrote:

> Oh -- of course if you're using IntPoint / LongPoint for your numeric
> fields, they won't be indexed as terms, so loading terms for them won't
> work.
>
> It's not the prettiest solution, but I think the following should let you
> collect the set of distinct point values for an IntPoint field:
>
>
> final Set<Integer> collectedValues = new TreeSet<>();
> for (LeafReaderContext lrc : reader.leaves()) {
> LeafReader lr = lrc.reader();
> PointValues.IntersectVisitor collectingVisitor = new
> PointValues.IntersectVisitor() {
> @Override
> public void visit(int docID) throws IOException {
>
> }
>
> @Override
> public void visit(int docID, byte[] packedValue) {
>
> collectedValues.add(IntPoint.decodeDimension(packedValue, 0));
> }
>
> @Override
> public PointValues.Relation compare(byte[]
> minPackedValue, byte[] maxPackedValue) {
> return PointValues.Relation.CELL_CROSSES_QUERY;
> }
> };
>
> lr.getPointValues(fieldname).intersect(collectingVisitor);
> }
>
>
>
> On Tue, Nov 28, 2023 at 1:42?PM Michael Froh <msfroh@gmail.com> wrote:
>
> > Hello!
> >
> > Instead of MultiFields.getFields(), you can use
> > MultiTerms.getTerms(reader, fieldname) to get the Terms instance.
> >
> > To decode your long / int values, you should be able to use
> > LongPoint/IntPoint.unpack to write the values into an array:
> >
> > long[] val = new long[1]; // Assuming 1-D values
> > LongPoint.unpack(value, 0, val);
> > values.add(val[0]);
> >
> > Hope that helps,
> > Froh
> >
> >
> > On Wed, Nov 22, 2023 at 11:09?AM <vvsevel@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> In Lucene 6 I was doing this to get all values for a given field
> >> knowing its type:
> >>
> >> public List<Object> getDistinctValues(IndexReader reader, String
> >> fieldname,
> >> Class<? extends Object> type) throws IOException {
> >>
> >> List<Object> values = new ArrayList<Object>();
> >> Fields fields = MultiFields.getFields(reader);
> >> if (fields == null) return values;
> >>
> >> Terms terms = fields.terms(fieldname);
> >> if (terms == null) return values;
> >>
> >> TermsEnum iterator = terms.iterator();
> >>
> >> BytesRef value = iterator.next();
> >>
> >> while (value != null) {
> >> if (type == Long.class) {
> >> values.add(LegacyNumericUtils.prefixCodedToLong(value));
> >> } else if (type == Integer.class) {
> >> values.add(LegacyNumericUtils.prefixCodedToInt(value));
> >> } else if (type == Boolean.class) {
> >> values.add(LegacyNumericUtils.prefixCodedToInt(value) == 1 ?
> >> TRUE : FALSE);
> >> } else if (type == Date.class) {
> >> values.add(new
> >> Date(LegacyNumericUtils.prefixCodedToLong(value)));
> >> } else if (type == String.class) {
> >> values.add(value.utf8ToString());
> >> } else {
> >> // ...
> >> }
> >>
> >> value = iterator.next();
> >> }
> >>
> >> return values;
> >> }
> >>
> >> I am trying to upgrade to lucene 9.
> >> there were 2 changes over time:
> >> - LegacyNumericUtils has been removed in favor of PointBase
> >> - MultiFields.getFields() has been dropped, and I read we were
> encouraged
> >> to avoid fields in general
> >>
> >> what is proper way to implement getting distinct values for a specific
> >> field in a reader?
> >>
> >> thanks for your help,
> >>
> >> vs
> >>
> >
>