Mailing List Archive

Dynamic numeric range
Hello everyone,

I have been exploring the possibilities of getting dynamic numeric range
facet counts without users specifying ranges.

An example use-case might be a price filter on an e-commerce site. Instead
of requiring ranges to be pre-defined before doing facet counting in
Lucene, it would be really cool if Lucene could examine the matching
products and automatically determine relevant price ranges.

I saw this blog post <
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-variablewidthhistogram-aggregation.html>
where Elasticsearch implemented similar functionality and think it would be
useful to bring a similar idea into Lucene itself.

I am very early in thinking about this. Has anybody else thought about this?

If anyone is also interested or has any thoughts. I am more than happy to
learn from you. Please let me know :)

Thanks,
Yuting
Re: Dynamic numeric range [ In reply to ]
[Disclaimer: I work with Yuting on Product Search at Amazon]

I think this is super interesting to explore! It would be useful to
have an implementation like LongRangeFacetCounts /
DoubleRangeFacetCounts that "discovers" its own ranges based on the
distribution of the underlying data, rather than requiring the user to
specify the ranges up-front.

Could you please create a Jira issue to track this work? That would be
a good place to track progress, as well as any suggestions others
might have on how to best implement this. Thanks for bringing up the
idea!

Cheers,
-Greg

On Wed, May 26, 2021 at 3:46 PM Yuti G <gan.yuti@gmail.com> wrote:
>
> Hello everyone,
>
> I have been exploring the possibilities of getting dynamic numeric range facet counts without users specifying ranges.
>
> An example use-case might be a price filter on an e-commerce site. Instead of requiring ranges to be pre-defined before doing facet counting in Lucene, it would be really cool if Lucene could examine the matching products and automatically determine relevant price ranges.
>
> I saw this blog post <https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-variablewidthhistogram-aggregation.html> where Elasticsearch implemented similar functionality and think it would be useful to bring a similar idea into Lucene itself.
>
> I am very early in thinking about this. Has anybody else thought about this?
>
> If anyone is also interested or has any thoughts. I am more than happy to learn from you. Please let me know :)
>
> Thanks,
> Yuting

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Dynamic numeric range [ In reply to ]
Disclaimer: I work with Yuting in Amazon Product Search but the thoughts in
this mail are independent and entirely mine.

I tried to see if something similar has been done in other libraries and
came across some interesting
<https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html>
finds in matplotlib.
matplotlib provides support for automated histograms with different options
of treating the data.

1. Provide the number of bins you want
2. Let the library decide the number of bins and the uniform bin width
(with Scott's method
<https://docs.astropy.org/en/stable/api/astropy.stats.scott_bin_width.html#astropy.stats.scott_bin_width>,
or Freedman-Diaconis rule
<https://docs.astropy.org/en/stable/api/astropy.stats.freedman_bin_width.html#astropy.stats.freedman_bin_width>
or other methods
<https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges>).

I think this option could provide reasonably good results which would work
for most "normal" use cases.

The blog post you've linked provides bins with different widths. That is
another important factor in deciding the implementation.
Lots of directions to explore!

Regards,
Gautam Worah.


On Wed, May 26, 2021 at 3:52 PM Yuti G <gan.yuti@gmail.com> wrote:

> Hello everyone,
>
> I have been exploring the possibilities of getting dynamic numeric range
> facet counts without users specifying ranges.
>
> An example use-case might be a price filter on an e-commerce site. Instead
> of requiring ranges to be pre-defined before doing facet counting in
> Lucene, it would be really cool if Lucene could examine the matching
> products and automatically determine relevant price ranges.
>
> I saw this blog post <
> https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-variablewidthhistogram-aggregation.html>
> where Elasticsearch implemented similar functionality and think it would be
> useful to bring a similar idea into Lucene itself.
>
> I am very early in thinking about this. Has anybody else thought about
> this?
>
> If anyone is also interested or has any thoughts. I am more than happy to
> learn from you. Please let me know :)
>
> Thanks,
> Yuting
>
Re: Dynamic numeric range [ In reply to ]
Hi Gautam,


Thank you for sharing your thoughts!


Matplotlib is a great resource I should look at. Currently, I am exploring
the possibility of generating equi-probable buckets with variable widths
based on the data. Uniform bin width is a good starting point as it will
be easy to merge buckets from different histograms.


Thanks,

Yuting Gan

On Wed, Jun 2, 2021 at 9:11 PM Gautam Worah <worah.gautam@gmail.com> wrote:

> Disclaimer: I work with Yuting in Amazon Product Search but the thoughts
> in this mail are independent and entirely mine.
>
> I tried to see if something similar has been done in other libraries and
> came across some interesting
> <https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html>
> finds in matplotlib.
> matplotlib provides support for automated histograms with different
> options of treating the data.
>
> 1. Provide the number of bins you want
> 2. Let the library decide the number of bins and the uniform bin width
> (with Scott's method
> <https://docs.astropy.org/en/stable/api/astropy.stats.scott_bin_width.html#astropy.stats.scott_bin_width>,
> or Freedman-Diaconis rule
> <https://docs.astropy.org/en/stable/api/astropy.stats.freedman_bin_width.html#astropy.stats.freedman_bin_width>
> or other methods
> <https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges>).
>
> I think this option could provide reasonably good results which would work
> for most "normal" use cases.
>
> The blog post you've linked provides bins with different widths. That is
> another important factor in deciding the implementation.
> Lots of directions to explore!
>
> Regards,
> Gautam Worah.
>
>
> On Wed, May 26, 2021 at 3:52 PM Yuti G <gan.yuti@gmail.com> wrote:
>
>> Hello everyone,
>>
>> I have been exploring the possibilities of getting dynamic numeric range
>> facet counts without users specifying ranges.
>>
>> An example use-case might be a price filter on an e-commerce site.
>> Instead of requiring ranges to be pre-defined before doing facet counting
>> in Lucene, it would be really cool if Lucene could examine the matching
>> products and automatically determine relevant price ranges.
>>
>> I saw this blog post <
>> https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-variablewidthhistogram-aggregation.html>
>> where Elasticsearch implemented similar functionality and think it would be
>> useful to bring a similar idea into Lucene itself.
>>
>> I am very early in thinking about this. Has anybody else thought about
>> this?
>>
>> If anyone is also interested or has any thoughts. I am more than happy to
>> learn from you. Please let me know :)
>>
>> Thanks,
>> Yuting
>>
>
Re: Dynamic numeric range [ In reply to ]
Hi Greg,


Thank you so much! I will create a Jira issue today.


Best,

Yuting

On Wed, Jun 2, 2021 at 7:02 PM Greg Miller <gsmiller@gmail.com> wrote:

> [Disclaimer: I work with Yuting on Product Search at Amazon]
>
> I think this is super interesting to explore! It would be useful to
> have an implementation like LongRangeFacetCounts /
> DoubleRangeFacetCounts that "discovers" its own ranges based on the
> distribution of the underlying data, rather than requiring the user to
> specify the ranges up-front.
>
> Could you please create a Jira issue to track this work? That would be
> a good place to track progress, as well as any suggestions others
> might have on how to best implement this. Thanks for bringing up the
> idea!
>
> Cheers,
> -Greg
>
> On Wed, May 26, 2021 at 3:46 PM Yuti G <gan.yuti@gmail.com> wrote:
> >
> > Hello everyone,
> >
> > I have been exploring the possibilities of getting dynamic numeric range
> facet counts without users specifying ranges.
> >
> > An example use-case might be a price filter on an e-commerce site.
> Instead of requiring ranges to be pre-defined before doing facet counting
> in Lucene, it would be really cool if Lucene could examine the matching
> products and automatically determine relevant price ranges.
> >
> > I saw this blog post <
> https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-variablewidthhistogram-aggregation.html>
> where Elasticsearch implemented similar functionality and think it would be
> useful to bring a similar idea into Lucene itself.
> >
> > I am very early in thinking about this. Has anybody else thought about
> this?
> >
> > If anyone is also interested or has any thoughts. I am more than happy
> to learn from you. Please let me know :)
> >
> > Thanks,
> > Yuting
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>