Mailing List Archive

FacetResult#value semantics?
Hi folks-

I'm trying to make sure I have a proper understanding of what
FacetResult#value is meant to represent, particularly in multi-valued
doc scenarios. Apologies if I'm missing something obvious, but it
seems that either my understanding is incorrect, or we have a bug in
how we count multi-value docs. This is particularly relevant to me at
the moment since I'm working on a couple facet-related changes, and I
want to make sure I've got a proper understanding of this field.
Thanks!

From the Javadocs:
/**
* Total value for this path (sum of all child counts, or sum of all
child values), even those not
* included in the topN.
*/
public final Number value;

So from the Javadocs, it seems this is simply the sum of all values
for the given dim+path. In the case of single-value docs, this would
also represent the total number of documents containing a value for
the given dim+path, which seems fairly useful (i.e., it might be nice
to know how many documents contain a value for a given facet
dim+path). On the other hand, if docs can be multi-valued, this seems
somewhat less useful. If this is truly the sum of the values for the
given dim+path, each document can contribute more than one count, so
the user can no longer interpret this as the number of documents that
have at least one value for the facet dim+path. It seems as though it
would be more useful to provide the number of documents with a given
dim+path value instead of just the total count, but this is where I'm
probably just misunderstanding something.

Finally, looking at the way taxonomy facets are counted, it looks like
this value is populated with the total number of documents, and
populated with -1 in multi-value cases where an accurate doc count
can't be provided (see IntTaxonomyFacets L:228 for example). This
isn't consistent with the implementation in LongValueFacetCounts
though, which will always populate the total of all values, ignoring
single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
appears the implementation in SortedSetDocValuesFacetCounts will also
"double count" multi-value cases similar to LongValueFacetCounts.

So... which do we think it is? Is it meant to be the total number of
docs, or the total of all values? Can anyone shed some light on this?
Thanks a bunch!

Cheers,
-Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: FacetResult#value semantics? [ In reply to ]
Hi Greg,

I agree the javadocs do not make it very clear :) And the different facet
impls are doing different things!

But I think we really should strive to count each hit only once. So, if a
multi-valued field has FacetLabel A more than once, we should count it only
once in the "value" for that FacetLabel?

I.e. the goal here is to say "if the user were to drill down on this value,
how many total hits would THAT query return?".

Mike McCandless

http://blog.mikemccandless.com


On Sun, May 9, 2021 at 12:40 PM Greg Miller <gsmiller@gmail.com> wrote:

> Hi folks-
>
> I'm trying to make sure I have a proper understanding of what
> FacetResult#value is meant to represent, particularly in multi-valued
> doc scenarios. Apologies if I'm missing something obvious, but it
> seems that either my understanding is incorrect, or we have a bug in
> how we count multi-value docs. This is particularly relevant to me at
> the moment since I'm working on a couple facet-related changes, and I
> want to make sure I've got a proper understanding of this field.
> Thanks!
>
> From the Javadocs:
> /**
> * Total value for this path (sum of all child counts, or sum of all
> child values), even those not
> * included in the topN.
> */
> public final Number value;
>
> So from the Javadocs, it seems this is simply the sum of all values
> for the given dim+path. In the case of single-value docs, this would
> also represent the total number of documents containing a value for
> the given dim+path, which seems fairly useful (i.e., it might be nice
> to know how many documents contain a value for a given facet
> dim+path). On the other hand, if docs can be multi-valued, this seems
> somewhat less useful. If this is truly the sum of the values for the
> given dim+path, each document can contribute more than one count, so
> the user can no longer interpret this as the number of documents that
> have at least one value for the facet dim+path. It seems as though it
> would be more useful to provide the number of documents with a given
> dim+path value instead of just the total count, but this is where I'm
> probably just misunderstanding something.
>
> Finally, looking at the way taxonomy facets are counted, it looks like
> this value is populated with the total number of documents, and
> populated with -1 in multi-value cases where an accurate doc count
> can't be provided (see IntTaxonomyFacets L:228 for example). This
> isn't consistent with the implementation in LongValueFacetCounts
> though, which will always populate the total of all values, ignoring
> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
> appears the implementation in SortedSetDocValuesFacetCounts will also
> "double count" multi-value cases similar to LongValueFacetCounts.
>
> So... which do we think it is? Is it meant to be the total number of
> docs, or the total of all values? Can anyone shed some light on this?
> Thanks a bunch!
>
> Cheers,
> -Greg
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: FacetResult#value semantics? [ In reply to ]
Hi Greg,

I think your understanding is correct. I tried to create test cases
<https://github.com/gautamworah96/lucene/commit/042878117308f76629a27b0bcf83e25f074dc8b1>
for FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and
LongValueFacetCounts.

FastTaxonomyFacetCounts treats common values in a document as a single
entity and returns the count of a dim+path as the number of documents that
contain these fields.
On the other hand, LongValueFacetCounts treats values as unique and returns
the number of instances of the dim+path value (each doc can be counted more
than once).

> In the case of single-value docs, this would
also represent the total number of documents containing a value for
the given dim+path, which seems fairly useful
+1

I think the <each doc can be counted more than once> logic also has merit.
For example, you could probably use it for counting the number of times a
movie has been watched in a person->list of movies watched schema.

I don't have any specific thoughts on the inconsistency issue because it
seems that LongValueFacetCounts and IntTaxonomyFacets were designed for
different purposes?
The latter supports hierarchical values, needs an explicit specification
for multi values and supports the getSpecificValue API.
It does seem odd that different groups of taxonomy classes treat counts
slightly differently.

As a side note:
I think we can make the
org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test
case more robust by forcing it to use atleast one duplicate multi-value?

Thanks
- Gautam


On Sun, May 9, 2021 at 9:40 AM Greg Miller <gsmiller@gmail.com> wrote:

> Hi folks-
>
> I'm trying to make sure I have a proper understanding of what
> FacetResult#value is meant to represent, particularly in multi-valued
> doc scenarios. Apologies if I'm missing something obvious, but it
> seems that either my understanding is incorrect, or we have a bug in
> how we count multi-value docs. This is particularly relevant to me at
> the moment since I'm working on a couple facet-related changes, and I
> want to make sure I've got a proper understanding of this field.
> Thanks!
>
> From the Javadocs:
> /**
> * Total value for this path (sum of all child counts, or sum of all
> child values), even those not
> * included in the topN.
> */
> public final Number value;
>
> So from the Javadocs, it seems this is simply the sum of all values
> for the given dim+path. In the case of single-value docs, this would
> also represent the total number of documents containing a value for
> the given dim+path, which seems fairly useful (i.e., it might be nice
> to know how many documents contain a value for a given facet
> dim+path). On the other hand, if docs can be multi-valued, this seems
> somewhat less useful. If this is truly the sum of the values for the
> given dim+path, each document can contribute more than one count, so
> the user can no longer interpret this as the number of documents that
> have at least one value for the facet dim+path. It seems as though it
> would be more useful to provide the number of documents with a given
> dim+path value instead of just the total count, but this is where I'm
> probably just misunderstanding something.
>
> Finally, looking at the way taxonomy facets are counted, it looks like
> this value is populated with the total number of documents, and
> populated with -1 in multi-value cases where an accurate doc count
> can't be provided (see IntTaxonomyFacets L:228 for example). This
> isn't consistent with the implementation in LongValueFacetCounts
> though, which will always populate the total of all values, ignoring
> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
> appears the implementation in SortedSetDocValuesFacetCounts will also
> "double count" multi-value cases similar to LongValueFacetCounts.
>
> So... which do we think it is? Is it meant to be the total number of
> docs, or the total of all values? Can anyone shed some light on this?
> Thanks a bunch!
>
> Cheers,
> -Greg
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: FacetResult#value semantics? [ In reply to ]
Hi all,

We use the facets a lot to generate all kinds of nice aggregates on our
data, and we alse needed to make the distinction between FIELD_COUNT and
DOCUMENT_COUNT, where the former increases for each multi-value, and the
latter only once for each document that contains that field at least once.

Maybe that distinction can be worded/implemented somehow to make it all
more consistent?

On Mon, May 10, 2021 at 5:02 PM Gautam Worah <worah.gautam@gmail.com> wrote:

> Hi Greg,
>
> I think your understanding is correct. I tried to create test cases
> <https://github.com/gautamworah96/lucene/commit/042878117308f76629a27b0bcf83e25f074dc8b1>
> for FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and
> LongValueFacetCounts.
>
> FastTaxonomyFacetCounts treats common values in a document as a single
> entity and returns the count of a dim+path as the number of documents that
> contain these fields.
> On the other hand, LongValueFacetCounts treats values as unique and
> returns the number of instances of the dim+path value (each doc can be
> counted more than once).
>
> > In the case of single-value docs, this would
> also represent the total number of documents containing a value for
> the given dim+path, which seems fairly useful
> +1
>
> I think the <each doc can be counted more than once> logic also has merit.
> For example, you could probably use it for counting the number of times a
> movie has been watched in a person->list of movies watched schema.
>
> I don't have any specific thoughts on the inconsistency issue because it
> seems that LongValueFacetCounts and IntTaxonomyFacets were designed for
> different purposes?
> The latter supports hierarchical values, needs an explicit specification
> for multi values and supports the getSpecificValue API.
> It does seem odd that different groups of taxonomy classes treat counts
> slightly differently.
>
> As a side note:
> I think we can make the
> org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test
> case more robust by forcing it to use atleast one duplicate multi-value?
>
> Thanks
> - Gautam
>
>
> On Sun, May 9, 2021 at 9:40 AM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Hi folks-
>>
>> I'm trying to make sure I have a proper understanding of what
>> FacetResult#value is meant to represent, particularly in multi-valued
>> doc scenarios. Apologies if I'm missing something obvious, but it
>> seems that either my understanding is incorrect, or we have a bug in
>> how we count multi-value docs. This is particularly relevant to me at
>> the moment since I'm working on a couple facet-related changes, and I
>> want to make sure I've got a proper understanding of this field.
>> Thanks!
>>
>> From the Javadocs:
>> /**
>> * Total value for this path (sum of all child counts, or sum of all
>> child values), even those not
>> * included in the topN.
>> */
>> public final Number value;
>>
>> So from the Javadocs, it seems this is simply the sum of all values
>> for the given dim+path. In the case of single-value docs, this would
>> also represent the total number of documents containing a value for
>> the given dim+path, which seems fairly useful (i.e., it might be nice
>> to know how many documents contain a value for a given facet
>> dim+path). On the other hand, if docs can be multi-valued, this seems
>> somewhat less useful. If this is truly the sum of the values for the
>> given dim+path, each document can contribute more than one count, so
>> the user can no longer interpret this as the number of documents that
>> have at least one value for the facet dim+path. It seems as though it
>> would be more useful to provide the number of documents with a given
>> dim+path value instead of just the total count, but this is where I'm
>> probably just misunderstanding something.
>>
>> Finally, looking at the way taxonomy facets are counted, it looks like
>> this value is populated with the total number of documents, and
>> populated with -1 in multi-value cases where an accurate doc count
>> can't be provided (see IntTaxonomyFacets L:228 for example). This
>> isn't consistent with the implementation in LongValueFacetCounts
>> though, which will always populate the total of all values, ignoring
>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
>> appears the implementation in SortedSetDocValuesFacetCounts will also
>> "double count" multi-value cases similar to LongValueFacetCounts.
>>
>> So... which do we think it is? Is it meant to be the total number of
>> docs, or the total of all values? Can anyone shed some light on this?
>> Thanks a bunch!
>>
>> Cheers,
>> -Greg
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
Re: FacetResult#value semantics? [ In reply to ]
Thanks Mike/Gautam/Rob!

I've created LUCENE-9952 to track the work of making FacetResult#value
consistently report doc count (as the taxonomy-based implementations
do).

Rob (or anyone else), do you think there's value in _also_ reporting
"field count?" Sounds like you may have some use-cases for this.
Should we cut a separate issue to track adding a "field count" concept
to FacetResult?

Cheers,
-Greg

On Mon, May 10, 2021 at 8:05 AM Rob Audenaerde <rob.audenaerde@gmail.com> wrote:
>
> Hi all,
>
> We use the facets a lot to generate all kinds of nice aggregates on our data, and we alse needed to make the distinction between FIELD_COUNT and DOCUMENT_COUNT, where the former increases for each multi-value, and the latter only once for each document that contains that field at least once.
>
> Maybe that distinction can be worded/implemented somehow to make it all more consistent?
>
> On Mon, May 10, 2021 at 5:02 PM Gautam Worah <worah.gautam@gmail.com> wrote:
>>
>> Hi Greg,
>>
>> I think your understanding is correct. I tried to create test cases for FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and LongValueFacetCounts.
>>
>> FastTaxonomyFacetCounts treats common values in a document as a single entity and returns the count of a dim+path as the number of documents that contain these fields.
>> On the other hand, LongValueFacetCounts treats values as unique and returns the number of instances of the dim+path value (each doc can be counted more than once).
>>
>> > In the case of single-value docs, this would
>> also represent the total number of documents containing a value for
>> the given dim+path, which seems fairly useful
>> +1
>>
>> I think the <each doc can be counted more than once> logic also has merit.
>> For example, you could probably use it for counting the number of times a movie has been watched in a person->list of movies watched schema.
>>
>> I don't have any specific thoughts on the inconsistency issue because it seems that LongValueFacetCounts and IntTaxonomyFacets were designed for different purposes?
>> The latter supports hierarchical values, needs an explicit specification for multi values and supports the getSpecificValue API.
>> It does seem odd that different groups of taxonomy classes treat counts slightly differently.
>>
>> As a side note:
>> I think we can make the org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test case more robust by forcing it to use atleast one duplicate multi-value?
>>
>> Thanks
>> - Gautam
>>
>>
>> On Sun, May 9, 2021 at 9:40 AM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>> Hi folks-
>>>
>>> I'm trying to make sure I have a proper understanding of what
>>> FacetResult#value is meant to represent, particularly in multi-valued
>>> doc scenarios. Apologies if I'm missing something obvious, but it
>>> seems that either my understanding is incorrect, or we have a bug in
>>> how we count multi-value docs. This is particularly relevant to me at
>>> the moment since I'm working on a couple facet-related changes, and I
>>> want to make sure I've got a proper understanding of this field.
>>> Thanks!
>>>
>>> From the Javadocs:
>>> /**
>>> * Total value for this path (sum of all child counts, or sum of all
>>> child values), even those not
>>> * included in the topN.
>>> */
>>> public final Number value;
>>>
>>> So from the Javadocs, it seems this is simply the sum of all values
>>> for the given dim+path. In the case of single-value docs, this would
>>> also represent the total number of documents containing a value for
>>> the given dim+path, which seems fairly useful (i.e., it might be nice
>>> to know how many documents contain a value for a given facet
>>> dim+path). On the other hand, if docs can be multi-valued, this seems
>>> somewhat less useful. If this is truly the sum of the values for the
>>> given dim+path, each document can contribute more than one count, so
>>> the user can no longer interpret this as the number of documents that
>>> have at least one value for the facet dim+path. It seems as though it
>>> would be more useful to provide the number of documents with a given
>>> dim+path value instead of just the total count, but this is where I'm
>>> probably just misunderstanding something.
>>>
>>> Finally, looking at the way taxonomy facets are counted, it looks like
>>> this value is populated with the total number of documents, and
>>> populated with -1 in multi-value cases where an accurate doc count
>>> can't be provided (see IntTaxonomyFacets L:228 for example). This
>>> isn't consistent with the implementation in LongValueFacetCounts
>>> though, which will always populate the total of all values, ignoring
>>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
>>> appears the implementation in SortedSetDocValuesFacetCounts will also
>>> "double count" multi-value cases similar to LongValueFacetCounts.
>>>
>>> So... which do we think it is? Is it meant to be the total number of
>>> docs, or the total of all values? Can anyone shed some light on this?
>>> Thanks a bunch!
>>>
>>> Cheers,
>>> -Greg
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: FacetResult#value semantics? [ In reply to ]
Hi Greg,

Honestly, I think the implementation should just focus on getting the
document-counts correct and consistent in all places, and having the
javadoc explain what it happering. (That said, it would be nice if the
implementations are easy to extend, so that ppl who want to implement
field_count can implement this easily themselves :)

In our solution, we use the AssociatedFacetFields a lot and do aggregates
on them ( like sum, max, avg), but I extended the base Facets class to
implement this.

-Rob

On Mon, May 10, 2021 at 6:36 PM Greg Miller <gsmiller@gmail.com> wrote:

> Thanks Mike/Gautam/Rob!
>
> I've created LUCENE-9952 to track the work of making FacetResult#value
> consistently report doc count (as the taxonomy-based implementations
> do).
>
> Rob (or anyone else), do you think there's value in _also_ reporting
> "field count?" Sounds like you may have some use-cases for this.
> Should we cut a separate issue to track adding a "field count" concept
> to FacetResult?
>
> Cheers,
> -Greg
>
> On Mon, May 10, 2021 at 8:05 AM Rob Audenaerde <rob.audenaerde@gmail.com>
> wrote:
> >
> > Hi all,
> >
> > We use the facets a lot to generate all kinds of nice aggregates on our
> data, and we alse needed to make the distinction between FIELD_COUNT and
> DOCUMENT_COUNT, where the former increases for each multi-value, and the
> latter only once for each document that contains that field at least once.
> >
> > Maybe that distinction can be worded/implemented somehow to make it all
> more consistent?
> >
> > On Mon, May 10, 2021 at 5:02 PM Gautam Worah <worah.gautam@gmail.com>
> wrote:
> >>
> >> Hi Greg,
> >>
> >> I think your understanding is correct. I tried to create test cases for
> FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and
> LongValueFacetCounts.
> >>
> >> FastTaxonomyFacetCounts treats common values in a document as a single
> entity and returns the count of a dim+path as the number of documents that
> contain these fields.
> >> On the other hand, LongValueFacetCounts treats values as unique and
> returns the number of instances of the dim+path value (each doc can be
> counted more than once).
> >>
> >> > In the case of single-value docs, this would
> >> also represent the total number of documents containing a value for
> >> the given dim+path, which seems fairly useful
> >> +1
> >>
> >> I think the <each doc can be counted more than once> logic also has
> merit.
> >> For example, you could probably use it for counting the number of times
> a movie has been watched in a person->list of movies watched schema.
> >>
> >> I don't have any specific thoughts on the inconsistency issue because
> it seems that LongValueFacetCounts and IntTaxonomyFacets were designed for
> different purposes?
> >> The latter supports hierarchical values, needs an explicit
> specification for multi values and supports the getSpecificValue API.
> >> It does seem odd that different groups of taxonomy classes treat counts
> slightly differently.
> >>
> >> As a side note:
> >> I think we can make the
> org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test
> case more robust by forcing it to use atleast one duplicate multi-value?
> >>
> >> Thanks
> >> - Gautam
> >>
> >>
> >> On Sun, May 9, 2021 at 9:40 AM Greg Miller <gsmiller@gmail.com> wrote:
> >>>
> >>> Hi folks-
> >>>
> >>> I'm trying to make sure I have a proper understanding of what
> >>> FacetResult#value is meant to represent, particularly in multi-valued
> >>> doc scenarios. Apologies if I'm missing something obvious, but it
> >>> seems that either my understanding is incorrect, or we have a bug in
> >>> how we count multi-value docs. This is particularly relevant to me at
> >>> the moment since I'm working on a couple facet-related changes, and I
> >>> want to make sure I've got a proper understanding of this field.
> >>> Thanks!
> >>>
> >>> From the Javadocs:
> >>> /**
> >>> * Total value for this path (sum of all child counts, or sum of all
> >>> child values), even those not
> >>> * included in the topN.
> >>> */
> >>> public final Number value;
> >>>
> >>> So from the Javadocs, it seems this is simply the sum of all values
> >>> for the given dim+path. In the case of single-value docs, this would
> >>> also represent the total number of documents containing a value for
> >>> the given dim+path, which seems fairly useful (i.e., it might be nice
> >>> to know how many documents contain a value for a given facet
> >>> dim+path). On the other hand, if docs can be multi-valued, this seems
> >>> somewhat less useful. If this is truly the sum of the values for the
> >>> given dim+path, each document can contribute more than one count, so
> >>> the user can no longer interpret this as the number of documents that
> >>> have at least one value for the facet dim+path. It seems as though it
> >>> would be more useful to provide the number of documents with a given
> >>> dim+path value instead of just the total count, but this is where I'm
> >>> probably just misunderstanding something.
> >>>
> >>> Finally, looking at the way taxonomy facets are counted, it looks like
> >>> this value is populated with the total number of documents, and
> >>> populated with -1 in multi-value cases where an accurate doc count
> >>> can't be provided (see IntTaxonomyFacets L:228 for example). This
> >>> isn't consistent with the implementation in LongValueFacetCounts
> >>> though, which will always populate the total of all values, ignoring
> >>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
> >>> appears the implementation in SortedSetDocValuesFacetCounts will also
> >>> "double count" multi-value cases similar to LongValueFacetCounts.
> >>>
> >>> So... which do we think it is? Is it meant to be the total number of
> >>> docs, or the total of all values? Can anyone shed some light on this?
> >>> Thanks a bunch!
> >>>
> >>> Cheers,
> >>> -Greg
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: FacetResult#value semantics? [ In reply to ]
Thanks Rob! For now, I've focused on making the javadoc a bit more
clear and fixing the bug, but it's an interesting thought to see if
there's a better way to make this extensible. I don't think it would
be particularly challenging to do this, so if there's enough interest
in that functionality (i.e., making the calculation of
FacetResult#value more extensible) I think we could open a separate
issue to track?

Cheers,
-Greg

On Tue, May 11, 2021 at 12:23 AM Rob Audenaerde
<rob.audenaerde@gmail.com> wrote:
>
> Hi Greg,
>
> Honestly, I think the implementation should just focus on getting the document-counts correct and consistent in all places, and having the javadoc explain what it happering. (That said, it would be nice if the implementations are easy to extend, so that ppl who want to implement field_count can implement this easily themselves :)
>
> In our solution, we use the AssociatedFacetFields a lot and do aggregates on them ( like sum, max, avg), but I extended the base Facets class to implement this.
>
> -Rob
>
> On Mon, May 10, 2021 at 6:36 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>> Thanks Mike/Gautam/Rob!
>>
>> I've created LUCENE-9952 to track the work of making FacetResult#value
>> consistently report doc count (as the taxonomy-based implementations
>> do).
>>
>> Rob (or anyone else), do you think there's value in _also_ reporting
>> "field count?" Sounds like you may have some use-cases for this.
>> Should we cut a separate issue to track adding a "field count" concept
>> to FacetResult?
>>
>> Cheers,
>> -Greg
>>
>> On Mon, May 10, 2021 at 8:05 AM Rob Audenaerde <rob.audenaerde@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > We use the facets a lot to generate all kinds of nice aggregates on our data, and we alse needed to make the distinction between FIELD_COUNT and DOCUMENT_COUNT, where the former increases for each multi-value, and the latter only once for each document that contains that field at least once.
>> >
>> > Maybe that distinction can be worded/implemented somehow to make it all more consistent?
>> >
>> > On Mon, May 10, 2021 at 5:02 PM Gautam Worah <worah.gautam@gmail.com> wrote:
>> >>
>> >> Hi Greg,
>> >>
>> >> I think your understanding is correct. I tried to create test cases for FastTaxonomyFacetCounts (inherits from IntTaxonomyFacets) and LongValueFacetCounts.
>> >>
>> >> FastTaxonomyFacetCounts treats common values in a document as a single entity and returns the count of a dim+path as the number of documents that contain these fields.
>> >> On the other hand, LongValueFacetCounts treats values as unique and returns the number of instances of the dim+path value (each doc can be counted more than once).
>> >>
>> >> > In the case of single-value docs, this would
>> >> also represent the total number of documents containing a value for
>> >> the given dim+path, which seems fairly useful
>> >> +1
>> >>
>> >> I think the <each doc can be counted more than once> logic also has merit.
>> >> For example, you could probably use it for counting the number of times a movie has been watched in a person->list of movies watched schema.
>> >>
>> >> I don't have any specific thoughts on the inconsistency issue because it seems that LongValueFacetCounts and IntTaxonomyFacets were designed for different purposes?
>> >> The latter supports hierarchical values, needs an explicit specification for multi values and supports the getSpecificValue API.
>> >> It does seem odd that different groups of taxonomy classes treat counts slightly differently.
>> >>
>> >> As a side note:
>> >> I think we can make the org.apache.lucene.facet.TestLongValueFacetCounts#testRandomMultiValued test case more robust by forcing it to use atleast one duplicate multi-value?
>> >>
>> >> Thanks
>> >> - Gautam
>> >>
>> >>
>> >> On Sun, May 9, 2021 at 9:40 AM Greg Miller <gsmiller@gmail.com> wrote:
>> >>>
>> >>> Hi folks-
>> >>>
>> >>> I'm trying to make sure I have a proper understanding of what
>> >>> FacetResult#value is meant to represent, particularly in multi-valued
>> >>> doc scenarios. Apologies if I'm missing something obvious, but it
>> >>> seems that either my understanding is incorrect, or we have a bug in
>> >>> how we count multi-value docs. This is particularly relevant to me at
>> >>> the moment since I'm working on a couple facet-related changes, and I
>> >>> want to make sure I've got a proper understanding of this field.
>> >>> Thanks!
>> >>>
>> >>> From the Javadocs:
>> >>> /**
>> >>> * Total value for this path (sum of all child counts, or sum of all
>> >>> child values), even those not
>> >>> * included in the topN.
>> >>> */
>> >>> public final Number value;
>> >>>
>> >>> So from the Javadocs, it seems this is simply the sum of all values
>> >>> for the given dim+path. In the case of single-value docs, this would
>> >>> also represent the total number of documents containing a value for
>> >>> the given dim+path, which seems fairly useful (i.e., it might be nice
>> >>> to know how many documents contain a value for a given facet
>> >>> dim+path). On the other hand, if docs can be multi-valued, this seems
>> >>> somewhat less useful. If this is truly the sum of the values for the
>> >>> given dim+path, each document can contribute more than one count, so
>> >>> the user can no longer interpret this as the number of documents that
>> >>> have at least one value for the facet dim+path. It seems as though it
>> >>> would be more useful to provide the number of documents with a given
>> >>> dim+path value instead of just the total count, but this is where I'm
>> >>> probably just misunderstanding something.
>> >>>
>> >>> Finally, looking at the way taxonomy facets are counted, it looks like
>> >>> this value is populated with the total number of documents, and
>> >>> populated with -1 in multi-value cases where an accurate doc count
>> >>> can't be provided (see IntTaxonomyFacets L:228 for example). This
>> >>> isn't consistent with the implementation in LongValueFacetCounts
>> >>> though, which will always populate the total of all values, ignoring
>> >>> single- vs. multi-valued cases (see LongValueFacetCounts L:163). It
>> >>> appears the implementation in SortedSetDocValuesFacetCounts will also
>> >>> "double count" multi-value cases similar to LongValueFacetCounts.
>> >>>
>> >>> So... which do we think it is? Is it meant to be the total number of
>> >>> docs, or the total of all values? Can anyone shed some light on this?
>> >>> Thanks a bunch!
>> >>>
>> >>> Cheers,
>> >>> -Greg
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org