Mailing List Archive

Accessibility of CollectedSearchGroup's state
In an effort to prepare Elasticsearch for modularization, we are
investigating and eliminating split packages. The situation has improved
through recent refactoring in Lucene 9.0 [1], but a number of split
packages still remain. This message identifies one such so that it can
be discussed in isolation, with a view to a potential solution either in
Lucene or possibly within Elasticsearch itself.

Elasticsearch has a collapsing search collector that groups documents
based on field values and collapses based on the top sorted documents,
`CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
`FirstPassGroupingCollector` [3], and extends its functionality to
get the top docs in just a single pass. As a subclass, the CTDC
leverages the sorted top N unique groups by means of the protected
`FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`),
when performing the collapsing. Specifically, the
`CollectedSearchGroup.topDoc` field is of interest in order to retrieve
the number of the top document. The `topDoc` field is package-private
and therefore not normally accessible to the CTDC (without resorting to
nasty tricks!).

Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
as a set of `CollectedSearchGroup`, and that CSG is a public class, it
would appear that the lack of public access to its state is likely an
oversight, rather than a deliberate design decision (otherwise, from
outside the package CSG adds no apparent value and appears as if a
marker interface, which is not useful to any subclasses).

Minimally, the elasticsearch collector requires read access to the
`CollectedSearchGroup.topDoc` field. This could be achieved by adding
an accessor method to CSG, that returns the primitive int doc value. But
this whole API seems fairly low-level and powerful (you should know what
you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
makes its state available through public fields, so maybe we could just
make CSG's state public too?

lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java

public class CollectedSearchGroup<T> extends SearchGroup<T> {
- int topDoc;
- int comparatorSlot;
+
+ /** The number of the top document. */
+ public int topDoc;
+
+ /** The field comparator slot. */
+ public int comparatorSlot;
}

-Chris.

[1] https://issues.apache.org/jira/browse/LUCENE-9319
[2] https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
[3] https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Accessibility of CollectedSearchGroup's state [ In reply to ]
I feel sorry for increasing the scope of all these requests for changes
that you make, but the way Elasticsearch overrides this collector feels
wrong to me as any change in the implementation details of this collector
would probably break Elasticsearch's collector too. In my opinion,
CollectedSearchGroup should not even be public. My preference would be to
copy this collector to the Elasticsearch code base and fold the changes
from Elasticsearch's CollapsingTopDocsCollector into it. I'm not super
familiar with this code, so I might be missing something. Maybe Jim or Alan
have an opinion.

On Thu, Oct 14, 2021 at 1:48 PM Chris Hegarty
<christopher.hegarty@elastic.co.invalid> wrote:

> In an effort to prepare Elasticsearch for modularization, we are
> investigating and eliminating split packages. The situation has improved
> through recent refactoring in Lucene 9.0 [1], but a number of split
> packages still remain. This message identifies one such so that it can
> be discussed in isolation, with a view to a potential solution either in
> Lucene or possibly within Elasticsearch itself.
>
> Elasticsearch has a collapsing search collector that groups documents
> based on field values and collapses based on the top sorted documents,
> `CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
> `FirstPassGroupingCollector` [3], and extends its functionality to
> get the top docs in just a single pass. As a subclass, the CTDC
> leverages the sorted top N unique groups by means of the protected
> `FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`),
> when performing the collapsing. Specifically, the
> `CollectedSearchGroup.topDoc` field is of interest in order to retrieve
> the number of the top document. The `topDoc` field is package-private
> and therefore not normally accessible to the CTDC (without resorting to
> nasty tricks!).
>
> Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
> as a set of `CollectedSearchGroup`, and that CSG is a public class, it
> would appear that the lack of public access to its state is likely an
> oversight, rather than a deliberate design decision (otherwise, from
> outside the package CSG adds no apparent value and appears as if a
> marker interface, which is not useful to any subclasses).
>
> Minimally, the elasticsearch collector requires read access to the
> `CollectedSearchGroup.topDoc` field. This could be achieved by adding
> an accessor method to CSG, that returns the primitive int doc value. But
> this whole API seems fairly low-level and powerful (you should know what
> you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
> makes its state available through public fields, so maybe we could just
> make CSG's state public too?
>
>
> lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java
>
> public class CollectedSearchGroup<T> extends SearchGroup<T> {
> - int topDoc;
> - int comparatorSlot;
> +
> + /** The number of the top document. */
> + public int topDoc;
> +
> + /** The field comparator slot. */
> + public int comparatorSlot;
> }
>
> -Chris.
>
> [1] https://issues.apache.org/jira/browse/LUCENE-9319
> [2]
> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
> [3]
> https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien
Re: Accessibility of CollectedSearchGroup's state [ In reply to ]
I agree, we should have a SinglePassGroupingCollector in Elasticsearch and
reduce the visibility of these expert classes in Lucene.
As it stands today, the FirstPassGroupingCollector could be a final class
imo.


Le jeu. 14 oct. 2021 à 18:42, Adrien Grand <jpountz@gmail.com> a écrit :

> I feel sorry for increasing the scope of all these requests for changes
> that you make, but the way Elasticsearch overrides this collector feels
> wrong to me as any change in the implementation details of this collector
> would probably break Elasticsearch's collector too. In my opinion,
> CollectedSearchGroup should not even be public. My preference would be to
> copy this collector to the Elasticsearch code base and fold the changes
> from Elasticsearch's CollapsingTopDocsCollector into it. I'm not super
> familiar with this code, so I might be missing something. Maybe Jim or Alan
> have an opinion.
>
> On Thu, Oct 14, 2021 at 1:48 PM Chris Hegarty
> <christopher.hegarty@elastic.co.invalid> wrote:
>
>> In an effort to prepare Elasticsearch for modularization, we are
>> investigating and eliminating split packages. The situation has improved
>> through recent refactoring in Lucene 9.0 [1], but a number of split
>> packages still remain. This message identifies one such so that it can
>> be discussed in isolation, with a view to a potential solution either in
>> Lucene or possibly within Elasticsearch itself.
>>
>> Elasticsearch has a collapsing search collector that groups documents
>> based on field values and collapses based on the top sorted documents,
>> `CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
>> `FirstPassGroupingCollector` [3], and extends its functionality to
>> get the top docs in just a single pass. As a subclass, the CTDC
>> leverages the sorted top N unique groups by means of the protected
>> `FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`),
>> when performing the collapsing. Specifically, the
>> `CollectedSearchGroup.topDoc` field is of interest in order to retrieve
>> the number of the top document. The `topDoc` field is package-private
>> and therefore not normally accessible to the CTDC (without resorting to
>> nasty tricks!).
>>
>> Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
>> as a set of `CollectedSearchGroup`, and that CSG is a public class, it
>> would appear that the lack of public access to its state is likely an
>> oversight, rather than a deliberate design decision (otherwise, from
>> outside the package CSG adds no apparent value and appears as if a
>> marker interface, which is not useful to any subclasses).
>>
>> Minimally, the elasticsearch collector requires read access to the
>> `CollectedSearchGroup.topDoc` field. This could be achieved by adding
>> an accessor method to CSG, that returns the primitive int doc value. But
>> this whole API seems fairly low-level and powerful (you should know what
>> you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
>> makes its state available through public fields, so maybe we could just
>> make CSG's state public too?
>>
>>
>> lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java
>>
>> public class CollectedSearchGroup<T> extends SearchGroup<T> {
>> - int topDoc;
>> - int comparatorSlot;
>> +
>> + /** The number of the top document. */
>> + public int topDoc;
>> +
>> + /** The field comparator slot. */
>> + public int comparatorSlot;
>> }
>>
>> -Chris.
>>
>> [1] https://issues.apache.org/jira/browse/LUCENE-9319
>> [2]
>> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
>> [3]
>> https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> Adrien
>
Re: Accessibility of CollectedSearchGroup's state [ In reply to ]
Hi Adrien, Jim,

First, thank you for your time and insightful comments. They are very much appreciated, and will no doubt lead to the best solution.

I will reimplement the elasticsearch collector as suggested, and then circle back here with a proposal for the reduction in accessibility most appropriate for Lucene.

-Chris.

P.S. Eliminating split packages shines a spotlight on tech debt. Each case different and requiring / deserving of its own investigation and analysis. I try to do as much of this upfront, but really it needs the valuable input from folk on this list. Thank you for your time.

> On 14 Oct 2021, at 19:05, jim ferenczi <jim.ferenczi@gmail.com> wrote:
>
> I agree, we should have a SinglePassGroupingCollector in Elasticsearch and reduce the visibility of these expert classes in Lucene.
> As it stands today, the FirstPassGroupingCollector could be a final class imo.
>
>
> Le jeu. 14 oct. 2021 à 18:42, Adrien Grand <jpountz@gmail.com <mailto:jpountz@gmail.com>> a écrit :
> I feel sorry for increasing the scope of all these requests for changes that you make, but the way Elasticsearch overrides this collector feels wrong to me as any change in the implementation details of this collector would probably break Elasticsearch's collector too. In my opinion, CollectedSearchGroup should not even be public. My preference would be to copy this collector to the Elasticsearch code base and fold the changes from Elasticsearch's CollapsingTopDocsCollector into it. I'm not super familiar with this code, so I might be missing something. Maybe Jim or Alan have an opinion.