In an effort to prepare Elasticsearch for modularization, we are
investigating and eliminating split packages. The situation has improved
through recent refactoring in Lucene 9.0 [1], but a number of split
packages still remain. This message identifies one such so that it can
be discussed in isolation, with a view to a potential solution either in
Lucene or possibly within Elasticsearch itself.
Elasticsearch has a collapsing search collector that groups documents
based on field values and collapses based on the top sorted documents,
`CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
`FirstPassGroupingCollector` [3], and extends its functionality to
get the top docs in just a single pass. As a subclass, the CTDC
leverages the sorted top N unique groups by means of the protected
`FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`),
when performing the collapsing. Specifically, the
`CollectedSearchGroup.topDoc` field is of interest in order to retrieve
the number of the top document. The `topDoc` field is package-private
and therefore not normally accessible to the CTDC (without resorting to
nasty tricks!).
Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
as a set of `CollectedSearchGroup`, and that CSG is a public class, it
would appear that the lack of public access to its state is likely an
oversight, rather than a deliberate design decision (otherwise, from
outside the package CSG adds no apparent value and appears as if a
marker interface, which is not useful to any subclasses).
Minimally, the elasticsearch collector requires read access to the
`CollectedSearchGroup.topDoc` field. This could be achieved by adding
an accessor method to CSG, that returns the primitive int doc value. But
this whole API seems fairly low-level and powerful (you should know what
you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
makes its state available through public fields, so maybe we could just
make CSG's state public too?
lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java
public class CollectedSearchGroup<T> extends SearchGroup<T> {
- int topDoc;
- int comparatorSlot;
+
+ /** The number of the top document. */
+ public int topDoc;
+
+ /** The field comparator slot. */
+ public int comparatorSlot;
}
-Chris.
[1] https://issues.apache.org/jira/browse/LUCENE-9319
[2] https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
[3] https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
investigating and eliminating split packages. The situation has improved
through recent refactoring in Lucene 9.0 [1], but a number of split
packages still remain. This message identifies one such so that it can
be discussed in isolation, with a view to a potential solution either in
Lucene or possibly within Elasticsearch itself.
Elasticsearch has a collapsing search collector that groups documents
based on field values and collapses based on the top sorted documents,
`CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
`FirstPassGroupingCollector` [3], and extends its functionality to
get the top docs in just a single pass. As a subclass, the CTDC
leverages the sorted top N unique groups by means of the protected
`FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`),
when performing the collapsing. Specifically, the
`CollectedSearchGroup.topDoc` field is of interest in order to retrieve
the number of the top document. The `topDoc` field is package-private
and therefore not normally accessible to the CTDC (without resorting to
nasty tricks!).
Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
as a set of `CollectedSearchGroup`, and that CSG is a public class, it
would appear that the lack of public access to its state is likely an
oversight, rather than a deliberate design decision (otherwise, from
outside the package CSG adds no apparent value and appears as if a
marker interface, which is not useful to any subclasses).
Minimally, the elasticsearch collector requires read access to the
`CollectedSearchGroup.topDoc` field. This could be achieved by adding
an accessor method to CSG, that returns the primitive int doc value. But
this whole API seems fairly low-level and powerful (you should know what
you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
makes its state available through public fields, so maybe we could just
make CSG's state public too?
lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java
public class CollectedSearchGroup<T> extends SearchGroup<T> {
- int topDoc;
- int comparatorSlot;
+
+ /** The number of the top document. */
+ public int topDoc;
+
+ /** The field comparator slot. */
+ public int comparatorSlot;
}
-Chris.
[1] https://issues.apache.org/jira/browse/LUCENE-9319
[2] https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
[3] https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org