Mailing List Archive: Finding out which fields matched the query

Finding out which fields matched the query

Jun 24, 2022, 8:14 PM

Post #1 of 16 (899 views)

Hello!

I’m a MSCS student from BU and learning to use Lucene. Recently I try to
output matched fields by one query. For example, for one document, there
are 10 fields and 2 of them match the query. I want to get the name of
these fields.

I have tried using explain() method and getting description then regex.
However it cost so much time.

I wonder what is the efficient way to get the matched fields. Would you
please offer some help? Thank you so much!

Best regards,
Yichen Sun

Re: Finding out which fields matched the query [ In reply to ]

serera at gmail

Jun 26, 2022, 12:52 AM

Post #2 of 16 (899 views)

Hi Yichen,

I think you can implement a custom Collector which tracks the fields that
were matched for each Scorer. I implemented an example such Collector below:

public class FieldMatchingCollector implements Collector {

/** Holds the number of matching documents for each field. */
public final Map<String, Integer> matchingFieldCounts = new HashMap<>();

/** Holds which fields were matched for each document. */
public final Map<Integer, Set<String>> docMatchingFields = new
HashMap<>();

private final Set<Scorer> termScorers = new HashSet<>();

@Override
public ScoreMode scoreMode() {
return ScoreMode.COMPLETE_NO_SCORES;
}

@Override
public LeafCollector getLeafCollector(LeafReaderContext context) {
final int docBase = context.docBase;
return new LeafCollector() {

@Override
public void setScorer(Scorable scorer) throws IOException {
termScorers.clear();
getSubTermScorers(scorer, termScorers);
}

@Override
public void collect(int doc) {
int basedDoc = doc + docBase;
for (Scorer scorer : termScorers) {
if (doc == scorer.docID()) {
// We know that we're dealing w/ TermScorers
String matchingField = ((TermQuery)
scorer.getWeight().getQuery()).getTerm().field();
docMatchingFields.computeIfAbsent(basedDoc, d -> new
HashSet<>()).add(matchingField);
matchingFieldCounts.merge(matchingField, 1, Integer::sum);
}
}
}
};
}

private void getSubTermScorers(Scorable scorer, Set<Scorer> set) throws
IOException {
if (scorer instanceof TermScorer) {
set.add((Scorer) scorer);
} else {
for (Scorable.ChildScorable child : scorer.getChildren()) {
getSubTermScorers(child.child, set);
}
}
}
}

This is of course an example implementation and you can optimize it to
match your needs (e.g. if you're only interested in the set of matching fields
you can change "matchingFieldCounts" to a Set<String>). Note that
"docMatchingFields"
is expensive, I've only included it as an example (and for debugging
purposes), but I recommend omitting it in a real application.

To use it you can do something like:

// Need to use this searcher to guarantee the bulk scorer API isn't used.
IndexSearcher searcher = new ScorerIndexSearcher(reader);

// Parse the query to match against a list of searchable fields
QueryParser qp = new MultiFieldQueryParser(FIELDS_TO_SEARCH_ON, new
StandardAnalyzer());
Query query = qp.parse(queryText);

// Collect the matching fields
FieldMatchingCollector fieldMatchingCollector = new
FieldMatchingCollector();
// If needed, collect the top matching documents too
TopScoreDocCollector topScoreDocCollector = TopScoreDocCollector.create(10,
Integer.MAX_VALUE);
searcher.search(query, MultiCollector.wrap(topScoreDocCollector,
fieldMatchingCollector));

System.out.println("matchingFieldCounts = " +
fieldMatchingCollector.matchingFieldCounts);
System.out.println("docMatchingFields = " +
fieldMatchingCollector.docMatchingFields);
System.out.println("totalHits = " + topScoreDocCollector.getTotalHits());

Hope this helps!

Shai

On Sat, Jun 25, 2022 at 7:58 AM Yichen Sun <yichen98@bu.edu> wrote:

> Hello!
>
> I’m a MSCS student from BU and learning to use Lucene. Recently I try to
> output matched fields by one query. For example, for one document, there
> are 10 fields and 2 of them match the query. I want to get the name of
> these fields.
>
> I have tried using explain() method and getting description then regex.
> However it cost so much time.
>
> I wonder what is the efficient way to get the matched fields. Would you
> please offer some help? Thank you so much!
>
> Best regards,
> Yichen Sun
>

Re: Finding out which fields matched the query [ In reply to ]

jornfranke at gmail

Jun 26, 2022, 11:52 PM

Post #3 of 16 (899 views)

What is the reason you need the matched fields? Maybe your use case can be solved using sth completely different than knowing which fields were matched.

> Am 25.06.2022 um 06:58 schrieb Yichen Sun <yichen98@bu.edu>:
>
> ?Hello!
>
> I’m a MSCS student from BU and learning to use Lucene. Recently I try to output matched fields by one query. For example, for one document, there are 10 fields and 2 of them match the query. I want to get the name of these fields.
>
> I have tried using explain() method and getting description then regex. However it cost so much time.
>
> I wonder what is the efficient way to get the matched fields. Would you please offer some help? Thank you so much!
>
> Best regards,
> Yichen Sun

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Finding out which fields matched the query [ In reply to ]

romseygeek at gmail

Jun 27, 2022, 1:48 AM

Post #4 of 16 (899 views)

The Matches API will give you this information - it’s still likely to be fairly slow, but it’s a lot easier to use than trying to parse Explain output.

Query q = ….;
Weight w = searcher.createWeight(searcher.rewrite(query), ScoreMode.COMPLETE_NO_SCORES, 1.0f);

Matches m = w.matches(context, doc);
List<String> matchingFields = new ArrayList();
for (String field : m) {
matchingFields.add(field);
}

Bear in mind that `matches` doesn’t maintain any state between calls, so calling it for every matching document is likely to be slow; for those cases Shai’s suggestion of using a Collector and examining low-level scorers will perform better, but it won’t work for every query type.

> On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
>
> Hello!
>
> I’m a MSCS student from BU and learning to use Lucene. Recently I try to output matched fields by one query. For example, for one document, there are 10 fields and 2 of them match the query. I want to get the name of these fields.
>
> I have tried using explain() method and getting description then regex. However it cost so much time.
>
> I wonder what is the efficient way to get the matched fields. Would you please offer some help? Thank you so much!
>
> Best regards,
> Yichen Sun

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Finding out which fields matched the query [ In reply to ]

dawid.weiss at gmail

Jun 27, 2022, 1:56 AM

Post #5 of 16 (899 views)

The matches API is awesome. Use it. You can also get a rough glimpse
into a superset of fields potentially matching the query via:

query.visit(
new QueryVisitor() {
@Override
public boolean acceptField(String field) {
affectedFields.add(field);
return false;
}
});

https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)

I'd go with the Matches API though.

Dawid

On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com> wrote:
>
> The Matches API will give you this information - it’s still likely to be fairly slow, but it’s a lot easier to use than trying to parse Explain output.
>
> Query q = ….;
> Weight w = searcher.createWeight(searcher.rewrite(query), ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>
> Matches m = w.matches(context, doc);
> List<String> matchingFields = new ArrayList();
> for (String field : m) {
> matchingFields.add(field);
> }
>
> Bear in mind that `matches` doesn’t maintain any state between calls, so calling it for every matching document is likely to be slow; for those cases Shai’s suggestion of using a Collector and examining low-level scorers will perform better, but it won’t work for every query type.
>
>
> > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
> >
> > Hello!
> >
> > I’m a MSCS student from BU and learning to use Lucene. Recently I try to output matched fields by one query. For example, for one document, there are 10 fields and 2 of them match the query. I want to get the name of these fields.
> >
> > I have tried using explain() method and getting description then regex. However it cost so much time.
> >
> > I wonder what is the efficient way to get the matched fields. Would you please offer some help? Thank you so much!
> >
> > Best regards,
> > Yichen Sun
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Finding out which fields matched the query [ In reply to ]

serera at gmail

Jun 27, 2022, 2:51 AM

Post #6 of 16 (899 views)

Out of curiosity and for education purposes, is the Collector approach I
proposed wrong/inefficient? Or less efficient than the matches() API?

I'm thinking, if you want to both match/rank documents and as a side effect
know which fields matched, the Collector will perform better than
Weight.matches(), but I could be wrong.

Shai

On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> The matches API is awesome. Use it. You can also get a rough glimpse
> into a superset of fields potentially matching the query via:
>
> query.visit(
> new QueryVisitor() {
> @Override
> public boolean acceptField(String field) {
> affectedFields.add(field);
> return false;
> }
> });
>
>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>
> I'd go with the Matches API though.
>
> Dawid
>
> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com>
> wrote:
> >
> > The Matches API will give you this information - it’s still likely to be
> fairly slow, but it’s a lot easier to use than trying to parse Explain
> output.
> >
> > Query q = ….;
> > Weight w = searcher.createWeight(searcher.rewrite(query),
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >
> > Matches m = w.matches(context, doc);
> > List<String> matchingFields = new ArrayList();
> > for (String field : m) {
> > matchingFields.add(field);
> > }
> >
> > Bear in mind that `matches` doesn’t maintain any state between calls, so
> calling it for every matching document is likely to be slow; for those
> cases Shai’s suggestion of using a Collector and examining low-level
> scorers will perform better, but it won’t work for every query type.
> >
> >
> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
> > >
> > > Hello!
> > >
> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
> to output matched fields by one query. For example, for one document, there
> are 10 fields and 2 of them match the query. I want to get the name of
> these fields.
> > >
> > > I have tried using explain() method and getting description then
> regex. However it cost so much time.
> > >
> > > I wonder what is the efficient way to get the matched fields. Would
> you please offer some help? Thank you so much!
> > >
> > > Best regards,
> > > Yichen Sun
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Finding out which fields matched the query [ In reply to ]

uwe at thetaphi

Jun 27, 2022, 3:00 AM

Post #7 of 16 (899 views)

I think the collector approach is perfectly fine for mass-processing of
queries.

By the way: Elasticserach/Opensearch have a feature already built-in and
it is working based on collector API in a similar way like you mentioned
(as far as I remember). It is a bit different as you can tag any clause
in a BQ (so every query) using a "name" (they call it "named query",
https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
When you get the search results, for each hit it tells you which named
queries were a match on the hit. The actual implementation is some
wrapper query on each of those clauses that contains the name. In hit
collection it just collects all named query instances found in query
tree. I think their implementation somehow the wrapper query scorer impl
adds the name to some global state.

Uwe

Am 27.06.2022 um 11:51 schrieb Shai Erera:
> Out of curiosity and for education purposes, is the Collector approach
> I proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better
> than Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
>
> The matches API is awesome. Use it. You can also get a rough glimpse
> into a superset of fields potentially matching the query via:
>
> query.visit(
> new QueryVisitor() {
> @Override
> public boolean acceptField(String field) {
> affectedFields.add(field);
> return false;
> }
> });
>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>
> I'd go with the Matches API though.
>
> Dawid
>
> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward
> <romseygeek@gmail.com> wrote:
> >
> > The Matches API will give you this information - it’s still
> likely to be fairly slow, but it’s a lot easier to use than trying
> to parse Explain output.
> >
> > Query q = ….;
> > Weight w = searcher.createWeight(searcher.rewrite(query),
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >
> > Matches m = w.matches(context, doc);
> > List<String> matchingFields = new ArrayList();
> > for (String field : m) {
> > matchingFields.add(field);
> > }
> >
> > Bear in mind that `matches` doesn’t maintain any state between
> calls, so calling it for every matching document is likely to be
> slow; for those cases Shai’s suggestion of using a Collector and
> examining low-level scorers will perform better, but it won’t work
> for every query type.
> >
> >
> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
> > >
> > > Hello!
> > >
> > > I’m a MSCS student from BU and learning to use Lucene.
> Recently I try to output matched fields by one query. For example,
> for one document, there are 10 fields and 2 of them match the
> query. I want to get the name of these fields.
> > >
> > > I have tried using explain() method and getting description
> then regex. However it cost so much time.
> > >
> > > I wonder what is the efficient way to get the matched fields.
> Would you please offer some help? Thank you so much!
> > >
> > > Best regards,
> > > Yichen Sun
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:uwe@thetaphi.de

Re: Finding out which fields matched the query [ In reply to ]

serera at gmail

Jun 27, 2022, 3:23 AM

Post #8 of 16 (899 views)

Thanks Uwe, I didn't know about named queries, but it seems useful. Is
there interest in getting similar functionality in Lucene, or perhaps just
the FieldMatching collector? I'd be happy to PR-it.

As for usecase, I was thinking of using something similar to this collector
for some kind of (simple) entity recognition task. If you have a corpus of
documents with many fields which denote product attributes, you could match
a word like "Red" to the various product attribute fields and determine
based on the matching fields + their doc count whether this word likely
represents a Color or Brand entity (hint: it matches both, the question is
which is more probable).

I'm sure there are other ways to achieve this, and probably much smarter
NER implementations, but this one is at least based on the actual data that
you index which guarantees something about the results you will receive if
applying a certain attribute filtering.

Shai

On Mon, Jun 27, 2022 at 1:01 PM Uwe Schindler <uwe@thetaphi.de> wrote:

> I think the collector approach is perfectly fine for mass-processing of
> queries.
>
> By the way: Elasticserach/Opensearch have a feature already built-in and
> it is working based on collector API in a similar way like you mentioned
> (as far as I remember). It is a bit different as you can tag any clause in
> a BQ (so every query) using a "name" (they call it "named query",
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
> When you get the search results, for each hit it tells you which named
> queries were a match on the hit. The actual implementation is some wrapper
> query on each of those clauses that contains the name. In hit collection it
> just collects all named query instances found in query tree. I think their
> implementation somehow the wrapper query scorer impl adds the name to some
> global state.
>
> Uwe
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>
> Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
>
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>> @Override
>> public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>> }
>> });
>>
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com>
>> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to
>> be fairly slow, but it’s a lot easier to use than trying to parse Explain
>> output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List<String> matchingFields = new ArrayList();
>> > for (String field : m) {
>> > matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls,
>> so calling it for every matching document is likely to be slow; for those
>> cases Shai’s suggestion of using a Collector and examining low-level
>> scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
>> to output matched fields by one query. For example, for one document, there
>> are 10 fields and 2 of them match the query. I want to get the name of
>> these fields.
>> > >
>> > > I have tried using explain() method and getting description then
>> regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would
>> you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>

Re: Finding out which fields matched the query [ In reply to ]

romseygeek at gmail

Jun 27, 2022, 4:10 AM

Post #9 of 16 (899 views)

Your approach is almost certainly more efficient, but it might give you false matches in some cases - for example, if you have a complex query with many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is positioned on the correct document, but which is part of a clause that doesn’t actually match. It also only works for term queries, so it won’t match phrases or span/interval groups. And Matches will work on points or docvalues queries as well. The reason I added Matches in the first place was precisely to handle these weird corner cases - I had written highlighters which more or less did the same thing you describe with a Collector and the Scorable tree, and I would occasionally get bad highlights back.

> On 27 Jun 2022, at 10:51, Shai Erera <serera@gmail.com> wrote:
>
> Out of curiosity and for education purposes, is the Collector approach I proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side effect know which fields matched, the Collector will perform better than Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com <mailto:dawid.weiss@gmail.com>> wrote:
> The matches API is awesome. Use it. You can also get a rough glimpse
> into a superset of fields potentially matching the query via:
>
> query.visit(
> new QueryVisitor() {
> @Override
> public boolean acceptField(String field) {
> affectedFields.add(field);
> return false;
> }
> });
>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor) <https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)>
>
> I'd go with the Matches API though.
>
> Dawid
>
> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com <mailto:romseygeek@gmail.com>> wrote:
> >
> > The Matches API will give you this information - it’s still likely to be fairly slow, but it’s a lot easier to use than trying to parse Explain output.
> >
> > Query q = ….;
> > Weight w = searcher.createWeight(searcher.rewrite(query), ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >
> > Matches m = w.matches(context, doc);
> > List<String> matchingFields = new ArrayList();
> > for (String field : m) {
> > matchingFields.add(field);
> > }
> >
> > Bear in mind that `matches` doesn’t maintain any state between calls, so calling it for every matching document is likely to be slow; for those cases Shai’s suggestion of using a Collector and examining low-level scorers will perform better, but it won’t work for every query type.
> >
> >
> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu <mailto:yichen98@bu.edu>> wrote:
> > >
> > > Hello!
> > >
> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try to output matched fields by one query. For example, for one document, there are 10 fields and 2 of them match the query. I want to get the name of these fields.
> > >
> > > I have tried using explain() method and getting description then regex. However it cost so much time.
> > >
> > > I wonder what is the efficient way to get the matched fields. Would you please offer some help? Thank you so much!
> > >
> > > Best regards,
> > > Yichen Sun
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
> > For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>

Re: Finding out which fields matched the query [ In reply to ]

dawid.weiss at gmail

Jun 27, 2022, 4:21 AM

Post #10 of 16 (899 views)

A side note - I've been using a highlighter based on matches API for
quite some time now and it's been fantastic. Very precise and handles
non-trivial queries (interval queries) very well.

https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html

Dawid

On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <romseygeek@gmail.com> wrote:
>
> Your approach is almost certainly more efficient, but it might give you false matches in some cases - for example, if you have a complex query with many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is positioned on the correct document, but which is part of a clause that doesn’t actually match. It also only works for term queries, so it won’t match phrases or span/interval groups. And Matches will work on points or docvalues queries as well. The reason I added Matches in the first place was precisely to handle these weird corner cases - I had written highlighters which more or less did the same thing you describe with a Collector and the Scorable tree, and I would occasionally get bad highlights back.
>
> On 27 Jun 2022, at 10:51, Shai Erera <serera@gmail.com> wrote:
>
> Out of curiosity and for education purposes, is the Collector approach I proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side effect know which fields matched, the Collector will perform better than Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>> @Override
>> public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>> }
>> });
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to be fairly slow, but it’s a lot easier to use than trying to parse Explain output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query), ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List<String> matchingFields = new ArrayList();
>> > for (String field : m) {
>> > matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls, so calling it for every matching document is likely to be slow; for those cases Shai’s suggestion of using a Collector and examining low-level scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try to output matched fields by one query. For example, for one document, there are 10 fields and 2 of them match the query. I want to get the name of these fields.
>> > >
>> > > I have tried using explain() method and getting description then regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Finding out which fields matched the query [ In reply to ]

serera at gmail

Jun 27, 2022, 4:46 AM

Post #11 of 16 (899 views)

Thanks Alan, yeah I guess I was thinking about the usecase I described,
which involves (usually) simple term queries, but you're definitely right
about complex boolean clauses as well non-term queries.

I think the case for highlighter is different though? I mean you usually
generate highlights only for the top-K results and therefore are probably
less affected by whether the matches() API is slower than a Collector. And
if you invoke the API for every document in the index, it might be much
slower (depending on the index size) than the Collector.

Maybe a hybrid approach which runs the query and caches the docs in a
DocIdSet (like FacetsCollector does) and then invokes the matches() API
only on those hits, will let you enjoy the best of both worlds? Assuming
though that the number of matching documents is not huge.

So it seems there are several options and one should choose based on their
usecase. Do you see an advantage for Lucene to offer a Collector for this
usecase? Or should we tell users to use the matches API

Shai

On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> A side note - I've been using a highlighter based on matches API for
> quite some time now and it's been fantastic. Very precise and handles
> non-trivial queries (interval queries) very well.
>
>
> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>
> Dawid
>
> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <romseygeek@gmail.com>
> wrote:
> >
> > Your approach is almost certainly more efficient, but it might give you
> false matches in some cases - for example, if you have a complex query with
> many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is
> positioned on the correct document, but which is part of a clause that
> doesn’t actually match. It also only works for term queries, so it won’t
> match phrases or span/interval groups. And Matches will work on points or
> docvalues queries as well. The reason I added Matches in the first place
> was precisely to handle these weird corner cases - I had written
> highlighters which more or less did the same thing you describe with a
> Collector and the Scorable tree, and I would occasionally get bad
> highlights back.
> >
> > On 27 Jun 2022, at 10:51, Shai Erera <serera@gmail.com> wrote:
> >
> > Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
> >
> > I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
> >
> > Shai
> >
> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
> >>
> >> The matches API is awesome. Use it. You can also get a rough glimpse
> >> into a superset of fields potentially matching the query via:
> >>
> >> query.visit(
> >> new QueryVisitor() {
> >> @Override
> >> public boolean acceptField(String field) {
> >> affectedFields.add(field);
> >> return false;
> >> }
> >> });
> >>
> >>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
> >>
> >> I'd go with the Matches API though.
> >>
> >> Dawid
> >>
> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com>
> wrote:
> >> >
> >> > The Matches API will give you this information - it’s still likely to
> be fairly slow, but it’s a lot easier to use than trying to parse Explain
> output.
> >> >
> >> > Query q = ….;
> >> > Weight w = searcher.createWeight(searcher.rewrite(query),
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >> >
> >> > Matches m = w.matches(context, doc);
> >> > List<String> matchingFields = new ArrayList();
> >> > for (String field : m) {
> >> > matchingFields.add(field);
> >> > }
> >> >
> >> > Bear in mind that `matches` doesn’t maintain any state between calls,
> so calling it for every matching document is likely to be slow; for those
> cases Shai’s suggestion of using a Collector and examining low-level
> scorers will perform better, but it won’t work for every query type.
> >> >
> >> >
> >> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
> >> > >
> >> > > Hello!
> >> > >
> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I
> try to output matched fields by one query. For example, for one document,
> there are 10 fields and 2 of them match the query. I want to get the name
> of these fields.
> >> > >
> >> > > I have tried using explain() method and getting description then
> regex. However it cost so much time.
> >> > >
> >> > > I wonder what is the efficient way to get the matched fields. Would
> you please offer some help? Thank you so much!
> >> > >
> >> > > Best regards,
> >> > > Yichen Sun
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Finding out which fields matched the query [ In reply to ]

jpountz at gmail

Jun 27, 2022, 4:48 AM

Post #12 of 16 (899 views)

Uwe,

Elasticsearch's named queries are not using a collector actually. Ater top
hits have been evaluated for the whole query, they are evaluated
independently on each of the top hits. It's probably faster than the
collector approach since it doesn't add per-document overhead to
collection, but also less flexible since it cannot compute statistics
across all matches.

On Mon, Jun 27, 2022 at 12:01 PM Uwe Schindler <uwe@thetaphi.de> wrote:

> I think the collector approach is perfectly fine for mass-processing of
> queries.
>
> By the way: Elasticserach/Opensearch have a feature already built-in and
> it is working based on collector API in a similar way like you mentioned
> (as far as I remember). It is a bit different as you can tag any clause in
> a BQ (so every query) using a "name" (they call it "named query",
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
> When you get the search results, for each hit it tells you which named
> queries were a match on the hit. The actual implementation is some wrapper
> query on each of those clauses that contains the name. In hit collection it
> just collects all named query instances found in query tree. I think their
> implementation somehow the wrapper query scorer impl adds the name to some
> global state.
>
> Uwe
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>
> Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
>
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>> @Override
>> public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>> }
>> });
>>
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com>
>> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to
>> be fairly slow, but it’s a lot easier to use than trying to parse Explain
>> output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List<String> matchingFields = new ArrayList();
>> > for (String field : m) {
>> > matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls,
>> so calling it for every matching document is likely to be slow; for those
>> cases Shai’s suggestion of using a Collector and examining low-level
>> scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
>> to output matched fields by one query. For example, for one document, there
>> are 10 fields and 2 of them match the query. I want to get the name of
>> these fields.
>> > >
>> > > I have tried using explain() method and getting description then
>> regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would
>> you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>

--
Adrien

Re: Finding out which fields matched the query [ In reply to ]

uwe at thetaphi

Jun 27, 2022, 5:25 AM

Post #13 of 16 (899 views)

Hi Adrien,

maybe it changed a bit, but last time I looked into is it was somehow
wrapping all Queries using a wrapper "NamedQuery" or similiar. When it
collected hits it was able to figure out by a wrapper somewhere around
weight/scorer/DISI and set a flag that the query was a hit. It could be
that this bit is only set when it goes into the topdocs, but in general
the work was done at collection phase.

I use this feature quite often also with scanning results and it is very
fast like without named query (at least for my queries - maybe the
result scanning and data transfer took longer than the overhead).

Uwe

P.S.: We at PANGAEA use the feature to implement our "OAI-PMH sets"
(Open Archives Protocol for Metadata Harvesting, a standard API used in
library world). This is for datacenters harvesting our metadata and all
the delivered results dynamically get their assigned sets tagged
(representated as queries). All those set queries are added a named
should queries to the main query and for each result it returns which
set a PANGAEA dataset belongs to (as this is required by the protocol).

Am 27.06.2022 um 13:48 schrieb Adrien Grand:
> Uwe,
>
> Elasticsearch's named queries are not using a collector actually. Ater
> top hits have been evaluated for the whole query, they are evaluated
> independently on each of the top hits. It's probably faster than the
> collector approach since it doesn't add per-document overhead to
> collection, but also less flexible since it cannot compute statistics
> across all matches.
>
> On Mon, Jun 27, 2022 at 12:01 PM Uwe Schindler <uwe@thetaphi.de> wrote:
>
> I think the collector approach is perfectly fine for
> mass-processing of queries.
>
> By the way: Elasticserach/Opensearch have a feature already
> built-in and it is working based on collector API in a similar way
> like you mentioned (as far as I remember). It is a bit different
> as you can tag any clause in a BQ (so every query) using a "name"
> (they call it "named query",
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
> When you get the search results, for each hit it tells you which
> named queries were a match on the hit. The actual implementation
> is some wrapper query on each of those clauses that contains the
> name. In hit collection it just collects all named query instances
> found in query tree. I think their implementation somehow the
> wrapper query scorer impl adds the name to some global state.
>
> Uwe
>
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>> Out of curiosity and for education purposes, is the Collector
>> approach I proposed wrong/inefficient? Or less efficient than the
>> matches() API?
>>
>> I'm thinking, if you want to both match/rank documents and as a
>> side effect know which fields matched, the Collector will perform
>> better than Weight.matches(), but I could be wrong.
>>
>> Shai
>>
>> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss
>> <dawid.weiss@gmail.com> wrote:
>>
>> The matches API is awesome. Use it. You can also get a rough
>> glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>> @Override
>> public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>> }
>> });
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward
>> <romseygeek@gmail.com> wrote:
>> >
>> > The Matches API will give you this information - it’s still
>> likely to be fairly slow, but it’s a lot easier to use than
>> trying to parse Explain output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List<String> matchingFields = new ArrayList();
>> > for (String field : m) {
>> > matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state
>> between calls, so calling it for every matching document is
>> likely to be slow; for those cases Shai’s suggestion of using
>> a Collector and examining low-level scorers will perform
>> better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene.
>> Recently I try to output matched fields by one query. For
>> example, for one document, there are 10 fields and 2 of them
>> match the query. I want to get the name of these fields.
>> > >
>> > > I have tried using explain() method and getting
>> description then regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched
>> fields. Would you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:uwe@thetaphi.de
>
>
>
> --
> Adrien

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:uwe@thetaphi.de

Re: Finding out which fields matched the query [ In reply to ]

wunder at wunderwood

Jun 27, 2022, 8:11 AM

Post #14 of 16 (899 views)

For a quick hack, you can use highlighting. That does more than you want, showing which words match, but it does have the info.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Jun 27, 2022, at 3:23 AM, Shai Erera <serera@gmail.com> wrote:
>
> Thanks Uwe, I didn't know about named queries, but it seems useful. Is there interest in getting similar functionality in Lucene, or perhaps just the FieldMatching collector? I'd be happy to PR-it.
>
> As for usecase, I was thinking of using something similar to this collector for some kind of (simple) entity recognition task. If you have a corpus of documents with many fields which denote product attributes, you could match a word like "Red" to the various product attribute fields and determine based on the matching fields + their doc count whether this word likely represents a Color or Brand entity (hint: it matches both, the question is which is more probable).
>
> I'm sure there are other ways to achieve this, and probably much smarter NER implementations, but this one is at least based on the actual data that you index which guarantees something about the results you will receive if applying a certain attribute filtering.
>
> Shai
>
> On Mon, Jun 27, 2022 at 1:01 PM Uwe Schindler <uwe@thetaphi.de <mailto:uwe@thetaphi.de>> wrote:
> I think the collector approach is perfectly fine for mass-processing of queries.
>
> By the way: Elasticserach/Opensearch have a feature already built-in and it is working based on collector API in a similar way like you mentioned (as far as I remember). It is a bit different as you can tag any clause in a BQ (so every query) using a "name" (they call it "named query", https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries <https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries>). When you get the search results, for each hit it tells you which named queries were a match on the hit. The actual implementation is some wrapper query on each of those clauses that contains the name. In hit collection it just collects all named query instances found in query tree. I think their implementation somehow the wrapper query scorer impl adds the name to some global state.
>
> Uwe
>
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>> Out of curiosity and for education purposes, is the Collector approach I proposed wrong/inefficient? Or less efficient than the matches() API?
>>
>> I'm thinking, if you want to both match/rank documents and as a side effect know which fields matched, the Collector will perform better than Weight.matches(), but I could be wrong.
>>
>> Shai
>>
>> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com <mailto:dawid.weiss@gmail.com>> wrote:
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>> @Override
>> public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>> }
>> });
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor) <https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)>
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com <mailto:romseygeek@gmail.com>> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to be fairly slow, but it’s a lot easier to use than trying to parse Explain output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query), ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List<String> matchingFields = new ArrayList();
>> > for (String field : m) {
>> > matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls, so calling it for every matching document is likely to be slow; for those cases Shai’s suggestion of using a Collector and examining low-level scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu <mailto:yichen98@bu.edu>> wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try to output matched fields by one query. For example, for one document, there are 10 fields and 2 of them match the query. I want to get the name of these fields.
>> > >
>> > > I have tried using explain() method and getting description then regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> > For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de <https://www.thetaphi.de/>
> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>

Re: Finding out which fields matched the query [ In reply to ]

romseygeek at gmail

Jun 28, 2022, 6:08 AM

Post #15 of 16 (899 views)

I think it depends on what information we actually want to get here. If it’s just finding which fields matched in which document, then running Matches over the top-k results is fine. If you want to get some kind of aggregate data, as in you want to get a list of fields that matched in *any* document (or conversely, a list of fields that *didn’t* match - useful if you want to prune your schema, for example), then Matches will be too slow. But at the same time, queries are designed to tell you which *documents* match efficiently, and they are allowed to advance their sub-queries lazily or indeed not at all if the result isn’t needed for scoring. So we don’t really have any way of finding this kind of information via a collector that is accurate and performs reasonably.

It *might* be possible to rework Matches so that they act more like an iterator and maintain their state within a segment, but there hasn’t been a pressing need for that so far.

> On 27 Jun 2022, at 12:46, Shai Erera <serera@gmail.com <mailto:serera@gmail.com>> wrote:
>
> Thanks Alan, yeah I guess I was thinking about the usecase I described, which involves (usually) simple term queries, but you're definitely right about complex boolean clauses as well non-term queries.
>
> I think the case for highlighter is different though? I mean you usually generate highlights only for the top-K results and therefore are probably less affected by whether the matches() API is slower than a Collector. And if you invoke the API for every document in the index, it might be much slower (depending on the index size) than the Collector.
>
> Maybe a hybrid approach which runs the query and caches the docs in a DocIdSet (like FacetsCollector does) and then invokes the matches() API only on those hits, will let you enjoy the best of both worlds? Assuming though that the number of matching documents is not huge.
>
> So it seems there are several options and one should choose based on their usecase. Do you see an advantage for Lucene to offer a Collector for this usecase? Or should we tell users to use the matches API
>
> Shai
>
> On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss <dawid.weiss@gmail.com <mailto:dawid.weiss@gmail.com>> wrote:
> A side note - I've been using a highlighter based on matches API for
> quite some time now and it's been fantastic. Very precise and handles
> non-trivial queries (interval queries) very well.
>
> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html <https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html>
>
> Dawid
>
> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <romseygeek@gmail.com <mailto:romseygeek@gmail.com>> wrote:
> >
> > Your approach is almost certainly more efficient, but it might give you false matches in some cases - for example, if you have a complex query with many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is positioned on the correct document, but which is part of a clause that doesn’t actually match. It also only works for term queries, so it won’t match phrases or span/interval groups. And Matches will work on points or docvalues queries as well. The reason I added Matches in the first place was precisely to handle these weird corner cases - I had written highlighters which more or less did the same thing you describe with a Collector and the Scorable tree, and I would occasionally get bad highlights back.
> >
> > On 27 Jun 2022, at 10:51, Shai Erera <serera@gmail.com <mailto:serera@gmail.com>> wrote:
> >
> > Out of curiosity and for education purposes, is the Collector approach I proposed wrong/inefficient? Or less efficient than the matches() API?
> >
> > I'm thinking, if you want to both match/rank documents and as a side effect know which fields matched, the Collector will perform better than Weight.matches(), but I could be wrong.
> >
> > Shai
> >
> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com <mailto:dawid.weiss@gmail.com>> wrote:
> >>
> >> The matches API is awesome. Use it. You can also get a rough glimpse
> >> into a superset of fields potentially matching the query via:
> >>
> >> query.visit(
> >> new QueryVisitor() {
> >> @Override
> >> public boolean acceptField(String field) {
> >> affectedFields.add(field);
> >> return false;
> >> }
> >> });
> >>
> >> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor) <https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)>
> >>
> >> I'd go with the Matches API though.
> >>
> >> Dawid
> >>
> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com <mailto:romseygeek@gmail.com>> wrote:
> >> >
> >> > The Matches API will give you this information - it’s still likely to be fairly slow, but it’s a lot easier to use than trying to parse Explain output.
> >> >
> >> > Query q = ….;
> >> > Weight w = searcher.createWeight(searcher.rewrite(query), ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >> >
> >> > Matches m = w.matches(context, doc);
> >> > List<String> matchingFields = new ArrayList();
> >> > for (String field : m) {
> >> > matchingFields.add(field);
> >> > }
> >> >
> >> > Bear in mind that `matches` doesn’t maintain any state between calls, so calling it for every matching document is likely to be slow; for those cases Shai’s suggestion of using a Collector and examining low-level scorers will perform better, but it won’t work for every query type.
> >> >
> >> >
> >> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu <mailto:yichen98@bu.edu>> wrote:
> >> > >
> >> > > Hello!
> >> > >
> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try to output matched fields by one query. For example, for one document, there are 10 fields and 2 of them match the query. I want to get the name of these fields.
> >> > >
> >> > > I have tried using explain() method and getting description then regex. However it cost so much time.
> >> > >
> >> > > I wonder what is the efficient way to get the matched fields. Would you please offer some help? Thank you so much!
> >> > >
> >> > > Best regards,
> >> > > Yichen Sun
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
> >> > For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
> >> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>

Re: Finding out which fields matched the query [ In reply to ]

serera at gmail

Jun 28, 2022, 11:04 PM

Post #16 of 16 (896 views)

I think it's a matter of tradeoff. For example when you do faceting then we
require complete evaluation, and since this field-matching is a kind of
aggregation I think it's OK if that's how it works. Users can choose which
technique they want to apply based on their usecase.

Anyway I don't think we must introduce this kind of collector in Lucene,
it's definitely something someone can write in his/her own project.

Shai

On Tue, Jun 28, 2022 at 4:09 PM Alan Woodward <romseygeek@gmail.com> wrote:

> I think it depends on what information we actually want to get here. If
> it’s just finding which fields matched in which document, then running
> Matches over the top-k results is fine. If you want to get some kind of
> aggregate data, as in you want to get a list of fields that matched in
> *any* document (or conversely, a list of fields that *didn’t* match -
> useful if you want to prune your schema, for example), then Matches will be
> too slow. But at the same time, queries are designed to tell you which
> *documents* match efficiently, and they are allowed to advance their
> sub-queries lazily or indeed not at all if the result isn’t needed for
> scoring. So we don’t really have any way of finding this kind of
> information via a collector that is accurate and performs reasonably.
>
> It *might* be possible to rework Matches so that they act more like an
> iterator and maintain their state within a segment, but there hasn’t been a
> pressing need for that so far.
>
> On 27 Jun 2022, at 12:46, Shai Erera <serera@gmail.com> wrote:
>
> Thanks Alan, yeah I guess I was thinking about the usecase I described,
> which involves (usually) simple term queries, but you're definitely right
> about complex boolean clauses as well non-term queries.
>
> I think the case for highlighter is different though? I mean you usually
> generate highlights only for the top-K results and therefore are probably
> less affected by whether the matches() API is slower than a Collector. And
> if you invoke the API for every document in the index, it might be much
> slower (depending on the index size) than the Collector.
>
> Maybe a hybrid approach which runs the query and caches the docs in a
> DocIdSet (like FacetsCollector does) and then invokes the matches() API
> only on those hits, will let you enjoy the best of both worlds? Assuming
> though that the number of matching documents is not huge.
>
> So it seems there are several options and one should choose based on their
> usecase. Do you see an advantage for Lucene to offer a Collector for this
> usecase? Or should we tell users to use the matches API
>
> Shai
>
> On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>> A side note - I've been using a highlighter based on matches API for
>> quite some time now and it's been fantastic. Very precise and handles
>> non-trivial queries (interval queries) very well.
>>
>>
>> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <romseygeek@gmail.com>
>> wrote:
>> >
>> > Your approach is almost certainly more efficient, but it might give you
>> false matches in some cases - for example, if you have a complex query with
>> many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is
>> positioned on the correct document, but which is part of a clause that
>> doesn’t actually match. It also only works for term queries, so it won’t
>> match phrases or span/interval groups. And Matches will work on points or
>> docvalues queries as well. The reason I added Matches in the first place
>> was precisely to handle these weird corner cases - I had written
>> highlighters which more or less did the same thing you describe with a
>> Collector and the Scorable tree, and I would occasionally get bad
>> highlights back.
>> >
>> > On 27 Jun 2022, at 10:51, Shai Erera <serera@gmail.com> wrote:
>> >
>> > Out of curiosity and for education purposes, is the Collector approach
>> I proposed wrong/inefficient? Or less efficient than the matches() API?
>> >
>> > I'm thinking, if you want to both match/rank documents and as a side
>> effect know which fields matched, the Collector will perform better than
>> Weight.matches(), but I could be wrong.
>> >
>> > Shai
>> >
>> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.weiss@gmail.com>
>> wrote:
>> >>
>> >> The matches API is awesome. Use it. You can also get a rough glimpse
>> >> into a superset of fields potentially matching the query via:
>> >>
>> >> query.visit(
>> >> new QueryVisitor() {
>> >> @Override
>> >> public boolean acceptField(String field) {
>> >> affectedFields.add(field);
>> >> return false;
>> >> }
>> >> });
>> >>
>> >>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>> >>
>> >> I'd go with the Matches API though.
>> >>
>> >> Dawid
>> >>
>> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseygeek@gmail.com>
>> wrote:
>> >> >
>> >> > The Matches API will give you this information - it’s still likely
>> to be fairly slow, but it’s a lot easier to use than trying to parse
>> Explain output.
>> >> >
>> >> > Query q = ….;
>> >> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >> >
>> >> > Matches m = w.matches(context, doc);
>> >> > List<String> matchingFields = new ArrayList();
>> >> > for (String field : m) {
>> >> > matchingFields.add(field);
>> >> > }
>> >> >
>> >> > Bear in mind that `matches` doesn’t maintain any state between
>> calls, so calling it for every matching document is likely to be slow; for
>> those cases Shai’s suggestion of using a Collector and examining low-level
>> scorers will perform better, but it won’t work for every query type.
>> >> >
>> >> >
>> >> > > On 25 Jun 2022, at 04:14, Yichen Sun <yichen98@bu.edu> wrote:
>> >> > >
>> >> > > Hello!
>> >> > >
>> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I
>> try to output matched fields by one query. For example, for one document,
>> there are 10 fields and 2 of them match the query. I want to get the name
>> of these fields.
>> >> > >
>> >> > > I have tried using explain() method and getting description then
>> regex. However it cost so much time.
>> >> > >
>> >> > > I wonder what is the efficient way to get the matched fields.
>> Would you please offer some help? Thank you so much!
>> >> > >
>> >> > > Best regards,
>> >> > > Yichen Sun
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>