Mailing List Archive

debugging query execution plan
Do we have a way to understand how BooleanQuery (and other composite
queries) are advancing their child queries? For example, a simple
conjunction of two queries advances the more restrictive (lower
cost()) query first, enabling the more costly query to skip over more
documents. But we may not be making the best choice in every case, and
I would like to know, for some query, how we are doing. For example,
we could execute in a debugging mode, interposing something that wraps
or observes the Scorers in some way, gathering statistics about how
many documents are visited by each Scorer, which can be aggregated for
later analysis.

This is motivated by a use case we have in which we currently
post-filter our query results in a custom collector using some filters
that we know to be expensive (they must be evaluated on every
document), but we would rather express these post-filters as Queries
and have them advanced during the main Query execution. However when
we tried to do that, we saw some slowdowns (in spite of marking these
Queries as high-cost) and I suspect it is due to the iteration order,
but I'm not sure how to debug.

Suggestions welcome!

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: debugging query execution plan [ In reply to ]
Generally you go over the least frequent to most frequent required terms as
the pivot. As you get more and more complicated queries, the ordering of
the query results tends to dominate. This also means there are two types
of measurement. One for the running query and one for query + results in
order.

tim

On Thu, May 6, 2021 at 12:20 PM Michael Sokolov <msokolov@gmail.com> wrote:

> Do we have a way to understand how BooleanQuery (and other composite
> queries) are advancing their child queries? For example, a simple
> conjunction of two queries advances the more restrictive (lower
> cost()) query first, enabling the more costly query to skip over more
> documents. But we may not be making the best choice in every case, and
> I would like to know, for some query, how we are doing. For example,
> we could execute in a debugging mode, interposing something that wraps
> or observes the Scorers in some way, gathering statistics about how
> many documents are visited by each Scorer, which can be aggregated for
> later analysis.
>
> This is motivated by a use case we have in which we currently
> post-filter our query results in a custom collector using some filters
> that we know to be expensive (they must be evaluated on every
> document), but we would rather express these post-filters as Queries
> and have them advanced during the main Query execution. However when
> we tried to do that, we saw some slowdowns (in spite of marking these
> Queries as high-cost) and I suspect it is due to the iteration order,
> but I'm not sure how to debug.
>
> Suggestions welcome!
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: debugging query execution plan [ In reply to ]
We have something like that in Elasticsearch that wraps queries in order to
be able to report cost, matchCost and the number of calls to
nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
the query tree.

It's not perfect as it needs to disable some optimizations in order to work
properly. For instance bulk scorers are disabled and conjunctions are not
inlined, which means that clauses may run in a different order. So results
need to be interpreted carefully as the way the query gets executed when
observed may differ a bit from how it gets executed normally. That said it
has still been useful in a number of cases. I don't think our
implementation works when IndexSearcher is configured with an executor but
we could maybe put it in sandbox and iterate from there?

For your case, do you think it could be attributed to deleted docs? Deleted
docs are checked before two-phase confirmation and collectors but after
disjunctions/conjunctions of postings.

Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a écrit :

> Do we have a way to understand how BooleanQuery (and other composite
> queries) are advancing their child queries? For example, a simple
> conjunction of two queries advances the more restrictive (lower
> cost()) query first, enabling the more costly query to skip over more
> documents. But we may not be making the best choice in every case, and
> I would like to know, for some query, how we are doing. For example,
> we could execute in a debugging mode, interposing something that wraps
> or observes the Scorers in some way, gathering statistics about how
> many documents are visited by each Scorer, which can be aggregated for
> later analysis.
>
> This is motivated by a use case we have in which we currently
> post-filter our query results in a custom collector using some filters
> that we know to be expensive (they must be evaluated on every
> document), but we would rather express these post-filters as Queries
> and have them advanced during the main Query execution. However when
> we tried to do that, we saw some slowdowns (in spite of marking these
> Queries as high-cost) and I suspect it is due to the iteration order,
> but I'm not sure how to debug.
>
> Suggestions welcome!
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: debugging query execution plan [ In reply to ]
Thanks Adrien, that is something like what I had in mind. If you are
able to share, that could be very helpful. And -- deleted docs is not
something I had considered, it's possibly a problem here. I'd have to
go check - I think these "filter" Queries were implemented in the
second part of the two-phase iteration.

On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
>
> We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>
> It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>
> For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>
> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a écrit :
>>
>> Do we have a way to understand how BooleanQuery (and other composite
>> queries) are advancing their child queries? For example, a simple
>> conjunction of two queries advances the more restrictive (lower
>> cost()) query first, enabling the more costly query to skip over more
>> documents. But we may not be making the best choice in every case, and
>> I would like to know, for some query, how we are doing. For example,
>> we could execute in a debugging mode, interposing something that wraps
>> or observes the Scorers in some way, gathering statistics about how
>> many documents are visited by each Scorer, which can be aggregated for
>> later analysis.
>>
>> This is motivated by a use case we have in which we currently
>> post-filter our query results in a custom collector using some filters
>> that we know to be expensive (they must be evaluated on every
>> document), but we would rather express these post-filters as Queries
>> and have them advanced during the main Query execution. However when
>> we tried to do that, we saw some slowdowns (in spite of marking these
>> Queries as high-cost) and I suspect it is due to the iteration order,
>> but I'm not sure how to debug.
>>
>> Suggestions welcome!
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: debugging query execution plan [ In reply to ]
Instead of a Collector, why isn't this a TwoPhaseIterator with a high
matchCost?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <msokolov@gmail.com> wrote:

> Thanks Adrien, that is something like what I had in mind. If you are
> able to share, that could be very helpful. And -- deleted docs is not
> something I had considered, it's possibly a problem here. I'd have to
> go check - I think these "filter" Queries were implemented in the
> second part of the two-phase iteration.
>
> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > We have something like that in Elasticsearch that wraps queries in order
> to be able to report cost, matchCost and the number of calls to
> nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
> the query tree.
> >
> > It's not perfect as it needs to disable some optimizations in order to
> work properly. For instance bulk scorers are disabled and conjunctions are
> not inlined, which means that clauses may run in a different order. So
> results need to be interpreted carefully as the way the query gets executed
> when observed may differ a bit from how it gets executed normally. That
> said it has still been useful in a number of cases. I don't think our
> implementation works when IndexSearcher is configured with an executor but
> we could maybe put it in sandbox and iterate from there?
> >
> > For your case, do you think it could be attributed to deleted docs?
> Deleted docs are checked before two-phase confirmation and collectors but
> after disjunctions/conjunctions of postings.
> >
> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a
> écrit :
> >>
> >> Do we have a way to understand how BooleanQuery (and other composite
> >> queries) are advancing their child queries? For example, a simple
> >> conjunction of two queries advances the more restrictive (lower
> >> cost()) query first, enabling the more costly query to skip over more
> >> documents. But we may not be making the best choice in every case, and
> >> I would like to know, for some query, how we are doing. For example,
> >> we could execute in a debugging mode, interposing something that wraps
> >> or observes the Scorers in some way, gathering statistics about how
> >> many documents are visited by each Scorer, which can be aggregated for
> >> later analysis.
> >>
> >> This is motivated by a use case we have in which we currently
> >> post-filter our query results in a custom collector using some filters
> >> that we know to be expensive (they must be evaluated on every
> >> document), but we would rather express these post-filters as Queries
> >> and have them advanced during the main Query execution. However when
> >> we tried to do that, we saw some slowdowns (in spite of marking these
> >> Queries as high-cost) and I suspect it is due to the iteration order,
> >> but I'm not sure how to debug.
> >>
> >> Suggestions welcome!
> >>
> >> -Mike
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: debugging query execution plan [ In reply to ]
Just chiming in here to answer David's question since I have some familiarity:

In this specific case, the logic was implemented inside a Collector
and we tried to move it into a Query abstraction using a
TwoPhaseIterator with a high matchCost. The first-phase would match on
all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
second phase would do the costly check. The matchCost was advertised
as reader.maxDoc(). ("reader" in this example is from the
LeafReaderContext).

Moving the logic behind a Query abstraction caused performance
regressions. So one theory is that it was somehow leading iteration
with an expensive "match all docs" DISI, but we don't actually know if
that's true right now.

Cheers,
-Greg

On Fri, May 7, 2021 at 8:41 AM David Smiley <dsmiley@apache.org> wrote:
>
> Instead of a Collector, why isn't this a TwoPhaseIterator with a high matchCost?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> Thanks Adrien, that is something like what I had in mind. If you are
>> able to share, that could be very helpful. And -- deleted docs is not
>> something I had considered, it's possibly a problem here. I'd have to
>> go check - I think these "filter" Queries were implemented in the
>> second part of the two-phase iteration.
>>
>> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>> >
>> > It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>> >
>> > For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>> >
>> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a écrit :
>> >>
>> >> Do we have a way to understand how BooleanQuery (and other composite
>> >> queries) are advancing their child queries? For example, a simple
>> >> conjunction of two queries advances the more restrictive (lower
>> >> cost()) query first, enabling the more costly query to skip over more
>> >> documents. But we may not be making the best choice in every case, and
>> >> I would like to know, for some query, how we are doing. For example,
>> >> we could execute in a debugging mode, interposing something that wraps
>> >> or observes the Scorers in some way, gathering statistics about how
>> >> many documents are visited by each Scorer, which can be aggregated for
>> >> later analysis.
>> >>
>> >> This is motivated by a use case we have in which we currently
>> >> post-filter our query results in a custom collector using some filters
>> >> that we know to be expensive (they must be evaluated on every
>> >> document), but we would rather express these post-filters as Queries
>> >> and have them advanced during the main Query execution. However when
>> >> we tried to do that, we saw some slowdowns (in spite of marking these
>> >> Queries as high-cost) and I suspect it is due to the iteration order,
>> >> but I'm not sure how to debug.
>> >>
>> >> Suggestions welcome!
>> >>
>> >> -Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: debugging query execution plan [ In reply to ]
Thanks for the clarification Greg. I've been looking into this recently and
filed https://issues.apache.org/jira/browse/LUCENE-9938 based on a hunch
that these DocIdSetIterator.all(maxDoc) iterators have a
non-negligible cost inside ConjunctionDISI. Ultimately I closed the issue
because the TPI design seems to prohibit removing them :-(. Feel free to
comment there nonetheless if you have any thoughts on the matter. For my
part, I have some benchmarking to do in Solr for a related matter that
would move certain queries that work at the collector stage to be TPIs
-- SOLR-14164.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, May 7, 2021 at 6:06 PM Greg Miller <gsmiller@gmail.com> wrote:

> Just chiming in here to answer David's question since I have some
> familiarity:
>
> In this specific case, the logic was implemented inside a Collector
> and we tried to move it into a Query abstraction using a
> TwoPhaseIterator with a high matchCost. The first-phase would match on
> all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
> second phase would do the costly check. The matchCost was advertised
> as reader.maxDoc(). ("reader" in this example is from the
> LeafReaderContext).
>
> Moving the logic behind a Query abstraction caused performance
> regressions. So one theory is that it was somehow leading iteration
> with an expensive "match all docs" DISI, but we don't actually know if
> that's true right now.
>
> Cheers,
> -Greg
>
> On Fri, May 7, 2021 at 8:41 AM David Smiley <dsmiley@apache.org> wrote:
> >
> > Instead of a Collector, why isn't this a TwoPhaseIterator with a high
> matchCost?
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> Thanks Adrien, that is something like what I had in mind. If you are
> >> able to share, that could be very helpful. And -- deleted docs is not
> >> something I had considered, it's possibly a problem here. I'd have to
> >> go check - I think these "filter" Queries were implemented in the
> >> second part of the two-phase iteration.
> >>
> >> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
> >> >
> >> > We have something like that in Elasticsearch that wraps queries in
> order to be able to report cost, matchCost and the number of calls to
> nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
> the query tree.
> >> >
> >> > It's not perfect as it needs to disable some optimizations in order
> to work properly. For instance bulk scorers are disabled and conjunctions
> are not inlined, which means that clauses may run in a different order. So
> results need to be interpreted carefully as the way the query gets executed
> when observed may differ a bit from how it gets executed normally. That
> said it has still been useful in a number of cases. I don't think our
> implementation works when IndexSearcher is configured with an executor but
> we could maybe put it in sandbox and iterate from there?
> >> >
> >> > For your case, do you think it could be attributed to deleted docs?
> Deleted docs are checked before two-phase confirmation and collectors but
> after disjunctions/conjunctions of postings.
> >> >
> >> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a
> écrit :
> >> >>
> >> >> Do we have a way to understand how BooleanQuery (and other composite
> >> >> queries) are advancing their child queries? For example, a simple
> >> >> conjunction of two queries advances the more restrictive (lower
> >> >> cost()) query first, enabling the more costly query to skip over more
> >> >> documents. But we may not be making the best choice in every case,
> and
> >> >> I would like to know, for some query, how we are doing. For example,
> >> >> we could execute in a debugging mode, interposing something that
> wraps
> >> >> or observes the Scorers in some way, gathering statistics about how
> >> >> many documents are visited by each Scorer, which can be aggregated
> for
> >> >> later analysis.
> >> >>
> >> >> This is motivated by a use case we have in which we currently
> >> >> post-filter our query results in a custom collector using some
> filters
> >> >> that we know to be expensive (they must be evaluated on every
> >> >> document), but we would rather express these post-filters as Queries
> >> >> and have them advanced during the main Query execution. However when
> >> >> we tried to do that, we saw some slowdowns (in spite of marking these
> >> >> Queries as high-cost) and I suspect it is due to the iteration order,
> >> >> but I'm not sure how to debug.
> >> >>
> >> >> Suggestions welcome!
> >> >>
> >> >> -Mike
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: debugging query execution plan [ In reply to ]
Thanks David, I took a look at the Jira and see the problem with
trying to do away with the "all docs" approximation. We'll be
revisiting our performance regression on our side soon, so hopefully
we'll get to the bottom of what's going on, particularly around
whether-or-not we're somehow leading iteration with this "all docs"
approximation. If we find something interesting, I'll report back.
Thanks again!

Cheers,
-Greg

On Fri, May 7, 2021 at 6:33 PM David Smiley <dsmiley@apache.org> wrote:
>
> Thanks for the clarification Greg. I've been looking into this recently and filed https://issues.apache.org/jira/browse/LUCENE-9938 based on a hunch that these DocIdSetIterator.all(maxDoc) iterators have a non-negligible cost inside ConjunctionDISI. Ultimately I closed the issue because the TPI design seems to prohibit removing them :-(. Feel free to comment there nonetheless if you have any thoughts on the matter. For my part, I have some benchmarking to do in Solr for a related matter that would move certain queries that work at the collector stage to be TPIs -- SOLR-14164.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Fri, May 7, 2021 at 6:06 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>> Just chiming in here to answer David's question since I have some familiarity:
>>
>> In this specific case, the logic was implemented inside a Collector
>> and we tried to move it into a Query abstraction using a
>> TwoPhaseIterator with a high matchCost. The first-phase would match on
>> all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
>> second phase would do the costly check. The matchCost was advertised
>> as reader.maxDoc(). ("reader" in this example is from the
>> LeafReaderContext).
>>
>> Moving the logic behind a Query abstraction caused performance
>> regressions. So one theory is that it was somehow leading iteration
>> with an expensive "match all docs" DISI, but we don't actually know if
>> that's true right now.
>>
>> Cheers,
>> -Greg
>>
>> On Fri, May 7, 2021 at 8:41 AM David Smiley <dsmiley@apache.org> wrote:
>> >
>> > Instead of a Collector, why isn't this a TwoPhaseIterator with a high matchCost?
>> >
>> > ~ David Smiley
>> > Apache Lucene/Solr Search Developer
>> > http://www.linkedin.com/in/davidwsmiley
>> >
>> >
>> > On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <msokolov@gmail.com> wrote:
>> >>
>> >> Thanks Adrien, that is something like what I had in mind. If you are
>> >> able to share, that could be very helpful. And -- deleted docs is not
>> >> something I had considered, it's possibly a problem here. I'd have to
>> >> go check - I think these "filter" Queries were implemented in the
>> >> second part of the two-phase iteration.
>> >>
>> >> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
>> >> >
>> >> > We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>> >> >
>> >> > It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>> >> >
>> >> > For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>> >> >
>> >> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a écrit :
>> >> >>
>> >> >> Do we have a way to understand how BooleanQuery (and other composite
>> >> >> queries) are advancing their child queries? For example, a simple
>> >> >> conjunction of two queries advances the more restrictive (lower
>> >> >> cost()) query first, enabling the more costly query to skip over more
>> >> >> documents. But we may not be making the best choice in every case, and
>> >> >> I would like to know, for some query, how we are doing. For example,
>> >> >> we could execute in a debugging mode, interposing something that wraps
>> >> >> or observes the Scorers in some way, gathering statistics about how
>> >> >> many documents are visited by each Scorer, which can be aggregated for
>> >> >> later analysis.
>> >> >>
>> >> >> This is motivated by a use case we have in which we currently
>> >> >> post-filter our query results in a custom collector using some filters
>> >> >> that we know to be expensive (they must be evaluated on every
>> >> >> document), but we would rather express these post-filters as Queries
>> >> >> and have them advanced during the main Query execution. However when
>> >> >> we tried to do that, we saw some slowdowns (in spite of marking these
>> >> >> Queries as high-cost) and I suspect it is due to the iteration order,
>> >> >> but I'm not sure how to debug.
>> >> >>
>> >> >> Suggestions welcome!
>> >> >>
>> >> >> -Mike
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: debugging query execution plan [ In reply to ]
+1 to start from the Elasticsearch implementation for low-level query
execution tracing, which I think is from (pre-7.10) ASL2 licensed code?

That sounds helpful, even with the Heisenberg caveats.

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:

> We have something like that in Elasticsearch that wraps queries in order
> to be able to report cost, matchCost and the number of calls to
> nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
> the query tree.
>
> It's not perfect as it needs to disable some optimizations in order to
> work properly. For instance bulk scorers are disabled and conjunctions are
> not inlined, which means that clauses may run in a different order. So
> results need to be interpreted carefully as the way the query gets executed
> when observed may differ a bit from how it gets executed normally. That
> said it has still been useful in a number of cases. I don't think our
> implementation works when IndexSearcher is configured with an executor but
> we could maybe put it in sandbox and iterate from there?
>
> For your case, do you think it could be attributed to deleted docs?
> Deleted docs are checked before two-phase confirmation and collectors but
> after disjunctions/conjunctions of postings.
>
> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a écrit :
>
>> Do we have a way to understand how BooleanQuery (and other composite
>> queries) are advancing their child queries? For example, a simple
>> conjunction of two queries advances the more restrictive (lower
>> cost()) query first, enabling the more costly query to skip over more
>> documents. But we may not be making the best choice in every case, and
>> I would like to know, for some query, how we are doing. For example,
>> we could execute in a debugging mode, interposing something that wraps
>> or observes the Scorers in some way, gathering statistics about how
>> many documents are visited by each Scorer, which can be aggregated for
>> later analysis.
>>
>> This is motivated by a use case we have in which we currently
>> post-filter our query results in a custom collector using some filters
>> that we know to be expensive (they must be evaluated on every
>> document), but we would rather express these post-filters as Queries
>> and have them advanced during the main Query execution. However when
>> we tried to do that, we saw some slowdowns (in spite of marking these
>> Queries as high-cost) and I suspect it is due to the iteration order,
>> but I'm not sure how to debug.
>>
>> Suggestions welcome!
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
Re: debugging query execution plan [ In reply to ]
Indeed this is code is ASL2 pre-7.10, but I wouldn't have expected any
concerns regardless. Jack volunteered to bring this code to Lucene by
removing the Elasticsearch-specific bits.

On Mon, May 10, 2021 at 4:55 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> +1 to start from the Elasticsearch implementation for low-level query
> execution tracing, which I think is from (pre-7.10) ASL2 licensed code?
>
> That sounds helpful, even with the Heisenberg caveats.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
>
>> We have something like that in Elasticsearch that wraps queries in order
>> to be able to report cost, matchCost and the number of calls to
>> nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
>> the query tree.
>>
>> It's not perfect as it needs to disable some optimizations in order to
>> work properly. For instance bulk scorers are disabled and conjunctions are
>> not inlined, which means that clauses may run in a different order. So
>> results need to be interpreted carefully as the way the query gets executed
>> when observed may differ a bit from how it gets executed normally. That
>> said it has still been useful in a number of cases. I don't think our
>> implementation works when IndexSearcher is configured with an executor but
>> we could maybe put it in sandbox and iterate from there?
>>
>> For your case, do you think it could be attributed to deleted docs?
>> Deleted docs are checked before two-phase confirmation and collectors but
>> after disjunctions/conjunctions of postings.
>>
>> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a
>> écrit :
>>
>>> Do we have a way to understand how BooleanQuery (and other composite
>>> queries) are advancing their child queries? For example, a simple
>>> conjunction of two queries advances the more restrictive (lower
>>> cost()) query first, enabling the more costly query to skip over more
>>> documents. But we may not be making the best choice in every case, and
>>> I would like to know, for some query, how we are doing. For example,
>>> we could execute in a debugging mode, interposing something that wraps
>>> or observes the Scorers in some way, gathering statistics about how
>>> many documents are visited by each Scorer, which can be aggregated for
>>> later analysis.
>>>
>>> This is motivated by a use case we have in which we currently
>>> post-filter our query results in a custom collector using some filters
>>> that we know to be expensive (they must be evaluated on every
>>> document), but we would rather express these post-filters as Queries
>>> and have them advanced during the main Query execution. However when
>>> we tried to do that, we saw some slowdowns (in spite of marking these
>>> Queries as high-cost) and I suspect it is due to the iteration order,
>>> but I'm not sure how to debug.
>>>
>>> Suggestions welcome!
>>>
>>> -Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

--
Adrien
Re: debugging query execution plan [ In reply to ]
FYI this got just checked in:
https://issues.apache.org/jira/browse/LUCENE-9965.

I'd be curious to know if it helps with your problem, Mike.

On Wed, May 12, 2021 at 1:54 PM Adrien Grand <jpountz@gmail.com> wrote:

> Indeed this is code is ASL2 pre-7.10, but I wouldn't have expected any
> concerns regardless. Jack volunteered to bring this code to Lucene by
> removing the Elasticsearch-specific bits.
>
> On Mon, May 10, 2021 at 4:55 PM Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> +1 to start from the Elasticsearch implementation for low-level query
>> execution tracing, which I think is from (pre-7.10) ASL2 licensed code?
>>
>> That sounds helpful, even with the Heisenberg caveats.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
>>
>>> We have something like that in Elasticsearch that wraps queries in order
>>> to be able to report cost, matchCost and the number of calls to
>>> nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
>>> the query tree.
>>>
>>> It's not perfect as it needs to disable some optimizations in order to
>>> work properly. For instance bulk scorers are disabled and conjunctions are
>>> not inlined, which means that clauses may run in a different order. So
>>> results need to be interpreted carefully as the way the query gets executed
>>> when observed may differ a bit from how it gets executed normally. That
>>> said it has still been useful in a number of cases. I don't think our
>>> implementation works when IndexSearcher is configured with an executor but
>>> we could maybe put it in sandbox and iterate from there?
>>>
>>> For your case, do you think it could be attributed to deleted docs?
>>> Deleted docs are checked before two-phase confirmation and collectors but
>>> after disjunctions/conjunctions of postings.
>>>
>>> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a
>>> écrit :
>>>
>>>> Do we have a way to understand how BooleanQuery (and other composite
>>>> queries) are advancing their child queries? For example, a simple
>>>> conjunction of two queries advances the more restrictive (lower
>>>> cost()) query first, enabling the more costly query to skip over more
>>>> documents. But we may not be making the best choice in every case, and
>>>> I would like to know, for some query, how we are doing. For example,
>>>> we could execute in a debugging mode, interposing something that wraps
>>>> or observes the Scorers in some way, gathering statistics about how
>>>> many documents are visited by each Scorer, which can be aggregated for
>>>> later analysis.
>>>>
>>>> This is motivated by a use case we have in which we currently
>>>> post-filter our query results in a custom collector using some filters
>>>> that we know to be expensive (they must be evaluated on every
>>>> document), but we would rather express these post-filters as Queries
>>>> and have them advanced during the main Query execution. However when
>>>> we tried to do that, we saw some slowdowns (in spite of marking these
>>>> Queries as high-cost) and I suspect it is due to the iteration order,
>>>> but I'm not sure how to debug.
>>>>
>>>> Suggestions welcome!
>>>>
>>>> -Mike
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>
> --
> Adrien
>


--
Adrien
Re: debugging query execution plan [ In reply to ]
Yes, I'm looking forward to checking it out, and really appreciate the
work to bring it here! I'm going to be out in the woods for a few
days, but it's on my list to try this out when I get back

On Wed, Jun 9, 2021 at 5:41 PM Adrien Grand <jpountz@gmail.com> wrote:
>
> FYI this got just checked in: https://issues.apache.org/jira/browse/LUCENE-9965.
>
> I'd be curious to know if it helps with your problem, Mike.
>
> On Wed, May 12, 2021 at 1:54 PM Adrien Grand <jpountz@gmail.com> wrote:
>>
>> Indeed this is code is ASL2 pre-7.10, but I wouldn't have expected any concerns regardless. Jack volunteered to bring this code to Lucene by removing the Elasticsearch-specific bits.
>>
>> On Mon, May 10, 2021 at 4:55 PM Michael McCandless <lucene@mikemccandless.com> wrote:
>>>
>>> +1 to start from the Elasticsearch implementation for low-level query execution tracing, which I think is from (pre-7.10) ASL2 licensed code?
>>>
>>> That sounds helpful, even with the Heisenberg caveats.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <jpountz@gmail.com> wrote:
>>>>
>>>> We have something like that in Elasticsearch that wraps queries in order to be able to report cost, matchCost and the number of calls to nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in the query tree.
>>>>
>>>> It's not perfect as it needs to disable some optimizations in order to work properly. For instance bulk scorers are disabled and conjunctions are not inlined, which means that clauses may run in a different order. So results need to be interpreted carefully as the way the query gets executed when observed may differ a bit from how it gets executed normally. That said it has still been useful in a number of cases. I don't think our implementation works when IndexSearcher is configured with an executor but we could maybe put it in sandbox and iterate from there?
>>>>
>>>> For your case, do you think it could be attributed to deleted docs? Deleted docs are checked before two-phase confirmation and collectors but after disjunctions/conjunctions of postings.
>>>>
>>>> Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <msokolov@gmail.com> a écrit :
>>>>>
>>>>> Do we have a way to understand how BooleanQuery (and other composite
>>>>> queries) are advancing their child queries? For example, a simple
>>>>> conjunction of two queries advances the more restrictive (lower
>>>>> cost()) query first, enabling the more costly query to skip over more
>>>>> documents. But we may not be making the best choice in every case, and
>>>>> I would like to know, for some query, how we are doing. For example,
>>>>> we could execute in a debugging mode, interposing something that wraps
>>>>> or observes the Scorers in some way, gathering statistics about how
>>>>> many documents are visited by each Scorer, which can be aggregated for
>>>>> later analysis.
>>>>>
>>>>> This is motivated by a use case we have in which we currently
>>>>> post-filter our query results in a custom collector using some filters
>>>>> that we know to be expensive (they must be evaluated on every
>>>>> document), but we would rather express these post-filters as Queries
>>>>> and have them advanced during the main Query execution. However when
>>>>> we tried to do that, we saw some slowdowns (in spite of marking these
>>>>> Queries as high-cost) and I suspect it is due to the iteration order,
>>>>> but I'm not sure how to debug.
>>>>>
>>>>> Suggestions welcome!
>>>>>
>>>>> -Mike
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>
>>
>> --
>> Adrien
>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org