Mailing List Archive: Relative cpu cost of fetching term frequency during scoring

Relative cpu cost of fetching term frequency during scoring

vkjk89 at gmail

Jun 19, 2023, 11:56 PM

Post #1 of 11 (453 views)

Hi,
I want to understand if fetching the term frequency of a term during
scoring is relatively cpu bound operation ?
Context - I am storing custom term frequency during indexing and later
using it for scoring during query execution time ( in Scorer's score()
method ). I noticed a performance drop in my application and I suspect it's
because of this change.
Any insight or related articles for reference would be appreciated.

*Thanks and Regards,*
*Vimal Jain*

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

vkjk89 at gmail

Jun 19, 2023, 11:57 PM

Post #2 of 11 (453 views)

Note - i am using lucene 7.7.3

*Thanks and Regards,*
*Vimal Jain*

On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com> wrote:

> Hi,
> I want to understand if fetching the term frequency of a term during
> scoring is relatively cpu bound operation ?
> Context - I am storing custom term frequency during indexing and later
> using it for scoring during query execution time ( in Scorer's score()
> method ). I noticed a performance drop in my application and I suspect it's
> because of this change.
> Any insight or related articles for reference would be appreciated.
>
>
> *Thanks and Regards,*
> *Vimal Jain*
>

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

jpountz at gmail

Jun 20, 2023, 12:29 AM

Post #3 of 11 (453 views)

You say you observed a performance drop, what are you comparing against?

Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a écrit :

> Note - i am using lucene 7.7.3
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com> wrote:
>
> > Hi,
> > I want to understand if fetching the term frequency of a term during
> > scoring is relatively cpu bound operation ?
> > Context - I am storing custom term frequency during indexing and later
> > using it for scoring during query execution time ( in Scorer's score()
> > method ). I noticed a performance drop in my application and I suspect
> it's
> > because of this change.
> > Any insight or related articles for reference would be appreciated.
> >
> >
> > *Thanks and Regards,*
> > *Vimal Jain*
> >
>

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

vkjk89 at gmail

Jun 20, 2023, 1:29 AM

Post #4 of 11 (453 views)

Ok , sorry , I realized that I need to provide more context.
So we used to create a lucene query which consisted of custom term queries
for different fields and based on the type of field , we used to assign a
boost that would be used in scoring.
Now we want to get rid off different fields and instead of creating
multiple term queries , we create only 1 term query for the merged field
and the scorer of this term query ( on merged field ) makes use of custom
term frequency info to deduce type of token ( during indexing we store this
info ) and hence the score that we were using earlier.
So perf drop is observed in reference to earlier implementation ( with
multiple term queries ).

*Thanks and Regards,*
*Vimal Jain*

On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com> wrote:

> You say you observed a performance drop, what are you comparing against?
>
> Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a écrit :
>
> > Note - i am using lucene 7.7.3
> >
> > *Thanks and Regards,*
> > *Vimal Jain*
> >
> >
> > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com> wrote:
> >
> > > Hi,
> > > I want to understand if fetching the term frequency of a term during
> > > scoring is relatively cpu bound operation ?
> > > Context - I am storing custom term frequency during indexing and later
> > > using it for scoring during query execution time ( in Scorer's score()
> > > method ). I noticed a performance drop in my application and I suspect
> > it's
> > > because of this change.
> > > Any insight or related articles for reference would be appreciated.
> > >
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> >
>

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

jpountz at gmail

Jun 20, 2023, 9:15 AM

Post #5 of 11 (453 views)

Intuitively replacing a disjunction across multiple fields with a single
term query should always be faster.

You're saying that you're storing the type of token as part of the term
frequency. This doesn't sound like something that would play well with
dynamic pruning, so I wonder if this is the reason why you are seeing
slower queries. But since you mentioned custom term queries, maybe you
never actually took advantage of dynamic pruning?

On Tue, Jun 20, 2023 at 10:30?AM Vimal Jain <vkjk89@gmail.com> wrote:

> Ok , sorry , I realized that I need to provide more context.
> So we used to create a lucene query which consisted of custom term queries
> for different fields and based on the type of field , we used to assign a
> boost that would be used in scoring.
> Now we want to get rid off different fields and instead of creating
> multiple term queries , we create only 1 term query for the merged field
> and the scorer of this term query ( on merged field ) makes use of custom
> term frequency info to deduce type of token ( during indexing we store this
> info ) and hence the score that we were using earlier.
> So perf drop is observed in reference to earlier implementation ( with
> multiple term queries ).
>
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com> wrote:
>
> > You say you observed a performance drop, what are you comparing against?
> >
> > Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a écrit :
> >
> > > Note - i am using lucene 7.7.3
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> > >
> > > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com> wrote:
> > >
> > > > Hi,
> > > > I want to understand if fetching the term frequency of a term during
> > > > scoring is relatively cpu bound operation ?
> > > > Context - I am storing custom term frequency during indexing and
> later
> > > > using it for scoring during query execution time ( in Scorer's
> score()
> > > > method ). I noticed a performance drop in my application and I
> suspect
> > > it's
> > > > because of this change.
> > > > Any insight or related articles for reference would be appreciated.
> > > >
> > > >
> > > > *Thanks and Regards,*
> > > > *Vimal Jain*
> > > >
> > >
> >
>

--
Adrien

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

vkjk89 at gmail

Jun 20, 2023, 9:57 AM

Post #6 of 11 (453 views)

Thanks Adrien for quick response.
Yes , i am replacing disjuncts across multiple fields with single custom
term query over merged field.
Can you please provide more details on what do you mean by dynamic pruning
in context of custom term query ?

On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <jpountz@gmail.com> wrote:

> Intuitively replacing a disjunction across multiple fields with a single
> term query should always be faster.
>
> You're saying that you're storing the type of token as part of the term
> frequency. This doesn't sound like something that would play well with
> dynamic pruning, so I wonder if this is the reason why you are seeing
> slower queries. But since you mentioned custom term queries, maybe you
> never actually took advantage of dynamic pruning?
>
> On Tue, Jun 20, 2023 at 10:30?AM Vimal Jain <vkjk89@gmail.com> wrote:
>
> > Ok , sorry , I realized that I need to provide more context.
> > So we used to create a lucene query which consisted of custom term
> queries
> > for different fields and based on the type of field , we used to assign a
> > boost that would be used in scoring.
> > Now we want to get rid off different fields and instead of creating
> > multiple term queries , we create only 1 term query for the merged field
> > and the scorer of this term query ( on merged field ) makes use of custom
> > term frequency info to deduce type of token ( during indexing we store
> this
> > info ) and hence the score that we were using earlier.
> > So perf drop is observed in reference to earlier implementation ( with
> > multiple term queries ).
> >
> >
> > *Thanks and Regards,*
> > *Vimal Jain*
> >
> >
> > On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > > You say you observed a performance drop, what are you comparing
> against?
> > >
> > > Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a écrit :
> > >
> > > > Note - i am using lucene 7.7.3
> > > >
> > > > *Thanks and Regards,*
> > > > *Vimal Jain*
> > > >
> > > >
> > > > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com>
> wrote:
> > > >
> > > > > Hi,
> > > > > I want to understand if fetching the term frequency of a term
> during
> > > > > scoring is relatively cpu bound operation ?
> > > > > Context - I am storing custom term frequency during indexing and
> > later
> > > > > using it for scoring during query execution time ( in Scorer's
> > score()
> > > > > method ). I noticed a performance drop in my application and I
> > suspect
> > > > it's
> > > > > because of this change.
> > > > > Any insight or related articles for reference would be appreciated.
> > > > >
> > > > >
> > > > > *Thanks and Regards,*
> > > > > *Vimal Jain*
> > > > >
> > > >
> > >
> >
>
>
> --
> Adrien
>

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

jpountz at gmail

Jun 20, 2023, 12:36 PM

Post #7 of 11 (453 views)

Lucene has logic to only evaluate a subset of the matching documents when
retrieving top-k hits. This leverages the Scorer#getMaxScore API. If you
never implemented it on your custom query, then you never took advantage of
dynamic pruning anyway. I wrote a bit more about it
<https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand>
a few years ago if you're curious.

On Tue, Jun 20, 2023 at 6:58?PM Vimal Jain <vkjk89@gmail.com> wrote:

> Thanks Adrien for quick response.
> Yes , i am replacing disjuncts across multiple fields with single custom
> term query over merged field.
> Can you please provide more details on what do you mean by dynamic pruning
> in context of custom term query ?
>
> On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <jpountz@gmail.com> wrote:
>
> > Intuitively replacing a disjunction across multiple fields with a single
> > term query should always be faster.
> >
> > You're saying that you're storing the type of token as part of the term
> > frequency. This doesn't sound like something that would play well with
> > dynamic pruning, so I wonder if this is the reason why you are seeing
> > slower queries. But since you mentioned custom term queries, maybe you
> > never actually took advantage of dynamic pruning?
> >
> > On Tue, Jun 20, 2023 at 10:30?AM Vimal Jain <vkjk89@gmail.com> wrote:
> >
> > > Ok , sorry , I realized that I need to provide more context.
> > > So we used to create a lucene query which consisted of custom term
> > queries
> > > for different fields and based on the type of field , we used to
> assign a
> > > boost that would be used in scoring.
> > > Now we want to get rid off different fields and instead of creating
> > > multiple term queries , we create only 1 term query for the merged
> field
> > > and the scorer of this term query ( on merged field ) makes use of
> custom
> > > term frequency info to deduce type of token ( during indexing we store
> > this
> > > info ) and hence the score that we were using earlier.
> > > So perf drop is observed in reference to earlier implementation ( with
> > > multiple term queries ).
> > >
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> > >
> > > On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com>
> wrote:
> > >
> > > > You say you observed a performance drop, what are you comparing
> > against?
> > > >
> > > > Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a écrit :
> > > >
> > > > > Note - i am using lucene 7.7.3
> > > > >
> > > > > *Thanks and Regards,*
> > > > > *Vimal Jain*
> > > > >
> > > > >
> > > > > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I want to understand if fetching the term frequency of a term
> > during
> > > > > > scoring is relatively cpu bound operation ?
> > > > > > Context - I am storing custom term frequency during indexing and
> > > later
> > > > > > using it for scoring during query execution time ( in Scorer's
> > > score()
> > > > > > method ). I noticed a performance drop in my application and I
> > > suspect
> > > > > it's
> > > > > > because of this change.
> > > > > > Any insight or related articles for reference would be
> appreciated.
> > > > > >
> > > > > >
> > > > > > *Thanks and Regards,*
> > > > > > *Vimal Jain*
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Adrien
> >
>

--
Adrien

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

vkjk89 at gmail

Jun 20, 2023, 11:53 PM

Post #8 of 11 (453 views)

Thanks Adrien , I had a look at your blog post. Looks like this
Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
A side question , is there any resource to help migrate newer major version
, i see lot of api changed from v7 to v8.

*Thanks and Regards,*
*Vimal Jain*

On Wed, Jun 21, 2023 at 1:08?AM Adrien Grand <jpountz@gmail.com> wrote:

> Lucene has logic to only evaluate a subset of the matching documents when
> retrieving top-k hits. This leverages the Scorer#getMaxScore API. If you
> never implemented it on your custom query, then you never took advantage of
> dynamic pruning anyway. I wrote a bit more about it
> <
> https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> >
> a few years ago if you're curious.
>
> On Tue, Jun 20, 2023 at 6:58?PM Vimal Jain <vkjk89@gmail.com> wrote:
>
> > Thanks Adrien for quick response.
> > Yes , i am replacing disjuncts across multiple fields with single custom
> > term query over merged field.
> > Can you please provide more details on what do you mean by dynamic
> pruning
> > in context of custom term query ?
> >
> > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <jpountz@gmail.com> wrote:
> >
> > > Intuitively replacing a disjunction across multiple fields with a
> single
> > > term query should always be faster.
> > >
> > > You're saying that you're storing the type of token as part of the term
> > > frequency. This doesn't sound like something that would play well with
> > > dynamic pruning, so I wonder if this is the reason why you are seeing
> > > slower queries. But since you mentioned custom term queries, maybe you
> > > never actually took advantage of dynamic pruning?
> > >
> > > On Tue, Jun 20, 2023 at 10:30?AM Vimal Jain <vkjk89@gmail.com> wrote:
> > >
> > > > Ok , sorry , I realized that I need to provide more context.
> > > > So we used to create a lucene query which consisted of custom term
> > > queries
> > > > for different fields and based on the type of field , we used to
> > assign a
> > > > boost that would be used in scoring.
> > > > Now we want to get rid off different fields and instead of creating
> > > > multiple term queries , we create only 1 term query for the merged
> > field
> > > > and the scorer of this term query ( on merged field ) makes use of
> > custom
> > > > term frequency info to deduce type of token ( during indexing we
> store
> > > this
> > > > info ) and hence the score that we were using earlier.
> > > > So perf drop is observed in reference to earlier implementation (
> with
> > > > multiple term queries ).
> > > >
> > > >
> > > > *Thanks and Regards,*
> > > > *Vimal Jain*
> > > >
> > > >
> > > > On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com>
> > wrote:
> > > >
> > > > > You say you observed a performance drop, what are you comparing
> > > against?
> > > > >
> > > > > Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a
> écrit :
> > > > >
> > > > > > Note - i am using lucene 7.7.3
> > > > > >
> > > > > > *Thanks and Regards,*
> > > > > > *Vimal Jain*
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > I want to understand if fetching the term frequency of a term
> > > during
> > > > > > > scoring is relatively cpu bound operation ?
> > > > > > > Context - I am storing custom term frequency during indexing
> and
> > > > later
> > > > > > > using it for scoring during query execution time ( in Scorer's
> > > > score()
> > > > > > > method ). I noticed a performance drop in my application and I
> > > > suspect
> > > > > > it's
> > > > > > > because of this change.
> > > > > > > Any insight or related articles for reference would be
> > appreciated.
> > > > > > >
> > > > > > >
> > > > > > > *Thanks and Regards,*
> > > > > > > *Vimal Jain*
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Adrien
>

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

jpountz at gmail

Jun 21, 2023, 1:13 AM

Post #9 of 11 (453 views)

As far as your performance problem is concerned, I don't know. Can you
compare the number of documents that need to be evaluated in both cases,
e.g. by running `IndexSearcher#count` on your two queries. If they're
similar, can you run your new query under a profiler to figure out what its
bottleneck is?

Regarding migration to newer major version, there is a MIGRATE.txt that
gives some advice:
https://github.com/apache/lucene/blob/releases/lucene-solr/8.0.0/lucene/MIGRATE.txt
.

On Wed, Jun 21, 2023 at 8:54?AM Vimal Jain <vkjk89@gmail.com> wrote:

> Thanks Adrien , I had a look at your blog post. Looks like this
> Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
> A side question , is there any resource to help migrate newer major version
> , i see lot of api changed from v7 to v8.
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Wed, Jun 21, 2023 at 1:08?AM Adrien Grand <jpountz@gmail.com> wrote:
>
> > Lucene has logic to only evaluate a subset of the matching documents when
> > retrieving top-k hits. This leverages the Scorer#getMaxScore API. If you
> > never implemented it on your custom query, then you never took advantage
> of
> > dynamic pruning anyway. I wrote a bit more about it
> > <
> >
> https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> > >
> > a few years ago if you're curious.
> >
> > On Tue, Jun 20, 2023 at 6:58?PM Vimal Jain <vkjk89@gmail.com> wrote:
> >
> > > Thanks Adrien for quick response.
> > > Yes , i am replacing disjuncts across multiple fields with single
> custom
> > > term query over merged field.
> > > Can you please provide more details on what do you mean by dynamic
> > pruning
> > > in context of custom term query ?
> > >
> > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <jpountz@gmail.com> wrote:
> > >
> > > > Intuitively replacing a disjunction across multiple fields with a
> > single
> > > > term query should always be faster.
> > > >
> > > > You're saying that you're storing the type of token as part of the
> term
> > > > frequency. This doesn't sound like something that would play well
> with
> > > > dynamic pruning, so I wonder if this is the reason why you are seeing
> > > > slower queries. But since you mentioned custom term queries, maybe
> you
> > > > never actually took advantage of dynamic pruning?
> > > >
> > > > On Tue, Jun 20, 2023 at 10:30?AM Vimal Jain <vkjk89@gmail.com>
> wrote:
> > > >
> > > > > Ok , sorry , I realized that I need to provide more context.
> > > > > So we used to create a lucene query which consisted of custom term
> > > > queries
> > > > > for different fields and based on the type of field , we used to
> > > assign a
> > > > > boost that would be used in scoring.
> > > > > Now we want to get rid off different fields and instead of creating
> > > > > multiple term queries , we create only 1 term query for the merged
> > > field
> > > > > and the scorer of this term query ( on merged field ) makes use of
> > > custom
> > > > > term frequency info to deduce type of token ( during indexing we
> > store
> > > > this
> > > > > info ) and hence the score that we were using earlier.
> > > > > So perf drop is observed in reference to earlier implementation (
> > with
> > > > > multiple term queries ).
> > > > >
> > > > >
> > > > > *Thanks and Regards,*
> > > > > *Vimal Jain*
> > > > >
> > > > >
> > > > > On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com>
> > > wrote:
> > > > >
> > > > > > You say you observed a performance drop, what are you comparing
> > > > against?
> > > > > >
> > > > > > Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a
> > écrit :
> > > > > >
> > > > > > > Note - i am using lucene 7.7.3
> > > > > > >
> > > > > > > *Thanks and Regards,*
> > > > > > > *Vimal Jain*
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <vkjk89@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > I want to understand if fetching the term frequency of a term
> > > > during
> > > > > > > > scoring is relatively cpu bound operation ?
> > > > > > > > Context - I am storing custom term frequency during indexing
> > and
> > > > > later
> > > > > > > > using it for scoring during query execution time ( in
> Scorer's
> > > > > score()
> > > > > > > > method ). I noticed a performance drop in my application and
> I
> > > > > suspect
> > > > > > > it's
> > > > > > > > because of this change.
> > > > > > > > Any insight or related articles for reference would be
> > > appreciated.
> > > > > > > >
> > > > > > > >
> > > > > > > > *Thanks and Regards,*
> > > > > > > > *Vimal Jain*
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > >
> >
> >
> > --
> > Adrien
> >
>

--
Adrien

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

vkjk89 at gmail

Jun 21, 2023, 9:00 PM

Post #10 of 11 (453 views)

I did profiling of new code and found that below api call is most time
consuming :-
org.apache.lucene.index.PostingsEnum#freq
If i comment out this call and instead use some random integer for testing
purpose, then perf is at least 5x compared to old code.
Is there any thoughts on why term frequency calls on PostingsEnum are that
slow ?

*Thanks and Regards,*
*Vimal Jain*

On Wed, Jun 21, 2023 at 1:43?PM Adrien Grand <jpountz@gmail.com> wrote:

> As far as your performance problem is concerned, I don't know. Can you
> compare the number of documents that need to be evaluated in both cases,
> e.g. by running `IndexSearcher#count` on your two queries. If they're
> similar, can you run your new query under a profiler to figure out what its
> bottleneck is?
>
> Regarding migration to newer major version, there is a MIGRATE.txt that
> gives some advice:
>
> https://github.com/apache/lucene/blob/releases/lucene-solr/8.0.0/lucene/MIGRATE.txt
> .
>
> On Wed, Jun 21, 2023 at 8:54?AM Vimal Jain <vkjk89@gmail.com> wrote:
>
> > Thanks Adrien , I had a look at your blog post. Looks like this
> > Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
> > A side question , is there any resource to help migrate newer major
> version
> > , i see lot of api changed from v7 to v8.
> >
> > *Thanks and Regards,*
> > *Vimal Jain*
> >
> >
> > On Wed, Jun 21, 2023 at 1:08?AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > > Lucene has logic to only evaluate a subset of the matching documents
> when
> > > retrieving top-k hits. This leverages the Scorer#getMaxScore API. If
> you
> > > never implemented it on your custom query, then you never took
> advantage
> > of
> > > dynamic pruning anyway. I wrote a bit more about it
> > > <
> > >
> >
> https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> > > >
> > > a few years ago if you're curious.
> > >
> > > On Tue, Jun 20, 2023 at 6:58?PM Vimal Jain <vkjk89@gmail.com> wrote:
> > >
> > > > Thanks Adrien for quick response.
> > > > Yes , i am replacing disjuncts across multiple fields with single
> > custom
> > > > term query over merged field.
> > > > Can you please provide more details on what do you mean by dynamic
> > > pruning
> > > > in context of custom term query ?
> > > >
> > > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <jpountz@gmail.com>
> wrote:
> > > >
> > > > > Intuitively replacing a disjunction across multiple fields with a
> > > single
> > > > > term query should always be faster.
> > > > >
> > > > > You're saying that you're storing the type of token as part of the
> > term
> > > > > frequency. This doesn't sound like something that would play well
> > with
> > > > > dynamic pruning, so I wonder if this is the reason why you are
> seeing
> > > > > slower queries. But since you mentioned custom term queries, maybe
> > you
> > > > > never actually took advantage of dynamic pruning?
> > > > >
> > > > > On Tue, Jun 20, 2023 at 10:30?AM Vimal Jain <vkjk89@gmail.com>
> > wrote:
> > > > >
> > > > > > Ok , sorry , I realized that I need to provide more context.
> > > > > > So we used to create a lucene query which consisted of custom
> term
> > > > > queries
> > > > > > for different fields and based on the type of field , we used to
> > > > assign a
> > > > > > boost that would be used in scoring.
> > > > > > Now we want to get rid off different fields and instead of
> creating
> > > > > > multiple term queries , we create only 1 term query for the
> merged
> > > > field
> > > > > > and the scorer of this term query ( on merged field ) makes use
> of
> > > > custom
> > > > > > term frequency info to deduce type of token ( during indexing we
> > > store
> > > > > this
> > > > > > info ) and hence the score that we were using earlier.
> > > > > > So perf drop is observed in reference to earlier implementation
> (
> > > with
> > > > > > multiple term queries ).
> > > > > >
> > > > > >
> > > > > > *Thanks and Regards,*
> > > > > > *Vimal Jain*
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > You say you observed a performance drop, what are you comparing
> > > > > against?
> > > > > > >
> > > > > > > Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a
> > > écrit :
> > > > > > >
> > > > > > > > Note - i am using lucene 7.7.3
> > > > > > > >
> > > > > > > > *Thanks and Regards,*
> > > > > > > > *Vimal Jain*
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <
> vkjk89@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > I want to understand if fetching the term frequency of a
> term
> > > > > during
> > > > > > > > > scoring is relatively cpu bound operation ?
> > > > > > > > > Context - I am storing custom term frequency during
> indexing
> > > and
> > > > > > later
> > > > > > > > > using it for scoring during query execution time ( in
> > Scorer's
> > > > > > score()
> > > > > > > > > method ). I noticed a performance drop in my application
> and
> > I
> > > > > > suspect
> > > > > > > > it's
> > > > > > > > > because of this change.
> > > > > > > > > Any insight or related articles for reference would be
> > > > appreciated.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *Thanks and Regards,*
> > > > > > > > > *Vimal Jain*
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Adrien
> > > > >
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Adrien
>

Re: Relative cpu cost of fetching term frequency during scoring [ In reply to ]

jpountz at gmail

Jun 26, 2023, 5:40 AM

Post #11 of 11 (439 views)

This is a bit surprising, can you share the profiler output (e.g.
screenshot), to see what is slow within the `PostingsEnum#freq` call?

`PostingsEnum#freq` may need to decode a block of freqs, but I would
generally not expect it to be 5x slower than decoding doc IDs for the
same block.

On Thu, Jun 22, 2023 at 6:00?AM Vimal Jain <vkjk89@gmail.com> wrote:
>
> I did profiling of new code and found that below api call is most time
> consuming :-
> org.apache.lucene.index.PostingsEnum#freq
> If i comment out this call and instead use some random integer for testing
> purpose, then perf is at least 5x compared to old code.
> Is there any thoughts on why term frequency calls on PostingsEnum are that
> slow ?
>
>
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Wed, Jun 21, 2023 at 1:43?PM Adrien Grand <jpountz@gmail.com> wrote:
>
> > As far as your performance problem is concerned, I don't know. Can you
> > compare the number of documents that need to be evaluated in both cases,
> > e.g. by running `IndexSearcher#count` on your two queries. If they're
> > similar, can you run your new query under a profiler to figure out what its
> > bottleneck is?
> >
> > Regarding migration to newer major version, there is a MIGRATE.txt that
> > gives some advice:
> >
> > https://github.com/apache/lucene/blob/releases/lucene-solr/8.0.0/lucene/MIGRATE.txt
> > .
> >
> > On Wed, Jun 21, 2023 at 8:54?AM Vimal Jain <vkjk89@gmail.com> wrote:
> >
> > > Thanks Adrien , I had a look at your blog post. Looks like this
> > > Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
> > > A side question , is there any resource to help migrate newer major
> > version
> > > , i see lot of api changed from v7 to v8.
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> > >
> > > On Wed, Jun 21, 2023 at 1:08?AM Adrien Grand <jpountz@gmail.com> wrote:
> > >
> > > > Lucene has logic to only evaluate a subset of the matching documents
> > when
> > > > retrieving top-k hits. This leverages the Scorer#getMaxScore API. If
> > you
> > > > never implemented it on your custom query, then you never took
> > advantage
> > > of
> > > > dynamic pruning anyway. I wrote a bit more about it
> > > > <
> > > >
> > >
> > https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> > > > >
> > > > a few years ago if you're curious.
> > > >
> > > > On Tue, Jun 20, 2023 at 6:58?PM Vimal Jain <vkjk89@gmail.com> wrote:
> > > >
> > > > > Thanks Adrien for quick response.
> > > > > Yes , i am replacing disjuncts across multiple fields with single
> > > custom
> > > > > term query over merged field.
> > > > > Can you please provide more details on what do you mean by dynamic
> > > > pruning
> > > > > in context of custom term query ?
> > > > >
> > > > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <jpountz@gmail.com>
> > wrote:
> > > > >
> > > > > > Intuitively replacing a disjunction across multiple fields with a
> > > > single
> > > > > > term query should always be faster.
> > > > > >
> > > > > > You're saying that you're storing the type of token as part of the
> > > term
> > > > > > frequency. This doesn't sound like something that would play well
> > > with
> > > > > > dynamic pruning, so I wonder if this is the reason why you are
> > seeing
> > > > > > slower queries. But since you mentioned custom term queries, maybe
> > > you
> > > > > > never actually took advantage of dynamic pruning?
> > > > > >
> > > > > > On Tue, Jun 20, 2023 at 10:30?AM Vimal Jain <vkjk89@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Ok , sorry , I realized that I need to provide more context.
> > > > > > > So we used to create a lucene query which consisted of custom
> > term
> > > > > > queries
> > > > > > > for different fields and based on the type of field , we used to
> > > > > assign a
> > > > > > > boost that would be used in scoring.
> > > > > > > Now we want to get rid off different fields and instead of
> > creating
> > > > > > > multiple term queries , we create only 1 term query for the
> > merged
> > > > > field
> > > > > > > and the scorer of this term query ( on merged field ) makes use
> > of
> > > > > custom
> > > > > > > term frequency info to deduce type of token ( during indexing we
> > > > store
> > > > > > this
> > > > > > > info ) and hence the score that we were using earlier.
> > > > > > > So perf drop is observed in reference to earlier implementation
> > (
> > > > with
> > > > > > > multiple term queries ).
> > > > > > >
> > > > > > >
> > > > > > > *Thanks and Regards,*
> > > > > > > *Vimal Jain*
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 20, 2023 at 1:01?PM Adrien Grand <jpountz@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > You say you observed a performance drop, what are you comparing
> > > > > > against?
> > > > > > > >
> > > > > > > > Le mar. 20 juin 2023, 08:59, Vimal Jain <vkjk89@gmail.com> a
> > > > écrit :
> > > > > > > >
> > > > > > > > > Note - i am using lucene 7.7.3
> > > > > > > > >
> > > > > > > > > *Thanks and Regards,*
> > > > > > > > > *Vimal Jain*
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Jun 20, 2023 at 12:26?PM Vimal Jain <
> > vkjk89@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > > I want to understand if fetching the term frequency of a
> > term
> > > > > > during
> > > > > > > > > > scoring is relatively cpu bound operation ?
> > > > > > > > > > Context - I am storing custom term frequency during
> > indexing
> > > > and
> > > > > > > later
> > > > > > > > > > using it for scoring during query execution time ( in
> > > Scorer's
> > > > > > > score()
> > > > > > > > > > method ). I noticed a performance drop in my application
> > and
> > > I
> > > > > > > suspect
> > > > > > > > > it's
> > > > > > > > > > because of this change.
> > > > > > > > > > Any insight or related articles for reference would be
> > > > > appreciated.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *Thanks and Regards,*
> > > > > > > > > > *Vimal Jain*
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Adrien
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > >
> >
> >
> > --
> > Adrien
> >

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org