Mailing List Archive

Deduplication of search result with custom with custom sort
Hi,
I need to deduplicate search results by specific field and I have no idea
how to implement this properly.
I have tried grouping with setGroupDocsLimit(1) and it gives me expected
results, but has not very good performance.
I think that I need something like DiversifiedTopDocsCollector, but
suitable for collecting TopFieldDocs.
Is there any possibility to achieve deduplication with existing lucene
components, or do I need to implement my own DiversifiedTopFieldsCollector?
Re:Deduplication of search result with custom with custom sort [ In reply to ]
Is the field that you are using to dedupe stored as a docvalue?

From: java-user@lucene.apache.org At: 10/09/20 12:18:04To: java-user@lucene.apache.org
Subject: Deduplication of search result with custom with custom sort

Hi,
I need to deduplicate search results by specific field and I have no idea
how to implement this properly.
I have tried grouping with setGroupDocsLimit(1) and it gives me expected
results, but has not very good performance.
I think that I need something like DiversifiedTopDocsCollector, but
suitable for collecting TopFieldDocs.
Is there any possibility to achieve deduplication with existing lucene
components, or do I need to implement my own DiversifiedTopFieldsCollector?
Re: Deduplication of search result with custom with custom sort [ In reply to ]
Yes, it is

??, 9 ???. 2020 ?. ? 14:25, Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarelli4@bloomberg.net>:

> Is the field that you are using to dedupe stored as a docvalue?
>
> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> java-user@lucene.apache.org
> Subject: Deduplication of search result with custom with custom sort
>
> Hi,
> I need to deduplicate search results by specific field and I have no idea
> how to implement this properly.
> I have tried grouping with setGroupDocsLimit(1) and it gives me expected
> results, but has not very good performance.
> I think that I need something like DiversifiedTopDocsCollector, but
> suitable for collecting TopFieldDocs.
> Is there any possibility to achieve deduplication with existing lucene
> components, or do I need to implement my own DiversifiedTopFieldsCollector?
>
>
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
How many documents in the collection, how many groups, and how long is it taking to do the grouping vs no grouping?

Also, if you remove the custom sort is it still slow?

From: java-user@lucene.apache.org At: 10/09/20 12:27:25To: Diego Ceccarelli (BLOOMBERG/ LONDON ) , java-user@lucene.apache.org
Subject: Re: Deduplication of search result with custom with custom sort

Yes, it is

??, 9 ???. 2020 ?. ? 14:25, Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarelli4@bloomberg.net>:

> Is the field that you are using to dedupe stored as a docvalue?
>
> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> java-user@lucene.apache.org
> Subject: Deduplication of search result with custom with custom sort
>
> Hi,
> I need to deduplicate search results by specific field and I have no idea
> how to implement this properly.
> I have tried grouping with setGroupDocsLimit(1) and it gives me expected
> results, but has not very good performance.
> I think that I need something like DiversifiedTopDocsCollector, but
> suitable for collecting TopFieldDocs.
> Is there any possibility to achieve deduplication with existing lucene
> components, or do I need to implement my own DiversifiedTopFieldsCollector?
>
>
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
At the Solr level, CollapsingQParserPlugin see:
https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html

You could perhaps steal some ideas from that if you
need this at the Lucene level.

Best,
Erick

> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <dceccarelli4@bloomberg.net> wrote:
>
> Is the field that you are using to dedupe stored as a docvalue?
>
> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To: java-user@lucene.apache.org
> Subject: Deduplication of search result with custom with custom sort
>
> Hi,
> I need to deduplicate search results by specific field and I have no idea
> how to implement this properly.
> I have tried grouping with setGroupDocsLimit(1) and it gives me expected
> results, but has not very good performance.
> I think that I need something like DiversifiedTopDocsCollector, but
> suitable for collecting TopFieldDocs.
> Is there any possibility to achieve deduplication with existing lucene
> components, or do I need to implement my own DiversifiedTopFieldsCollector?
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Deduplication of search result with custom with custom sort [ In reply to ]
I have 12_000_000 documents, 6_500_000 groups

With sort: It takes around 1 sec without grouping, 2 sec with grouping and
12 sec with setAllGroups(true)
Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
grouping and 10 sec with setAllGroups(true)

Thank you, Erick, I will look into it

??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <erickerickson@gmail.com>:

> At the Solr level, CollapsingQParserPlugin see:
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
>
> You could perhaps steal some ideas from that if you
> need this at the Lucene level.
>
> Best,
> Erick
>
> > On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarelli4@bloomberg.net> wrote:
> >
> > Is the field that you are using to dedupe stored as a docvalue?
> >
> > From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> java-user@lucene.apache.org
> > Subject: Deduplication of search result with custom with custom sort
> >
> > Hi,
> > I need to deduplicate search results by specific field and I have no idea
> > how to implement this properly.
> > I have tried grouping with setGroupDocsLimit(1) and it gives me expected
> > results, but has not very good performance.
> > I think that I need something like DiversifiedTopDocsCollector, but
> > suitable for collecting TopFieldDocs.
> > Is there any possibility to achieve deduplication with existing lucene
> > components, or do I need to implement my own
> DiversifiedTopFieldsCollector?
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
This is going to be fairly painful. You need to keep a list 6.5M
items long, sorted.

Before diving in there, I’d really back up and ask what the use-case
is. Returning 6.5M docs to a user is useless, so are you’re doing
some kind of analytics maybe? In which case, and again
assuming you’re using Solr, Streaming Aggregation might
be a better option.

This really sounds like an XY problem. You’re trying to solve problem X
and asking how to accomplish it with Y. What I’m questioning
is whether Y (grouping) is a good approach or not. Perhaps if
you explained X there’d be a better suggestion.

Best,
Erick

> On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com> wrote:
>
> I have 12_000_000 documents, 6_500_000 groups
>
> With sort: It takes around 1 sec without grouping, 2 sec with grouping and
> 12 sec with setAllGroups(true)
> Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> grouping and 10 sec with setAllGroups(true)
>
> Thank you, Erick, I will look into it
>
> ??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <erickerickson@gmail.com>:
>
>> At the Solr level, CollapsingQParserPlugin see:
>> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
>>
>> You could perhaps steal some ideas from that if you
>> need this at the Lucene level.
>>
>> Best,
>> Erick
>>
>>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
>> dceccarelli4@bloomberg.net> wrote:
>>>
>>> Is the field that you are using to dedupe stored as a docvalue?
>>>
>>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
>> java-user@lucene.apache.org
>>> Subject: Deduplication of search result with custom with custom sort
>>>
>>> Hi,
>>> I need to deduplicate search results by specific field and I have no idea
>>> how to implement this properly.
>>> I have tried grouping with setGroupDocsLimit(1) and it gives me expected
>>> results, but has not very good performance.
>>> I think that I need something like DiversifiedTopDocsCollector, but
>>> suitable for collecting TopFieldDocs.
>>> Is there any possibility to achieve deduplication with existing lucene
>>> components, or do I need to implement my own
>> DiversifiedTopFieldsCollector?
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Deduplication of search result with custom with custom sort [ In reply to ]
6_500_000 is the total count of groups in the entire collection. I only
return the top 1000 to users.
I use Lucene where I have documents that can have the same docvalue, and I
want to deduplicate this documents by this docvalue during search.
Also, i sort my documents by multiple fields and because of this i can`t
use DiversifiedTopDocsCollector that works with relevance score only.

??, 9 ???. 2020 ?. ? 16:02, Erick Erickson <erickerickson@gmail.com>:

> This is going to be fairly painful. You need to keep a list 6.5M
> items long, sorted.
>
> Before diving in there, I’d really back up and ask what the use-case
> is. Returning 6.5M docs to a user is useless, so are you’re doing
> some kind of analytics maybe? In which case, and again
> assuming you’re using Solr, Streaming Aggregation might
> be a better option.
>
> This really sounds like an XY problem. You’re trying to solve problem X
> and asking how to accomplish it with Y. What I’m questioning
> is whether Y (grouping) is a good approach or not. Perhaps if
> you explained X there’d be a better suggestion.
>
> Best,
> Erick
>
> > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com> wrote:
> >
> > I have 12_000_000 documents, 6_500_000 groups
> >
> > With sort: It takes around 1 sec without grouping, 2 sec with grouping
> and
> > 12 sec with setAllGroups(true)
> > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > grouping and 10 sec with setAllGroups(true)
> >
> > Thank you, Erick, I will look into it
> >
> > ??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <erickerickson@gmail.com>:
> >
> >> At the Solr level, CollapsingQParserPlugin see:
> >>
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> >>
> >> You could perhaps steal some ideas from that if you
> >> need this at the Lucene level.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> >> dceccarelli4@bloomberg.net> wrote:
> >>>
> >>> Is the field that you are using to dedupe stored as a docvalue?
> >>>
> >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> >> java-user@lucene.apache.org
> >>> Subject: Deduplication of search result with custom with custom sort
> >>>
> >>> Hi,
> >>> I need to deduplicate search results by specific field and I have no
> idea
> >>> how to implement this properly.
> >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> expected
> >>> results, but has not very good performance.
> >>> I think that I need something like DiversifiedTopDocsCollector, but
> >>> suitable for collecting TopFieldDocs.
> >>> Is there any possibility to achieve deduplication with existing lucene
> >>> components, or do I need to implement my own
> >> DiversifiedTopFieldsCollector?
> >>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
As Erick said, can you tell us a bit more about the use case?
There might be another way to achieve the same result.

What are these documents?
Why do you need 1000 docs per user?


From: java-user@lucene.apache.org At: 10/09/20 14:25:02To: java-user@lucene.apache.org
Subject: Re: Deduplication of search result with custom with custom sort

6_500_000 is the total count of groups in the entire collection. I only
return the top 1000 to users.
I use Lucene where I have documents that can have the same docvalue, and I
want to deduplicate this documents by this docvalue during search.
Also, i sort my documents by multiple fields and because of this i can`t
use DiversifiedTopDocsCollector that works with relevance score only.

??, 9 ???. 2020 ?. ? 16:02, Erick Erickson <erickerickson@gmail.com>:

> This is going to be fairly painful. You need to keep a list 6.5M
> items long, sorted.
>
> Before diving in there, I’d really back up and ask what the use-case
> is. Returning 6.5M docs to a user is useless, so are you’re doing
> some kind of analytics maybe? In which case, and again
> assuming you’re using Solr, Streaming Aggregation might
> be a better option.
>
> This really sounds like an XY problem. You’re trying to solve problem X
> and asking how to accomplish it with Y. What I’m questioning
> is whether Y (grouping) is a good approach or not. Perhaps if
> you explained X there’d be a better suggestion.
>
> Best,
> Erick
>
> > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com> wrote:
> >
> > I have 12_000_000 documents, 6_500_000 groups
> >
> > With sort: It takes around 1 sec without grouping, 2 sec with grouping
> and
> > 12 sec with setAllGroups(true)
> > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > grouping and 10 sec with setAllGroups(true)
> >
> > Thank you, Erick, I will look into it
> >
> > ??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <erickerickson@gmail.com>:
> >
> >> At the Solr level, CollapsingQParserPlugin see:
> >>
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> >>
> >> You could perhaps steal some ideas from that if you
> >> need this at the Lucene level.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> >> dceccarelli4@bloomberg.net> wrote:
> >>>
> >>> Is the field that you are using to dedupe stored as a docvalue?
> >>>
> >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> >> java-user@lucene.apache.org
> >>> Subject: Deduplication of search result with custom with custom sort
> >>>
> >>> Hi,
> >>> I need to deduplicate search results by specific field and I have no
> idea
> >>> how to implement this properly.
> >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> expected
> >>> results, but has not very good performance.
> >>> I think that I need something like DiversifiedTopDocsCollector, but
> >>> suitable for collecting TopFieldDocs.
> >>> Is there any possibility to achieve deduplication with existing lucene
> >>> components, or do I need to implement my own
> >> DiversifiedTopFieldsCollector?
> >>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
My learnings dealing this problem

We faced a similar problem before, and did the following things:

1) Don't request totalGroupCount, and the response was fast. as computing
group count is an expensive task. If you can live without groupCount.
Although you can approximate pagination up to total count and then group
count will be less so when you get empty results you stop pagination.
2) Have more shards, so you can get the best out of parallel execution.

I have seen use-cases of 60M total documents dedup doc values field, with
4 shards.

Query time SLA is around 5-6 seconds. Not unbearable for users.

Let me know if you find better solution.






On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarelli4@bloomberg.net> wrote:

> As Erick said, can you tell us a bit more about the use case?
> There might be another way to achieve the same result.
>
> What are these documents?
> Why do you need 1000 docs per user?
>
>
> From: java-user@lucene.apache.org At: 10/09/20 14:25:02To:
> java-user@lucene.apache.org
> Subject: Re: Deduplication of search result with custom with custom sort
>
> 6_500_000 is the total count of groups in the entire collection. I only
> return the top 1000 to users.
> I use Lucene where I have documents that can have the same docvalue, and I
> want to deduplicate this documents by this docvalue during search.
> Also, i sort my documents by multiple fields and because of this i can`t
> use DiversifiedTopDocsCollector that works with relevance score only.
>
> ??, 9 ???. 2020 ?. ? 16:02, Erick Erickson <erickerickson@gmail.com>:
>
> > This is going to be fairly painful. You need to keep a list 6.5M
> > items long, sorted.
> >
> > Before diving in there, I’d really back up and ask what the use-case
> > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > some kind of analytics maybe? In which case, and again
> > assuming you’re using Solr, Streaming Aggregation might
> > be a better option.
> >
> > This really sounds like an XY problem. You’re trying to solve problem X
> > and asking how to accomplish it with Y. What I’m questioning
> > is whether Y (grouping) is a good approach or not. Perhaps if
> > you explained X there’d be a better suggestion.
> >
> > Best,
> > Erick
> >
> > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com> wrote:
> > >
> > > I have 12_000_000 documents, 6_500_000 groups
> > >
> > > With sort: It takes around 1 sec without grouping, 2 sec with grouping
> > and
> > > 12 sec with setAllGroups(true)
> > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > > grouping and 10 sec with setAllGroups(true)
> > >
> > > Thank you, Erick, I will look into it
> > >
> > > ??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <erickerickson@gmail.com>:
> > >
> > >> At the Solr level, CollapsingQParserPlugin see:
> > >>
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > >>
> > >> You could perhaps steal some ideas from that if you
> > >> need this at the Lucene level.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > >> dceccarelli4@bloomberg.net> wrote:
> > >>>
> > >>> Is the field that you are using to dedupe stored as a docvalue?
> > >>>
> > >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> > >> java-user@lucene.apache.org
> > >>> Subject: Deduplication of search result with custom with custom sort
> > >>>
> > >>> Hi,
> > >>> I need to deduplicate search results by specific field and I have no
> > idea
> > >>> how to implement this properly.
> > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> > expected
> > >>> results, but has not very good performance.
> > >>> I think that I need something like DiversifiedTopDocsCollector, but
> > >>> suitable for collecting TopFieldDocs.
> > >>> Is there any possibility to achieve deduplication with existing
> lucene
> > >>> components, or do I need to implement my own
> > >> DiversifiedTopFieldsCollector?
> > >>>
> > >>>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
Thank you very much for helping!

There isn't much I can add about my use case. I have user-generated video
titles and hash codes by which I can understand that these are the same
videos. Users search videos by title and I should return the top 1000
unique videos to them.

I will try to use grouping without counting groups. Otherwise I'll look
here https://issues.apache.org/jira/browse/SOLR-11831 or here
https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html

Thanks again!

??, 9 ???. 2020 ?. ? 18:57, Jigar Shah <jigaronline@gmail.com>:

> My learnings dealing this problem
>
> We faced a similar problem before, and did the following things:
>
> 1) Don't request totalGroupCount, and the response was fast. as computing
> group count is an expensive task. If you can live without groupCount.
> Although you can approximate pagination up to total count and then group
> count will be less so when you get empty results you stop pagination.
> 2) Have more shards, so you can get the best out of parallel execution.
>
> I have seen use-cases of 60M total documents dedup doc values field, with
> 4 shards.
>
> Query time SLA is around 5-6 seconds. Not unbearable for users.
>
> Let me know if you find better solution.
>
>
>
>
>
>
> On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarelli4@bloomberg.net> wrote:
>
> > As Erick said, can you tell us a bit more about the use case?
> > There might be another way to achieve the same result.
> >
> > What are these documents?
> > Why do you need 1000 docs per user?
> >
> >
> > From: java-user@lucene.apache.org At: 10/09/20 14:25:02To:
> > java-user@lucene.apache.org
> > Subject: Re: Deduplication of search result with custom with custom sort
> >
> > 6_500_000 is the total count of groups in the entire collection. I only
> > return the top 1000 to users.
> > I use Lucene where I have documents that can have the same docvalue, and
> I
> > want to deduplicate this documents by this docvalue during search.
> > Also, i sort my documents by multiple fields and because of this i can`t
> > use DiversifiedTopDocsCollector that works with relevance score only.
> >
> > ??, 9 ???. 2020 ?. ? 16:02, Erick Erickson <erickerickson@gmail.com>:
> >
> > > This is going to be fairly painful. You need to keep a list 6.5M
> > > items long, sorted.
> > >
> > > Before diving in there, I’d really back up and ask what the use-case
> > > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > > some kind of analytics maybe? In which case, and again
> > > assuming you’re using Solr, Streaming Aggregation might
> > > be a better option.
> > >
> > > This really sounds like an XY problem. You’re trying to solve problem X
> > > and asking how to accomplish it with Y. What I’m questioning
> > > is whether Y (grouping) is a good approach or not. Perhaps if
> > > you explained X there’d be a better suggestion.
> > >
> > > Best,
> > > Erick
> > >
> > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com> wrote:
> > > >
> > > > I have 12_000_000 documents, 6_500_000 groups
> > > >
> > > > With sort: It takes around 1 sec without grouping, 2 sec with
> grouping
> > > and
> > > > 12 sec with setAllGroups(true)
> > > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > > > grouping and 10 sec with setAllGroups(true)
> > > >
> > > > Thank you, Erick, I will look into it
> > > >
> > > > ??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <erickerickson@gmail.com
> >:
> > > >
> > > >> At the Solr level, CollapsingQParserPlugin see:
> > > >>
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > > >>
> > > >> You could perhaps steal some ideas from that if you
> > > >> need this at the Lucene level.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > > >> dceccarelli4@bloomberg.net> wrote:
> > > >>>
> > > >>> Is the field that you are using to dedupe stored as a docvalue?
> > > >>>
> > > >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> > > >> java-user@lucene.apache.org
> > > >>> Subject: Deduplication of search result with custom with custom
> sort
> > > >>>
> > > >>> Hi,
> > > >>> I need to deduplicate search results by specific field and I have
> no
> > > idea
> > > >>> how to implement this properly.
> > > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> > > expected
> > > >>> results, but has not very good performance.
> > > >>> I think that I need something like DiversifiedTopDocsCollector, but
> > > >>> suitable for collecting TopFieldDocs.
> > > >>> Is there any possibility to achieve deduplication with existing
> > lucene
> > > >>> components, or do I need to implement my own
> > > >> DiversifiedTopFieldsCollector?
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> >
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
> https://issues.apache.org/jira/browse/SOLR-11831 I collaborated on Las Vegas patch, I don't think that patch will be merged - it modifies too many things in the core - we ended up reimplementing it as a standalone plugin.
Also keep in mind that the patch makes the difference only if you are using Solr Cloud, while it seems that you are using lucene.

Do you really need to return 1000 results to the user? is this for paging purposes?

Do you know how frequent are the groups? if they are not too frequent and you are not strict on 1000, you might retrieve more let's say 2000 without grouping and then do the deduping after..

Cheers,
Diego


From: java-user@lucene.apache.org At: 10/12/20 13:02:46To: java-user@lucene.apache.org
Subject: Re: Deduplication of search result with custom with custom sort

Thank you very much for helping!

There isn't much I can add about my use case. I have user-generated video
titles and hash codes by which I can understand that these are the same
videos. Users search videos by title and I should return the top 1000
unique videos to them.

I will try to use grouping without counting groups. Otherwise I'll look
here https://issues.apache.org/jira/browse/SOLR-11831 or here
https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html

Thanks again!

??, 9 ???. 2020 ?. ? 18:57, Jigar Shah <jigaronline@gmail.com>:

> My learnings dealing this problem
>
> We faced a similar problem before, and did the following things:
>
> 1) Don't request totalGroupCount, and the response was fast. as computing
> group count is an expensive task. If you can live without groupCount.
> Although you can approximate pagination up to total count and then group
> count will be less so when you get empty results you stop pagination.
> 2) Have more shards, so you can get the best out of parallel execution.
>
> I have seen use-cases of 60M total documents dedup doc values field, with
> 4 shards.
>
> Query time SLA is around 5-6 seconds. Not unbearable for users.
>
> Let me know if you find better solution.
>
>
>
>
>
>
> On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarelli4@bloomberg.net> wrote:
>
> > As Erick said, can you tell us a bit more about the use case?
> > There might be another way to achieve the same result.
> >
> > What are these documents?
> > Why do you need 1000 docs per user?
> >
> >
> > From: java-user@lucene.apache.org At: 10/09/20 14:25:02To:
> > java-user@lucene.apache.org
> > Subject: Re: Deduplication of search result with custom with custom sort
> >
> > 6_500_000 is the total count of groups in the entire collection. I only
> > return the top 1000 to users.
> > I use Lucene where I have documents that can have the same docvalue, and
> I
> > want to deduplicate this documents by this docvalue during search.
> > Also, i sort my documents by multiple fields and because of this i can`t
> > use DiversifiedTopDocsCollector that works with relevance score only.
> >
> > ??, 9 ???. 2020 ?. ? 16:02, Erick Erickson <erickerickson@gmail.com>:
> >
> > > This is going to be fairly painful. You need to keep a list 6.5M
> > > items long, sorted.
> > >
> > > Before diving in there, I’d really back up and ask what the use-case
> > > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > > some kind of analytics maybe? In which case, and again
> > > assuming you’re using Solr, Streaming Aggregation might
> > > be a better option.
> > >
> > > This really sounds like an XY problem. You’re trying to solve problem X
> > > and asking how to accomplish it with Y. What I’m questioning
> > > is whether Y (grouping) is a good approach or not. Perhaps if
> > > you explained X there’d be a better suggestion.
> > >
> > > Best,
> > > Erick
> > >
> > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com> wrote:
> > > >
> > > > I have 12_000_000 documents, 6_500_000 groups
> > > >
> > > > With sort: It takes around 1 sec without grouping, 2 sec with
> grouping
> > > and
> > > > 12 sec with setAllGroups(true)
> > > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> > > > grouping and 10 sec with setAllGroups(true)
> > > >
> > > > Thank you, Erick, I will look into it
> > > >
> > > > ??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <erickerickson@gmail.com
> >:
> > > >
> > > >> At the Solr level, CollapsingQParserPlugin see:
> > > >>
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > > >>
> > > >> You could perhaps steal some ideas from that if you
> > > >> need this at the Lucene level.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > > >> dceccarelli4@bloomberg.net> wrote:
> > > >>>
> > > >>> Is the field that you are using to dedupe stored as a docvalue?
> > > >>>
> > > >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> > > >> java-user@lucene.apache.org
> > > >>> Subject: Deduplication of search result with custom with custom
> sort
> > > >>>
> > > >>> Hi,
> > > >>> I need to deduplicate search results by specific field and I have
> no
> > > idea
> > > >>> how to implement this properly.
> > > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> > > expected
> > > >>> results, but has not very good performance.
> > > >>> I think that I need something like DiversifiedTopDocsCollector, but
> > > >>> suitable for collecting TopFieldDocs.
> > > >>> Is there any possibility to achieve deduplication with existing
> > lucene
> > > >>> components, or do I need to implement my own
> > > >> DiversifiedTopFieldsCollector?
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> >
>
Re: Deduplication of search result with custom with custom sort [ In reply to ]
I studied the Las Vegas patch and got one simple thought.
FirstPassingGroupCollector collects CollectedSearchGroup inside itself.
CollectedSearchGroup contains docId and sortValues. This is exactly what I
need. Thanks for the help!

??, 12 ???. 2020 ?. ? 17:38, Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarelli4@bloomberg.net>:

> > https://issues.apache.org/jira/browse/SOLR-11831 I collaborated on Las
> Vegas patch, I don't think that patch will be merged - it modifies too many
> things in the core - we ended up reimplementing it as a standalone plugin.
> Also keep in mind that the patch makes the difference only if you are
> using Solr Cloud, while it seems that you are using lucene.
>
> Do you really need to return 1000 results to the user? is this for paging
> purposes?
>
> Do you know how frequent are the groups? if they are not too frequent and
> you are not strict on 1000, you might retrieve more let's say 2000 without
> grouping and then do the deduping after..
>
> Cheers,
> Diego
>
>
> From: java-user@lucene.apache.org At: 10/12/20 13:02:46To:
> java-user@lucene.apache.org
> Subject: Re: Deduplication of search result with custom with custom sort
>
> Thank you very much for helping!
>
> There isn't much I can add about my use case. I have user-generated video
> titles and hash codes by which I can understand that these are the same
> videos. Users search videos by title and I should return the top 1000
> unique videos to them.
>
> I will try to use grouping without counting groups. Otherwise I'll look
> here https://issues.apache.org/jira/browse/SOLR-11831 or here
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
>
> Thanks again!
>
> ??, 9 ???. 2020 ?. ? 18:57, Jigar Shah <jigaronline@gmail.com>:
>
> > My learnings dealing this problem
> >
> > We faced a similar problem before, and did the following things:
> >
> > 1) Don't request totalGroupCount, and the response was fast. as computing
> > group count is an expensive task. If you can live without groupCount.
> > Although you can approximate pagination up to total count and then group
> > count will be less so when you get empty results you stop pagination.
> > 2) Have more shards, so you can get the best out of parallel execution.
> >
> > I have seen use-cases of 60M total documents dedup doc values field,
> with
> > 4 shards.
> >
> > Query time SLA is around 5-6 seconds. Not unbearable for users.
> >
> > Let me know if you find better solution.
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 9, 2020 at 11:45 AM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > dceccarelli4@bloomberg.net> wrote:
> >
> > > As Erick said, can you tell us a bit more about the use case?
> > > There might be another way to achieve the same result.
> > >
> > > What are these documents?
> > > Why do you need 1000 docs per user?
> > >
> > >
> > > From: java-user@lucene.apache.org At: 10/09/20 14:25:02To:
> > > java-user@lucene.apache.org
> > > Subject: Re: Deduplication of search result with custom with custom
> sort
> > >
> > > 6_500_000 is the total count of groups in the entire collection. I only
> > > return the top 1000 to users.
> > > I use Lucene where I have documents that can have the same docvalue,
> and
> > I
> > > want to deduplicate this documents by this docvalue during search.
> > > Also, i sort my documents by multiple fields and because of this i
> can`t
> > > use DiversifiedTopDocsCollector that works with relevance score only.
> > >
> > > ??, 9 ???. 2020 ?. ? 16:02, Erick Erickson <erickerickson@gmail.com>:
> > >
> > > > This is going to be fairly painful. You need to keep a list 6.5M
> > > > items long, sorted.
> > > >
> > > > Before diving in there, I’d really back up and ask what the use-case
> > > > is. Returning 6.5M docs to a user is useless, so are you’re doing
> > > > some kind of analytics maybe? In which case, and again
> > > > assuming you’re using Solr, Streaming Aggregation might
> > > > be a better option.
> > > >
> > > > This really sounds like an XY problem. You’re trying to solve
> problem X
> > > > and asking how to accomplish it with Y. What I’m questioning
> > > > is whether Y (grouping) is a good approach or not. Perhaps if
> > > > you explained X there’d be a better suggestion.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Oct 9, 2020, at 8:19 AM, Dmitry Emets <emetsds@gmail.com>
> wrote:
> > > > >
> > > > > I have 12_000_000 documents, 6_500_000 groups
> > > > >
> > > > > With sort: It takes around 1 sec without grouping, 2 sec with
> > grouping
> > > > and
> > > > > 12 sec with setAllGroups(true)
> > > > > Without sort: It takes around 0.2 sec without grouping, 0.6 sec
> with
> > > > > grouping and 10 sec with setAllGroups(true)
> > > > >
> > > > > Thank you, Erick, I will look into it
> > > > >
> > > > > ??, 9 ???. 2020 ?. ? 14:32, Erick Erickson <
> erickerickson@gmail.com
> > >:
> > > > >
> > > > >> At the Solr level, CollapsingQParserPlugin see:
> > > > >>
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
> > > > >>
> > > > >> You could perhaps steal some ideas from that if you
> > > > >> need this at the Lucene level.
> > > > >>
> > > > >> Best,
> > > > >> Erick
> > > > >>
> > > > >>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON)
> <
> > > > >> dceccarelli4@bloomberg.net> wrote:
> > > > >>>
> > > > >>> Is the field that you are using to dedupe stored as a docvalue?
> > > > >>>
> > > > >>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
> > > > >> java-user@lucene.apache.org
> > > > >>> Subject: Deduplication of search result with custom with custom
> > sort
> > > > >>>
> > > > >>> Hi,
> > > > >>> I need to deduplicate search results by specific field and I have
> > no
> > > > idea
> > > > >>> how to implement this properly.
> > > > >>> I have tried grouping with setGroupDocsLimit(1) and it gives me
> > > > expected
> > > > >>> results, but has not very good performance.
> > > > >>> I think that I need something like DiversifiedTopDocsCollector,
> but
> > > > >>> suitable for collecting TopFieldDocs.
> > > > >>> Is there any possibility to achieve deduplication with existing
> > > lucene
> > > > >>> components, or do I need to implement my own
> > > > >> DiversifiedTopFieldsCollector?
> > > > >>>
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >>
> > > > >>
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
>