Mailing List Archive

Retrieving query-time join fromQuery hits
Hi,

When using Lucene’s query-time join feature [1], how can the hits from the
first phase which determine / contribute to the returned results be
retrieved?

This topic has been brought up before [2], and at the time the
recommendation was to re-run the query with added constraints based on the
join fields values. Is there any alternative way of doing this when trying
to get the contributing hits for every returned result and in the context
of having multiple terms in the toField?

I see that the info that is being tracked by the Join API refers to the
scores and the terms collected in the first phase. During this feature’s
development [3] there was also a 3-phased approach taken into
consideration, which involved recording fromSearcher’s docIds, translating
them into joinable terms and then recording toSearcher’s docIds. However,
even if docId info would be recorded between phases, it would then have to
be exposed somehow.

Thanks,
Stefan Onofrei

[1]
https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
[2]
https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
[3] https://issues.apache.org/jira/browse/LUCENE-3602
Re: Retrieving query-time join fromQuery hits [ In reply to ]
I am trying first to understand the proposed solution from the previous
thread.

You run query #1, it returns top N hits. From those hits you ask JoinUtil
to create the "joined" query #2. You run the query #2 to get the top final
(joined) hits.

Then, to reconstruct which docids from query #1 matched which hits from
query #2, do you run a new query for every hit out of query #2? E.g. if
you want top 10 hits, you must run 10 new queries in the end, to match up
each docid in the final result set with each docid hit from query #1?

Mike McCandless

http://blog.mikemccandless.com


On Tue, May 12, 2020 at 12:23 PM Stefan Onofrei <stefanonofrei@gmail.com>
wrote:

> Hi,
>
> When using Lucene’s query-time join feature [1], how can the hits from the
> first phase which determine / contribute to the returned results be
> retrieved?
>
> This topic has been brought up before [2], and at the time the
> recommendation was to re-run the query with added constraints based on the
> join fields values. Is there any alternative way of doing this when trying
> to get the contributing hits for every returned result and in the context
> of having multiple terms in the toField?
>
> I see that the info that is being tracked by the Join API refers to the
> scores and the terms collected in the first phase. During this feature’s
> development [3] there was also a 3-phased approach taken into
> consideration, which involved recording fromSearcher’s docIds, translating
> them into joinable terms and then recording toSearcher’s docIds. However,
> even if docId info would be recorded between phases, it would then have to
> be exposed somehow.
>
> Thanks,
> Stefan Onofrei
>
> [1]
>
> https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
> [2]
>
> https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
> [3] https://issues.apache.org/jira/browse/LUCENE-3602
>
Re: Retrieving query-time join fromQuery hits [ In reply to ]
Actually, I do not see how this can work efficiently with per-hit queries
after the join.

For each of the final joined hits, you must 1) retrieve the join key
value(s) by pulling doc values iterators and advancing to the right docid,
2) run another query to "join backwards" to the hits from the left side of
the join.

I don't see how step 2) can work efficiently when there are many possible
hits on the left side that might have matched those join keys?

Elasticsearch offers query time joins ... I wonder how it retrieves and
returns hits from both left and right? It seems like the left side of the
join must retain some state, to know which top hits corresponded to those
join values, and then add an API to retrieve them?

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 20, 2020 at 6:31 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> I am trying first to understand the proposed solution from the previous
> thread.
>
> You run query #1, it returns top N hits. From those hits you ask JoinUtil
> to create the "joined" query #2. You run the query #2 to get the top final
> (joined) hits.
>
> Then, to reconstruct which docids from query #1 matched which hits from
> query #2, do you run a new query for every hit out of query #2? E.g. if
> you want top 10 hits, you must run 10 new queries in the end, to match up
> each docid in the final result set with each docid hit from query #1?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, May 12, 2020 at 12:23 PM Stefan Onofrei <stefanonofrei@gmail.com>
> wrote:
>
>> Hi,
>>
>> When using Lucene’s query-time join feature [1], how can the hits from the
>> first phase which determine / contribute to the returned results be
>> retrieved?
>>
>> This topic has been brought up before [2], and at the time the
>> recommendation was to re-run the query with added constraints based on the
>> join fields values. Is there any alternative way of doing this when trying
>> to get the contributing hits for every returned result and in the context
>> of having multiple terms in the toField?
>>
>> I see that the info that is being tracked by the Join API refers to the
>> scores and the terms collected in the first phase. During this feature’s
>> development [3] there was also a 3-phased approach taken into
>> consideration, which involved recording fromSearcher’s docIds, translating
>> them into joinable terms and then recording toSearcher’s docIds. However,
>> even if docId info would be recorded between phases, it would then have to
>> be exposed somehow.
>>
>> Thanks,
>> Stefan Onofrei
>>
>> [1]
>>
>> https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
>> [2]
>>
>> https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
>> [3] https://issues.apache.org/jira/browse/LUCENE-3602
>>
>
Re: Retrieving query-time join fromQuery hits [ In reply to ]
Hi, Stefan.
Have you considered faceting/aggregation over `from` field?

On Tue, May 12, 2020 at 7:23 PM Stefan Onofrei <stefanonofrei@gmail.com>
wrote:

> Hi,
>
> When using Lucene’s query-time join feature [1], how can the hits from the
> first phase which determine / contribute to the returned results be
> retrieved?
>
> This topic has been brought up before [2], and at the time the
> recommendation was to re-run the query with added constraints based on the
> join fields values. Is there any alternative way of doing this when trying
> to get the contributing hits for every returned result and in the context
> of having multiple terms in the toField?
>
> I see that the info that is being tracked by the Join API refers to the
> scores and the terms collected in the first phase. During this feature’s
> development [3] there was also a 3-phased approach taken into
> consideration, which involved recording fromSearcher’s docIds, translating
> them into joinable terms and then recording toSearcher’s docIds. However,
> even if docId info would be recorded between phases, it would then have to
> be exposed somehow.
>
> Thanks,
> Stefan Onofrei
>
> [1]
>
> https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
> [2]
>
> https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
> [3] https://issues.apache.org/jira/browse/LUCENE-3602
>


--
Sincerely yours
Mikhail Khludnev
Re: Retrieving query-time join fromQuery hits [ In reply to ]
Thanks for the replies.

@Mike: Yes, I think the idea is to run separate queries for each of the
resulting hits, as you described. I am concerned about the performance
implications of going down this route, especially when dealing with large
result sets.

@Mikhail: Thanks for the suggestion! I actually hadn't thought of that.
Could you please provide more details on how we could approach the problem
from this angle?

Thanks,
Stefan Onofrei

On Wed, Jun 3, 2020 at 9:59 PM Mikhail Khludnev <mkhl@apache.org> wrote:

> Hi, Stefan.
> Have you considered faceting/aggregation over `from` field?
>
> On Tue, May 12, 2020 at 7:23 PM Stefan Onofrei <stefanonofrei@gmail.com>
> wrote:
>
> > Hi,
> >
> > When using Lucene’s query-time join feature [1], how can the hits from
> the
> > first phase which determine / contribute to the returned results be
> > retrieved?
> >
> > This topic has been brought up before [2], and at the time the
> > recommendation was to re-run the query with added constraints based on
> the
> > join fields values. Is there any alternative way of doing this when
> trying
> > to get the contributing hits for every returned result and in the context
> > of having multiple terms in the toField?
> >
> > I see that the info that is being tracked by the Join API refers to the
> > scores and the terms collected in the first phase. During this feature’s
> > development [3] there was also a 3-phased approach taken into
> > consideration, which involved recording fromSearcher’s docIds,
> translating
> > them into joinable terms and then recording toSearcher’s docIds. However,
> > even if docId info would be recorded between phases, it would then have
> to
> > be exposed somehow.
> >
> > Thanks,
> > Stefan Onofrei
> >
> > [1]
> >
> >
> https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
> > [2]
> >
> >
> https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
> > [3] https://issues.apache.org/jira/browse/LUCENE-3602
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
Re: Retrieving query-time join fromQuery hits [ In reply to ]
Hi, Stefan.
I'm just thinking loud. Let's say we join FromDoc with (FromID, FromFK) to
ToDoc via ToDoc.ID=FromFK.
Results are ToDocs obviously. But if we count facet of FromFK over
fromQuery, its' values matches to ToDoc.IDs, then we can sub-facet (or
nested facet) by FromIDs that gives us full relation extracted. Not sure if
it helps.

On Mon, Jun 8, 2020 at 11:37 AM Stefan Onofrei <stefanonofrei@gmail.com>
wrote:

> Thanks for the replies.
>
> @Mike: Yes, I think the idea is to run separate queries for each of the
> resulting hits, as you described. I am concerned about the performance
> implications of going down this route, especially when dealing with large
> result sets.
>
> @Mikhail: Thanks for the suggestion! I actually hadn't thought of that.
> Could you please provide more details on how we could approach the problem
> from this angle?
>
> Thanks,
> Stefan Onofrei
>
> On Wed, Jun 3, 2020 at 9:59 PM Mikhail Khludnev <mkhl@apache.org> wrote:
>
> > Hi, Stefan.
> > Have you considered faceting/aggregation over `from` field?
> >
> > On Tue, May 12, 2020 at 7:23 PM Stefan Onofrei <stefanonofrei@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > When using Lucene’s query-time join feature [1], how can the hits from
> > the
> > > first phase which determine / contribute to the returned results be
> > > retrieved?
> > >
> > > This topic has been brought up before [2], and at the time the
> > > recommendation was to re-run the query with added constraints based on
> > the
> > > join fields values. Is there any alternative way of doing this when
> > trying
> > > to get the contributing hits for every returned result and in the
> context
> > > of having multiple terms in the toField?
> > >
> > > I see that the info that is being tracked by the Join API refers to the
> > > scores and the terms collected in the first phase. During this
> feature’s
> > > development [3] there was also a 3-phased approach taken into
> > > consideration, which involved recording fromSearcher’s docIds,
> > translating
> > > them into joinable terms and then recording toSearcher’s docIds.
> However,
> > > even if docId info would be recorded between phases, it would then have
> > to
> > > be exposed somehow.
> > >
> > > Thanks,
> > > Stefan Onofrei
> > >
> > > [1]
> > >
> > >
> >
> https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
> > > [2]
> > >
> > >
> >
> https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
> > > [3] https://issues.apache.org/jira/browse/LUCENE-3602
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


--
Sincerely yours
Mikhail Khludnev