Mailing List Archive: Trying to understand cross-collection-join routing/hashing choices and behavior

I'm trying to understand the cross-collection JOIN
<https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#cross-collection-join>
documentation,
behavior, choices, and viability.

*# Terminology language choice*

"""routerField - If the documents are routed to shards using the
CompositeID router by the join field, then that field name should be
specified in the configuration here. This will allow the parser to optimize
the resulting HashRange query."""

"""routed - If true, the cross collection join query will use each shard’s
hash range to determine the set of join keys to retrieve for that shard.
This parameter improves the performance of the cross-collection join, but
it depends on the local collection being routed by the to field. If this
parameter is not specified, the cross collection join query will try to
determine the correct value automatically."""

*Question 1*: Why overload terminology like "route" when these parameters
do NOT route AFAICT. Based on my reading of the code all they do is add a
hash_range fq parameter to the remote join query request. Filtering results
is not routing, so this fosters confusion. Is there reasoning behind this
or just happenstance?

*# Implied vs Actual behavior*

My reading of the code base is this: the hash_range parameter is always
populated with the "fromField" value. The routerField is only used to check
against the "toField" for equality to enable the hash_range parameter
usage, this is only done as a fall back if "routed" is not set.

It's a little strange to me that "routerField" is not used as a router
field, or even as a hash field. It is only used as a flag for "if a query
is joining to this field then use hash_range filter on the fromField" (or
at least that's how I read the code).

*Question 2:* Is my reading of the code correct? Can we try to update the
documentation to be more explicit about this?

*# Routing *

*Question 3:* Is there a reason why actual routing was not used? I'm not
familiar with the Solr code base, but it seems like it'd be nicer to
instead use existing routing behavior in this context instead of querying
all and filtering results. This seems like it would need 2 things: First,
the _route_ value from the current "local" request, and second, either the
local client (like how solrj does) or the remote "/export" handler would
need to recognize and handle this parameter. Is that obviously doable or
not doable? Trying to understand why that approach wasn't taken originally.

*# Hashing*

Here is the behavior touted in the docs for HashRangeQueryParser
<https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#hash-range-query-parser>
.
"""In the cross collection join case, the hash range query parser is used
to ensure that each shard only gets the set of join keys that would end up
on that shard. This query parser uses the MurmurHash3_x86_32. This is the
same as the default hashing for the default composite ID router in Solr."""

The documentation mentions "CompositeID router", which we know is based on
prefixes (split on "!") being hashed and routed with the first/top 16 bits
of info (with the later 16 bits provided by the rest of the doc "id" on
inserts).

The CrossCollectionJoinQuery uses 16 bits from the current/local shard
range, which seems fine and good. However, the HashRangeQuery appears to hash
the entire field
<https://github.com/apache/solr/blob/26195c82493422cb9d6d4bdf9d4452046e7b3f67/solr/core/src/java/org/apache/solr/search/join/HashRangeQuery.java#L116-L117>.
So I'm struggling to understand how this would work, especially since the
join field and the "route" field are sourced from the same value. Either
the join field is a compositeId in which case the HashRangeQuery code
appears to be invalid, as it would not hash "A!B" the same as the actual
router would hash "A", or the join field is not a compositeId in which case
for it to work it would have to be the exact value as the actual
compositeId prefix field something like this doc: {"id":"A!B",
"myJoinField": "A"}. (Or maybe using "router.field=myJoinField" works
without the compositeId/"!" format?). And if the join field is not a
compositeId, then the only thing you could join on is the broad category
tenant/product/etc that is used as the compositeId prefix, which would
severely limit the use-case of the plugin, preventing joins on something
more akin to record-ids/foreign-keys, and only allowing you to narrow down
the results by what you know ahead of time to cram into the "v=" query
field.

*Question 4:* Not a specific question so much as "am I onto something here
or am I missing something and off base?"

Actually reading through the test code, now I see that my hypothesized "it
could only work if router key and join field are the same value" is
actually what is tested. The data is set-up
<https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L128-L130>with
product_id as the compositeId prefix. Then all the test queries
<https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L166-L217>
are joins on another field with the same product_Id value. So that explains
how it can work.

*Alternative Use-Case*
While I'm here I guess I'll fill in the use-case I was hoping for based on
how we currently do local joins. We want to have two collections which both
route on the same tenantId, whereas our join is on more of a foreign-key,
as seen below.

// Collection-1
{
"id": "tenantId!abc"
"entity": "userUpload",
"entity_id": "abc",
"uploadedBy": "123",
}

// Collection-2
{
"id": "tenantId!123",
"entity": "user",
"entity_id": "123",
"user_groups": ["xyz",...]
}

// Query Collection-1, join example adapted to crossCollection. This will
include user-upload documents that were uploaded-by the user in group xyz.
{!join method="crossCollection"
fromIndex="Collection-2" // remote
from="entity_id" // remote
to="uploadedBy" // local
v="user_groups:xyz" // remote search filter
}

This query works locally and should work remotely, cross-collection, but it
appears incompatible with the current routing/hashing behavior of the
plugin.

At this point I have worked through it enough that I understand how it
currently works, and even rereading the docs it kinda makes more sense now
like the information was there the whole time, but I think this is still
worth raising for awareness and discussion. I don't currently have the
need/time to update the plugin to expand its behavior. But I might be able
to update the documentation to make it more clear so that others don't go
through the same rollercoaster and deep dive that I've gone through.

Thanks a bunch for any assistance or information regarding this!

Hello Zachariah,

You have sent this to the wrong list. This is the Lucene dev list. Your
message should go to dev@solr.apache.org
The Lucene & Solr projects have split; not long ago, this would have been
the right list.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Thu, Dec 1, 2022 at 12:56 AM Zachariah Kendall <
zachariahkendall@gmail.com> wrote:

> I'm trying to understand the cross-collection JOIN
> <https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#cross-collection-join> documentation,
> behavior, choices, and viability.
>
> *# Terminology language choice*
>
> """routerField - If the documents are routed to shards using the
> CompositeID router by the join field, then that field name should be
> specified in the configuration here. This will allow the parser to optimize
> the resulting HashRange query."""
>
> """routed - If true, the cross collection join query will use each shard’s
> hash range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but
> it depends on the local collection being routed by the to field. If this
> parameter is not specified, the cross collection join query will try to
> determine the correct value automatically."""
>
> *Question 1*: Why overload terminology like "route" when these parameters
> do NOT route AFAICT. Based on my reading of the code all they do is add a
> hash_range fq parameter to the remote join query request. Filtering results
> is not routing, so this fosters confusion. Is there reasoning behind this
> or just happenstance?
>
> *# Implied vs Actual behavior*
>
> My reading of the code base is this: the hash_range parameter is always
> populated with the "fromField" value. The routerField is only used to check
> against the "toField" for equality to enable the hash_range parameter
> usage, this is only done as a fall back if "routed" is not set.
>
> It's a little strange to me that "routerField" is not used as a router
> field, or even as a hash field. It is only used as a flag for "if a query
> is joining to this field then use hash_range filter on the fromField" (or
> at least that's how I read the code).
>
> *Question 2:* Is my reading of the code correct? Can we try to update the
> documentation to be more explicit about this?
>
>
> *# Routing *
>
> *Question 3:* Is there a reason why actual routing was not used? I'm not
> familiar with the Solr code base, but it seems like it'd be nicer to
> instead use existing routing behavior in this context instead of querying
> all and filtering results. This seems like it would need 2 things: First,
> the _route_ value from the current "local" request, and second, either the
> local client (like how solrj does) or the remote "/export" handler would
> need to recognize and handle this parameter. Is that obviously doable or
> not doable? Trying to understand why that approach wasn't taken originally.
>
>
> *# Hashing*
>
> Here is the behavior touted in the docs for HashRangeQueryParser
> <https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#hash-range-query-parser>
> .
> """In the cross collection join case, the hash range query parser is used
> to ensure that each shard only gets the set of join keys that would end up
> on that shard. This query parser uses the MurmurHash3_x86_32. This is the
> same as the default hashing for the default composite ID router in Solr."""
>
> The documentation mentions "CompositeID router", which we know is based on
> prefixes (split on "!") being hashed and routed with the first/top 16 bits
> of info (with the later 16 bits provided by the rest of the doc "id" on
> inserts).
>
> The CrossCollectionJoinQuery uses 16 bits from the current/local shard
> range, which seems fine and good. However, the HashRangeQuery appears to hash
> the entire field
> <https://github.com/apache/solr/blob/26195c82493422cb9d6d4bdf9d4452046e7b3f67/solr/core/src/java/org/apache/solr/search/join/HashRangeQuery.java#L116-L117>.
> So I'm struggling to understand how this would work, especially since the
> join field and the "route" field are sourced from the same value. Either
> the join field is a compositeId in which case the HashRangeQuery code
> appears to be invalid, as it would not hash "A!B" the same as the actual
> router would hash "A", or the join field is not a compositeId in which case
> for it to work it would have to be the exact value as the actual
> compositeId prefix field something like this doc: {"id":"A!B",
> "myJoinField": "A"}. (Or maybe using "router.field=myJoinField" works
> without the compositeId/"!" format?). And if the join field is not a
> compositeId, then the only thing you could join on is the broad category
> tenant/product/etc that is used as the compositeId prefix, which would
> severely limit the use-case of the plugin, preventing joins on something
> more akin to record-ids/foreign-keys, and only allowing you to narrow down
> the results by what you know ahead of time to cram into the "v=" query
> field.
>
> *Question 4:* Not a specific question so much as "am I onto something
> here or am I missing something and off base?"
>
> Actually reading through the test code, now I see that my hypothesized "it
> could only work if router key and join field are the same value" is
> actually what is tested. The data is set-up
> <https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L128-L130>with
> product_id as the compositeId prefix. Then all the test queries
> <https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L166-L217>
> are joins on another field with the same product_Id value. So that explains
> how it can work.
>
> *Alternative Use-Case*
> While I'm here I guess I'll fill in the use-case I was hoping for based on
> how we currently do local joins. We want to have two collections which both
> route on the same tenantId, whereas our join is on more of a foreign-key,
> as seen below.
>
> // Collection-1
> {
> "id": "tenantId!abc"
> "entity": "userUpload",
> "entity_id": "abc",
> "uploadedBy": "123",
> }
>
> // Collection-2
> {
> "id": "tenantId!123",
> "entity": "user",
> "entity_id": "123",
> "user_groups": ["xyz",...]
> }
>
> // Query Collection-1, join example adapted to crossCollection. This will
> include user-upload documents that were uploaded-by the user in group xyz.
> {!join method="crossCollection"
> fromIndex="Collection-2" // remote
> from="entity_id" // remote
> to="uploadedBy" // local
> v="user_groups:xyz" // remote search filter
> }
>
> This query works locally and should work remotely, cross-collection, but
> it appears incompatible with the current routing/hashing behavior of the
> plugin.
>
> At this point I have worked through it enough that I understand how it
> currently works, and even rereading the docs it kinda makes more sense now
> like the information was there the whole time, but I think this is still
> worth raising for awareness and discussion. I don't currently have the
> need/time to update the plugin to expand its behavior. But I might be able
> to update the documentation to make it more clear so that others don't go
> through the same rollercoaster and deep dive that I've gone through.
>
> Thanks a bunch for any assistance or information regarding this!
>
>