Mailing List Archive

Parallel SQL join on multivalued fields
Hi,

I don't know if this is the right place for my question, anyway I'll try to
explain the issue here and understand together with you if it's worth
working on it.
I'm working with the parallel sql feature of Solr. Even though , looking at
the documentation, the join seems not supported, the join works. I gave a
look to the code and I understood that it works thanks to the calcite
features, (the framework on top of which is is build parellel sql feature).
My project doesn't need to works with big amount of data, and I think that
the calcite join feature can work well for my use case.
The problems arise when I need to join two multivalued field.
In parallel sql, the JOIN operation on two multivalued field seems to works
matching the two fields as a unique string, so that, for example , a
document with the join field like this :["a","b"]; will match with a
document with a join field exactly equal, like this: ["a","b"]; otherwise
if even only one element is different, they do not match.The right way to
do this should be first to explode the document in more document as much as
the number of element in the multivalued field (cross product on the field)
and then perform the join. I managed to solve the problem using streaming
expression:

innerJoin(
sort(
cartesianProduct(

search(census_defence_system,q="*:*",fl="id,defence_system,description,supplier",sort="id
asc",qt="/select",rows="1000"),
supplier
),
by="supplier asc"
),
sort(
cartesianProduct(

search(census_components,q="*:*",fl="id,compoenent_name,supplier",sort="id
asc",qt="/select",rows="10000"),
supplier
),
by="supplier asc"
),
on="supplier"
)
with suplier as a multivalued field. It works very well.
Anyway it would be great if the JOIN of multivalued field performed this
behavior with cartesian product also in parallel sql.
I think this could be a very powerful improvement, apply sql in a not
normalized collection/table. It could be possible to implement this
feature? I will be very glad to work on it.

Thank you,

Piero
Re: Parallel SQL join on multivalued fields [ In reply to ]
any suggestion on this?
Thanks

Il giorno mer 1 lug 2020 alle ore 11:22 Piero Scrima <pierscr82@gmail.com>
ha scritto:

> Hi,
>
> I don't know if this is the right place for my question, anyway I'll try
> to explain the issue here and understand together with you if it's worth
> working on it.
> I'm working with the parallel sql feature of Solr. Even though , looking
> at the documentation, the join seems not supported, the join works. I gave
> a look to the code and I understood that it works thanks to the calcite
> features, (the framework on top of which is is build parellel sql feature).
> My project doesn't need to works with big amount of data, and I think that
> the calcite join feature can work well for my use case.
> The problems arise when I need to join two multivalued field.
> In parallel sql, the JOIN operation on two multivalued field seems to
> works matching the two fields as a unique string, so that, for example , a
> document with the join field like this :["a","b"]; will match with a
> document with a join field exactly equal, like this: ["a","b"]; otherwise
> if even only one element is different, they do not match.The right way to
> do this should be first to explode the document in more document as much as
> the number of element in the multivalued field (cross product on the field)
> and then perform the join. I managed to solve the problem using streaming
> expression:
>
> innerJoin(
> sort(
> cartesianProduct(
>
> search(census_defence_system,q="*:*",fl="id,defence_system,description,supplier",sort="id
> asc",qt="/select",rows="1000"),
> supplier
> ),
> by="supplier asc"
> ),
> sort(
> cartesianProduct(
>
> search(census_components,q="*:*",fl="id,compoenent_name,supplier",sort="id
> asc",qt="/select",rows="10000"),
> supplier
> ),
> by="supplier asc"
> ),
> on="supplier"
> )
> with suplier as a multivalued field. It works very well.
> Anyway it would be great if the JOIN of multivalued field performed this
> behavior with cartesian product also in parallel sql.
> I think this could be a very powerful improvement, apply sql in a not
> normalized collection/table. It could be possible to implement this
> feature? I will be very glad to work on it.
>
> Thank you,
>
> Piero
>
Re: Parallel SQL join on multivalued fields [ In reply to ]
I think the first step would be comprehensive unit tests for joins in
Parallel SQL, coupled with performance tests so we understand how
distributed performs at scale through the calcites framework. Then we know
if we can actually say joins are really supported. Then we can add the
documentation.

If join support becomes part of parallel SQL then we can actively look at
improving them.

If you want to add the unit tests I can find the time to help commit and I
can help with the performance tests.




Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Jul 21, 2020 at 5:02 AM Piero Scrima <pierscr82@gmail.com> wrote:

> any suggestion on this?
> Thanks
>
> Il giorno mer 1 lug 2020 alle ore 11:22 Piero Scrima <pierscr82@gmail.com>
> ha scritto:
>
>> Hi,
>>
>> I don't know if this is the right place for my question, anyway I'll try
>> to explain the issue here and understand together with you if it's worth
>> working on it.
>> I'm working with the parallel sql feature of Solr. Even though , looking
>> at the documentation, the join seems not supported, the join works. I gave
>> a look to the code and I understood that it works thanks to the calcite
>> features, (the framework on top of which is is build parellel sql feature).
>> My project doesn't need to works with big amount of data, and I think that
>> the calcite join feature can work well for my use case.
>> The problems arise when I need to join two multivalued field.
>> In parallel sql, the JOIN operation on two multivalued field seems to
>> works matching the two fields as a unique string, so that, for example , a
>> document with the join field like this :["a","b"]; will match with a
>> document with a join field exactly equal, like this: ["a","b"]; otherwise
>> if even only one element is different, they do not match.The right way to
>> do this should be first to explode the document in more document as much as
>> the number of element in the multivalued field (cross product on the field)
>> and then perform the join. I managed to solve the problem using streaming
>> expression:
>>
>> innerJoin(
>> sort(
>> cartesianProduct(
>>
>> search(census_defence_system,q="*:*",fl="id,defence_system,description,supplier",sort="id
>> asc",qt="/select",rows="1000"),
>> supplier
>> ),
>> by="supplier asc"
>> ),
>> sort(
>> cartesianProduct(
>>
>> search(census_components,q="*:*",fl="id,compoenent_name,supplier",sort="id
>> asc",qt="/select",rows="10000"),
>> supplier
>> ),
>> by="supplier asc"
>> ),
>> on="supplier"
>> )
>> with suplier as a multivalued field. It works very well.
>> Anyway it would be great if the JOIN of multivalued field performed this
>> behavior with cartesian product also in parallel sql.
>> I think this could be a very powerful improvement, apply sql in a not
>> normalized collection/table. It could be possible to implement this
>> feature? I will be very glad to work on it.
>>
>> Thank you,
>>
>> Piero
>>
>