Mailing List Archive

Taxonomy vs SSDVFF for faceted search
Hello everyone,

We are trying to choose between Taxonomy and SortedSetDocValuesFacetField
implementations for faceted search, and based on available information and
our quick tests, the difference is the following -

- Taxonomy is faster at query time (on our test workload, the difference
sometimes is higher than documented 25%). Also SortedSet adds latency to an
NRT refresh.
- Taxonomy is slower at index time, and unlike SortedSet implementation, it
does not scale as good with more than 4 threads (a lot of contention at
DirectoryTaxonomyWriter#addCategory() and UTF8TaxonomyWriterCache.get()
synchronized blocks)
- SortedSet does not support hierarchical queries
- SortedSet does not require a sidecar index
- Tie-break differences for labels with the same count

Am I missing something, or that’s everything we should take into account as
of today?

I know that Solr and ES use their own faceting for historical reasons, but
are there any other large Lucene-based products, which have chosen one
implementation over another? Do we know why?
Any insight on less known trade-offs and production experience is greatly
appreciated!

--
Thank you,
Alex
Re: Taxonomy vs SSDVFF for faceted search [ In reply to ]
Alex,

With our lucene based implementation of Zulia (
https://github.com/zuliaio/zuliasearch) we have went back and forth. We
started with Taxonomy and switched and then switched back to taxonomy. In
our experience the Taxonomy based approach is more scalable and
performant. We do large searches (sometimes returning millions of
results) with about 20 facets being run with some high cardinality facets.
A small dataset version of the tool that is backed by zulia we released for
covid can be found here (
https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907).
If you click on the facet tab you can see how we use facets. I believe the
use case might largely drive the choice.

Thanks,
Matt

On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
alexanderlukyanchikov@gmail.com> wrote:

> Hello everyone,
>
> We are trying to choose between Taxonomy and SortedSetDocValuesFacetField
> implementations for faceted search, and based on available information and
> our quick tests, the difference is the following -
>
> - Taxonomy is faster at query time (on our test workload, the difference
> sometimes is higher than documented 25%). Also SortedSet adds latency to an
> NRT refresh.
> - Taxonomy is slower at index time, and unlike SortedSet implementation, it
> does not scale as good with more than 4 threads (a lot of contention at
> DirectoryTaxonomyWriter#addCategory() and UTF8TaxonomyWriterCache.get()
> synchronized blocks)
> - SortedSet does not support hierarchical queries
> - SortedSet does not require a sidecar index
> - Tie-break differences for labels with the same count
>
> Am I missing something, or that’s everything we should take into account as
> of today?
>
> I know that Solr and ES use their own faceting for historical reasons, but
> are there any other large Lucene-based products, which have chosen one
> implementation over another? Do we know why?
> Any insight on less known trade-offs and production experience is greatly
> appreciated!
>
> --
> Thank you,
> Alex
>
Re: Taxonomy vs SSDVFF for faceted search [ In reply to ]
Hi Matt,
It's very interesting, thanks for the response! Did you have any issues
with Taxonomy indexing performance, or maybe tried to optimize it somehow?
Also, any problems maintaining a sidecar index or experience building a
distributed system around it with sharding/rebalancing?

--
Regards,
Alex


On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics411@gmail.com> wrote:

> Alex,
>
> With our lucene based implementation of Zulia (
> https://github.com/zuliaio/zuliasearch) we have went back and forth. We
> started with Taxonomy and switched and then switched back to taxonomy. In
> our experience the Taxonomy based approach is more scalable and
> performant. We do large searches (sometimes returning millions of
> results) with about 20 facets being run with some high cardinality facets.
> A small dataset version of the tool that is backed by zulia we released for
> covid can be found here (
>
> https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907
> ).
> If you click on the facet tab you can see how we use facets. I believe the
> use case might largely drive the choice.
>
> Thanks,
> Matt
>
> On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
> alexanderlukyanchikov@gmail.com> wrote:
>
> > Hello everyone,
> >
> > We are trying to choose between Taxonomy and SortedSetDocValuesFacetField
> > implementations for faceted search, and based on available information
> and
> > our quick tests, the difference is the following -
> >
> > - Taxonomy is faster at query time (on our test workload, the difference
> > sometimes is higher than documented 25%). Also SortedSet adds latency to
> an
> > NRT refresh.
> > - Taxonomy is slower at index time, and unlike SortedSet implementation,
> it
> > does not scale as good with more than 4 threads (a lot of contention at
> > DirectoryTaxonomyWriter#addCategory() and UTF8TaxonomyWriterCache.get()
> > synchronized blocks)
> > - SortedSet does not support hierarchical queries
> > - SortedSet does not require a sidecar index
> > - Tie-break differences for labels with the same count
> >
> > Am I missing something, or that’s everything we should take into account
> as
> > of today?
> >
> > I know that Solr and ES use their own faceting for historical reasons,
> but
> > are there any other large Lucene-based products, which have chosen one
> > implementation over another? Do we know why?
> > Any insight on less known trade-offs and production experience is greatly
> > appreciated!
> >
> > --
> > Thank you,
> > Alex
> >
>
Re: Taxonomy vs SSDVFF for faceted search [ In reply to ]
Alex,

We did consider trying to optimize Taxonomy indexing performance but we
never really got around to it. The sidecar index is annoying to deal with
and we have had occasional issues with it. Zulia has sharding implemented.
The main issue here is not the taxonomy but rather just getting exact
counts with returning all facets values. We chose to implement a method
similar to elastic search (
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error).
For replication we plan to use native Lucene index replication built into
lucene. The framework is currently there for routing queries and such but
the actual copying of the index has not been implemented yet so I can't
speak to that. Hope this helps some.

Thanks,
Matt


On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov <
alexanderlukyanchikov@gmail.com> wrote:

> Hi Matt,
> It's very interesting, thanks for the response! Did you have any issues
> with Taxonomy indexing performance, or maybe tried to optimize it somehow?
> Also, any problems maintaining a sidecar index or experience building a
> distributed system around it with sharding/rebalancing?
>
> --
> Regards,
> Alex
>
>
> On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics411@gmail.com>
> wrote:
>
> > Alex,
> >
> > With our lucene based implementation of Zulia (
> > https://github.com/zuliaio/zuliasearch) we have went back and forth. We
> > started with Taxonomy and switched and then switched back to taxonomy.
> In
> > our experience the Taxonomy based approach is more scalable and
> > performant. We do large searches (sometimes returning millions of
> > results) with about 20 facets being run with some high cardinality
> facets.
> > A small dataset version of the tool that is backed by zulia we released
> for
> > covid can be found here (
> >
> >
> https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907
> > ).
> > If you click on the facet tab you can see how we use facets. I believe
> the
> > use case might largely drive the choice.
> >
> > Thanks,
> > Matt
> >
> > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
> > alexanderlukyanchikov@gmail.com> wrote:
> >
> > > Hello everyone,
> > >
> > > We are trying to choose between Taxonomy and
> SortedSetDocValuesFacetField
> > > implementations for faceted search, and based on available information
> > and
> > > our quick tests, the difference is the following -
> > >
> > > - Taxonomy is faster at query time (on our test workload, the
> difference
> > > sometimes is higher than documented 25%). Also SortedSet adds latency
> to
> > an
> > > NRT refresh.
> > > - Taxonomy is slower at index time, and unlike SortedSet
> implementation,
> > it
> > > does not scale as good with more than 4 threads (a lot of contention at
> > > DirectoryTaxonomyWriter#addCategory() and UTF8TaxonomyWriterCache.get()
> > > synchronized blocks)
> > > - SortedSet does not support hierarchical queries
> > > - SortedSet does not require a sidecar index
> > > - Tie-break differences for labels with the same count
> > >
> > > Am I missing something, or that’s everything we should take into
> account
> > as
> > > of today?
> > >
> > > I know that Solr and ES use their own faceting for historical reasons,
> > but
> > > are there any other large Lucene-based products, which have chosen one
> > > implementation over another? Do we know why?
> > > Any insight on less known trade-offs and production experience is
> greatly
> > > appreciated!
> > >
> > > --
> > > Thank you,
> > > Alex
> > >
> >
>
Re: Taxonomy vs SSDVFF for faceted search [ In reply to ]
Hi Alex-

Amazon's product search engine is built on top of Lucene, which is a
fairly large-scale application (w.r.t. both index size, traffic and
use-case complexity). We have found taxonomy-based faceting to work
well for us generally, and haven't needed to do much to optimize
beyond what's already there. As you can imagine, with Amazon's catalog
being quite broad, we have a large number of unique facets available
for customers to use, which means a single facet-field storing all
dimensions can have high cardinality (as is the case by default with
taxonomy facets). This is an area where we have experimented a little
bit (e.g., "sharding" facets into separate fields to lower cardinality
of counting at query-time), but we tend to find Lucene works well
"as-is" for the most part in this sapce. The last bit I'll mention
here is that, for fields that are numeric and low-cardinality in
nature, LUCENE-7927
(https://issues.apache.org/jira/browse/LUCENE-7927) added the ability
to count these cases a bit more efficiently than trying to apply a
taxonomy-based approach.

Happy faceting!

Cheers,
-Greg

On Thu, Apr 29, 2021 at 5:09 AM Matt Davis <kryptonics411@gmail.com> wrote:
>
> Alex,
>
> We did consider trying to optimize Taxonomy indexing performance but we
> never really got around to it. The sidecar index is annoying to deal with
> and we have had occasional issues with it. Zulia has sharding implemented.
> The main issue here is not the taxonomy but rather just getting exact
> counts with returning all facets values. We chose to implement a method
> similar to elastic search (
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error).
> For replication we plan to use native Lucene index replication built into
> lucene. The framework is currently there for routing queries and such but
> the actual copying of the index has not been implemented yet so I can't
> speak to that. Hope this helps some.
>
> Thanks,
> Matt
>
>
> On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov <
> alexanderlukyanchikov@gmail.com> wrote:
>
> > Hi Matt,
> > It's very interesting, thanks for the response! Did you have any issues
> > with Taxonomy indexing performance, or maybe tried to optimize it somehow?
> > Also, any problems maintaining a sidecar index or experience building a
> > distributed system around it with sharding/rebalancing?
> >
> > --
> > Regards,
> > Alex
> >
> >
> > On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics411@gmail.com>
> > wrote:
> >
> > > Alex,
> > >
> > > With our lucene based implementation of Zulia (
> > > https://github.com/zuliaio/zuliasearch) we have went back and forth. We
> > > started with Taxonomy and switched and then switched back to taxonomy.
> > In
> > > our experience the Taxonomy based approach is more scalable and
> > > performant. We do large searches (sometimes returning millions of
> > > results) with about 20 facets being run with some high cardinality
> > facets.
> > > A small dataset version of the tool that is backed by zulia we released
> > for
> > > covid can be found here (
> > >
> > >
> > https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907
> > > ).
> > > If you click on the facet tab you can see how we use facets. I believe
> > the
> > > use case might largely drive the choice.
> > >
> > > Thanks,
> > > Matt
> > >
> > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
> > > alexanderlukyanchikov@gmail.com> wrote:
> > >
> > > > Hello everyone,
> > > >
> > > > We are trying to choose between Taxonomy and
> > SortedSetDocValuesFacetField
> > > > implementations for faceted search, and based on available information
> > > and
> > > > our quick tests, the difference is the following -
> > > >
> > > > - Taxonomy is faster at query time (on our test workload, the
> > difference
> > > > sometimes is higher than documented 25%). Also SortedSet adds latency
> > to
> > > an
> > > > NRT refresh.
> > > > - Taxonomy is slower at index time, and unlike SortedSet
> > implementation,
> > > it
> > > > does not scale as good with more than 4 threads (a lot of contention at
> > > > DirectoryTaxonomyWriter#addCategory() and UTF8TaxonomyWriterCache.get()
> > > > synchronized blocks)
> > > > - SortedSet does not support hierarchical queries
> > > > - SortedSet does not require a sidecar index
> > > > - Tie-break differences for labels with the same count
> > > >
> > > > Am I missing something, or that’s everything we should take into
> > account
> > > as
> > > > of today?
> > > >
> > > > I know that Solr and ES use their own faceting for historical reasons,
> > > but
> > > > are there any other large Lucene-based products, which have chosen one
> > > > implementation over another? Do we know why?
> > > > Any insight on less known trade-offs and production experience is
> > greatly
> > > > appreciated!
> > > >
> > > > --
> > > > Thank you,
> > > > Alex
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Taxonomy vs SSDVFF for faceted search [ In reply to ]
Hi Greg, Matt,
Thank you for the responses, it's very helpful and great to hear that
Taxonomy is successfully used for large scale products!

Our biggest concern with it right now is future complications related to
index split and merge, which we are most likely going to use to implement
sharding and rebalancing. While split should not be too complicated
(pre-split taxonomy works for the divided parts, and we can build an
optimized taxonomy without unused categories in background for each new
index), the merge seems to be challenging and probably involves tricky
logic to translate ordinals from respective taxonomies, also taking into
account parent-child order guarantees for hierarchical categories.

I wonder if anyone implemented something similar, or have any thoughts or
ideas about that?

--
Regards,
Alex


On Thu, Apr 29, 2021 at 6:08 AM Greg Miller <gsmiller@gmail.com> wrote:

> Hi Alex-
>
> Amazon's product search engine is built on top of Lucene, which is a
> fairly large-scale application (w.r.t. both index size, traffic and
> use-case complexity). We have found taxonomy-based faceting to work
> well for us generally, and haven't needed to do much to optimize
> beyond what's already there. As you can imagine, with Amazon's catalog
> being quite broad, we have a large number of unique facets available
> for customers to use, which means a single facet-field storing all
> dimensions can have high cardinality (as is the case by default with
> taxonomy facets). This is an area where we have experimented a little
> bit (e.g., "sharding" facets into separate fields to lower cardinality
> of counting at query-time), but we tend to find Lucene works well
> "as-is" for the most part in this sapce. The last bit I'll mention
> here is that, for fields that are numeric and low-cardinality in
> nature, LUCENE-7927
> (https://issues.apache.org/jira/browse/LUCENE-7927) added the ability
> to count these cases a bit more efficiently than trying to apply a
> taxonomy-based approach.
>
> Happy faceting!
>
> Cheers,
> -Greg
>
> On Thu, Apr 29, 2021 at 5:09 AM Matt Davis <kryptonics411@gmail.com>
> wrote:
> >
> > Alex,
> >
> > We did consider trying to optimize Taxonomy indexing performance but we
> > never really got around to it. The sidecar index is annoying to deal
> with
> > and we have had occasional issues with it. Zulia has sharding
> implemented.
> > The main issue here is not the taxonomy but rather just getting exact
> > counts with returning all facets values. We chose to implement a method
> > similar to elastic search (
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error
> ).
> > For replication we plan to use native Lucene index replication built into
> > lucene. The framework is currently there for routing queries and such
> but
> > the actual copying of the index has not been implemented yet so I can't
> > speak to that. Hope this helps some.
> >
> > Thanks,
> > Matt
> >
> >
> > On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov <
> > alexanderlukyanchikov@gmail.com> wrote:
> >
> > > Hi Matt,
> > > It's very interesting, thanks for the response! Did you have any issues
> > > with Taxonomy indexing performance, or maybe tried to optimize it
> somehow?
> > > Also, any problems maintaining a sidecar index or experience building a
> > > distributed system around it with sharding/rebalancing?
> > >
> > > --
> > > Regards,
> > > Alex
> > >
> > >
> > > On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics411@gmail.com>
> > > wrote:
> > >
> > > > Alex,
> > > >
> > > > With our lucene based implementation of Zulia (
> > > > https://github.com/zuliaio/zuliasearch) we have went back and
> forth. We
> > > > started with Taxonomy and switched and then switched back to
> taxonomy.
> > > In
> > > > our experience the Taxonomy based approach is more scalable and
> > > > performant. We do large searches (sometimes returning millions of
> > > > results) with about 20 facets being run with some high cardinality
> > > facets.
> > > > A small dataset version of the tool that is backed by zulia we
> released
> > > for
> > > > covid can be found here (
> > > >
> > > >
> > >
> https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907
> > > > ).
> > > > If you click on the facet tab you can see how we use facets. I
> believe
> > > the
> > > > use case might largely drive the choice.
> > > >
> > > > Thanks,
> > > > Matt
> > > >
> > > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
> > > > alexanderlukyanchikov@gmail.com> wrote:
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > We are trying to choose between Taxonomy and
> > > SortedSetDocValuesFacetField
> > > > > implementations for faceted search, and based on available
> information
> > > > and
> > > > > our quick tests, the difference is the following -
> > > > >
> > > > > - Taxonomy is faster at query time (on our test workload, the
> > > difference
> > > > > sometimes is higher than documented 25%). Also SortedSet adds
> latency
> > > to
> > > > an
> > > > > NRT refresh.
> > > > > - Taxonomy is slower at index time, and unlike SortedSet
> > > implementation,
> > > > it
> > > > > does not scale as good with more than 4 threads (a lot of
> contention at
> > > > > DirectoryTaxonomyWriter#addCategory() and
> UTF8TaxonomyWriterCache.get()
> > > > > synchronized blocks)
> > > > > - SortedSet does not support hierarchical queries
> > > > > - SortedSet does not require a sidecar index
> > > > > - Tie-break differences for labels with the same count
> > > > >
> > > > > Am I missing something, or that’s everything we should take into
> > > account
> > > > as
> > > > > of today?
> > > > >
> > > > > I know that Solr and ES use their own faceting for historical
> reasons,
> > > > but
> > > > > are there any other large Lucene-based products, which have chosen
> one
> > > > > implementation over another? Do we know why?
> > > > > Any insight on less known trade-offs and production experience is
> > > greatly
> > > > > appreciated!
> > > > >
> > > > > --
> > > > > Thank you,
> > > > > Alex
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Taxonomy vs SSDVFF for faceted search [ In reply to ]
Interesting Alex. So for your "merge" case, are you suggesting you
would have a different taxonomy index for each segment and would need
to merge those? I could be completely mistaken (I'm not nearly as
familiar with the indexing side of things), but I thought Lucene
maintains one single taxonomy index regardless of how many shards
there are. It should be append-only where new ordinals are created
when they're first seen, and then stay stable through merges. Or am I
misunderstanding your use-case and you're actually doing some shard
management on top of what Lucene is doing?

Cheers,
-Greg

On Thu, Apr 29, 2021 at 2:48 PM Alexander Lukyanchikov
<alexanderlukyanchikov@gmail.com> wrote:
>
> Hi Greg, Matt,
> Thank you for the responses, it's very helpful and great to hear that
> Taxonomy is successfully used for large scale products!
>
> Our biggest concern with it right now is future complications related to
> index split and merge, which we are most likely going to use to implement
> sharding and rebalancing. While split should not be too complicated
> (pre-split taxonomy works for the divided parts, and we can build an
> optimized taxonomy without unused categories in background for each new
> index), the merge seems to be challenging and probably involves tricky
> logic to translate ordinals from respective taxonomies, also taking into
> account parent-child order guarantees for hierarchical categories.
>
> I wonder if anyone implemented something similar, or have any thoughts or
> ideas about that?
>
> --
> Regards,
> Alex
>
>
> On Thu, Apr 29, 2021 at 6:08 AM Greg Miller <gsmiller@gmail.com> wrote:
>
> > Hi Alex-
> >
> > Amazon's product search engine is built on top of Lucene, which is a
> > fairly large-scale application (w.r.t. both index size, traffic and
> > use-case complexity). We have found taxonomy-based faceting to work
> > well for us generally, and haven't needed to do much to optimize
> > beyond what's already there. As you can imagine, with Amazon's catalog
> > being quite broad, we have a large number of unique facets available
> > for customers to use, which means a single facet-field storing all
> > dimensions can have high cardinality (as is the case by default with
> > taxonomy facets). This is an area where we have experimented a little
> > bit (e.g., "sharding" facets into separate fields to lower cardinality
> > of counting at query-time), but we tend to find Lucene works well
> > "as-is" for the most part in this sapce. The last bit I'll mention
> > here is that, for fields that are numeric and low-cardinality in
> > nature, LUCENE-7927
> > (https://issues.apache.org/jira/browse/LUCENE-7927) added the ability
> > to count these cases a bit more efficiently than trying to apply a
> > taxonomy-based approach.
> >
> > Happy faceting!
> >
> > Cheers,
> > -Greg
> >
> > On Thu, Apr 29, 2021 at 5:09 AM Matt Davis <kryptonics411@gmail.com>
> > wrote:
> > >
> > > Alex,
> > >
> > > We did consider trying to optimize Taxonomy indexing performance but we
> > > never really got around to it. The sidecar index is annoying to deal
> > with
> > > and we have had occasional issues with it. Zulia has sharding
> > implemented.
> > > The main issue here is not the taxonomy but rather just getting exact
> > > counts with returning all facets values. We chose to implement a method
> > > similar to elastic search (
> > >
> > https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error
> > ).
> > > For replication we plan to use native Lucene index replication built into
> > > lucene. The framework is currently there for routing queries and such
> > but
> > > the actual copying of the index has not been implemented yet so I can't
> > > speak to that. Hope this helps some.
> > >
> > > Thanks,
> > > Matt
> > >
> > >
> > > On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov <
> > > alexanderlukyanchikov@gmail.com> wrote:
> > >
> > > > Hi Matt,
> > > > It's very interesting, thanks for the response! Did you have any issues
> > > > with Taxonomy indexing performance, or maybe tried to optimize it
> > somehow?
> > > > Also, any problems maintaining a sidecar index or experience building a
> > > > distributed system around it with sharding/rebalancing?
> > > >
> > > > --
> > > > Regards,
> > > > Alex
> > > >
> > > >
> > > > On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics411@gmail.com>
> > > > wrote:
> > > >
> > > > > Alex,
> > > > >
> > > > > With our lucene based implementation of Zulia (
> > > > > https://github.com/zuliaio/zuliasearch) we have went back and
> > forth. We
> > > > > started with Taxonomy and switched and then switched back to
> > taxonomy.
> > > > In
> > > > > our experience the Taxonomy based approach is more scalable and
> > > > > performant. We do large searches (sometimes returning millions of
> > > > > results) with about 20 facets being run with some high cardinality
> > > > facets.
> > > > > A small dataset version of the tool that is backed by zulia we
> > released
> > > > for
> > > > > covid can be found here (
> > > > >
> > > > >
> > > >
> > https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907
> > > > > ).
> > > > > If you click on the facet tab you can see how we use facets. I
> > believe
> > > > the
> > > > > use case might largely drive the choice.
> > > > >
> > > > > Thanks,
> > > > > Matt
> > > > >
> > > > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
> > > > > alexanderlukyanchikov@gmail.com> wrote:
> > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > We are trying to choose between Taxonomy and
> > > > SortedSetDocValuesFacetField
> > > > > > implementations for faceted search, and based on available
> > information
> > > > > and
> > > > > > our quick tests, the difference is the following -
> > > > > >
> > > > > > - Taxonomy is faster at query time (on our test workload, the
> > > > difference
> > > > > > sometimes is higher than documented 25%). Also SortedSet adds
> > latency
> > > > to
> > > > > an
> > > > > > NRT refresh.
> > > > > > - Taxonomy is slower at index time, and unlike SortedSet
> > > > implementation,
> > > > > it
> > > > > > does not scale as good with more than 4 threads (a lot of
> > contention at
> > > > > > DirectoryTaxonomyWriter#addCategory() and
> > UTF8TaxonomyWriterCache.get()
> > > > > > synchronized blocks)
> > > > > > - SortedSet does not support hierarchical queries
> > > > > > - SortedSet does not require a sidecar index
> > > > > > - Tie-break differences for labels with the same count
> > > > > >
> > > > > > Am I missing something, or that’s everything we should take into
> > > > account
> > > > > as
> > > > > > of today?
> > > > > >
> > > > > > I know that Solr and ES use their own faceting for historical
> > reasons,
> > > > > but
> > > > > > are there any other large Lucene-based products, which have chosen
> > one
> > > > > > implementation over another? Do we know why?
> > > > > > Any insight on less known trade-offs and production experience is
> > > > greatly
> > > > > > appreciated!
> > > > > >
> > > > > > --
> > > > > > Thank you,
> > > > > > Alex
> > > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Taxonomy vs SSDVFF for faceted search [ In reply to ]
Hi Greg,
Yes, in future we consider to implement our own shard management with
entire index split and merge operations, so for now we just wanted to make
sure that Taxonomy won't make it too complicated.
In fact, recently I found the TaxonomyMergeUtils, which is doing just that
- merging the main and taxonomy index pairs. Looks pretty straightforward,
although the complexity is a bit hidden there:
https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyMergeUtils.java

It's definitely slower than regular index merge because of the additional
work to translate the ordinals, but anyway it's much better than full
re-index.
I would appreciate any advice if someone used this approach.

--
Thank you,
Alex


On Fri, Apr 30, 2021 at 3:53 PM Greg Miller <gsmiller@gmail.com> wrote:

> Interesting Alex. So for your "merge" case, are you suggesting you
> would have a different taxonomy index for each segment and would need
> to merge those? I could be completely mistaken (I'm not nearly as
> familiar with the indexing side of things), but I thought Lucene
> maintains one single taxonomy index regardless of how many shards
> there are. It should be append-only where new ordinals are created
> when they're first seen, and then stay stable through merges. Or am I
> misunderstanding your use-case and you're actually doing some shard
> management on top of what Lucene is doing?
>
> Cheers,
> -Greg
>
> On Thu, Apr 29, 2021 at 2:48 PM Alexander Lukyanchikov
> <alexanderlukyanchikov@gmail.com> wrote:
> >
> > Hi Greg, Matt,
> > Thank you for the responses, it's very helpful and great to hear that
> > Taxonomy is successfully used for large scale products!
> >
> > Our biggest concern with it right now is future complications related to
> > index split and merge, which we are most likely going to use to implement
> > sharding and rebalancing. While split should not be too complicated
> > (pre-split taxonomy works for the divided parts, and we can build an
> > optimized taxonomy without unused categories in background for each new
> > index), the merge seems to be challenging and probably involves tricky
> > logic to translate ordinals from respective taxonomies, also taking into
> > account parent-child order guarantees for hierarchical categories.
> >
> > I wonder if anyone implemented something similar, or have any thoughts or
> > ideas about that?
> >
> > --
> > Regards,
> > Alex
> >
> >
> > On Thu, Apr 29, 2021 at 6:08 AM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > > Hi Alex-
> > >
> > > Amazon's product search engine is built on top of Lucene, which is a
> > > fairly large-scale application (w.r.t. both index size, traffic and
> > > use-case complexity). We have found taxonomy-based faceting to work
> > > well for us generally, and haven't needed to do much to optimize
> > > beyond what's already there. As you can imagine, with Amazon's catalog
> > > being quite broad, we have a large number of unique facets available
> > > for customers to use, which means a single facet-field storing all
> > > dimensions can have high cardinality (as is the case by default with
> > > taxonomy facets). This is an area where we have experimented a little
> > > bit (e.g., "sharding" facets into separate fields to lower cardinality
> > > of counting at query-time), but we tend to find Lucene works well
> > > "as-is" for the most part in this sapce. The last bit I'll mention
> > > here is that, for fields that are numeric and low-cardinality in
> > > nature, LUCENE-7927
> > > (https://issues.apache.org/jira/browse/LUCENE-7927) added the ability
> > > to count these cases a bit more efficiently than trying to apply a
> > > taxonomy-based approach.
> > >
> > > Happy faceting!
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Thu, Apr 29, 2021 at 5:09 AM Matt Davis <kryptonics411@gmail.com>
> > > wrote:
> > > >
> > > > Alex,
> > > >
> > > > We did consider trying to optimize Taxonomy indexing performance but
> we
> > > > never really got around to it. The sidecar index is annoying to deal
> > > with
> > > > and we have had occasional issues with it. Zulia has sharding
> > > implemented.
> > > > The main issue here is not the taxonomy but rather just getting exact
> > > > counts with returning all facets values. We chose to implement a
> method
> > > > similar to elastic search (
> > > >
> > >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error
> > > ).
> > > > For replication we plan to use native Lucene index replication built
> into
> > > > lucene. The framework is currently there for routing queries and
> such
> > > but
> > > > the actual copying of the index has not been implemented yet so I
> can't
> > > > speak to that. Hope this helps some.
> > > >
> > > > Thanks,
> > > > Matt
> > > >
> > > >
> > > > On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov <
> > > > alexanderlukyanchikov@gmail.com> wrote:
> > > >
> > > > > Hi Matt,
> > > > > It's very interesting, thanks for the response! Did you have any
> issues
> > > > > with Taxonomy indexing performance, or maybe tried to optimize it
> > > somehow?
> > > > > Also, any problems maintaining a sidecar index or experience
> building a
> > > > > distributed system around it with sharding/rebalancing?
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Alex
> > > > >
> > > > >
> > > > > On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <
> kryptonics411@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Alex,
> > > > > >
> > > > > > With our lucene based implementation of Zulia (
> > > > > > https://github.com/zuliaio/zuliasearch) we have went back and
> > > forth. We
> > > > > > started with Taxonomy and switched and then switched back to
> > > taxonomy.
> > > > > In
> > > > > > our experience the Taxonomy based approach is more scalable and
> > > > > > performant. We do large searches (sometimes returning millions
> of
> > > > > > results) with about 20 facets being run with some high
> cardinality
> > > > > facets.
> > > > > > A small dataset version of the tool that is backed by zulia we
> > > released
> > > > > for
> > > > > > covid can be found here (
> > > > > >
> > > > > >
> > > > >
> > >
> https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907
> > > > > > ).
> > > > > > If you click on the facet tab you can see how we use facets. I
> > > believe
> > > > > the
> > > > > > use case might largely drive the choice.
> > > > > >
> > > > > > Thanks,
> > > > > > Matt
> > > > > >
> > > > > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
> > > > > > alexanderlukyanchikov@gmail.com> wrote:
> > > > > >
> > > > > > > Hello everyone,
> > > > > > >
> > > > > > > We are trying to choose between Taxonomy and
> > > > > SortedSetDocValuesFacetField
> > > > > > > implementations for faceted search, and based on available
> > > information
> > > > > > and
> > > > > > > our quick tests, the difference is the following -
> > > > > > >
> > > > > > > - Taxonomy is faster at query time (on our test workload, the
> > > > > difference
> > > > > > > sometimes is higher than documented 25%). Also SortedSet adds
> > > latency
> > > > > to
> > > > > > an
> > > > > > > NRT refresh.
> > > > > > > - Taxonomy is slower at index time, and unlike SortedSet
> > > > > implementation,
> > > > > > it
> > > > > > > does not scale as good with more than 4 threads (a lot of
> > > contention at
> > > > > > > DirectoryTaxonomyWriter#addCategory() and
> > > UTF8TaxonomyWriterCache.get()
> > > > > > > synchronized blocks)
> > > > > > > - SortedSet does not support hierarchical queries
> > > > > > > - SortedSet does not require a sidecar index
> > > > > > > - Tie-break differences for labels with the same count
> > > > > > >
> > > > > > > Am I missing something, or that’s everything we should take
> into
> > > > > account
> > > > > > as
> > > > > > > of today?
> > > > > > >
> > > > > > > I know that Solr and ES use their own faceting for historical
> > > reasons,
> > > > > > but
> > > > > > > are there any other large Lucene-based products, which have
> chosen
> > > one
> > > > > > > implementation over another? Do we know why?
> > > > > > > Any insight on less known trade-offs and production experience
> is
> > > > > greatly
> > > > > > > appreciated!
> > > > > > >
> > > > > > > --
> > > > > > > Thank you,
> > > > > > > Alex
> > > > > > >
> > > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>