Mailing List Archive: SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit

SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit

Mar 28, 2024, 8:37 AM

Post #1 of 3 (59 views)

Hello,

We are trying to search for phrases where the last term is a prefix match.
For example, find all documents that contain "foo bar.*", with a
configurable slop between "foo" and "bar". We were able to do this using
`SpanNearQuery` where the last clause is a `SpanMultiTermQueryWrapper` that
wraps a `PrefixQuery`. However, this seems to run into the limit of 1024
clauses very quickly if the last term appears as a common prefix in the
index.

I have a branch that reproduces the query at
https://github.com/apache/lucene/compare/main...yixunx:yx/span-query-limit?expand=1,
and also pasted the code below.

It seems that if slop = 0 then we can use `MultiPhraseQuery` instead, which
doesn't hit the clause limit. For the slop != 0 case, is it intended that
`SpanMultiTermQueryWrapper` can easily hit the clause limit, or am I using
the queries wrong? Is there a workaround other than increasing
`maxClauseCount`?

Thank you for the help!

```java
public class TestSpanNearQueryClauseLimit extends LuceneTestCase {

private static final String FIELD_NAME = "field";
private static final int NUM_DOCUMENTS = 1025;

/**
* Creates an index with NUM_DOCUMENTS documents. Each document has a
text field in the form of "abc foo bar_[UUID]".
*/
private Directory createIndex() throws Exception {
Directory dir = newDirectory();
try (IndexWriter writer = new IndexWriter(dir, new
IndexWriterConfig())) {
for (int i = 0; i < NUM_DOCUMENTS; i++) {
Document doc = new Document();
doc.add(new TextField("field", "abc foo bar_" +
UUID.randomUUID(), Field.Store.YES));
writer.addDocument(doc);
}
writer.commit();
}
return dir;
}

public void testSpanNearQueryClauseLimit() throws Exception {
Directory dir = createIndex();

// Find documents that match "abc <some term> bar.*", which should
match all documents.
try (IndexReader reader = DirectoryReader.open(dir)) {
Query query = new SpanNearQuery.Builder(FIELD_NAME, true)
.setSlop(1)
.addClause(new SpanTermQuery(new Term(FIELD_NAME,
"abc")))
.addClause(new SpanMultiTermQueryWrapper<>(new
PrefixQuery(new Term(FIELD_NAME, "bar"))))
.build();

// This throws exception if NUM_DOCUMENTS is > 1024.
// ```
// org.apache.lucene.search.IndexSearcher$TooManyNestedClauses:
Query contains too many nested clauses;
// maxClauseCount is set to 1024
// ```
TopDocs docs = new IndexSearcher(reader).search(query, 10);
System.out.println(docs.totalHits);
}

dir.close();
}
}
```

Thank you,
Yixun Xu

Re: SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit [ In reply to ]

rcmuir at gmail

Mar 28, 2024, 9:57 AM

Post #2 of 3 (59 views)

Permalink

using spans and wildcards together is asking for trouble, you will hit
limits, it is not efficient by definition.

I'd recommend to change your indexing so that your queries are fast
and you aren't using wildcards that enumerate many terms at
search-time.
Don't index words such as "bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
and then use wildcards to match just "bar".
Instead add a synonym "bar" (or similar, whatever you want) to
"bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
This way you can match it with ordinary termquery: "bar"

e.g. for your simple example, this would look approximately like this:
instead of: abc foo bar_" + UUID.randomUUID()
index something like: abc foo bar bar_" + UUID.randomUUID()

but if you use an analyzer, then
bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6 and its synonym "bar" will
sit at the same position, so your spans/sloppy-phrases will work fine.

On Thu, Mar 28, 2024 at 11:37?AM Yixun Xu <yixunx@gmail.com> wrote:
>
> Hello,
>
> We are trying to search for phrases where the last term is a prefix match.
> For example, find all documents that contain "foo bar.*", with a
> configurable slop between "foo" and "bar". We were able to do this using
> `SpanNearQuery` where the last clause is a `SpanMultiTermQueryWrapper` that
> wraps a `PrefixQuery`. However, this seems to run into the limit of 1024
> clauses very quickly if the last term appears as a common prefix in the
> index.
>
> I have a branch that reproduces the query at
> https://github.com/apache/lucene/compare/main...yixunx:yx/span-query-limit?expand=1,
> and also pasted the code below.
>
> It seems that if slop = 0 then we can use `MultiPhraseQuery` instead, which
> doesn't hit the clause limit. For the slop != 0 case, is it intended that
> `SpanMultiTermQueryWrapper` can easily hit the clause limit, or am I using
> the queries wrong? Is there a workaround other than increasing
> `maxClauseCount`?
>
> Thank you for the help!
>
> ```java
> public class TestSpanNearQueryClauseLimit extends LuceneTestCase {
>
> private static final String FIELD_NAME = "field";
> private static final int NUM_DOCUMENTS = 1025;
>
> /**
> * Creates an index with NUM_DOCUMENTS documents. Each document has a
> text field in the form of "abc foo bar_[UUID]".
> */
> private Directory createIndex() throws Exception {
> Directory dir = newDirectory();
> try (IndexWriter writer = new IndexWriter(dir, new
> IndexWriterConfig())) {
> for (int i = 0; i < NUM_DOCUMENTS; i++) {
> Document doc = new Document();
> doc.add(new TextField("field", "abc foo bar_" +
> UUID.randomUUID(), Field.Store.YES));
> writer.addDocument(doc);
> }
> writer.commit();
> }
> return dir;
> }
>
> public void testSpanNearQueryClauseLimit() throws Exception {
> Directory dir = createIndex();
>
> // Find documents that match "abc <some term> bar.*", which should
> match all documents.
> try (IndexReader reader = DirectoryReader.open(dir)) {
> Query query = new SpanNearQuery.Builder(FIELD_NAME, true)
> .setSlop(1)
> .addClause(new SpanTermQuery(new Term(FIELD_NAME,
> "abc")))
> .addClause(new SpanMultiTermQueryWrapper<>(new
> PrefixQuery(new Term(FIELD_NAME, "bar"))))
> .build();
>
> // This throws exception if NUM_DOCUMENTS is > 1024.
> // ```
> // org.apache.lucene.search.IndexSearcher$TooManyNestedClauses:
> Query contains too many nested clauses;
> // maxClauseCount is set to 1024
> // ```
> TopDocs docs = new IndexSearcher(reader).search(query, 10);
> System.out.println(docs.totalHits);
> }
>
> dir.close();
> }
> }
> ```
>
> Thank you,
> Yixun Xu

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit [ In reply to ]

yixunx at gmail

Mar 28, 2024, 12:52 PM

Post #3 of 3 (59 views)

Permalink

That makes sense. Thank you!

On Thu, Mar 28, 2024 at 12:58?PM Robert Muir <rcmuir@gmail.com> wrote:

> using spans and wildcards together is asking for trouble, you will hit
> limits, it is not efficient by definition.
>
> I'd recommend to change your indexing so that your queries are fast
> and you aren't using wildcards that enumerate many terms at
> search-time.
> Don't index words such as "bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
> and then use wildcards to match just "bar".
> Instead add a synonym "bar" (or similar, whatever you want) to
> "bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
> This way you can match it with ordinary termquery: "bar"
>
> e.g. for your simple example, this would look approximately like this:
> instead of: abc foo bar_" + UUID.randomUUID()
> index something like: abc foo bar bar_" + UUID.randomUUID()
>
> but if you use an analyzer, then
> bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6 and its synonym "bar" will
> sit at the same position, so your spans/sloppy-phrases will work fine.
>
> On Thu, Mar 28, 2024 at 11:37?AM Yixun Xu <yixunx@gmail.com> wrote:
> >
> > Hello,
> >
> > We are trying to search for phrases where the last term is a prefix
> match.
> > For example, find all documents that contain "foo bar.*", with a
> > configurable slop between "foo" and "bar". We were able to do this using
> > `SpanNearQuery` where the last clause is a `SpanMultiTermQueryWrapper`
> that
> > wraps a `PrefixQuery`. However, this seems to run into the limit of 1024
> > clauses very quickly if the last term appears as a common prefix in the
> > index.
> >
> > I have a branch that reproduces the query at
> >
> https://github.com/apache/lucene/compare/main...yixunx:yx/span-query-limit?expand=1
> ,
> > and also pasted the code below.
> >
> > It seems that if slop = 0 then we can use `MultiPhraseQuery` instead,
> which
> > doesn't hit the clause limit. For the slop != 0 case, is it intended that
> > `SpanMultiTermQueryWrapper` can easily hit the clause limit, or am I
> using
> > the queries wrong? Is there a workaround other than increasing
> > `maxClauseCount`?
> >
> > Thank you for the help!
> >
> > ```java
> > public class TestSpanNearQueryClauseLimit extends LuceneTestCase {
> >
> > private static final String FIELD_NAME = "field";
> > private static final int NUM_DOCUMENTS = 1025;
> >
> > /**
> > * Creates an index with NUM_DOCUMENTS documents. Each document has a
> > text field in the form of "abc foo bar_[UUID]".
> > */
> > private Directory createIndex() throws Exception {
> > Directory dir = newDirectory();
> > try (IndexWriter writer = new IndexWriter(dir, new
> > IndexWriterConfig())) {
> > for (int i = 0; i < NUM_DOCUMENTS; i++) {
> > Document doc = new Document();
> > doc.add(new TextField("field", "abc foo bar_" +
> > UUID.randomUUID(), Field.Store.YES));
> > writer.addDocument(doc);
> > }
> > writer.commit();
> > }
> > return dir;
> > }
> >
> > public void testSpanNearQueryClauseLimit() throws Exception {
> > Directory dir = createIndex();
> >
> > // Find documents that match "abc <some term> bar.*", which
> should
> > match all documents.
> > try (IndexReader reader = DirectoryReader.open(dir)) {
> > Query query = new SpanNearQuery.Builder(FIELD_NAME, true)
> > .setSlop(1)
> > .addClause(new SpanTermQuery(new Term(FIELD_NAME,
> > "abc")))
> > .addClause(new SpanMultiTermQueryWrapper<>(new
> > PrefixQuery(new Term(FIELD_NAME, "bar"))))
> > .build();
> >
> > // This throws exception if NUM_DOCUMENTS is > 1024.
> > // ```
> > //
> org.apache.lucene.search.IndexSearcher$TooManyNestedClauses:
> > Query contains too many nested clauses;
> > // maxClauseCount is set to 1024
> > // ```
> > TopDocs docs = new IndexSearcher(reader).search(query, 10);
> > System.out.println(docs.totalHits);
> > }
> >
> > dir.close();
> > }
> > }
> > ```
> >
> > Thank you,
> > Yixun Xu
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>