Mailing List Archive

Nov 9, 2001, 8:34 PM

Post #2 of 14 (4295 views)

I'm surprised that your memory use is that high.

An IndexReader requires:
one byte per field per document in index (norms)
one open file per file in index
1/128 of the Terms in the index
a Term has two pointers (8 bytes)
and a String (4 pointers = 24 bytes, one to 16-bit chars)

A Search requires:
1 1024 byte buffer per TermQuery
2 128 int buffers per TermQuery
2 1024 byte buffers per PhraseQuery term
1 1024 element bucket array per BooleanQuery
each bucket has 5 fields, and hence requires ~20 bytes
1 bit per document in index per DateFilter

A Hits requires:
up to n+100 ScoreDocs (float+int, 8 bytes)
where n is the highest Hits.doc(n) accessed
up to 200 Document objects

I may have forgotten something...

Let's assume that your 1M document index has 2M unique terms, and that you
only look at the top-100 hits, that your index has three fields, and that
the typical document has two stored fields, each 20 characters. Your
30-term boolean query over a 1M document index should use around the
following numbers of bytes:
IndexReader:
3,000,000 (norms)
1,000,000 (1/128 of 2M terms, each requiring ~50 bytes)
during search
50,000 (TermQuery buffers)
20,000 (BooleanQuery buckets)
100,000 (DateFilter bit vector)
in Hits
2,000 (200 ScoreDocs)
30,000 (up to 200 cached Documents)

So searches should run in a 5Mb heap. Are my assumptions off?

You can also see why it is useful to keep a single IndexReader and use it
for all queries. (IndexReader is thread safe.)

You could also 'java -Xrunhprof:heap=sites' to see what's using memory.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Memory Usage? [ In reply to ]

Nov 11, 2001, 7:59 AM

Post #3 of 14 (4307 views)

I am not very familiar with the output of -Xrunhprof, but I've attached the
output of a run of a search through and index of 50.000 documents. It gave
me out-of-memory errors until I allocated 100 megabytes of heap-space.

The top 10:

SITES BEGIN (ordered by live bytes) Sun Nov 11 15:50:31 2001
percent live alloc'ed stack class
rank self accum bytes objs bytes objs trace name
1 26.41% 26.41% 12485200 12005 45566560 43814 1783 [.B
2 25.18% 51.59% 11904880 11447 44867680 43142 1796 [.B
3 4.15% 55.74% 1962904 69214 171546352 5510292 1632 [.C
4 3.83% 59.58% 1812096 3432 1812096 3432 1768 [.I
5 3.83% 63.41% 1812096 3432 1812096 3432 1769 [.I
6 3.34% 66.75% 1580688 65862 130618992 5442458 1631 java.lang.String
7 3.19% 69.95% 1509584 44763 1509584 44763 458 [.C
8 3.03% 72.98% 1432416 44763 1432416 44763 459
org.apache.lucene.index.TermInfo
9 2.27% 75.25% 1074312 44763 1074312 44763 457 java.lang.String
10 2.23% 77.48% 1053792 65862 87079328 5442458 1631
org.apache.lucene.index.Term

and the top 3 traces were:

TRACE 1783:
org.apache.lucene.store.InputStream.refill(InputStream.java:165)
org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)

org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:101)

TRACE 1796:
org.apache.lucene.store.InputStream.refill(InputStream.java:165)
org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)

org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:
100)

TRACE 1632:
java.lang.String.<init>(String.java:198)

org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:134)

org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114)

org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:166)

I've attached the whole trace as gzipped.txt

regards,
Anders Nielsen

-----Original Message-----
From: Doug Cutting [mailto:DCutting@grandcentral.com]
Sent: 10. november 2001 04:35
To: 'Lucene Users List'
Subject: RE: Memory Usage?

I'm surprised that your memory use is that high.

An IndexReader requires:
one byte per field per document in index (norms)
one open file per file in index
1/128 of the Terms in the index
a Term has two pointers (8 bytes)
and a String (4 pointers = 24 bytes, one to 16-bit chars)

A Search requires:
1 1024 byte buffer per TermQuery
2 128 int buffers per TermQuery
2 1024 byte buffers per PhraseQuery term
1 1024 element bucket array per BooleanQuery
each bucket has 5 fields, and hence requires ~20 bytes
1 bit per document in index per DateFilter

A Hits requires:
up to n+100 ScoreDocs (float+int, 8 bytes)
where n is the highest Hits.doc(n) accessed
up to 200 Document objects

I may have forgotten something...

Let's assume that your 1M document index has 2M unique terms, and that you
only look at the top-100 hits, that your index has three fields, and that
the typical document has two stored fields, each 20 characters. Your
30-term boolean query over a 1M document index should use around the
following numbers of bytes:
IndexReader:
3,000,000 (norms)
1,000,000 (1/128 of 2M terms, each requiring ~50 bytes)
during search
50,000 (TermQuery buffers)
20,000 (BooleanQuery buckets)
100,000 (DateFilter bit vector)
in Hits
2,000 (200 ScoreDocs)
30,000 (up to 200 cached Documents)

So searches should run in a 5Mb heap. Are my assumptions off?

You can also see why it is useful to keep a single IndexReader and use it
for all queries. (IndexReader is thread safe.)

You could also 'java -Xrunhprof:heap=sites' to see what's using memory.

Doug

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

RE: Memory Usage? [ In reply to ]

Nov 12, 2001, 9:41 AM

Post #4 of 14 (4293 views)

This was a single query? How many terms, and of what type are in the query?
From the trace it looks like there could be over 40,000 terms in the query!
Is this a prefix or wildcard query? These can generate *very* large
queries...

Doug

> -----Original Message-----
> From: Anders Nielsen [mailto:anders@visator.dk]
> Sent: Sunday, November 11, 2001 6:59 AM
> To: Lucene Users List
> Subject: RE: Memory Usage?
>
>
> I am not very familiar with the output of -Xrunhprof, but
> I've attached the
> output of a run of a search through and index of 50.000
> documents. It gave
> me out-of-memory errors until I allocated 100 megabytes of heap-space.
>
> The top 10:
>
> SITES BEGIN (ordered by live bytes) Sun Nov 11 15:50:31 2001
> percent live alloc'ed stack class
> rank self accum bytes objs bytes objs trace name
> 1 26.41% 26.41% 12485200 12005 45566560 43814 1783 [.B
> 2 25.18% 51.59% 11904880 11447 44867680 43142 1796 [.B
> 3 4.15% 55.74% 1962904 69214 171546352 5510292 1632 [.C
> 4 3.83% 59.58% 1812096 3432 1812096 3432 1768 [.I
> 5 3.83% 63.41% 1812096 3432 1812096 3432 1769 [.I
> 6 3.34% 66.75% 1580688 65862 130618992 5442458 1631
> java.lang.String
> 7 3.19% 69.95% 1509584 44763 1509584 44763 458 [.C
> 8 3.03% 72.98% 1432416 44763 1432416 44763 459
> org.apache.lucene.index.TermInfo
> 9 2.27% 75.25% 1074312 44763 1074312 44763 457
> java.lang.String
> 10 2.23% 77.48% 1053792 65862 87079328 5442458 1631
> org.apache.lucene.index.Term
>
> and the top 3 traces were:
>
> TRACE 1783:
>
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
>
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
>
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
>
> org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:101)
>
> TRACE 1796:
>
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
>
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
>
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
>
> org.apache.lucene.index.SegmentTermPositions.next(SegmentTermP
> ositions.java:
> 100)
>
> TRACE 1632:
> java.lang.String.<init>(String.java:198)
>
> org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEn
> um.java:134)
>
> org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114)
>
> org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosRead
> er.java:166)
>
>
> I've attached the whole trace as gzipped.txt
>
> regards,
> Anders Nielsen
>
> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> Sent: 10. november 2001 04:35
> To: 'Lucene Users List'
> Subject: RE: Memory Usage?
>
>
> I'm surprised that your memory use is that high.
>
> An IndexReader requires:
> one byte per field per document in index (norms)
> one open file per file in index
> 1/128 of the Terms in the index
> a Term has two pointers (8 bytes)
> and a String (4 pointers = 24 bytes, one to 16-bit chars)
>
> A Search requires:
> 1 1024 byte buffer per TermQuery
> 2 128 int buffers per TermQuery
> 2 1024 byte buffers per PhraseQuery term
> 1 1024 element bucket array per BooleanQuery
> each bucket has 5 fields, and hence requires ~20 bytes
> 1 bit per document in index per DateFilter
>
> A Hits requires:
> up to n+100 ScoreDocs (float+int, 8 bytes)
> where n is the highest Hits.doc(n) accessed
> up to 200 Document objects
>
> I may have forgotten something...
>
> Let's assume that your 1M document index has 2M unique terms,
> and that you
> only look at the top-100 hits, that your index has three
> fields, and that
> the typical document has two stored fields, each 20 characters. Your
> 30-term boolean query over a 1M document index should use around the
> following numbers of bytes:
> IndexReader:
> 3,000,000 (norms)
> 1,000,000 (1/128 of 2M terms, each requiring ~50 bytes)
> during search
> 50,000 (TermQuery buffers)
> 20,000 (BooleanQuery buckets)
> 100,000 (DateFilter bit vector)
> in Hits
> 2,000 (200 ScoreDocs)
> 30,000 (up to 200 cached Documents)
>
> So searches should run in a 5Mb heap. Are my assumptions off?
>
> You can also see why it is useful to keep a single
> IndexReader and use it
> for all queries. (IndexReader is thread safe.)
>
> You could also 'java -Xrunhprof:heap=sites' to see what's
> using memory.
>
> Doug
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Memory Usage? [ In reply to ]

Nov 12, 2001, 10:02 AM

Post #5 of 14 (4308 views)

this was a big boolean query, with several prefixqueries but no wildcard
queries in the or-branches.

-----Original Message-----
From: Doug Cutting [mailto:DCutting@grandcentral.com]
Sent: 12. november 2001 17:41
To: Lucene Users List
Subject: RE: Memory Usage?

This was a single query? How many terms, and of what type are in the query?
From the trace it looks like there could be over 40,000 terms in the query!
Is this a prefix or wildcard query? These can generate *very* large
queries...

Doug

> -----Original Message-----
> From: Anders Nielsen [mailto:anders@visator.dk]
> Sent: Sunday, November 11, 2001 6:59 AM
> To: Lucene Users List
> Subject: RE: Memory Usage?
>
>
> I am not very familiar with the output of -Xrunhprof, but
> I've attached the
> output of a run of a search through and index of 50.000
> documents. It gave
> me out-of-memory errors until I allocated 100 megabytes of heap-space.
>
> The top 10:
>
> SITES BEGIN (ordered by live bytes) Sun Nov 11 15:50:31 2001
> percent live alloc'ed stack class
> rank self accum bytes objs bytes objs trace name
> 1 26.41% 26.41% 12485200 12005 45566560 43814 1783 [.B
> 2 25.18% 51.59% 11904880 11447 44867680 43142 1796 [.B
> 3 4.15% 55.74% 1962904 69214 171546352 5510292 1632 [.C
> 4 3.83% 59.58% 1812096 3432 1812096 3432 1768 [.I
> 5 3.83% 63.41% 1812096 3432 1812096 3432 1769 [.I
> 6 3.34% 66.75% 1580688 65862 130618992 5442458 1631
> java.lang.String
> 7 3.19% 69.95% 1509584 44763 1509584 44763 458 [.C
> 8 3.03% 72.98% 1432416 44763 1432416 44763 459
> org.apache.lucene.index.TermInfo
> 9 2.27% 75.25% 1074312 44763 1074312 44763 457
> java.lang.String
> 10 2.23% 77.48% 1053792 65862 87079328 5442458 1631
> org.apache.lucene.index.Term
>
> and the top 3 traces were:
>
> TRACE 1783:
>
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
>
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
>
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
>
> org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:101)
>
> TRACE 1796:
>
> org.apache.lucene.store.InputStream.refill(InputStream.java:165)
>
> org.apache.lucene.store.InputStream.readByte(InputStream.java:80)
>
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:106)
>
> org.apache.lucene.index.SegmentTermPositions.next(SegmentTermP
> ositions.java:
> 100)
>
> TRACE 1632:
> java.lang.String.<init>(String.java:198)
>
> org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEn
> um.java:134)
>
> org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:114)
>
> org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosRead
> er.java:166)
>
>
> I've attached the whole trace as gzipped.txt
>
> regards,
> Anders Nielsen
>
> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> Sent: 10. november 2001 04:35
> To: 'Lucene Users List'
> Subject: RE: Memory Usage?
>
>
> I'm surprised that your memory use is that high.
>
> An IndexReader requires:
> one byte per field per document in index (norms)
> one open file per file in index
> 1/128 of the Terms in the index
> a Term has two pointers (8 bytes)
> and a String (4 pointers = 24 bytes, one to 16-bit chars)
>
> A Search requires:
> 1 1024 byte buffer per TermQuery
> 2 128 int buffers per TermQuery
> 2 1024 byte buffers per PhraseQuery term
> 1 1024 element bucket array per BooleanQuery
> each bucket has 5 fields, and hence requires ~20 bytes
> 1 bit per document in index per DateFilter
>
> A Hits requires:
> up to n+100 ScoreDocs (float+int, 8 bytes)
> where n is the highest Hits.doc(n) accessed
> up to 200 Document objects
>
> I may have forgotten something...
>
> Let's assume that your 1M document index has 2M unique terms,
> and that you
> only look at the top-100 hits, that your index has three
> fields, and that
> the typical document has two stored fields, each 20 characters. Your
> 30-term boolean query over a 1M document index should use around the
> following numbers of bytes:
> IndexReader:
> 3,000,000 (norms)
> 1,000,000 (1/128 of 2M terms, each requiring ~50 bytes)
> during search
> 50,000 (TermQuery buffers)
> 20,000 (BooleanQuery buckets)
> 100,000 (DateFilter bit vector)
> in Hits
> 2,000 (200 ScoreDocs)
> 30,000 (up to 200 cached Documents)
>
> So searches should run in a 5Mb heap. Are my assumptions off?
>
> You can also see why it is useful to keep a single
> IndexReader and use it
> for all queries. (IndexReader is thread safe.)
>
> You could also 'java -Xrunhprof:heap=sites' to see what's
> using memory.
>
> Doug
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Memory Usage? [ In reply to ]

Nov 12, 2001, 12:46 PM

Post #6 of 14 (4301 views)

> From: Anders Nielsen [mailto:anders@visator.dk]
>
> this was a big boolean query, with several prefixqueries but
> no wildcard
> queries in the or-branches.

Well it looks like those prefixes are expanding to a lot of terms, a total
of over 40,000! (A prefix query expands into a BooleanQuery with all the
terms matching the prefix.)

If most of these expansions are low-frequency, then a simple fix should
improve things considerably. I've attached an optimized version of
TermQuery that will hold less memory per low-frequency term. In particular,
if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is
freed immediately.

Tell me how this works. Please send another heap dump.

Longer term, or if lots of the expanded terms occur more than 128 times,
perhaps BooleanScorer should use a different algorithm when there are
thousands of terms. In this case it might use less memory to construct an
array of score buckets for all documents. If (query.termCount() * 1024) >
(12 * getMaxDoc()) then this would use less memory. In your case, with
500,000 documents and a 40,000 term query, it's currently taking 40MB/query,
and could be done in 6MB/query. This optimization would not be too
difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer.

Doug

RE: Memory Usage? [ In reply to ]

scott.ganyo at eTapestry

Nov 12, 2001, 1:04 PM

Post #7 of 14 (4302 views)

I think something like this would be a HUGE boon for us. We do a lot of
complex queries on a lot of different indexes and end up suffering from
severe garbage collection issues on our system. I'd be willing to help out
in any way to make this issue go away as soon as possible.

Scott

> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> Sent: Monday, November 12, 2001 2:47 PM
> To: 'Lucene Users List'
> Subject: RE: Memory Usage?
>
>
> > From: Anders Nielsen [mailto:anders@visator.dk]
> >
> > this was a big boolean query, with several prefixqueries but
> > no wildcard
> > queries in the or-branches.
>
> Well it looks like those prefixes are expanding to a lot of
> terms, a total
> of over 40,000! (A prefix query expands into a BooleanQuery
> with all the
> terms matching the prefix.)
>
> If most of these expansions are low-frequency, then a simple
> fix should
> improve things considerably. I've attached an optimized version of
> TermQuery that will hold less memory per low-frequency term.
> In particular,
> if a term occurs fewer than 128 times then a 1024 byte
> InputStream buffer is
> freed immediately.
>
> Tell me how this works. Please send another heap dump.
>
> Longer term, or if lots of the expanded terms occur more than
> 128 times,
> perhaps BooleanScorer should use a different algorithm when there are
> thousands of terms. In this case it might use less memory to
> construct an
> array of score buckets for all documents. If
> (query.termCount() * 1024) >
> (12 * getMaxDoc()) then this would use less memory. In your
> case, with
> 500,000 documents and a 40,000 term query, it's currently
> taking 40MB/query,
> and could be done in 6MB/query. This optimization would not be too
> difficult, as it could be mostly isolated to BooleanQuery and
> BooleanScorer.
>
> Doug
>
>
>

RE: Memory Usage? [ In reply to ]

Nov 12, 2001, 1:48 PM

Post #8 of 14 (4303 views)

> From: Scott Ganyo [mailto:scott.ganyo@eTapestry.com]
>
> I think something like this would be a HUGE boon for us. We
> do a lot of
> complex queries on a lot of different indexes and end up
> suffering from
> severe garbage collection issues on our system. I'd be
> willing to help out
> in any way to make this issue go away as soon as possible.

Did you try the code I just sent out? Did it help much?

A problem with things like PrefixQuery are that they let folks easily
construct queries which are *very* expensive to evaluate. It is no
coincidence that Google et. al. do not permit these sort of queries. So,
while we can remove some of the GC overhead, don't forget that these are
still expensive operations and will still be rather slow. A feature like
PrefixQuery should thus be used sparingly.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Memory Usage? [ In reply to ]

brian at quiotix

Nov 12, 2001, 1:57 PM

Post #9 of 14 (4306 views)

> This was a single query? How many terms, and of what type are in the query?
> >From the trace it looks like there could be over 40,000 terms in the query!
> Is this a prefix or wildcard query? These can generate *very* large
> queries...

I think the fact that prefix / wildcard queries can generate such
nasty huge queries is a good reason to consider taking the Foo* syntax
out of the query parser. I realize that its cool, and there are some
people who probably rely on it already, but it seems that the prefix
query is in the category of "professional driver, don't try this at
home" and therefore should be kept behind the locked cabinet where
ordinary shoppers won't take it home by accident (to mix some
metaphors horribly.)

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Memory Usage? [ In reply to ]

Nov 12, 2001, 2:43 PM

Post #10 of 14 (4317 views)

hmm, I seem to be getting a different number of hits when I use the files
you sent out.

-----Original Message-----
From: Doug Cutting [mailto:DCutting@grandcentral.com]
Sent: 12. november 2001 20:47
To: 'Lucene Users List'
Subject: RE: Memory Usage?

> From: Anders Nielsen [mailto:anders@visator.dk]
>
> this was a big boolean query, with several prefixqueries but
> no wildcard
> queries in the or-branches.

Well it looks like those prefixes are expanding to a lot of terms, a total
of over 40,000! (A prefix query expands into a BooleanQuery with all the
terms matching the prefix.)

If most of these expansions are low-frequency, then a simple fix should
improve things considerably. I've attached an optimized version of
TermQuery that will hold less memory per low-frequency term. In particular,
if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is
freed immediately.

Tell me how this works. Please send another heap dump.

Longer term, or if lots of the expanded terms occur more than 128 times,
perhaps BooleanScorer should use a different algorithm when there are
thousands of terms. In this case it might use less memory to construct an
array of score buckets for all documents. If (query.termCount() * 1024) >
(12 * getMaxDoc()) then this would use less memory. In your case, with
500,000 documents and a 40,000 term query, it's currently taking 40MB/query,
and could be done in 6MB/query. This optimization would not be too
difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Memory Usage? [ In reply to ]

Nov 12, 2001, 2:52 PM

Post #11 of 14 (4296 views)

> From: Anders Nielsen [mailto:anders@visator.dk]
>
> hmm, I seem to be getting a different number of hits when I
> use the files
> you sent out.

Please provide more information! Is it larger or smaller than before? By
how much? What differences show up in the hits? That's a terrible bug
report...

I think before it may have been possible to get a spurious hit if a query
term only occurred in deleted documents. A wildcard query with 40,000 terms
might make this sort of thing happen more often, and unless you tried to
access the Hits.doc() for such a hit, you would not see an error. If this
was in fact a problem, the code I just sent out would have fixed it. So
your results may in fact be better. Or there may be a bug in what I sent.
Or both!

For the cases I have tried I get the same results with and without those
changes.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Memory Usage? [ In reply to ]

carlson at bookandhammer

Nov 13, 2001, 12:28 AM

Post #12 of 14 (4295 views)

Since this is changing behavior that people are depending on, what about
creating a new QueryParser called QueryParserSafe that excludes option.
I don't like the idea of removing functionality with no backward
compatibility.

Any thoughts.

--Peter

On Monday, November 12, 2001, at 12:57 PM, Brian Goetz wrote:

>> This was a single query? How many terms, and of what type are in the
>> query?
>>> From the trace it looks like there could be over 40,000 terms in the
>>> query!
>> Is this a prefix or wildcard query? These can generate *very* large
>> queries...
>
> I think the fact that prefix / wildcard queries can generate such
> nasty huge queries is a good reason to consider taking the Foo* syntax
> out of the query parser. I realize that its cool, and there are some
> people who probably rely on it already, but it seems that the prefix
> query is in the category of "professional driver, don't try this at
> home" and therefore should be kept behind the locked cabinet where
> ordinary shoppers won't take it home by accident (to mix some
> metaphors horribly.)
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Memory Usage? [ In reply to ]

halacsy.peter at axelero

Nov 13, 2001, 2:27 AM

Post #13 of 14 (4303 views)

> -----Original Message-----
> From: Brian Goetz [mailto:brian@quiotix.com]
> Sent: Tuesday, November 13, 2001 8:58 AM
> To: Lucene Users List
> Cc: lucene-dev@jakarta.apache.org
> Subject: Re: Memory Usage?
>
>
> > Since this is changing behavior that people are depending
> on, what about
> > creating a new QueryParser called QueryParserSafe that
> excludes option.
> > I don't like the idea of removing functionality with no backward
> > compatibility.
>
> I knew this was coming.
>
> I'm sorry, but I have to laugh just a little bit. The new query
> parser has only existed for less than two months -- and people have
> built empires based on it? I'm perfectly willing to debate whether
> its a good idea or not to remove the wildcard match syntax from the
> query parser, but I think the "backward compatibility" argument is one
> of the less compelling arguments against doing so. Bear in mind that
> no one is suggesting removing the functionality from the core -- just
> restricting its use to programmatically generated queries. A strong
> argument can be made for not exposing the "don't try this at home"
> behavior through an interface that is bound to be used by naive
> end-users.
>

How about this:
"You must have at least four non-wildcard characters in a word before
you introduce a wildcard." (source:
http://www.northernlight.com/docs/search_help_optimize.html)

I think the best approach would be to have a parameter (of query
parser?, of indexsearcher?) to set the minimal non wild-char characters
before any wildchar.

peter

RE: Memory Usage? [ In reply to ]

Nov 13, 2001, 6:30 AM

Post #14 of 14 (4296 views)