Mailing List Archive

Fwd: Error in function refill...
Message forwarded to the list at Henry's request. -- Marvin
-----------------------------------------------------------

Hello all,

I'm getting the following error when searching on a field ('site' in
this
case) - or no error at all when performing a phrase search (but not all
phrase searches result in an error):

Error in function refill at ../c_src/KinoSearch/Store/InStream.c:100:
Read
past EOF of /www/sites/zen.co.za/htdocs/test00/testindex/_2.cf (start:
3184 len 3097),...

\t at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/
Searcher.pm
line 158,...

\tKinoSearch::Searcher::search_hit_collector
('KinoSearch::Searcher=HASH(0x109257
0)',
'hit_collector', 'KinoSearch::Search::SortCollector=SCALAR(0x10c78b0)',
'weight', 'KinoSearch::Search::PhraseWeight=HASH(0x11064d0)', 'filter',
'undef') called at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/
Searcher.pm
line 109,...

\tKinoSearch::Searcher::search_top_docs('KinoSearch::Searcher=HASH
(0x1092570)',
'num_wanted', 10, 'query',
'KinoSearch::Search::PhraseQuery=HASH(0x114db80)', 'filter', 'undef',
'sort_spec', 'KinoSearch::Search::SortSpec=HASH(0x63c160)', ...)
called at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/
Search/Hits.
pm...

\tKinoSearch::Search::Hits::seek(3) called at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/
Searcher.pm
line 77,...

\tKinoSearch::Searcher::search('KinoSearch::Searcher=HASH(0x1092570)',
'query', 'site:www.site.com/index.html', 'offset', 0, 'num_wanted', 10,
'sort_spec', 'KinoSearch::Search::SortSpec=HASH(0x63c160)', ...)
called at
/www.../htdocs/test/searchtest.cgi line 46,...


This seems to happen with 0.20 or the latest svn build. In each case, I
reindexed the test files, and the schema is the same, etc.

I'm sorting using a KinoSearch::Search::SortSpec object on a special
field
which is indexed but not analysed (the error occurs with or without the
sorting btw).

Somehow, it's related to the content of the second test file batch (ie,
I'm indexing two sites (which have many pages each), and merging the
subindexes into one, which is then searched). If I only index/search
site
(a) no errors occur, but site (b) presents a problem (ie, whether the
index is seperate or merged).
Fwd: Error in function refill... [ In reply to ]
Here's a repeatable test case using the us_constitution/ data
which on my setup fails consistently (using Rev: 2228):

Change USConSchema.pm and add another field (title2 in this eg):

our %FIELDS = (
title => 'KinoSearch::Schema::FieldSpec',
content => 'KinoSearch::Schema::FieldSpec',
url => 'USConSchema::UnIndexedField',

title2 => 'KinoSearch::Schema::FieldSpec',

);


Change invindexer.plx:

$doc{title} = $1;
$doc{title2} = "test.co.us";


Run ./invindexer.plx

Then search for:

title2:test.co.us (FAILS)
title2:test.co.us/ (FAILS)
title2:test;co;us (FAILS)
title2:test.co (WORKS)
title2:test.co.us/123 (WORKS)


The apache error log:

[Wed Mar 28 17:34:12 2007] [error] [client 19.2.18.11] Read past EOF
of uscon_invindex/_1.cf (start: 480 len 480) at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/Searcher.pm
line 169
[Wed Mar 28 17:34:12 2007] [error] [client 19.2.18.11]
\tKinoSearch::Searcher::collect('KinoSearch::Searcher=HASH(0xf763d0)',
'collector', 'KinoSearch::Search::TopDocCollector=SCALAR(0xf76850)',
'query', 'KinoSearch::Search::PhraseQuery=HASH(0xff4470)', 'filter',
'undef', 'num_wanted', 10, ...) called at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/Searcher.pm
line 106
[Wed Mar 28 17:34:12 2007] [error] [client 19.2.18.11]
\tKinoSearch::Searcher::search_top_docs('KinoSearch::Searcher=HASH(0xf763d0)',
'num_wanted', 10, 'query',
'KinoSearch::Search::PhraseQuery=HASH(0xff4470)', 'filter', 'undef',
'sort_spec', 'undef', ...) called at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/Search/Hits.pm
line 47
[Wed Mar 28 17:34:12 2007] [error] [client 19.2.18.11]
\tKinoSearch::Search::Hits::seek(3) called at
/usr/lib/perl5/site_perl/5.8.8/x86_64-linux-thread-multi/KinoSearch/Searcher.pm
line 76
[Wed Mar 28 17:34:12 2007] [error] [client 19.2.18.11]
\tKinoSearch::Searcher::search('KinoSearch::Searcher=HASH(0xf763d0)',
'query', 'title2:test.co.us', 'offset', 0, 'num_wanted', 10) called at
/www/sites/test/htdocs/test00/search.cgi line 29
Fwd: Error in function refill... [ In reply to ]
On Mar 28, 2007, at 8:22 AM, Henk - CityWEB wrote:

> Here's a repeatable test case using the us_constitution/ data

Fab! Thanks!

I was *so* not looking forward to trying to track this down in a huge
dataset.

Banzai,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Fwd: Error in function refill... [ In reply to ]
On Wed, 28 Mar 2007, Marvin Humphrey wrote:

> > Here's a repeatable test case using the us_constitution/ data
>
> Fab! Thanks!
>
> I was *so* not looking forward to trying to track this down in a huge
> dataset.


Hi Marvin,

Any progress on this one?

Regards
Henry
Fwd: Error in function refill... [ In reply to ]
On Apr 2, 2007, at 11:46 PM, Henk - CityWEB wrote:

> On Wed, 28 Mar 2007, Marvin Humphrey wrote:
>
>>> Here's a repeatable test case using the us_constitution/ data
>>
>> Fab! Thanks!
>>
>> I was *so* not looking forward to trying to track this down in a huge
>> dataset.
>
>
> Hi Marvin,
>
> Any progress on this one?

I've made progress. It appears to be an indexing glitch, rather than
a search-time issue -- however, I think it occurs pretty infrequently
and does not cause corruption to propagate with ongoing incremental
indexing. If I'm right, it affects the first term in any field other
than the first when that term occurs more than 16 times in the
collection. (In your test, the offending term is "title2:co". ) And
it only affects phrase queries at present, which is how QueryParser
interprets that domain name search.

Hopefully I will have confirmation and a fix soon.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Fwd: Error in function refill... [ In reply to ]
On Tue, 3 Apr 2007, Marvin Humphrey wrote:
> I've made progress. It appears to be an indexing glitch, rather than
> a search-time issue -- however, I think it occurs pretty infrequently
> and does not cause corruption to propagate with ongoing incremental
> indexing. If I'm right, it affects the first term in any field other
> than the first when that term occurs more than 16 times in the
> collection. (In your test, the offending term is "title2:co". ) And
> it only affects phrase queries at present, which is how QueryParser
> interprets that domain name search.
>
> Hopefully I will have confirmation and a fix soon.

Excellent! Thanks Marvin.
Fwd: Error in function refill... [ In reply to ]
On Mar 28, 2007, at 8:22 AM, Henk - CityWEB wrote:
> Here's a repeatable test case using the us_constitution/ data

This bug is solved as of revision 2314. The file format also changed
with that revision, so you'll need to regenerate any indexes if you
upgrade. There will be other file format changes over the next
couple of days, as well, so heads up.

The problem was similar to the crashing bug that Karel K. isolated in
0.20_01 and that was fixed in 0.20_02. Again, we've switched files
as we've switched fields, but failed to update a filepointer. Again,
it only affects fields other than the first. Again, it's
intermittent -- only affecting terms which occur more than 16 times
in the corpus.

However, fixing this bug -- reliably, at least -- was harder, because
it was smack dab in the middle of the gnarliest loop in KS. These
sorts of bugs have been cropping up because the Lucene file format,
from which the KS file format was derived, is just too complicated.
Here's the relevant section from the Lucene File Formats document:

DocSkip records the document number before every SkipInterval-th
document in TermFreqs. If payloads are disabled for the term's
field, then DocSkip represents the difference from the previous
value in the sequence. If payloads are enabled for the term's
field, then DocSkip/2 represents the difference from the previous
value in the sequence. If payloads are enabled and DocSkip is
odd, then PayloadLength is stored indicating the length of the
last payload before the SkipInterval-th document in TermPositions.
FreqSkip and ProxSkip record the position of every SkipInterval-th
entry in FreqFile and ProxFile, respectively. File positions are
relative to the start of TermFreqs and Positions, to the previous
SkipDatum in the sequence.

Y'follow that? Me neither.

Well I know what each item does. But I can't keep it all in my head
and follow control flow through code that implements that spec.
What's happened is that several features and optimizations have been
crammed into this part of Lucene, but there isn't a plugin convention
designed to help integrate them in an orderly manner.

So, the fix has been to implement just such a plugin system -- which
I've long planned -- then squash the bug in the newly simplified
code. And it worked like a charm.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Fwd: Error in function refill... [ In reply to ]
On Sun, 15 Apr 2007, Marvin Humphrey wrote:
> This bug is solved as of revision 2314. The file format also changed
> with that revision, so you'll need to regenerate any indexes if you
> upgrade. There will be other file format changes over the next
> couple of days, as well, so heads up.

Woohoo! Thanks Marvin. The index format change is not a problem;
while waiting for the index format to crystalize I've taken the time to
reorganise my code (a little) anyway.


> However, fixing this bug -- reliably, at least -- was harder, because
> it was smack dab in the middle of the gnarliest loop in KS. These
> sorts of bugs have been cropping up because the Lucene file format,
> from which the KS file format was derived, is just too complicated.
> Here's the relevant section from the Lucene File Formats document:
>
> DocSkip records the document number before every SkipInterval-th
> document in TermFreqs. If payloads are disabled for the term's
> field, then DocSkip represents the difference from the previous
> value in the sequence. If payloads are enabled for the term's
> field, then DocSkip/2 represents the difference from the previous
> value in the sequence. If payloads are enabled and DocSkip is
> odd, then PayloadLength is stored indicating the length of the
> last payload before the SkipInterval-th document in TermPositions.
> FreqSkip and ProxSkip record the position of every SkipInterval-th
> entry in FreqFile and ProxFile, respectively. File positions are
> relative to the start of TermFreqs and Positions, to the previous
> SkipDatum in the sequence.
>
> Y'follow that? Me neither.

[gulp], that reads like one of those memorandi in opposition to the
objection of the amended submission subject to the order following
depositions and declarations of expert witnesses for amended motions
for summary judgement... [feeling nauseous yet?]
(SCO vs IBM, www.groklaw.net).

> So, the fix has been to implement just such a plugin system -- which
> I've long planned -- then squash the bug in the newly simplified
> code. And it worked like a charm.

Brilliant. Thank you very much for killing this bug.