Hello, Devs!
I tried to introduce a custom index to speedup *infix* queries. Note: I'm
interested in cases where EdgeNGram is not an option. For example, if the
term 'foobar' is stored in a block at position 200, and 'bar' at 100. I try
to put the following suffixes in FST:
foobar->[200]
oobar->[200]
obar->[200]
bar->[100,200]
ar->[100,200]
r->[100,200]
The idea is to seekCeil(oba) (when querying *oba*), get lists of offsets,
read term blocks, filter terms by *oba*, expand *oba* to foobar term.
Gotcha!
The standard blocktree codec seemed too complex for hacking.
I took blockterms with VariableGapTermsIndexWriter and added terms index
file with FST of all terms suffixes mapped to offsets to blocks in
blockterms' .tib. Also, I have to write term heads in .tib blocks, since
blocks store only term tails.
The first problem is the suffixes FST size: FST in .tiv stores only part of
terms. Since suffix terms don't have their own blocks and just point to
original terms blocks, their FST contains all of them.
And it's a really important property of original terms index: output
offsets are increasing as input terms order. This allows to store only part
of terms, and outputs for suffixes FST are unordered, and it requires to
store all suffixes.
For 5mln enwiki docs index tiv size is 1.9M and full suffixes FST takes
3.3G. I even limited suffixes length to reduce their number.
Benchmark shows only 10% gain. So, it's a failure. I only can say that it
might show better improvement with fewer huge segments. Otherwise, with
many smaller segments fully scanning term dictionaries is comparable to
seeking suffixes FST and scanning certain blocks.
WDYT? Do you have a better idea or point to some whitepaper?
Are there any codec examples of writing top-level data structures like
GlobalOrds index or livedocs bitmask?
--
Sincerely yours
Mikhail Khludnev
I tried to introduce a custom index to speedup *infix* queries. Note: I'm
interested in cases where EdgeNGram is not an option. For example, if the
term 'foobar' is stored in a block at position 200, and 'bar' at 100. I try
to put the following suffixes in FST:
foobar->[200]
oobar->[200]
obar->[200]
bar->[100,200]
ar->[100,200]
r->[100,200]
The idea is to seekCeil(oba) (when querying *oba*), get lists of offsets,
read term blocks, filter terms by *oba*, expand *oba* to foobar term.
Gotcha!
The standard blocktree codec seemed too complex for hacking.
I took blockterms with VariableGapTermsIndexWriter and added terms index
file with FST of all terms suffixes mapped to offsets to blocks in
blockterms' .tib. Also, I have to write term heads in .tib blocks, since
blocks store only term tails.
The first problem is the suffixes FST size: FST in .tiv stores only part of
terms. Since suffix terms don't have their own blocks and just point to
original terms blocks, their FST contains all of them.
And it's a really important property of original terms index: output
offsets are increasing as input terms order. This allows to store only part
of terms, and outputs for suffixes FST are unordered, and it requires to
store all suffixes.
For 5mln enwiki docs index tiv size is 1.9M and full suffixes FST takes
3.3G. I even limited suffixes length to reduce their number.
Benchmark shows only 10% gain. So, it's a failure. I only can say that it
might show better improvement with fewer huge segments. Otherwise, with
many smaller segments fully scanning term dictionaries is comparable to
seeking suffixes FST and scanning certain blocks.
WDYT? Do you have a better idea or point to some whitepaper?
Are there any codec examples of writing top-level data structures like
GlobalOrds index or livedocs bitmask?
--
Sincerely yours
Mikhail Khludnev