Mailing List Archive

FST codec for *infix* queries. No luck so far.
Hello, Devs!
I tried to introduce a custom index to speedup *infix* queries. Note: I'm
interested in cases where EdgeNGram is not an option. For example, if the
term 'foobar' is stored in a block at position 200, and 'bar' at 100. I try
to put the following suffixes in FST:
foobar->[200]
oobar->[200]
obar->[200]
bar->[100,200]
ar->[100,200]
r->[100,200]
The idea is to seekCeil(oba) (when querying *oba*), get lists of offsets,
read term blocks, filter terms by *oba*, expand *oba* to foobar term.
Gotcha!
The standard blocktree codec seemed too complex for hacking.
I took blockterms with VariableGapTermsIndexWriter and added terms index
file with FST of all terms suffixes mapped to offsets to blocks in
blockterms' .tib. Also, I have to write term heads in .tib blocks, since
blocks store only term tails.
The first problem is the suffixes FST size: FST in .tiv stores only part of
terms. Since suffix terms don't have their own blocks and just point to
original terms blocks, their FST contains all of them.
And it's a really important property of original terms index: output
offsets are increasing as input terms order. This allows to store only part
of terms, and outputs for suffixes FST are unordered, and it requires to
store all suffixes.
For 5mln enwiki docs index tiv size is 1.9M and full suffixes FST takes
3.3G. I even limited suffixes length to reduce their number.
Benchmark shows only 10% gain. So, it's a failure. I only can say that it
might show better improvement with fewer huge segments. Otherwise, with
many smaller segments fully scanning term dictionaries is comparable to
seeking suffixes FST and scanning certain blocks.
WDYT? Do you have a better idea or point to some whitepaper?
Are there any codec examples of writing top-level data structures like
GlobalOrds index or livedocs bitmask?
--
Sincerely yours
Mikhail Khludnev
Re: FST codec for *infix* queries. No luck so far. [ In reply to ]
Hi Mikhail,

I don't have any spectacular suggestions but something stemming from experience.

1) While the problem is intellectually interesting, I rarely found
anybody who'd be comfortable with using infix suggestions - people are
very used to "completions" happening on a prefix of one or multiple
words (see my note below, though).

2) Wouldn't it be better/ more efficient to maintain an fst/ index of
word suffix(es) -> complete word instead of offsets within the block?
This can be combined with term frequency to limit the number of
suggested words to just certain categories (or most frequent terms)
which would make the fst smaller still.

3) I'd never try to store infixes shorter than 2, 3 characters (you
said you did it - "I even limited suffixes length to reduce their
number"). This requires folks to type in longer input but prevents fst
bloat and in general leads to higher-quality suggestions (since
there'll be so many of them).

> Otherwise, with many smaller segments fully scanning term dictionaries is comparable to seeking suffixes FST and scanning certain blocks.

Yeah, I'd expect the automaton here to be huge. The complexity of the
vocabulary and number of characters in the language will also play a
key role.

4) IntelliJ idea has this kind of "search everywhere" functionality
which greps for infixes (it is really nice). I recall looking at the
(open source engine) to see how it was done and my conclusion from
glancing over the code was that it's a fixed, coarse, n-gram based
index of consecutive letters pointing at potential matches, which are
then revalidated against the query. So you have a super-simple index,
with a very fast lookup and the cost of verifying and finding exact
matches is shifted to once you have a candidate list. While this
doesn't help with Lucene indexes, perhaps it's a sign that for this
particular task a different index/search paradigm is needed?


Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: FST codec for *infix* queries. No luck so far. [ In reply to ]
One small datapoint: Amazon's customer facing product search now includes
some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only
in fallback cases when the prefix suggesters didn't find compelling options.

And I think Netflix's suggester used to be primarily infix, but now when I
tested it, I get no suggestions at all, only live search results, which I
like less :)

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Hi Mikhail,
>
> I don't have any spectacular suggestions but something stemming from
> experience.
>
> 1) While the problem is intellectually interesting, I rarely found
> anybody who'd be comfortable with using infix suggestions - people are
> very used to "completions" happening on a prefix of one or multiple
> words (see my note below, though).
>
> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
> word suffix(es) -> complete word instead of offsets within the block?
> This can be combined with term frequency to limit the number of
> suggested words to just certain categories (or most frequent terms)
> which would make the fst smaller still.
>
> 3) I'd never try to store infixes shorter than 2, 3 characters (you
> said you did it - "I even limited suffixes length to reduce their
> number"). This requires folks to type in longer input but prevents fst
> bloat and in general leads to higher-quality suggestions (since
> there'll be so many of them).
>
> > Otherwise, with many smaller segments fully scanning term dictionaries
> is comparable to seeking suffixes FST and scanning certain blocks.
>
> Yeah, I'd expect the automaton here to be huge. The complexity of the
> vocabulary and number of characters in the language will also play a
> key role.
>
> 4) IntelliJ idea has this kind of "search everywhere" functionality
> which greps for infixes (it is really nice). I recall looking at the
> (open source engine) to see how it was done and my conclusion from
> glancing over the code was that it's a fixed, coarse, n-gram based
> index of consecutive letters pointing at potential matches, which are
> then revalidated against the query. So you have a super-simple index,
> with a very fast lookup and the cost of verifying and finding exact
> matches is shifted to once you have a candidate list. While this
> doesn't help with Lucene indexes, perhaps it's a sign that for this
> particular task a different index/search paradigm is needed?
>
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: FST codec for *infix* queries. No luck so far. [ In reply to ]
I'm not sure under which scenario ngrams (edgengrams) would not be an
option? Another to try maybe would be something like BPE (byte pair
encoding). In this encoding, you train a set of tokens from a
vocabulary based on frequency of occurrence, and agglomerate them
iteratively until you have the vocabulary at a size you like. You tend
to end up with commonly-ocurring subwords (morphemes) that can
possibly be good indexing choices for this sort of thing?

On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> One small datapoint: Amazon's customer facing product search now includes some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only in fallback cases when the prefix suggesters didn't find compelling options.
>
> And I think Netflix's suggester used to be primarily infix, but now when I tested it, I get no suggestions at all, only live search results, which I like less :)
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>
>> Hi Mikhail,
>>
>> I don't have any spectacular suggestions but something stemming from experience.
>>
>> 1) While the problem is intellectually interesting, I rarely found
>> anybody who'd be comfortable with using infix suggestions - people are
>> very used to "completions" happening on a prefix of one or multiple
>> words (see my note below, though).
>>
>> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
>> word suffix(es) -> complete word instead of offsets within the block?
>> This can be combined with term frequency to limit the number of
>> suggested words to just certain categories (or most frequent terms)
>> which would make the fst smaller still.
>>
>> 3) I'd never try to store infixes shorter than 2, 3 characters (you
>> said you did it - "I even limited suffixes length to reduce their
>> number"). This requires folks to type in longer input but prevents fst
>> bloat and in general leads to higher-quality suggestions (since
>> there'll be so many of them).
>>
>> > Otherwise, with many smaller segments fully scanning term dictionaries is comparable to seeking suffixes FST and scanning certain blocks.
>>
>> Yeah, I'd expect the automaton here to be huge. The complexity of the
>> vocabulary and number of characters in the language will also play a
>> key role.
>>
>> 4) IntelliJ idea has this kind of "search everywhere" functionality
>> which greps for infixes (it is really nice). I recall looking at the
>> (open source engine) to see how it was done and my conclusion from
>> glancing over the code was that it's a fixed, coarse, n-gram based
>> index of consecutive letters pointing at potential matches, which are
>> then revalidated against the query. So you have a super-simple index,
>> with a very fast lookup and the cost of verifying and finding exact
>> matches is shifted to once you have a candidate list. While this
>> doesn't help with Lucene indexes, perhaps it's a sign that for this
>> particular task a different index/search paradigm is needed?
>>
>>
>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: FST codec for *infix* queries. No luck so far. [ In reply to ]
I built the original Netflix autocomplete. That used edge Ngrams running on Solr 1.3.

It isn’t a really big index. There just aren’t that many movies and TV shows. I think we had 70k titles and 150k people (actors, directors, …).

We handled one corner case in the client code. Movies with a one-character title must show up for that character or they are unmatchable. You can’t type more characters to match A, M, X, or Z (all movies). That special case still works on dvd.netflix.com, but not on the streaming site.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Apr 26, 2022, at 12:45 PM, Michael Sokolov <msokolov@gmail.com> wrote:
>
> I'm not sure under which scenario ngrams (edgengrams) would not be an
> option? Another to try maybe would be something like BPE (byte pair
> encoding). In this encoding, you train a set of tokens from a
> vocabulary based on frequency of occurrence, and agglomerate them
> iteratively until you have the vocabulary at a size you like. You tend
> to end up with commonly-ocurring subwords (morphemes) that can
> possibly be good indexing choices for this sort of thing?
>
> On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
> <lucene@mikemccandless.com> wrote:
>>
>> One small datapoint: Amazon's customer facing product search now includes some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only in fallback cases when the prefix suggesters didn't find compelling options.
>>
>> And I think Netflix's suggester used to be primarily infix, but now when I tested it, I get no suggestions at all, only live search results, which I like less :)
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>>
>>> Hi Mikhail,
>>>
>>> I don't have any spectacular suggestions but something stemming from experience.
>>>
>>> 1) While the problem is intellectually interesting, I rarely found
>>> anybody who'd be comfortable with using infix suggestions - people are
>>> very used to "completions" happening on a prefix of one or multiple
>>> words (see my note below, though).
>>>
>>> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
>>> word suffix(es) -> complete word instead of offsets within the block?
>>> This can be combined with term frequency to limit the number of
>>> suggested words to just certain categories (or most frequent terms)
>>> which would make the fst smaller still.
>>>
>>> 3) I'd never try to store infixes shorter than 2, 3 characters (you
>>> said you did it - "I even limited suffixes length to reduce their
>>> number"). This requires folks to type in longer input but prevents fst
>>> bloat and in general leads to higher-quality suggestions (since
>>> there'll be so many of them).
>>>
>>>> Otherwise, with many smaller segments fully scanning term dictionaries is comparable to seeking suffixes FST and scanning certain blocks.
>>>
>>> Yeah, I'd expect the automaton here to be huge. The complexity of the
>>> vocabulary and number of characters in the language will also play a
>>> key role.
>>>
>>> 4) IntelliJ idea has this kind of "search everywhere" functionality
>>> which greps for infixes (it is really nice). I recall looking at the
>>> (open source engine) to see how it was done and my conclusion from
>>> glancing over the code was that it's a fixed, coarse, n-gram based
>>> index of consecutive letters pointing at potential matches, which are
>>> then revalidated against the query. So you have a super-simple index,
>>> with a very fast lookup and the cost of verifying and finding exact
>>> matches is shifted to once you have a candidate list. While this
>>> doesn't help with Lucene indexes, perhaps it's a sign that for this
>>> particular task a different index/search paradigm is needed?
>>>
>>>
>>> Dawid
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
Re: FST codec for *infix* queries. No luck so far. [ In reply to ]
Hello, David.
Thanks for your answers. Let me comment below.

On Tue, Apr 26, 2022 at 3:13 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Hi Mikhail,
>
> I don't have any spectacular suggestions but something stemming from
> experience.
>
> 1) While the problem is intellectually interesting, I rarely found
> anybody who'd be comfortable with using infix suggestions - people are
> very used to "completions" happening on a prefix of one or multiple
> words (see my note below, though).
>
It's interesting that I asked about generic search for *foo* queries, but
you read it as a question about infix suggestions.
It's a little bit odd but I meet customers who ask about generic search for
*infix* often - find me everything including these letters 'foo'.
I usually try to convince them that they are focusing on positive results,
but such high recall search is prone for false positives, and this makes it
quite useless.


>
> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
> word suffix(es) -> complete word instead of offsets within the block?
> This can be combined with term frequency to limit the number of
> suggested words to just certain categories (or most frequent terms)
> which would make the fst smaller still.
>
Well, I did a prototype which uses infix suggester for query expansion. It
looks quite good. But it a small lucene index, not FST with terms outputs.
Also, for such odd requirements pruning is undesirable - find me
everything, you know.


>
> 3) I'd never try to store infixes shorter than 2, 3 characters (you
> said you did it - "I even limited suffixes length to reduce their
> number"). This requires folks to type in longer input but prevents fst
> bloat and in general leads to higher-quality suggestions (since
> there'll be so many of them).
>
Good spot. Short infixes are out of use.


>
> > Otherwise, with many smaller segments fully scanning term dictionaries
> is comparable to seeking suffixes FST and scanning certain blocks.
>
> Yeah, I'd expect the automaton here to be huge. The complexity of the
> vocabulary and number of characters in the language will also play a
> key role.
>
> 4) IntelliJ idea has this kind of "search everywhere" functionality
> which greps for infixes (it is really nice). I recall looking at the
> (open source engine) to see how it was done and my conclusion from
> glancing over the code was that it's a fixed, coarse, n-gram based
> index of consecutive letters pointing at potential matches, which are
> then revalidated against the query. So you have a super-simple index,
> with a very fast lookup and the cost of verifying and finding exact
> matches is shifted to once you have a candidate list. While this
> doesn't help with Lucene indexes, perhaps it's a sign that for this
> particular task a different index/search paradigm is needed?
>
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
Re: FST codec for *infix* queries. No luck so far. [ In reply to ]
Hi, Michael.

On Tue, Apr 26, 2022 at 10:45 PM Michael Sokolov <msokolov@gmail.com> wrote:

> I'm not sure under which scenario ngrams (edgengrams) would not be an
> option?

Edgengrams bumps index size a few times, since postings are repeated per
every derived term. Some systems can't afford such a big footprint.


> Another to try maybe would be something like BPE (byte pair
> encoding). In this encoding, you train a set of tokens from a
> vocabulary based on frequency of occurrence, and agglomerate them
> iteratively until you have the vocabulary at a size you like. You tend
> to end up with commonly-ocurring subwords (morphemes) that can
> possibly be good indexing choices for this sort of thing?
>
It's a productive idea, but there will be some queries which yield
no results due to this pruning.


>
> On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
> <lucene@mikemccandless.com> wrote:
> >
> > One small datapoint: Amazon's customer facing product search now
> includes some infix suggestions (using Lucene's AnalyzingInfixSuggester),
> but only in fallback cases when the prefix suggesters didn't find
> compelling options.
> >
> > And I think Netflix's suggester used to be primarily infix, but now when
> I tested it, I get no suggestions at all, only live search results, which I
> like less :)
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
> >>
> >> Hi Mikhail,
> >>
> >> I don't have any spectacular suggestions but something stemming from
> experience.
> >>
> >> 1) While the problem is intellectually interesting, I rarely found
> >> anybody who'd be comfortable with using infix suggestions - people are
> >> very used to "completions" happening on a prefix of one or multiple
> >> words (see my note below, though).
> >>
> >> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
> >> word suffix(es) -> complete word instead of offsets within the block?
> >> This can be combined with term frequency to limit the number of
> >> suggested words to just certain categories (or most frequent terms)
> >> which would make the fst smaller still.
> >>
> >> 3) I'd never try to store infixes shorter than 2, 3 characters (you
> >> said you did it - "I even limited suffixes length to reduce their
> >> number"). This requires folks to type in longer input but prevents fst
> >> bloat and in general leads to higher-quality suggestions (since
> >> there'll be so many of them).
> >>
> >> > Otherwise, with many smaller segments fully scanning term
> dictionaries is comparable to seeking suffixes FST and scanning certain
> blocks.
> >>
> >> Yeah, I'd expect the automaton here to be huge. The complexity of the
> >> vocabulary and number of characters in the language will also play a
> >> key role.
> >>
> >> 4) IntelliJ idea has this kind of "search everywhere" functionality
> >> which greps for infixes (it is really nice). I recall looking at the
> >> (open source engine) to see how it was done and my conclusion from
> >> glancing over the code was that it's a fixed, coarse, n-gram based
> >> index of consecutive letters pointing at potential matches, which are
> >> then revalidated against the query. So you have a super-simple index,
> >> with a very fast lookup and the cost of verifying and finding exact
> >> matches is shifted to once you have a candidate list. While this
> >> doesn't help with Lucene indexes, perhaps it's a sign that for this
> >> particular task a different index/search paradigm is needed?
> >>
> >>
> >> Dawid
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>


--
Sincerely yours
Mikhail Khludnev
Re: FST codec for *infix* queries. No luck so far. [ In reply to ]
> It's interesting that I asked about generic search for *foo* queries, but you read it as a question about infix suggestions.

Right, apologies. These are a bit related though, aren't they?

> It's a little bit odd but I meet customers who ask about generic search for *infix* often - find me everything including these letters 'foo'.

My counter argument is typically to show those customers how many
terms and documents match such queries on a large index (and what
these terms are)... But you're right - some power users do have
legitimate needs for prefix/ suffix/ infix queries (and even regexp
queries). In those cases, if they're rare, I simply ran a full scan on
the terms index and expanded the query. If the number of potential
matching terms was huge, the search would abort and wait for more
input from the user so that the scope is saner. I'd be interested in
whether this can be done more efficiently.

> I usually try to convince them that they are focusing on positive results, but such high recall search is prone for false positives, and this makes it quite useless.

Agreed.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org