Mailing List Archive

more abstract interfaces to kinosearch
There are two things I find myself wanting over and over when I use KS:

* easier adding/loading: $index->add(@objects), or perhaps $index->load()
(using a predefined, per-schema data source)
* easier searching, without using QueryParser, like SQL::Abstract (since
that's what I'm used to):
$hits = $index->search({ foo => 1, bar => [ (qw(baz quux) ] })

I've created prototypes of both of these for a project I'm working on. If
anyone's interested, I'd love suggestions; I'll probably try to spin them off
into e.g. KSx::Manager or KSx::AbstractSearch.

In both cases my fundamental goal is to reduce the number of objects and method
calls needed in code that manipulates or uses the invindex, because refactoring
is nice and I don't like doing all that typing.

hdp.
more abstract interfaces to kinosearch [ In reply to ]
On Jun 20, 2007, at 8:09 PM, Hans Dieter Pearcey wrote:

> * easier adding/loading: $index->add(@objects), or perhaps $index-
> >load()
> (using a predefined, per-schema data source)

I'm can see those statements meaning several different things. Can
you please elaborate on what it is you're trying to do?

> * easier searching, without using QueryParser, like SQL::Abstract
> (since
> that's what I'm used to):
> $hits = $index->search({ foo => 1, bar => [ (qw(baz quux) ] })

Sounds like a good idea for an extension. Constructing large queries
via the KinoSearch::Search::Query subclass constructors can be
tedious. At the same time, I wouldn't want to have to implement/
document those constructs in Searcher itself.

> I've created prototypes of both of these for a project I'm working
> on. If
> anyone's interested, I'd love suggestions; I'll probably try to
> spin them off
> into e.g. KSx::Manager or KSx::AbstractSearch.

I might encourage you to put those into KSx::Index::Manager and
KSx::Search::Abstract, but it's just a suggestion.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
On Wed, Jun 20, 2007 at 09:05:58PM -0700, Marvin Humphrey wrote:
>
> On Jun 20, 2007, at 8:09 PM, Hans Dieter Pearcey wrote:
>
> >* easier adding/loading: $index->add(@objects), or perhaps $index-
> >>load()
> > (using a predefined, per-schema data source)
>
> I'm can see those statements meaning several different things. Can
> you please elaborate on what it is you're trying to do?

That's because they do mean different things, but I was too tired when I was
writing that email to see how confusing it was.

Right now I have a base index manager class. That base class has a structure
for handling new input; that's where the $index->add(@objects) example came
from. Subclasses have a to_doc() method that they override. The whole thing
is not much more than a convenient wrapper around a combination of open/clobber
and for (@objects) { $inv->add_doc }

I also have a subclass that knows where its data comes from. It keeps track of
the last-indexed document from that data source and can create/update
invindexes based on various parameters (e.g. userid => 17). This means I don't
actually have to pass it any objects at all; I can just say $index->update
periodically and it takes care of finding new things to index and indexing
them.

That part hasn't been properly abstracted yet. It's very specific to that
particular subclass and data source (and indexing schema), and it's also not
very polished, so I'm not sure how it will turn out yet; in particular, if I'm
generating multiple indexes from a single data source, the last update kind of
metadat needs to be shared between them, they need to all be updated with a
single pull of data instead of each looking for updates, etc.

hdp.
more abstract interfaces to kinosearch [ In reply to ]
On Jun 21, 2007, at 5:53 AM, Hans Dieter Pearcey wrote:

> The whole thing
> is not much more than a convenient wrapper around a combination of
> open/clobber
> and for (@objects) { $inv->add_doc }

I've often subclassed InvIndexer to provide convenience methods like
these. They tend to look like this:

sub add_employees {
my ( $self, @employees ) = @_;
for my $employee (@employees) {
$self->add_doc({
name => $employee->get_name,
job => $employee->get_role,
});
}
}

> I also have a subclass that knows where its data comes from.

---->8 SNIP 8<----

> I can just say $index->update
> periodically and it takes care of finding new things to index and
> indexing
> them.

Nice. I think I'll start using that pattern. :)

> It's very specific to that
> particular subclass and data source (and indexing schema),

Yes, that matches my experience. I think it will be challenging to
come up with an abstraction layer for this problem that matches the
elegance and universality of your proposed search query builder.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
On Wed, Jun 20, 2007 at 09:05:58PM -0700, Marvin Humphrey wrote:
> >* easier searching, without using QueryParser, like SQL::Abstract
> >(since
> > that's what I'm used to):
> > $hits = $index->search({ foo => 1, bar => [ (qw(baz quux) ] })
>
> Sounds like a good idea for an extension. Constructing large queries
> via the KinoSearch::Search::Query subclass constructors can be
> tedious. At the same time, I wouldn't want to have to implement/
> document those constructs in Searcher itself.

Now that I've worked on this for a little bit, I have some best
practices/performance questions.

Take the example query that I gave above:

{ foo => 1, bar => [qw(baz quux)] }

Let's say this translates to the SQL

WHERE foo = 1 AND (bar = 'baz' OR bar = 'quux')

I can't find any way to do the OR, there, in a Query, only with PolyFilter. In
this particular case that isn't a big problem, since there's still the
TermQuery for 'foo = 1', but it's easy to imagine a query that translates
entirely into Filters and has no Query. What do I feed the searcher then?

Given the existence of QueryFilter, is there any documentation on what you
would want to use as a query and what you would want to use as a filter via
QueryFilter? I looked through the source, but it was on a plane back from YAPC
and I was tired, so I may have missed clues. I suspect it's something like
"filters should be reused because they cache stuff, while queries can be
one-off because they don't", but I'm not sure.

If that's all there is, is there any roadblock to a Query analogue to
PolyFilter (PolyQuery)? Is there something about OR that makes it difficult to
do there? Likewise, is there anything about the RangeFilter that would make it
difficult to turn into a query? For my application, I have both OR logic and
ranges that I expect to change with some frequency.

hdp.
more abstract interfaces to kinosearch [ In reply to ]
On Jun 28, 2007, at 6:54 PM, Hans Dieter Pearcey wrote:

> I can't find any way to do the OR, there, in a Query, only with
> PolyFilter.

BooleanQuery?

> "filters should be reused because they cache stuff, while queries
> can be
> one-off because they don't",

In general, that's the case.

Filters are on-off, so their result can be cached in a BitVector.

Caching the result of a Scorer (derived from a Query) would be much
more expensive, because you'd need to keep 1 32-bit doc number and 1
float score around for each match. In a worst case scenario -- every
doc matches -- we're talkin' 64 times as expensive as a Filter.

> Likewise, is there anything about the RangeFilter that would make it
> difficult to turn into a query?

Actually, there is. It's hard to know how matches should contribute
to the score.

If you issue a query which effectively says "give me matches where
content contains 'foo' OR date is greater than 2000-01-01", how
should docs which match only the date compare to docs which only
match foo? If the date is very rare, should it contribute more?

Lucene offers three solutions.

One, punt and say "use a RangeFilter". This is the approach
KinoSearch has taken.

Two, weight rare terms within the range more heavily than common
terms within the range. This is expensive and produces occasionally
bizarre results. It's the oldest approach, and the consensus is that
it's not a very good one.

Three, apply a constant score, as in the Lucene class
ConstantScoreRangeQuery. I like this approach so long as the
constant score is zero :) which basically makes it a filter. :) Once
you have non-zero scores, though, scaling the contribution that it
should make to the score is a headache.

Another approach would be to find the number of terms that exist
within the range, sum their doc_freqs together, and use that to
calculate IDF and weight the score. That's expensive, though, and
only appropriate for esoteric situations.

I think the ultimate solution will be to make MatchFieldQuery public
and give it a constant score which defaults to zero. Then it could
be combined with a RangeFilter to produce the same effect as a
ConstantScoreRangeQuery. MatchFieldQuery is relatively simple, and
lets you do things that require kludges otherwise.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
I keep forgetting that I'm not using all of kinosearch's functionality. In
particular, I don't actually care about scoring, because I'm using KS like a
database and always sort by some inherent property of the documents.

On Thu, Jun 28, 2007 at 11:00:21PM -0700, Marvin Humphrey wrote:
>
> On Jun 28, 2007, at 6:54 PM, Hans Dieter Pearcey wrote:
>
> >I can't find any way to do the OR, there, in a Query, only with
> >PolyFilter.
>
> BooleanQuery?

I don't see how I'd do this just in terms of matching. Maybe I don't
understand SHOULD?

> >"filters should be reused because they cache stuff, while queries
> >can be
> >one-off because they don't",
>
> In general, that's the case.
>
> Filters are on-off, so their result can be cached in a BitVector.
>
> Caching the result of a Scorer (derived from a Query) would be much
> more expensive, because you'd need to keep 1 32-bit doc number and 1
> float score around for each match. In a worst case scenario -- every
> doc matches -- we're talkin' 64 times as expensive as a Filter.

There's another question here that I may not have made explicit enough. If
some particular selection mechanism is available both as a Query and as a
Filter -- e.g. BooleanQuery, which you can also use as part of a Queryfilter --
is there any reason to prefer one over the other, assuming that you are (as I
am) only interested in matching, not scoring? Do Filters have any kind of
startup overhead compared to Queries, etc.?

> I think the ultimate solution will be to make MatchFieldQuery public
> and give it a constant score which defaults to zero. Then it could
> be combined with a RangeFilter to produce the same effect as a
> ConstantScoreRangeQuery. MatchFieldQuery is relatively simple, and
> lets you do things that require kludges otherwise.

I had found MatchFieldQuery, and thought that that might work, but didn't know
enough internals to be sure. I like this idea. What can I do to make it work?

hdp.
more abstract interfaces to kinosearch [ In reply to ]
On Jun 29, 2007, at 5:45 AM, Hans Dieter Pearcey wrote:

>> BooleanQuery?
>
> I don't see how I'd do this just in terms of matching. Maybe I don't
> understand SHOULD?

If you add two clauses to a BooleanQuery with SHOULD, then their
result sets get OR'd together.

$bool_query->add_clause( query => $term_query_a, occur =>
'SHOULD' );
$bool_query->add_clause( query => $term_query_b, occur =>
'SHOULD' );

> If some particular selection mechanism is available both as a Query
> and as a
> Filter -- e.g. BooleanQuery, which you can also use as part of a
> Queryfilter --
> is there any reason to prefer one over the other, assuming that you
> are (as I
> am) only interested in matching, not scoring? Do Filters have any
> kind of
> startup overhead compared to Queries, etc.?

If you don't care about scoring and you can reuse Filters, you should
use as many as practical.

Scorers require hitting the disk.

QueryFilters and PolyFilters, once their internal caches are warmed,
do not.

The startup cost for a RangeFilter only happens once per field per
IndexReader, when a portion of that field's lexicon is read into
memory. The main per-query cost is a single burst of disk activity
to look up the search term and and assign it a "term number" based on
where it falls in the lexicon, after which everything else is CPU
crunching and memory access.

>> I think the ultimate solution will be to make MatchFieldQuery public
>> and give it a constant score which defaults to zero. Then it could
>> be combined with a RangeFilter to produce the same effect as a
>> ConstantScoreRangeQuery. MatchFieldQuery is relatively simple, and
>> lets you do things that require kludges otherwise.
>
> I had found MatchFieldQuery, and thought that that might work, but
> didn't know
> enough internals to be sure. I like this idea. What can I do to
> make it work?

Sorry for the delayed response -- I had to think this over.

I've resisted making MatchFieldQuery public because I didn't feel
like its API was mature enough. I'm still not sure about it, and I
don't want to add it to the list of things that have to get done
prior to the release of 0.20. For the time being, I suggest you go
ahead and use MatchFieldQuery as is, but mark that aspect of your
module experimental. Looking forward, you can help move things along
by participating in design discussions about subclassing strategies.

A lot of the KS public API and class design is pretty solid. To
touch on one aspect, I'm pleased that the Query components allow you
to create your own query building mechanism as an alternative to
QueryParser. I'm also more certain than ever that the decision to
limit QueryParser to a much simpler syntax than its Lucene
counterpart was the right one. What you are doing demonstrates that
it is possible to write custom KSx extensions to play the Query-
building role, and if someone wants to write a Lucene-ish query
parser that supports syntax like 'boost^3', they can. Core
KinoSearch, by opting out of the more complex high-level task, lowers
its support costs and maintains greater flexibility.

This is successful modularization, "divide and conquer", "loose
coupling", etc, in action. Every class has its own reasonably
contained problem domain. There are no "God Objects" that know too
much or do too much. The components tolerate being assembled into
many different configurations.

The main goal of KinoSearch 0.30 will be to reproduce this
flexibility across more phases of search and indexing. Scorer should
be public and it should not be so challenging to subclass. If that
were already the case, somebody could whip up KSx::Search::RangeQuery
and you could use it without waiting for me to act.

For 0.20, though, it's time to think reductively (to echo a sentiment
expressed by Nathan Kurz). Rather than add new public APIs, it's
time to yank Hits->seek (and simplify Searcher), migrate some
documentation out of POD and onto to the new wiki, and possibly
redact the public APIs for Analyzer, Token, and TokenBatch, marking
them as experimental once again so that we have the option to modify
them.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
On Mon, Jul 02, 2007 at 08:17:55AM -0700, Marvin Humphrey wrote:
> If you add two clauses to a BooleanQuery with SHOULD, then their
> result sets get OR'd together.
>
> $bool_query->add_clause( query => $term_query_a, occur =>
> 'SHOULD' );
> $bool_query->add_clause( query => $term_query_b, occur =>
> 'SHOULD' );

Is this true even when (like me) you are only interested in matching?

Also, is there some reason that this isn't documented? I wanted to make
Searcher::Abstract support arbitrary nesting of conditions, and the fact that
some operations appeared to be only supported as part of filters (which would
make getting the correct precedence tricky) was a big reason I didn't
(referring to PolyFilter and RangeFilter not having Query analogues).

One thing I thought about doing was turning *everything* into a Filter, and
having a Query that did something like "match all documents", but I wasn't sure
how to build such a query.

> If you don't care about scoring and you can reuse Filters, you should
> use as many as practical.

What if I can't reuse Filters, but I don't care about scoring?

> Sorry for the delayed response -- I had to think this over.
>
> I've resisted making MatchFieldQuery public because I didn't feel
> like its API was mature enough. I'm still not sure about it, and I
> don't want to add it to the list of things that have to get done
> prior to the release of 0.20. For the time being, I suggest you go
> ahead and use MatchFieldQuery as is, but mark that aspect of your
> module experimental. Looking forward, you can help move things along
> by participating in design discussions about subclassing strategies.

> This is successful modularization, "divide and conquer", "loose
> coupling", etc, in action. Every class has its own reasonably
> contained problem domain. There are no "God Objects" that know too
> much or do too much. The components tolerate being assembled into
> many different configurations.

I agree that core KS has taken the right direction here.

The one place where this seems less true is the distinction between scoring and
matching, as I noted previously. My guess (because I don't know anything about
IR theory, or whatever) is that you assumed that of COURSE people wouldn't want
just matching and not scoring, and so the documentation and currently-existing
classes reflect that assumption, but the documentation does not extend to the
building blocks that someone like me would need in order to create the
theoretical RangeQuery. It ends up looking like there's some random arbitrary
difference between the capabilities of Query and Filter, when in fact there is
a perfectly good reason they're different.

Of course, I'm approaching it from a different direction, so I have different
assumptions; I want to treat KS more like a traditional database, which means I
have different expectations, 'unique' constraints, stuff like that.

hdp.
more abstract interfaces to kinosearch [ In reply to ]
On 7/2/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> If you don't care about scoring and you can reuse Filters, you should
> use as many as practical.

I've been trying to understand the role of filters as well. My
impression is that they are a good stopgap while custom scoring is
hard (ie, what Marvin says is probably good practice for the present),
but that once it becomes easy to subclass scorers filters can go away.
That is, they should folded into the scoring apparatus rather than
wrapped around the Hit Collector.

To my mind, the main disadvantage of filters is that they happen too
late, after all the rest of the query has already run. Consider some
expensive query that we want to filter, say a search for a mention of
the band "The The" in the category "Music".

With the current filtering approach, we have to do an expensive phrase
with position checks on every document that contains the word 'the',
and only afterward is the category checked. While the category check
against a bit vector is very efficient, the rest of the query is
terribly expensive and disk intensive.

Contrast this with an ANDScorer with ordered subclauses
'category:music' and 'text:"the-the"'. Checking whether the category
field contains the term 'music' is really fast and efficient, and if
the music category is relatively small we only have to do the
expensive part of the query on a very small subset of the documents.

Sure, you say, but what about restricted to ranges? Surely you need a
filter for that? Currently, but only because it's hard to write a
custom scorer. I don't see any inherent limitation that would prevent
a custom scorer from dealing with ranges directly. And if one wanted,
I don't see any reason that its Next doc method has to use an index at
all --- it could work from a cached bit vector just as well.

> If that were already the case, somebody could whip up
> KSx::Search::RangeQuery and you could use it without waiting for me
> to act.

Yes, yes, that! With that in mind, and thinking of the long-term, are
there things that can currently done with the Filter apparatus that
couldn't be done with a Scorer as or more efficiently? This isn't to
mean that discriminating HitCollectors need to go away, only that
maybe they don't need to be part of the primary API.

Nathan Kurz
nate@verse.com
more abstract interfaces to kinosearch [ In reply to ]
On Jul 2, 2007, at 8:38 AM, Hans Dieter Pearcey wrote:

>> If you add two clauses to a BooleanQuery with SHOULD, then their
>> result sets get OR'd together.
>>
>> $bool_query->add_clause( query => $term_query_a, occur =>
>> 'SHOULD' );
>> $bool_query->add_clause( query => $term_query_b, occur =>
>> 'SHOULD' );
>
> Is this true even when (like me) you are only interested in matching?

In theory the (unfinished) MatchPosting class is supposed to help out
with situations like yours. However, because it doesn't store token
position, it doesn't support phrase matching, and maybe it needs to
be rethought.

> Also, is there some reason that this isn't documented?

No, there's no reason. I guess I thought the capabilities were
implied by the class name. Looks like usability testing has revealed
a flaw! ;)

> a Query that did something like "match all documents"

That would be the as-yet-non-existent MatchAllDocsQuery, which would
have an interface similar to MatchFieldQuery. The difference between
the two would be analogous to the difference between 'SELECT doc_num'
and 'SELECT doc_num WHERE foo IS NOT NULL'.

>> If you don't care about scoring and you can reuse Filters, you should
>> use as many as practical.
>
> What if I can't reuse Filters, but I don't care about scoring?

If you can't reuse QueryFilters or PolyFilters, they offer no
advantage. They're probably mildly less efficient than just adding a
clause to the query.

>> This is successful modularization, "divide and conquer", "loose
>> coupling", etc, in action. Every class has its own reasonably
>> contained problem domain. There are no "God Objects" that know too
>> much or do too much. The components tolerate being assembled into
>> many different configurations.
>
> I agree that core KS has taken the right direction here.
>
> The one place where this seems less true is the distinction between
> scoring and
> matching, as I noted previously.

Lucene has a family of Query subclasses called SpanQueries, which I
have not ported and don't intend to ever put in KinoSearch's core.
What I'd like to do is make it possible for someone to write a
KSx::Spans distro. It might even include a KSx::Spans::QueryParser
subclass that uses SpanTermQuery in place of TermQuery and so on.

Analogously, it should be possible to create a suite of Queries/
Scorers which are optimized for matching alone. I believe that the
changes to KinoSearch's file format in 0.20 and the introduction of
Posting should facilitate this, but the OO infrastructure needs more
work.

In the meantime, the current KS query classes don't exactly suck if
you just need matching. :)

> My guess (because I don't know anything about
> IR theory, or whatever) is that you assumed that of COURSE people
> wouldn't want
> just matching and not scoring,

It's true that returning results ranked by relevancy is something I
put a high priority on, but I've definitely thought about other
cases. It's just that unstructured search is a more pressing
problem. There are a lot of good databases out there. KS shouldn't
aspire to compete with PostgreSQL.

> Of course, I'm approaching it from a different direction, so I have
> different
> assumptions; I want to treat KS more like a traditional database,
> which means I
> have different expectations, 'unique' constraints, stuff like that.

KinoSearch is always going to be optimized for the use case of a
large number of queries against a single view of an index.

I don't think we'll have to make a choice between matching alone and
matching with scoring, though. It should be possible to support both
without compromise.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
On Jul 2, 2007, at 12:32 PM, Nathan Kurz wrote:

> To my mind, the main disadvantage of filters is that they happen too
> late, after all the rest of the query has already run.

Interesting point. I suppose we added a filter_bits argument to
Scorer_Collect, we could apply the filter earlier. Roughly, replace
this...

do {
...
} while (Scorer_Next(self));

... with this:

do {
...
} while (Scorer_Skip_To( self, BitVec_Next_Set_Bit(filter_bits) ));

The downside is that it takes the filtering logic out of HitCollector
and puts it in Scorer. HitCollector is sparse; Scorer has a lot more
going on.

> Contrast this with an ANDScorer with ordered subclauses
> 'category:music' and 'text:"the-the"'. Checking whether the category
> field contains the term 'music' is really fast and efficient, and if
> the music category is relatively small we only have to do the
> expensive part of the query on a very small subset of the documents.

The advantage of implementing this with a QueryFilter is that instead
of combining the results from two PostingList objects, you're
combining a PostingList with a BitVector.

PostingList objects are iterators that refill from disk; BitVectors
are memory only. Few objects refilling from disk means fewer disk
seeks means better performance.

PList_Skip_To helps to optimize the case where you have one common
term and one rare term -- the larger PostingList will spend most of
its time skipping, which is cheaper than calling PList_Next over and
over. But it's still probably faster to have fewer PostingLists and
more BitVectors.

>
> Sure, you say, but what about restricted to ranges? Surely you need a
> filter for that? Currently, but only because it's hard to write a
> custom scorer. I don't see any inherent limitation that would prevent
> a custom scorer from dealing with ranges directly. And if one wanted,
> I don't see any reason that its Next doc method has to use an index at
> all --- it could work from a cached bit vector just as well.

Yeah, that's true. :) A TermScorer just iterates over a
PostingList, spitting back document numbers (and derived scores).
You could just as easily feed it from a bitvector using
BitVec_Next_Set_Bit (which is in 0.15 but got zapped as unnecessary
in 0.20). Such a scorer would have to either supply a fixed score or
no score, since the BitVector would only store document number info.

> are there things that can currently done with the Filter apparatus
> that
> couldn't be done with a Scorer as or more efficiently?

Filters are going to be faster in some cases because of the caching.
However, BitVectors don't work well remotely, since they're too big
to pass around -- that's why SearchClient doesn't support filtering.
So Filters have their drawbacks, too.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
On 7/2/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> Rather than add new public APIs, it's
> time to yank Hits->seek (and simplify Searcher),

This is definitely a direction I would like to see. As a raving
lunatic, I have the nagging fear that I will come across as a raving
lunatic if I keep raving without offering code. But in my crazed
silence, I've been making some stabs at simplifying the search API.
I've been peeling away layers, trying to figure out the minimum that
one can work with.

In short, what I've done is to write a set of more generic search
classes (Perl and C) that don't presume any particular scoring system.
Right now I'm concentrating on my specific application of searching
short documents with Proximity and Scope scoring, but ideally I'll
figure out how to retrofit the existing scoring mechanisms as well.
It's not working yet, and may not be the right direction for
KinoSearch as a whole, but it's helping me understand the task better
and might eventually make for good fodder for a simplified KinoSearch
API.

The hierarchy I'm working with is much flatter and more
straightforward: a reusable Query produces an index-specific Scorer
that returns a HitCollector: no Searcher, no Weight objects, no
Similarity object, no twisty mazes, no delayed inits, and no
Hits->seek. It's probably not quite generic enough for general use,
but I think it's going to be possible to move it in that direction
once I get it working. It's certainly simpler to comprehend, and I
think it might end up more efficient as well.

I started along the path of trying to subclass the existing Scoring
classes to make them work the way I wanted, but it wasn't working
well. Line by line the existing code is great to work with, but the
overall hierarchy hardcodes TF/IDF scoring at a fairly deep level.
While it's been slow, I've been much happier trying to factor this out
rather than overriding it class by class. And by redoing it, I'm
getting a much clearer understanding about how the current setup
works.

I've got tons of questions, but I'll limit myself to two for now:

1) Is there a good reason to keep BooleanScorer at the C level, rather
than moving it up into Perl? Its main function is to create the real
scorer out of the Boolean components (ANDScorer, ORScorer, etc) and it
seems like this real scorer could be assembled just as well by the
query processor. And having this all in Perl would give people a
better example of how to create their own custom Query parsers.

2) The current ORScorer calls Tally on its subscorers at the same time
it is skipping through documents, rather than at the end of the phase.
Is this a good practice that I should emulate? My instinct is that
it would be inefficient for certain types of queries:
((expensive-phrase OR expensive-phrase) AND rare-filter). Or is this
less problematic than it seems? If it's fine in general, then I'd be
tempted to combine the Next and Tally stages more generally.

Nathan Kurz
nate@verse.com

ps. I like the direction of KinoSearch::Simple, particularly the
integration of the indexing and searching. I'm tempted to think that
rather than calling it 'Simple', you should just call it 'KinoSearch'
and eventually have it be the main API.
more abstract interfaces to kinosearch [ In reply to ]
On Mon, Jul 02, 2007 at 01:35:45PM -0700, Marvin Humphrey wrote:
> >Is this true even when (like me) you are only interested in matching?
>
> In theory the (unfinished) MatchPosting class is supposed to help out
> with situations like yours. However, because it doesn't store token
> position, it doesn't support phrase matching, and maybe it needs to
> be rethought.

I *think* this meant "no, SHOULD won't help you", but I can't tell. Is that
true? :)

> No, there's no reason. I guess I thought the capabilities were
> implied by the class name. Looks like usability testing has revealed
> a flaw! ;)

BooleanQuery implies OR, but SHOULD doesn't (to me). I found this fairly
frustrating specifically *because* the class name implied that I could use it to
do OR, but I can't, because I don't want to sort by score.

> There are a lot of good databases out there. KS shouldn't aspire to compete
> with PostgreSQL.

No, and my use case is actually augmenting PostgreSQL. Pg (at least with the
hardware I have) doesn't search people's email fast enough. It is good at
things like referential integrity and constraints on the data. It isn't good
at looking through 250G of mail in under a second.

> KinoSearch is always going to be optimized for the use case of a
> large number of queries against a single view of an index.

I don't think I'm asking that it be anything else, and I think that is a
perfectly sensible optimization to make.

> I don't think we'll have to make a choice between matching alone and
> matching with scoring, though. It should be possible to support both
> without compromise.

That's sort of my point. Right now, it seems difficult to me to get matching
alone, because the documentation isn't there to explain to me how to get the
various different behaviors I want.

When I asked about RangeFilter and how to get the same behavior, but as a
Query, you said "Oh, you could plug X and Y and Z together, but that wouldn't
get you scoring". What I took away from that was that the pieces were there to
do what I wanted, but I didn't know it because a) I don't know anything about
IR theory and b) the documentation didn't tell me so, because it seems to
assume that you want scoring. Does that seem unfair?

Here is how I (naively) think the classes might make more sense:

* A class per document selector (term, phrase, range, boolean, ...)
* A 'Query' class combining selector + whatever makes up Query behavior
(scoring?)
* A 'Filter' class combining selector + whatever makes up Filter behavior (bit
vector caching?)
* explanations in specific selectors about why they can or can't be used for
either Query or Filter: Range can't be used with Query because there's no way
to know how it should contribute to scoring
* documentation on how to write new selectors, and on what things make a
particular selector suitable for Query or Filter

I can try to explain this better later when I'm not late for dinner. :)

hdp.
more abstract interfaces to kinosearch [ In reply to ]
On Mon, Jul 02, 2007 at 05:22:04PM -0400, Hans Dieter Pearcey wrote:
> I can try to explain this better later when I'm not late for dinner. :)

With a little time to think about it, I can now sum up better what I would find
really helpful.

I don't mind reading documentation and source. KinoSearch is more impenetrable
to me than most libraries, though, both because it's got so much C (making
random Dumpering difficult) and because there are a lot of different classes.

A basic overview of how the different components of searching work together
would make it much easier for me to understand each individual piece. I'm
aware of Filters, Queries, Scorers, Posting, Collectors, and so on, but for
anything past Filter and Query I have only a vague idea what they do; even for
Filter and Query I don't understand the difference much beyond "filters cache
and queries don't".

I'd really enjoy using all the pieces in the different combinations to do what
I want. I don't think I want KinoSearch to do more than it's designed to. In
some ways, the difficulty I've had grasping the details has been more
frustrating specifically because it is clearly very modular; if it were a big
chunk of unmalleable code, I'd give up and move on.

If you don't have the time to go into great detail, that's understandable.
Even a general sketch would help a lot.

Also, if you're interested, I wouldn't mind trying to organize anything I learn
into part of the manual, though documentation isn't my strong point.
Docs::DevGuide seems like a reasonable place for an "overview of KinoSearch
and how the classes fit together" section.

hdp.
more abstract interfaces to kinosearch [ In reply to ]
On 7/2/07, Hans Dieter Pearcey <hdp@pobox.com> wrote:
> A basic overview of how the different components of searching work together
> would make it much easier for me to understand each individual piece.

Hi Hans ---

I've been feeling much the same way, although I think the problem
isn't really with KinoSearch but with the Lucene architecture that it
is based on: it's hairy and convoluted. On the bright side, it's
pretty well documented. I found this page after I'd already spelunked
the KinoSearch code:
<http://lucene.apache.org/java/docs/scoring.html>. Apart from some
small differences in class names, it describes KinoSearch as well.
And aside from the formatting problems, the notes I made while trying
to figure out the code might help as well:
http://www.rectangular.com/pipermail/kinosearch/2007-June/001005.html

> I wouldn't mind trying to organize anything I learn
> into part of the manual, though documentation isn't my strong point.
> Docs::DevGuide seems like a reasonable place for an "overview of KinoSearch
> and how the classes fit together" section.

Not to discourage you from writing documentation, but I'm personally
of the opinion that documenting the current state is less important
than simplifying the way it works. The Lucene code is well
documented, but still (to my way of thinking) almost impenetrable.

With you trying to do scoreless matching, me trying to implement
straight proximity scoring, and Marvin on top of the current scoring
method, it seems like we ought to be able to come up with a generic
system that can be easily extended to cover most needs.

> I don't think I want KinoSearch to do more than it's designed to. In
> some ways, the difficulty I've had grasping the details has been more
> frustrating specifically because it is clearly very modular; if it were a big
> chunk of unmalleable code, I'd give up and move on.

Maybe you could send something trying to describe more exactly what
you are trying to accomplish, with less emphasis on how you are trying
to do it? My impression from reading your exchange with Marvin is
that what you want (unlike what I want) is indeed possible without any
changes to the C code or architecture, although without optimal
efficiency.

The 'SHOULD' clauses of the BooleanQuery are handled by the code in
ORScorer.c, which uses a common-English (as opposed to
logical/short-circuit) OR. All the clauses are checked, even if an
earlier clause matched. It might help to think of it as a "Some" or
"Sum" query: only documents that match at least one of the clauses
are passed on, and the sum of the subqueries is used as the total
score.

To get this to work for your purposes, you either need to
change/subclass ORScorer.c to return a constant score, or get its
subqueries to return 0 for a score so that it doesn't matter how many
of them match. Changing ORScorer would require working at the C
level, but I think you should be able to get the subscores to zero by
setting 'boost' to zero somewhere.

This is a somewhat inefficient hack (as it calculates the full score
before multiplying it by 0), but from what I can tell about your goals
should work. Alternatively, if you are doing away with scoring
altogether, you might be able to use a SortCollector instead of a
TopDocCollector. I haven't played with it, but I think it completely
ignores the returned score and lets you specify your own sorting
function.

Hope this helps,

Nathan Kurz
nate@verse.com
more abstract interfaces to kinosearch [ In reply to ]
On Jul 2, 2007, at 5:35 PM, Hans Dieter Pearcey wrote:

> A basic overview of how the different components of searching work
> together
> would make it much easier for me to understand each individual
> piece. I'm
> aware of Filters, Queries, Scorers, Posting, Collectors, and so on,
> but for
> anything past Filter and Query I have only a vague idea what they
> do; even for
> Filter and Query I don't understand the difference much beyond
> "filters cache
> and queries don't".

This is actually quite difficult to sum up succinctly without using
IR jargon. :(

Hopefully the Lucene scoring documentation Nathan pointed you at will
help. Beyond that, I think what we need is an improved and extended
tutorial and a Cookbook. My intention is to germinate these as HowTo
docs on the Wiki and move the most important ones to the official
documentation as they mature.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
On Jul 2, 2007, at 2:16 PM, Nathan Kurz wrote:

> The hierarchy I'm working with is much flatter and more
> straightforward: a reusable Query produces an index-specific Scorer
> that returns a HitCollector:

> no Searcher,

Hmm. I don't see the advantage in that.

> no Weight objects,

Here is the rationale behind Weight, from the original commit message
in January 2003.

a. Queries are no longer modified during a search. This makes
it possible, e.g., to reuse the same query instance with
multiple indexes from multiple threads.

I can see doing away with Weight. Then our queries would look more
like Plucene's.

Note that you have to modify the Query itself. If you try to wait
until creating the Scorer to perform the weighting, you run into
problems with MultiSearcher, specifically with the calculation of
IDF. The MultiSearcher object knows how common a term is across all
the sub-indexes *combined*. That's information that each individual
IndexReader doesn't have access to.

> no Similarity object,

For large corpuses, optimal results are usually only obtained if you
use different length_norm functions for short fields like "title" and
longer fields like "content". See the explanation in the docs for
LongFieldSim:

http://www.rectangular.com/kinosearch/docs/devel/KSx/Search/
LongFieldSim.html

I know that length normalization doesn't matter for your application,
but it's key for standard tf/idf.

I think Doug Cutting nailed this aspect of the design. Concentrating
as much IR theory as possible in one class was absolutely the right
move, IMO. This stuff would be harder to understand, test,
experiment with, or improve if it was spread out over the whole library.

> no twisty mazes,

:)

> no delayed inits,

Any of these that can go away, I say good riddance. They are usually
artifacts of a simplified external API, though. The delayed init
spares the caller from the responsibility of invoking some init
routine manually before a loop begins.

> and no Hits->seek.

I'd love to see a patch for this one. It would be an excellent start.

> It's probably not quite generic enough for general use,
> but I think it's going to be possible to move it in that direction
> once I get it working. It's certainly simpler to comprehend, and I
> think it might end up more efficient as well.

What optimizations have you found?

> I started along the path of trying to subclass the existing Scoring
> classes to make them work the way I wanted, but it wasn't working
> well. Line by line the existing code is great to work with, but the
> overall hierarchy hardcodes TF/IDF scoring at a fairly deep level.
> While it's been slow, I've been much happier trying to factor this out
> rather than overriding it class by class. And by redoing it, I'm
> getting a much clearer understanding about how the current setup
> works.
>
> I've got tons of questions, but I'll limit myself to two for now:
>
> 1) Is there a good reason to keep BooleanScorer at the C level, rather
> than moving it up into Perl?

BooleanScorer's internal compilation phase could be handled in Perl.
There aren't any performance considerations.

A thought: what if we did away with BooleanQuery, replacing it with
ANDQuery, ORQuery, etc? I dunno that we want to open that can o'
worms when 0.20 is so close, though.

> 2) The current ORScorer calls Tally on its subscorers at the same time
> it is skipping through documents, rather than at the end of the phase.
> Is this a good practice that I should emulate? My instinct is that
> it would be inefficient for certain types of queries:
> ((expensive-phrase OR expensive-phrase) AND rare-filter).

That's true. However it's quite efficient for this:

((expensive-phrase OR expensive-phrase) AND rare-term).

That's because all the subscorers get to call Skip_To for docs that
match 'rare-term' and don't bother with building phrase matches for
docs that can't be hits.

> ps. I like the direction of KinoSearch::Simple, particularly the
> integration of the indexing and searching. I'm tempted to think that
> rather than calling it 'Simple', you should just call it 'KinoSearch'
> and eventually have it be the main API.

I disagree. Searching and indexing are completely distinct tasks.

http://en.wikipedia.org/wiki/God_object

It's fine if someone like Hans creates a convenience layer on top of
KS that violates separation of concerns, but the primary classes
should not be designed that way.

PS: Family's coming to visit, so this'll be my last big missive for a
while.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
more abstract interfaces to kinosearch [ In reply to ]
On 7/3/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> > no Searcher,
>
> Hmm. I don't see the advantage in that.

You certainly may be right. And to reiterate, I'm not suggesting that
these things would be good for KinoSearch as a whole, only for my
particular purposes. After I've got something that works we (you) can
consider adopting any (or none) of them. I'm playing blue-sky here.

It's not that I see a great advantage in getting rid of Searcher, only
that I want to flatten the hierarchy. I think it is of conceptual
benefit, but we'll see if the code comes out OK. I like that it
exposes the Scorers better.

> > no Weight objects,
>
> Note that you have to modify the Query itself. If you try to wait
> until creating the Scorer to perform the weighting, you run into
> problems with MultiSearcher, specifically with the calculation of
> IDF. The MultiSearcher object knows how common a term is across all
> the sub-indexes *combined*. That's information that each individual
> IndexReader doesn't have access to.

Wow, so you are making multiple round trips to the search servers? I
was guessing that a local approximation would suffice, so that each
server would be independent. I guess that's only way to get truly
accurate weights, but I wonder if the increased precision is worth the
cost.

I don't see that this means I have to modify the Query, though. I've
thought a little bit about distributed search, but only for my
specific non-weighted case. I've been presuming (for my purposes) a
parallel set of HTTP connections to the server farm, each returning N
top hits, from which the top N are selected.
I'll think more.

> > no Similarity object,
>
> I know that length normalization doesn't matter for your application,
> but it's key for standard tf/idf.

Roger. As I said, I'm optimizing for my particular needs first. I do
like that this data is well encapsulated. Rather than saying 'no
Similarity object', perhaps it would be better to say that the
Similarity object becomes scorer specific: each set of cooperating
scorers shares a ScorerData object, which according to the needs of
that scoring system may or may not look like the current Similarity
object.

> > no delayed inits,
>
> Any of these that can go away, I say good riddance. They are usually
> artifacts of a simplified external API, though. The delayed init
> spares the caller from the responsibility of invoking some init
> routine manually before a loop begins.

Yes, in particular it's because of the incremental formation of
BooleanScorer. Rather than requiring an init call, I'm thinking that
we can just require that all the clauses be passed to the constructor
like the subscorers currently do.

> What optimizations have you found?

Nothing major, and nothing tested. I think there is a small gain by
having Scorer_Advance return a doc number directly rather than a
boolean, obviating the need for a follow-up call to Scorer_Doc. I
think I've done a slightly better job of caching the current document
positions of the subscorers.

But some of this isn't even compiled yet, much less benchmarked. And
I'm sure I've missed some of the more subtle optimizations you've
made, as well as royally messed up in places. Once I have something
working correctly, I'll look closer at everything you're doing and
compare them.

> A thought: what if we did away with BooleanQuery, replacing it with
> ANDQuery, ORQuery, etc? I dunno that we want to open that can o'
> worms when 0.20 is so close, though.

Yes, that is the direction I was thinking. But upon further
reflection, I'm not sure if it's a good idea, though. I like bringing
the subscorers more into the light, but for the Proximity scoring I'm
doing I actually need a top level scorer in a position similar to that
which BooleanScorer is in now.

> > 2) The current ORScorer calls Tally on its subscorers at the same time
> > it is skipping through documents, rather than at the end of the phase.
> > Is this a good practice that I should emulate?
>
> That's true. However it's quite efficient for this:
>
> ((expensive-phrase OR expensive-phrase) AND rare-term).

You are right. A better worst-case example would have been:

(expensive-tally OR expensive-tally) AND (expensive-tally OR expensive-tally)

Assume the pessimal case where the first and-clause matches only odd
documents, and the second and-clause matches only even. In the
current code, we'd still be performing a lot of expensive Tally's even
though in theory we don't need to perform any.

I'm still not sure if this makes a difference, though, and whether or
not I should try to keep Tally clearly distinct and after Advance. So
long as finding a match is more expensive than tallying it, it's not
going to make much of a difference.

> > ps. I like the direction of KinoSearch::Simple, particularly the
> > integration of the indexing and searching. I'm tempted to think that
> > rather than calling it 'Simple', you should just call it 'KinoSearch'
> > and eventually have it be the main API.
>
> I disagree. Searching and indexing are completely distinct tasks.

I appreciate your disagreement, although I think that there isn't much
danger of God objects if using a Has-A
(http://en.wikipedia.org/wiki/Has-a) approach like Hans is doing.

In my mind, there is an Index, and on this Index one can perform
operations. I was suggesting that internally searching and indexing
would still be distinct, but that these operations would be accessible
through a single composite (not multiply-inherited) KinoSearch object.

Going back to the top of this message, I think I'd like to see that
Index/KinoSearch object replace the current Searcher object as a
convenience composite object. But this is a mental model question:
internally, I too want them to be clearly separated.

Nathan Kurz
nate@verse.com
more abstract interfaces to kinosearch [ In reply to ]
On Jul 3, 2007, at 1:06 PM, Nathan Kurz wrote:

> I'm playing blue-sky here.

I appreciate the challenges. :)

>> The MultiSearcher object knows how common a term is across all
>> the sub-indexes *combined*. That's information that each individual
>> IndexReader doesn't have access to.
>
> Wow, so you are making multiple round trips to the search servers?

Yes.

There's one call to each node for the doc freqs in MultiSearcher-
>create_weight:

# get an aggregated doc_freq for each term
my @aggregated_doc_freqs = (0) x scalar @terms;
for my $i ( 0 .. $#$searchables ) {
my $doc_freqs = $searchables->[$i]->doc_freqs( \@terms );
for my $j ( 0 .. $#terms ) {
$aggregated_doc_freqs[$j] += $doc_freqs->[$j];
}
}

Then there's one call to each searcher for collect().

Then individual doc content gets fetched from the searcher that owns
them, for however many hits are to be displayed.


> I was guessing that a local approximation would suffice, so that each
> server would be independent.

That would be close some of the time and wildly inaccurate some of
the time.

> I guess that's only way to get truly
> accurate weights, but I wonder if the increased precision is worth the
> cost.

Very few bits are going over the wire until doc content starts flying.

> each set of cooperating
> scorers shares a ScorerData object, which according to the needs of
> that scoring system may or may not look like the current Similarity
> object.

You could do this by extending Similarity, which I find conceptually
appealing. But I can see why you'd find it cleaner to start fresh.

>> > no delayed inits,
>>
>> Any of these that can go away, I say good riddance. They are usually
>> artifacts of a simplified external API, though. The delayed init
>> spares the caller from the responsibility of invoking some init
>> routine manually before a loop begins.
>
> Yes, in particular it's because of the incremental formation of
> BooleanScorer. Rather than requiring an init call, I'm thinking that
> we can just require that all the clauses be passed to the constructor
> like the subscorers currently do.

Every place I've done that I've regretted it, and in some places
changed it -- e.g. Highlighter 0.15 vs 0.20_04.

http://www.rectangular.com/kinosearch/docs/stable/KinoSearch/
Highlight/Highlighter.html
http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Highlight/
Highlighter.html

Jamming everything into your constructor isn't good design. Pretty
soon you want to add another parameter, and the thing starts to get
gnarly....

> Nothing major, and nothing tested. I think there is a small gain by
> having Scorer_Advance return a doc number directly rather than a
> boolean, obviating the need for a follow-up call to Scorer_Doc.

Doc numbers begin at 0, so that's not going to work without making
some changes. Might be worth trying, but a lot of stuff that depends
on the current behavior of IndexReader->max_doc would go haywire.

> A better worst-case example would have been:
>
> (expensive-tally OR expensive-tally) AND (expensive-tally OR
> expensive-tally)
>
> Assume the pessimal case where the first and-clause matches only odd
> documents, and the second and-clause matches only even. In the
> current code, we'd still be performing a lot of expensive Tally's even
> though in theory we don't need to perform any.
>
> I'm still not sure if this makes a difference, though, and whether or
> not I should try to keep Tally clearly distinct and after Advance. So
> long as finding a match is more expensive than tallying it, it's not
> going to make much of a difference.

I think ORScorer is the only place this is going to apply. It's a
valid concern, but the current BooleanScorer proceeds doc-at-a-time,
which allows it to call Scorer_Skip_To() on its subscorers. That's
not the case for the BooleanScorer used in KS 0.15.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/