Mailing List Archive: rfc: faceted search

rfc: faceted search

bramble.andrew at gmail

Jul 29, 2008, 4:33 AM

Post #1 of 6 (9085 views)

Given my head is in this problem space already , I figure there's no
harm distilling some thoughts while the ideas are 'fresh'.

Forgive me if there are some details of KinoSearch that I have yet to
properly understand.

To provide faceted search (FS) capability for KinoSearch (KS)
requires, to quote Marvin "massive server-side caching". We like!!

A FS class would trawl the index on startup (or even better during
index time and store with the invindex, there's an API for index
overlays... right?!?) and generate bitvectors for desired field(s)
terms - storing a 1 in doc_num position for documents posessing the
given term. For a field with 100 facets or terms - you'd need 100 bit
vectors of at most max_docs bits long.

An FS query would wrap a regular KS query - AND'ing the query results
with each term's cached bitvector to derive a count of documents
within the wrapped query that posess that term ( 100 ANDs + 100 counts
of bitvectors no greater than maxdoc bits ).

I have made a VERY naive implementation of this without glueing into
KinoSearch XS/C, since I confess to _barely_ grokking charmony + XS +
C beyond kindergarten level.

Facet::Counter2 (yes I embrace version control really) is quite
hopeless from a practical standpoint and will only count the facets of
documents returned by KS::Search::Hits , limited to num_wanted.

My next goal would be to better understand KS internals so as to
* use KS BitVectors to replace scalars and vec
* make Facet::Counter into KSx::Search::FacetQuery to collect the >0
scored results of a child query and count facets this way instead of
using KS::Search::Hits

Constructive abuse welcome.

AB

Re: rfc: faceted search [ In reply to ]

Jul 29, 2008, 1:53 PM

Post #2 of 6 (8790 views)

On Tue, Jul 29, 2008 at 5:33 AM, Andrew Bramble
<bramble.andrew@gmail.com> wrote:
> To provide faceted search (FS) capability for KinoSearch (KS)
> requires, to quote Marvin "massive server-side caching". We like!!

My mental model of faceted search (likely faulty) has two
requirements: to break down search results by facet, and to allow
filtering by facet. Can't this be done with just Boolean queries and
stored field values? Why does it require bitvectors and wrapped
queries?

For a simple example, presume that each 'doc' is given a color and a
size. We want to search for large, blue widgets. Can't we just
search for "size:large && color:blue && text:widget"? This seems like
it would already be pretty efficient.

And on the display side, wouldn't it be easier just to have a forward
index listing the combined facets for each doc? You'd have to buzz
over this list for the results of each new query, but it seems like it
would be easy to cache the facet counts at the application level.

What am I missing?

Nathan Kurz
nate@verse.com

>
> A FS class would trawl the index on startup (or even better during
> index time and store with the invindex, there's an API for index
> overlays... right?!?) and generate bitvectors for desired field(s)
> terms - storing a 1 in doc_num position for documents posessing the
> given term. For a field with 100 facets or terms - you'd need 100 bit
> vectors of at most max_docs bits long.
>
> An FS query would wrap a regular KS query - AND'ing the query results
> with each term's cached bitvector to derive a count of documents
> within the wrapped query that posess that term ( 100 ANDs + 100 counts
> of bitvectors no greater than maxdoc bits ).
>
> I have made a VERY naive implementation of this without glueing into
> KinoSearch XS/C, since I confess to _barely_ grokking charmony + XS +
> C beyond kindergarten level.
>
> Facet::Counter2 (yes I embrace version control really) is quite
> hopeless from a practical standpoint and will only count the facets of
> documents returned by KS::Search::Hits , limited to num_wanted.
>
> My next goal would be to better understand KS internals so as to
> * use KS BitVectors to replace scalars and vec
> * make Facet::Counter into KSx::Search::FacetQuery to collect the >0
> scored results of a child query and count facets this way instead of
> using KS::Search::Hits
>
> Constructive abuse welcome.
>
> AB
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>
>

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: rfc: faceted search [ In reply to ]

marvin at rectangular

Jul 29, 2008, 3:44 PM

Post #3 of 6 (8767 views)

On Jul 29, 2008, at 4:33 AM, Andrew Bramble wrote:

> A FS class would trawl the index on startup (or even better during
> index time and store with the invindex,

Caching on startup is a known successful design (used by Solr), but
storing with the invindex might work well, particularly if we can get
mmap happening properly.

(Note to Nate: mmap is something I've been considering for term
dictionaries as well, as it could potentially eliminate startup costs
for sort fields or even index Lexicons.)

> there's an API for index overlays... right?!?)

The architecture is mostly there. There's no API yet, but perhaps
this project will provide the necessary nudge. More at the bottom of
this reply...

> and generate bitvectors for desired field(s)
> terms - storing a 1 in doc_num position for documents posessing the
> given term. For a field with 100 facets or terms - you'd need 100 bit
> vectors of at most max_docs bits long.

Exactly.

> An FS query would wrap a regular KS query - AND'ing the query results
> with each term's cached bitvector to derive a count of documents
> within the wrapped query that posess that term ( 100 ANDs + 100 counts
> of bitvectors no greater than maxdoc bits ).

Yes. Something like this:

my $intersection = KinoSearch::Util::BitVector->new;
while ( my ( $field_name, $field_set ) = each %cached_bit_sets ) ) {
$intersection->copy($result_set);
$intersection->and($field_set);
$facet_counts{$field_name} = $intersection->count;
}

> I have made a VERY naive implementation of this without glueing into
> KinoSearch XS/C, since I confess to _barely_ grokking charmony + XS +
> C beyond kindergarten level.

Let's just worry about rapid prototyping for now.

> My next goal would be to better understand KS internals so as to
> * use KS BitVectors to replace scalars and vec

KinoSearch::Util::BitVector has now been exposed as a public class as
of r3661. The API is similar to that of Java's BitSet.

> * make Facet::Counter into KSx::Search::FacetQuery to collect the >0
> scored results of a child query and count facets this way instead of
> using KS::Search::Hits

If all we wanted was the facet counts, we could start with application
logic something like this...

my $result_set = KinoSearch::Util::BitVector->new(
capacity => $searcher->max_docs + 1,
);
my $bit_collector = KinoSearch::Search::HitCollector::BitCollector-
>new(
bit_vector => $result_set,
);
$searcher->collect(
query => $query,
collector => $bit_collector,
);

... followed by the facet count code above.

HitCollector, BitCollector, and Searcher::collect have now been
exposed as public APIs as of r3662.

I might suggest one other bullet point:we want to be thinking about
how to scale this across a search cluster from the get-go. We can get
facet counts working pretty fast using the techniques above, but only
for a single machine. It becomes more challenging with multiple nodes
because Searchable's collect() method can't be implemented to work
efficiently over a remote connection.

Perhaps opening up InvIndexer and IndexReader is the key. (We can
assume that we're extending the index with new files, though that
wouldn't strictly be necessary as the existing index files contain all
the necessary information.) Here's a rough sketch of how things might
work at index-time:

my $facet_writer = KSx::Facets2::FacetWriter->new( invindex =>
$invindex );
my $seg_writer = KinoSearch::Index::SegWriter->new(
invindex => $invindex,
);
$seg_writer->add_writer($facet_writer);
my $invindexer = KinoSearch::InvIndexer->(
invindex => $invindex,
seg_writer => $seg_writer,
);
...

And, at search-time:

my $facet_reader = KSx::Facets2::FacetReader->new(
invindex => $invindex,
fields => \@facet_fields,
);
my $reader = KinoSearch::Index::IndexReader->open(
invindex => $invindex,
);
$reader->add_readers($facet_reader);
my $searcher = KinoSearch::Searcher->new(
reader => $reader,
);
...

while ( my $cgi = CGI::Fast->new ) {
my $query = $queryparser->parse( $cgi->param('q') || '' );
my $facet_query = KSx::Facets2::FacetQuery->new(
query => $query,
fields => \@facet_fields,
);
my $hits = $searcher->search( query => $facet_query );
...
}

The main idea is that the FacetQuery class would rely on an
IndexReader having a FacetReader as one of its children. The
FacetReader would be responsible for holding the cached BitVectors.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: rfc: faceted search [ In reply to ]

Jul 30, 2008, 11:03 AM

Post #4 of 6 (8772 views)

On Tue, Jul 29, 2008 at 2:53 PM, Nathan Kurz <nate@verse.com> wrote:
> My mental model of faceted search (likely faulty) has two
> requirements: to break down search results by facet, and to allow
> filtering by facet. Can't this be done with just Boolean queries and
> stored field values? Why does it require bitvectors and wrapped
> queries?

Sorry to reply to myself, but I thought about this a bit more last
night. I'm now more sure that bit vectors and wrapped queries are
_not_ a good solution for faceted search. The bit vectors become
unworkable once you have a large number of facets (think of using
'author' as a facet on a large dataset), and the wrapped query
approach doesn't spread well to a cluster (as one would be forced to
round-trip the full list of matching docs over the net and not just
the top-n).

Instead the problem can be better scalably solved by reusing the
existing inverted and document indexes. Really the only change
necessary is accepting that facet counting needs to be done during
the main query, and not as a later step like gathering excepts. This
makes sense, though, as a facet count applies to the full query, and
the counts can be considered a query response just as the matching
docs are.

Instead of wrapping a query, one just uses a FacetCollector. Hit by
hit, this collector (which could be wrapping a HitCollector) steps
through the facet field's DocVector (is this where the forward index
of terms is kept?) adding up term occurrences.
These facet counts are then returned as part of the response along
with the top matches. Caching, if it is to occur, happens in a
centralized way (memcached) at the top level for the entire query,
rather than component by component within the cluster.

Your servant of Ockham,

Nathan Kurz
nate@verse.com

ps. I can't tell at a glance how KinoSearch currently treats fields
in its indexes. For example, does each field get its own lexicon?

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: rfc: faceted search [ In reply to ]

marvin at rectangular

Jul 30, 2008, 12:38 PM

Post #5 of 6 (8771 views)

On Jul 30, 2008, at 11:03 AM, Nathan Kurz wrote:

> The bit vectors become
> unworkable once you have a large number of facets (think of using
> 'author' as a facet on a large dataset),

If you anticipate needing a large number of facets with small data
sets, that's where you don't use BitVectors any more. Instead you use
a short array of sorted integers, possibly compressed -- I think Solr
uses a class called SortedVIntList for this. If we added such a
feature, we'd need to create methods to and() and or() BitVectors
against these integer lists instead of against other BitVectors.

That doesn't solve all our problems, but it does address the issue of
sparse BitVectors using too much memory.

> and the wrapped query
> approach doesn't spread well to a cluster (as one would be forced to
> round-trip the full list of matching docs over the net and not just
> the top-n).

You never need to communicate the full list of matching docs across
the network. All you need to communicate is the top-n matching docs
and the facet counts, which isn't a problem.

How to pass those facet counts, I'm not exactly sure. We might need
to extend TopDocs or something like that. Since Solr sits on top of
Lucene, it's must have its own Searcher-style class which already
knows how to pass facet counts. I wouldn't want to add such
capabilities to our core searching classes, though. Instead, I'd like
to make IndexReader and Searchable extensible so that they can handle
arbitrary Query subclasses, such as FacetQueries. We'll see how that
works out. :)

> Instead the problem can be better scalably solved by reusing the
> existing inverted and document indexes. Really the only change
> necessary is accepting that facet counting needs to be done during
> the main query, and not as a later step like gathering excepts.

Caching result seets for each facet into BitVectors or what-have-you
is important because matching a particular facet is likely to be an
expensive query. Here are the top facet counts for a search at Amazon
for "moon":

* Books (343,415)
* MP3 Downloads (35,160)
* Home & Garden (22,234)
* Music (7,086)
...

We don't want to execute 'category:books', 'category:mp3_downloads',
etc. with each search query. Therefore, we have to cache the result
sets for those facets. We can do it at search-time like so:

for my $category (qw( books mp3_downloads home_and_garden music )) {
my $bit_vec = KinoSearch::Util::BitVector->new(
capacity => $searcher->max_docs + 1,
);
my $bit_collector =
KinoSearch::Search::HitCollector::BitCollector->new(
bit_vector => $bit_vector,
);
my $cat_query = KinoSearch::Search::TermQuery->new(
field => 'category',
term => $category,
);
$searcher->collect(
query => $cat_query,
collector => $bit_collector,
);
$facet_caches{$category} = $bit_vec;
}

We could also potentially perform the same sort of search at index-
time and write out the result sets as e.g. bit-arrays. That way, it
wouldn't be necessary to execute a bunch of searches at Search-time
startup, especially if we could set up special BitVectors where the
bit arrays were mmap'd against those files.

If I understand correctly, we'd get an even bigger benefit from index-
time preparation of facet result sets when there are several search
processes, since the data for the facet result sets would be cached in
shared system memory pages, instead of in per-process BitVector objects.

> Instead of wrapping a query, one just uses a FacetCollector. Hit by
> hit, this collector (which could be wrapping a HitCollector) steps
> through the facet field's DocVector (is this where the forward index
> of terms is kept?) adding up term occurrences.

I think there's some confusion as to the role of DocVector. Each
DocVector is basically a self-contained single-document inverted
index, and retrieving a DocVector is approximately as expensive as
retrieving a Doc. Using DocVector while iterating through posting
lists isn't feasible; retrieving one DocVector for each of the top
hits after you've finished searching is.

The only thing that uses DocVector right now is the Highlighter.
DocVector will also come in handy if we get an Explain() method
happening for Query/Compiler.

> These facet counts are then returned as part of the response along
> with the top matches. Caching, if it is to occur, happens in a
> centralized way (memcached) at the top level for the entire query,
> rather than component by component within the cluster.

I can't quite wrap my head around that. Can you please elaborate?

> ps. I can't tell at a glance how KinoSearch currently treats fields
> in its indexes. For example, does each field get its own lexicon?

Yes, right now that is the case. In every segment, each field gets
its own .lex and .lexx "files
". When there are a lot of fields, that means there are a huge number
of such "files", but since KS uses a virtual file system and all those
"files" are actually contained within a single .cf file per segment,
we don't run out of descriptors. Peek at the JSON inside a .cfmeta
file to get an idea of what's going on.

Per-field lexicon files are not perfectly satisfactory, but explaining
all the reasoning would require an extended digression.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: rfc: faceted search [ In reply to ]

Aug 1, 2008, 12:32 PM

Post #6 of 6 (8748 views)

On Wed, Jul 30, 2008 at 1:38 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Jul 30, 2008, at 11:03 AM, Nathan Kurz wrote:
>
>> The bit vectors become
>> unworkable once you have a large number of facets (think of using
>> 'author' as a facet on a large dataset),
>
> If you anticipate needing a large number of facets with small data sets,
> that's where you don't use BitVectors any more.

Great, I think we agree here. The question might be what a 'large
number' means in this context. My guess is that the 'large number'
break point is somewhere between 10 and 100. My other guess (probably
skewed by my own interests) is that most of the interesting
applications are going to require 'large numbers' of facets, often
'very large numbers'. Since solutions that work with a the large
number of facets can (IMO) work adequately for the situations with
small numbers (but not vice versa), the large number solution strikes
me as primary.

> Instead you use a short
> array of sorted integers, possibly compressed -- I think Solr uses a class
> called SortedVIntList for this. If we added such a feature, we'd need to
> create methods to and() and or() BitVectors against these integer lists
> instead of against other BitVectors.

Rather than adding a separate infrastructure for doing Boolean
operations across BitVectors and SortedVIntList, I would strongly
suggest shoehorning these structures into the existing index
structures and use the existing scoring system. Just as as should be
possible for an index to use P4Delta compression, it should be
possible for an index to store information as a BitVector. This buys
you a lot more flexibility with a cleaner infrastructure. I'm also
suggesting that the current field specific inverted index would work
fine for many applications.

>> and the wrapped query
>> approach doesn't spread well to a cluster (as one would be forced to
>> round-trip the full list of matching docs over the net and not just
>> the top-n).
>
> You never need to communicate the full list of matching docs across the
> network. All you need to communicate is the top-n matching docs and the
> facet counts, which isn't a problem.

Great. This was my misunderstanding of the proposed solution.

> How to pass those facet counts, I'm not exactly sure.

There are lots of solutions to this, and it doesn't strike me as a
worry at all. It can be solved (and changed) after the internals are
sorted out.

> Caching result seets for each facet into BitVectors or what-have-you is
> important because matching a particular facet is likely to be an expensive
> query. Here are the top facet counts for a search at Amazon for "moon":
>
> * Books (343,415)
> * MP3 Downloads (35,160)
> * Home & Garden (22,234)
> * Music (7,086)
> ...
>
> We don't want to execute 'category:books', 'category:mp3_downloads', etc.
> with each search query. Therefore, we have to cache the result sets for
> those facets.

You are right that we don't want to go through the inverted index for
each of the possible facets, but this is why I was suggesting that we
could step through the forward indexed (doc->termlist) DocVectors to
count facets. While this could be counted as expensive, I think it
would be on a par with scoring a relatively common search term in the
inverted index, say, the search for 'moon' itself. I also think it
would be more efficient than merging 100 BitVectors, although less
efficient than merging 10.

> If I understand correctly, we'd get an even bigger benefit from index-time
> preparation of facet result sets when there are several search processes,
> since the data for the facet result sets would be cached in shared system
> memory pages, instead of in per-process BitVector objects.

Yes, a fine plan. There is definitely a benefit to having these
cached in files rather than in per-process memory. Accessing these
files via mmap gives some extra benefit, but since the file system
effectively uses mmap internally, you'll get most of this cross-proces
benefit using standard IO as well.

> I think there's some confusion as to the role of DocVector. Each DocVector
> is basically a self-contained single-document inverted index, and retrieving
> a DocVector is approximately as expensive as retrieving a Doc. Using
> DocVector while iterating through posting lists isn't feasible; retrieving
> one DocVector for each of the top hits after you've finished searching is.

Could it be made to be feasible? While I understand that random
access to the DocVectors will be expensive, I was hoping that
sequentially stepping through the TermVectors for a field could be
done very efficiently, especially for unanalyzed fields with a limited
vocabulary (ie, for a field used to store facets). Conceptually,
stepping over a doc->termList index doesn't strike me as inherently
any worse than stepping through the term->docList inverted index.

>> Caching, if it is to occur, happens in a
>> centralized way (memcached) at the top level for the entire query,
>> rather than component by component within the cluster.
>
> I can't quite wrap my head around that. Can you please elaborate?

I'm offering a philosophical conviction that it's best to do the
caching at the highest level possible rather than complicating the
lower levels. Rather than farming the request out to the cluster and
having each member check a local cache for facet values, I'm
suggesting that it would be better for the central dispatch to cache
the finished client response. I think this is easier to scale, makes
better use of system resources, and yields simpler and easier to
debug code.

>> ps. I can't tell at a glance how KinoSearch currently treats fields
>> in its indexes. For example, does each field get its own lexicon?
>
> Yes, right now that is the case. [...]
> Per-field lexicon files are not perfectly satisfactory, but explaining all
> the reasoning would require an extended digression.

Great, that's what I was hoping. I'm not sure what the downsides are,
but having per field lexicons means that limited vocabulary fields can
be stored much more efficiently than if they shared a lexicon with
everyone else.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch