Mailing List Archive

Coding and schema strategies for partial-match search?
The document collection to which I intend to apply KinoSearch is set up
as a file hierarchy. As it might be:

/foo/bar/1.html
/foo/bar/2.html
/foo/baz/1.html
/qux/1.html

I need to be able to restrict searches by file path, automatically
including all subdirectories of the specified directory: for example,
someone might want to search "only /foo/bar", or "only /foo" (which
would include /foo/bar and /foo/baz) - or in more complex cases "both
/foo/baz and /qux". From the documentation I can't see a way of doing
this; is there something obvious I'm missing?

The only approach that comes to mind so far is to store a series of path
components in a custom field - so /foo/bar/1.html would have "/foo
/foo/bar". But this seems ugly.

Roger
Coding and schema strategies for partial-match search? [ In reply to ]
On May 9, 2007, at 1:49 AM, Roger Burton West wrote:

> The document collection to which I intend to apply KinoSearch is
> set up
> as a file hierarchy. As it might be:
>
> /foo/bar/1.html
> /foo/bar/2.html
> /foo/baz/1.html
> /qux/1.html
>
> I need to be able to restrict searches by file path, automatically
> including all subdirectories of the specified directory: for example,
> someone might want to search "only /foo/bar", or "only /foo" (which
> would include /foo/bar and /foo/baz) - or in more complex cases "both
> /foo/baz and /qux".

I'd lean towards handling things with a phrase match: '"foo bar"'.

You'll want to assign a custom analyzer to your filepath field.
Don't use the stock PolyAnalyzer, because you don't want stemming.
I'd recommend a Tokenizer, possibly augmented with an LCNormalizer
(wrapping the two in a custom PolyAnalyzer) if you want case-
insensitive matching. The default Tokenizer will break '/foo/bar/
1.html' into four tokens: qw( foo bar 1 html ). You might want to
supply your own token_re customized for splitting filepaths.

The only question then is how to anchor it so that you don't match /
home/foo/bar/. The easiest way to pull that off is to prepend a
symbolic "root" to each directory name, e.g. "/rootdirectory/foo/
bar", at both index-time and search-time.

You'd add the restriction to the search by creating a main
BooleanQuery with two required clauses: the user query, probably
generated by taking the output from a QueryParser, and a query
representing the filepath restriction, which would probably be a
BooleanQuery itself with multiple optional sub PhraseQueries.
There's an example of how to do something similar (limiting search by
the contents of a 'category' field) in the BooleanQuery docs.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/