Mailing List Archive: Field-specific terms vs. query filter?

Field-specific terms vs. query filter?

Nov 30, 2007, 6:47 PM

Post #1 of 5 (3016 views)

Hi -

I'm working with an index that has a "category_id" field, and I need
to filter search results based on specific categories which the user
shouldn't be able to see. Right now I'm taking the user's original
query and appending a lot of booleans with field-specific terms, e.g.
a search for "foo" turns into something like "(foo) AND NOT
(category_id:1) AND NOT (category_id:7) AND NOT ...".

Is there any advantage to building up a Boolean query for just the
category_id parts, and using that as a query filter instead? I won't
necessarily be able to cache/reuse the query filter. Is there a
better way in general to do this?

Related question: Assuming the category_id field is indexed but not
analyzed, is a field-specific term going to do an exact match? I.e.
will "category_id:1" match just "1", or will it also match "10"?

Thanks!
Larry

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Field-specific terms vs. query filter? [ In reply to ]

marvin at rectangular

Dec 1, 2007, 10:20 AM

Post #2 of 5 (2888 views)

Permalink

Howdy, Larry...

On Nov 30, 2007, at 6:47 PM, Larry Leszczynski wrote:

> I'm working with an index that has a "category_id" field, and I
> need to filter search results based on specific categories which
> the user shouldn't be able to see. Right now I'm taking the user's
> original query and appending a lot of booleans with field-specific
> terms, e.g. a search for "foo" turns into something like "(foo) AND
> NOT (category_id:1) AND NOT (category_id:7) AND NOT ...".
>
> Is there any advantage to building up a Boolean query for just the
> category_id parts, and using that as a query filter instead? I
> won't necessarily be able to cache/reuse the query filter.

There shouldn't be much difference if any between the filtering
approach and the complex-query approach. You have to build up the
same data either way; the QueryFilter just executes the "excluded"
parts of the search first and stuffs the result into a BitVector,
then applies that to the "required" part.

If you were able to keep the QueryFilter objects around, then on
subsequent searches, you wouldn't have to re-run the excluded parts
of the search, but oh well. If the box you're running this on has
sufficient memory, the kernel cache will probably help you out behind
the scenes.

> Is there a better way in general to do this?

Though there are some approaches that we could take for better
supporting this kind of search, they all involve caching something.
For large datasets, KS is really designed to be used with a
persistent search object. For small datasets, it doesn't matter.
Where the threshold between "large" and "small" lies depends on a lot
of variables.

> Related question: Assuming the category_id field is indexed but
> not analyzed, is a field-specific term going to do an exact match?
> I.e. will "category_id:1" match just "1", or will it also match "10"?

It will do an exact match.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Field-specific terms vs. query filter? [ In reply to ]

larryl at emailplus

Dec 1, 2007, 12:01 PM

Post #3 of 5 (2885 views)

Permalink

Hey Marvin -

>> Is there any advantage to building up a Boolean query for just the
>> category_id parts, and using that as a query filter instead? I won't
>> necessarily be able to cache/reuse the query filter.
>
> There shouldn't be much difference if any between the filtering approach and
> the complex-query approach. You have to build up the same data either way;
> the QueryFilter just executes the "excluded" parts of the search first and
> stuffs the result into a BitVector, then applies that to the "required" part.
>
> If you were able to keep the QueryFilter objects around, then on subsequent
> searches, you wouldn't have to re-run the excluded parts of the search, but
> oh well. If the box you're running this on has sufficient memory, the kernel
> cache will probably help you out behind the scenes.

I figured that might be the case, thanks much for the info.

>> Related question: Assuming the category_id field is indexed but not
>> analyzed, is a field-specific term going to do an exact match? I.e. will
>> "category_id:1" match just "1", or will it also match "10"?
>
> It will do an exact match.

Cool. Is there any syntax to query an analyzed field for an exact
match? E.g. if I want to match "cream" but not "creamy"?

Larry

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Field-specific terms vs. query filter? [ In reply to ]

marvin at rectangular

Dec 1, 2007, 1:37 PM

Post #4 of 5 (2875 views)

Permalink

On Dec 1, 2007, at 12:01 PM, Larry Leszczynski wrote:

> Is there any syntax to query an analyzed field for an exact match?
> E.g. if I want to match "cream" but not "creamy"?

What ends up in the index depends on the Analyzer assigned to that
field. The choice the KS docs shunt you towards as the easiest is a
PolyAnalyzer that's actually a series of three other analyzers.
Here's the code from PolyAnalyzer's constructor:

# create a default set of analyzers if language was specified
if ( !defined $args->{analyzers} ) {
confess("Must specify either 'language' or 'analyzers'")
unless $language;
$args->{analyzers} = [
KinoSearch::Analysis::LCNormalizer->new,
KinoSearch::Analysis::Tokenizer->new,
KinoSearch::Analysis::Stemmer->new( language =>
$language ),
];
}

The element that reduces "creamy" to "cream" is the Stemmer. If you
take it out of the loop, then searches for "creamy" will no longer
docs matching "cream" and vice versa. Whether that's desirable
depends on your application. It's perfectly reasonable to index
twice, so long as you have the resources to spare (you'd probably
only want to have only one of the fields be "stored").

Take the Tokenizer out of the loop too, and then searches for
"creamy" will no longer match a field whose value is "Creamy
Goodness". Take the LCNormalizer out of the loop too, and then
there's no more analysis being performed -- so only exact matches
will succeed.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Field-specific terms vs. query filter? [ In reply to ]

larryl at emailplus

Dec 2, 2007, 10:03 AM

Post #5 of 5 (2877 views)

Permalink

On Sat, 1 Dec 2007, Marvin Humphrey wrote:

>> Is there any syntax to query an analyzed field for an exact match?
>> E.g. if I want to match "cream" but not "creamy"?
>
> What ends up in the index depends on the Analyzer assigned to that
> field. The choice the KS docs shunt you towards as the easiest is a
> PolyAnalyzer that's actually a series of three other analyzers.
> Here's the code from PolyAnalyzer's constructor:
>
> # create a default set of analyzers if language was specified
> if ( !defined $args->{analyzers} ) {
> confess("Must specify either 'language' or 'analyzers'")
> unless $language;
> $args->{analyzers} = [.
> KinoSearch::Analysis::LCNormalizer->new,
> KinoSearch::Analysis::Tokenizer->new,
> KinoSearch::Analysis::Stemmer->new( language => $language ),
> ];
> }
>
> The element that reduces "creamy" to "cream" is the Stemmer. If you
> take it out of the loop, then searches for "creamy" will no longer
> docs matching "cream" and vice versa. Whether that's desirable
> depends on your application. It's perfectly reasonable to index
> twice, so long as you have the resources to spare (you'd probably
> only want to have only one of the fields be "stored").
>
> Take the Tokenizer out of the loop too, and then searches for
> "creamy" will no longer match a field whose value is "Creamy
> Goodness". Take the LCNormalizer out of the loop too, and then
> there's no more analysis being performed -- so only exact matches
> will succeed.

Coolness, info much appreciated as usual!

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch