Mailing List Archive

Tokens with spaces inside
Hi all, new to the list.

I have just subscribed after searching the archives for something on
this, but cannot find much. I would thank any pointers to relevant
threads or documentation, or failing that, any ideas that could help...
sorry if I'm asking too-obvious questions.

My problem is this: I want my tokens to be able to contain spaces
inside. I'm building an index specifically for tags that can be made up
of several words, and want searches to match exactly those words. Thus,
I have built my polyanalyzer this way:

my $normalizer = KinoSearch::Analysis::LCNormalizer->new;
my $token_re = qr/
\w[\w ]* # our tokens can have word characteres AND spaces.
/mxs;

my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
token_re => $token_re
);

my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $normalizer, $tokenizer ],
);

I have built the analyzer without a stemmer as I want to search exactly
for the words that compose the terms.

Now comes the first doubt: Is there any way to examine what terms are
being indexed? That would help me confirm the tokenizer regex is correct
(I have tested it outside and it seems correct to me, but would like to
know what use does Kino of it).

Then with this analyzer I have built an index, but my searches return
nothing. To build the queries I use a QueryParser that uses the analyzer
above.

Using Data::Dumper I can examine the Query objects produced by the
parser, and it appears the parser has split the query string in the
spaces, disregarding what I (guess) should be the correct behavior,
which would be to divide the query string in what my regex says are
tokens (any word character followed by any number of word characters and
spaces).

I have previously tried with a slightly similar regex: /\w[^,]*/msx
because my tokens are comma-separated, but no luck either.

So, my questions... am I mistaken by expecting the query to have terms
contain spaces just as I want my tokens to? Any other suggestions on how
to solve the "terms that contain spaces" problem?


I'm pasting below a sample dump of a Query object made by a QueryParser
that uses the analyzer above.

Thanks in advance!

--R

The query is "foo bar,baz" (without the double quotes). I want the
search to be made of "foo bar" and "baz" as terms... but the parsed
query looks otherwise:

Query: $VAR1 = bless( {
'clauses' => [.
bless( {
'query' => bless( {
'boost' => 1,
'term' =>
bless( {

'text' => 'foo',

'field' => 'tags'

}, 'KinoSearch::Index::Term' )
},
'KinoSearch::Search::TermQuery' ),
'occur' => 'SHOULD'
},
'KinoSearch::Search::BooleanClause' ),
bless( {
'query' => bless( {
'boost' => 1,
'positions'
=> [

0,

1

],
'terms' => [.

bless( {

'text' => 'bar',

'field' => 'tags'

}, 'KinoSearch::Index::Term' ),

bless( {

'text' => 'baz',

'field' => 'tags'

}, 'KinoSearch::Index::Term' )
],
'slop' => 0,
'field' =>
'tags'
},
'KinoSearch::Search::PhraseQuery' ),
'occur' => 'SHOULD'
},
'KinoSearch::Search::BooleanClause' )
],
'disable_coord' => 0,
'boost' => 1,
'max_clause_count' => 1024
}, 'KinoSearch::Search::BooleanQuery' );