Mailing List Archive

API request for KS::InvIndexer -- field_value as arrayref
Hi folks.

This is a request for an API change to KS::InvIndexer. I'd like to be able to do
this:

while ( my ( $title, $content ) = each %source_docs ) {
$invindexer->add_doc({
title => $title,
content => $content, # could be arrayref or scalar string
});
}

where the field value of each hashref key/value pair could be a scalar string
(as it is now) or an arrayref of scalar strings.

If it were an arrayref, then the pos_inc would bump by +1 for every item in the
array.

Example:

my $content = ['eats shoots and leaves', 'by the morning train'];
$invindexer->add_doc({ content => $content });

and a phrase search for "leaves by the morning train" would fail because the
pos_inc for 'by the morning train' would bump the position for 'by' out of reach
of 'leaves'.

eats shoots and leaves by the morning train
0 1 2 3 5 6 7 8

This API change would address the need raised in this thread:

http://www.rectangular.com/pipermail/kinosearch/2006-November/000529.html

by allowing the pos_inc to be incremented automatically in the TokenBatch.

It seems[1] to me that most of this magic would be implemented in the SegWriter
class in Perl space, by creating a new TokenBatch for each string in the array
and calling the add_batch() multiple times for each $field_name. However, it
also looks like the XS for TokenBatch->new would need to be modified to accept a
pos_inc instead of using the hardcoded '1' value.

I'd be happy to hack up a patch if the API is suitable.

Thoughts?

[1] "seems" because I'm having a hard time wrapping my head around some of the
magic in the interaction between TokenBatch and the Analyzer.

--
Peter Karman . http://peknet.com/ . peter@peknet.com
API request for KS::InvIndexer -- field_value as arrayref [ In reply to ]
Peter,

What's the ultimate goal here? Is it that you want to supply pre-
parsed fields? I've been thinking about that a bit myself, because
for HTML parsing with per-position boosts, I want to store a version
with tags stripped, but the tags have to be there at parse-time to
determine boost for each token (bigger, heavier text = bigger boost).

Another possibility would be to allow TokenBatch objects as field
values rather than arrayrefs. But in either case we have the problem
of how to join them together to form the string to be stored.

> while ( my ( $title, $content ) = each %source_docs ) {
> $invindexer->add_doc({
> title => $title,
> content => $content, # could be arrayref or scalar string
> });
> }
>
> where the field value of each hashref key/value pair could be a
> scalar string (as it is now) or an arrayref of scalar strings.
>
> If it were an arrayref, then the pos_inc would bump by +1 for every
> item in the array.

What I would really like to see here is for this to be implemented as
an Analyzer subclass. Possibly to be published on CPAN as a plugin
within a "KinoSearchX" namespace. I want to accommodate this in such
a way as it is convenient and fast.

I am reluctant to complicate the API for InvIndexer->add_doc, though,
because it's a bottleneck that many different problems must pass
through -- like Searcher->search. It would be better design to
divide and conquer this problem and implement a solution within a
purpose-built class. Then we can work on it in isolation, or even
replace it with a second version if a better algo occurs to us --
without disrupting other KS users or cluttering the API for an
essential method.

If we need to modify some low-level aspect of KS to support such a
class, that's cool. Especially if the low-level mod can be put into
service supporting other higher-level needs.

Hmm. This gives me an idea about how to simplify add_doc. If we
resurrect KinoSearch::Document::Doc, implemented as a blessed hash
with boost stored as an inside-out member, the Doc object can carry
the boost information -- and we can eliminate the extra args to
InvIndexer->add_doc.

See where I'm going with this?

> Example:
>
> my $content = ['eats shoots and leaves', 'by the morning train'];

Where are these texts coming from? If you join them with
"A_SEPARATOR_THAT_NEVER_APPEARS_IN_THE_TEXT" you could hack up a
custom Tokenizer which recognizes that string and bumps the position
increment rather than adding a Token.

Then you have the same problem as me with the HTML tags, though,
because you don't want metadata like that separator polluting the
stored version. Hmm.

Are there other reasons that solution wouldn't work for you?

> [1] "seems" because I'm having a hard time wrapping my head around
> some of the magic in the interaction between TokenBatch and the
> Analyzer.

Thanks for that bit of feedback. If we can improve the architecture/
documentation of those two so that the API is easier to grok, great.
Power is more important than ease of use, though, since relatively
few users will need to write custom Analyzer subclasses.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
API request for KS::InvIndexer -- field_value as arrayref [ In reply to ]
Marvin Humphrey scribbled on 5/2/07 9:36 PM:
>
> Peter,
>
> What's the ultimate goal here? Is it that you want to supply pre-parsed
> fields? I've been thinking about that a bit myself, because for HTML
> parsing with per-position boosts, I want to store a version with tags
> stripped, but the tags have to be there at parse-time to determine boost
> for each token (bigger, heavier text = bigger boost).
>

yes, if by 'pre-parsed fields' you mean that the content of a document has been
divvied up into distinct categories (fields) (or in Swish terms, MetaNames). In
my case, it's not for boosts but for preventing false positive phrase matches.

The ultimate goal in my suggestion was to convey basic structural positional
information about a doc's contents via the contents' data structure, rather than
explicitly setting pos_inc for each relevant token, which would require more
expert level coding of an Analyzer subclass, or hacking around it with the
NO_SUCH_WORD_HERE approach. It just seemed like a common enough case that KS
could handle it natively. At least, I know I wasn't the first person to raise
the issue, given the original email thread from Nov.

So instead of this:

'eats shoots and leaves NO_SUCH_WORD_HERE by the morning train'

this:

['eats shoots and leaves', 'by the morning train']


Here's another example of what I'm talking about.

$ cat foo.html
<html>
<body>
<div>eats shoots and leaves</div>
<div>by the morning train</div>
</body>
</html>


# the long way
my @divs;
foreach my $div ($parser->parse_html('foo.html'))
{
push(@divs, $div);
}
$invindexer->add_doc({ content => \@divs });

# the short way
$invindexer->add_doc({ content => $parser->parse_html('foo.html') });



Right now I would have to either write my own Analyzer (as you suggest below),
or do the hack I suggested in that email thread above:

$invindexer->add_doc({ content => join(' NO_SUCH_WORD_HERE ', @divs) });

That hacks works, but feels a little in-elegant somehow, because there's always
the risk (for example) that I'm indexing this particular mailing list archive
and so my special word appears in a legitimate context. ;)


> Another possibility would be to allow TokenBatch objects as field values
> rather than arrayrefs. But in either case we have the problem of how to
> join them together to form the string to be stored.
>

"them" == multiple TokenBatch objects for the same field? Yes, I wondered about
that. Would need something like a $tb1->add_token_batch($tb2) method.

My assumption in the original post was that calling add_batch() multiple times
on a single field name would be handled ok by the PostingWriter. But maybe not?


>> while ( my ( $title, $content ) = each %source_docs ) {
>> $invindexer->add_doc({
>> title => $title,
>> content => $content, # could be arrayref or scalar string
>> });
>> }
>>
>> where the field value of each hashref key/value pair could be a scalar
>> string (as it is now) or an arrayref of scalar strings.
>>
>> If it were an arrayref, then the pos_inc would bump by +1 for every
>> item in the array.
>
> What I would really like to see here is for this to be implemented as an
> Analyzer subclass. Possibly to be published on CPAN as a plugin within
> a "KinoSearchX" namespace. I want to accommodate this in such a way as
> it is convenient and fast.

Sure. I actually considered that at first. I also thought of constructing a
savvy regex that would achieve the same ends using the existing Tokenizer.

But then it occurred to me that the issue (preventing false positive phrase
matches) was probably a pretty common one for anyone indexing marked-up content
and perhaps KS could handle it in a Perlish way by just doing a ref() check on
the field value and then DWIM vis-a-vis the pos_inc bump.


>
> I am reluctant to complicate the API for InvIndexer->add_doc, though,
> because it's a bottleneck that many different problems must pass through
> -- like Searcher->search. It would be better design to divide and
> conquer this problem and implement a solution within a purpose-built
> class. Then we can work on it in isolation, or even replace it with a
> second version if a better algo occurs to us -- without disrupting other
> KS users or cluttering the API for an essential method.
>

fair enough.

[snip]

> Hmm. This gives me an idea about how to simplify add_doc. If we
> resurrect KinoSearch::Document::Doc, implemented as a blessed hash with
> boost stored as an inside-out member, the Doc object can carry the boost
> information -- and we can eliminate the extra args to InvIndexer->add_doc.
>
> See where I'm going with this?
>

somewhere tidy? :) yes, I think so.


>> Example:
>>
>> my $content = ['eats shoots and leaves', 'by the morning train'];
>
> Where are these texts coming from? If you join them with
> "A_SEPARATOR_THAT_NEVER_APPEARS_IN_THE_TEXT" you could hack up a custom
> Tokenizer which recognizes that string and bumps the position increment
> rather than adding a Token.

yes. but see my points above. Hacking up a custom Tokenizer for what I'm
guessing is a common case for marked up docs seems prohibitive for the casual user.


>
> Then you have the same problem as me with the HTML tags, though, because
> you don't want metadata like that separator polluting the stored
> version. Hmm.

exactly.

This is why Swish stores de-tagged text (PropertyNames) and token contextual
information (MetaNames) separately, so that you can return unblemished text
chunks in results but get granular in setting boosts, position, etc.

With KS's concept of a 'field' you'd have to almost pass the text in twice,
using a namespace convention of some kind:

$invindexer->add_doc({
token_content => $parsed_for_positions,
store_content => $parsed_for_display
});

and set up your Schema accordingly.

In Swish you can also alias your MetaNames and PropertyNames, so you can
retrieve 'content' in search results for highlighting, but 'content' is an alias
to 'store_content' (using the example above).


>
> Are there other reasons that solution wouldn't work for you?
>

I'd be certainly content to write my own Analyzer and put it on CPAN and use it
for Swish3 with KS. Something like KSx::Analysis::PhraseTokenizer. (Guess if I
wanted to support arrayrefs I'd have to get specific about where in the
PolyAnalyzer queue it appeared.) I was suggesting the core API change because it
seemed like a common enough case that the stock Analyzers wouldn't even need to
know about it.

(But 'common' is one of those politically charged words, as in "it's just plain
common sense to do such and such", when what we really mean is, "I want to do
such and such." :) )

Am I speaking in a void? Anyone else have an opinion on this score? (a-hem. is
this thing on?)


>> [1] "seems" because I'm having a hard time wrapping my head around
>> some of the magic in the interaction between TokenBatch and the Analyzer.
>
> Thanks for that bit of feedback. If we can improve the
> architecture/documentation of those two so that the API is easier to
> grok, great. Power is more important than ease of use, though, since
> relatively few users will need to write custom Analyzer subclasses.
>

I think the docs for each class are adequate, it's the relationship between them
that is a little murky to me. Once I got in to the code and lurked about a
little, I could see this happening quite a bit:

$token_batch = $analyzer->analyzer( $token_batch );

which just seemed like doublespeak till I looked at the C and XS, and then saw
how Tokens were being created, etc.

Perhaps an example in the Tutorial, or an Advanced Tutorial, showing how/why
someone would want to create their own Analyzer?

--
Peter Karman . http://peknet.com/ . peter@peknet.com
API request for KS::InvIndexer -- field_value as arrayref [ In reply to ]
Peter,

I think the general solution will be to make it possible for
Analyzers to affect the stored text. To do this, we'll have to give
them access to the document itself, and the field name. Here's one
possibility, which just extends the existing Analyzer->analyze method
by adding more args:

sub analyze {
my ( $self, $token_batch, $doc, $field_name ) = @_;
...
$doc->{$field_name} = $new_text;
return $new_token_batch;
}

I think we can do better than that, though, as you'll see below...

On May 3, 2007, at 7:10 AM, Peter Karman wrote:

> The ultimate goal in my suggestion was to convey basic structural
> positional information about a doc's contents via the contents'
> data structure,

---->8 SNIP 8<----

> $ cat foo.html
> <html>
> <body>
> <div>eats shoots and leaves</div>
> <div>by the morning train</div>
> </body>
> </html>
>
>
> # the long way
> my @divs;
> foreach my $div ($parser->parse_html('foo.html'))
> {
> push(@divs, $div);
> }
> $invindexer->add_doc({ content => \@divs });
>
> # the short way
> $invindexer->add_doc({ content => $parser->parse_html('foo.html') });

That would solve your specific problem of wanting to forbid phrase
matching in certain cases. However, it encodes one particular kind
of metadata using one particular convention. Forbidding phrase
matches across structural divisions is a worthy idea (: and I intend
to steal it for KinoSearch::Simple->parse_html :) but I don't think
the proposed implementation is general enough. There's no way we can
anticipate all the different kinds of metadata people might want to
pass through InvIndexer->add_doc. You haven't solved my problem of
how to pass visual text weight metadata, for example.

My inclination is to allow documents to use whatever-the-hell-they-
want as field values. Filehandles. Arrayrefs. Arbitrary objects.
Undefs. The only things really standing in the way of this now are:

* The utf8::upgrade calls performed by InvIndexer, which
can probably be moved to individual analyzers.
* The field name verification regime, intended to thwart
misspelled field names, which I'm not sure what to do
about but would like to keep if possible.
* The current behavior of DocWriter/DocReader.

To facilitate this, we can add a public, overrideable method:
Analyzer->analyze_field. (Also, Analyzer->analyze should probably be
renamed to process_batch or something like that.) Here's how
LCNormalizer->analyze_field would look:

sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
utf8::upgrade( $doc->{$field_name} );
return KinoSearch::Analysis::TokenBatch->new(
text => lc( $doc->{$field_name} ),
);
}

This set-up would allow you to perform Swish analysis entirely within
an Analyzer. Or to pre-process everything and create a TokenBatch
later. We'd still need to add some methods to TokenBatch to fully
support what you want to do, but here's a rough outline of how things
could work:

sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
my $divs = $doc->{$field_name};
my $token_batch = KinoSearch::Analysis::TokenBatch->new;

for my $div (@$divs) {
my $sub_batch = $self->{tokenizer}->analyze_text($div);
$token_batch->eat($sub_batch);
}

# ugly, wouldn't really want to do this...
$doc->{$field_name} = join( "\n", @$divs );

return $token_batch;
}

Allowing KS documents to have arbitrary structure also moves us a few
steps towards the concept of an OO database, which I'd really dig.
It would also be great to allow integer or float type fields as well
in addition to the string-type fields currently supported.

> Hacking up a custom Tokenizer for what I'm guessing is a common
> case for marked up docs seems prohibitive for the casual user.

Yes, and another problem is that KinoSearch's XS-based Tokenizer is
much faster than alternative pure-Perl implementations.

> Perhaps an example in the Tutorial, or an Advanced Tutorial,
> showing how/why someone would want to create their own Analyzer?

I think the place for this is the Analyzer documentation. Analyzer
exists to be subclassed. Right now the docs are sparse; they could
be much longer. Subclassing Analyzer is an "expert API" task, so
verbose docs are OK.

The other possibility is to add a tutorial under KinoSearch::Docs, or
even publish such a tutorial on a WikiToBeNamedLater, reserving
Analyzer's POD for concise API documentation. I lean towards
stuffing everything into Analyzer, though.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
API request for KS::InvIndexer -- field_value as arrayref [ In reply to ]
Marvin Humphrey scribbled on 5/3/07 1:51 PM:
> Peter,
>
> I think the general solution will be to make it possible for Analyzers
> to affect the stored text. To do this, we'll have to give them access
> to the document itself, and the field name.

Marvin,

As always, thanks for your thorough and thoughtful reply.

Yes, I see where you're headed with putting the power in the hands of the
Analyzers, and how my suggestion for the InvIndexer API is only a particular
solution to a more general issue.

Sounds like you are suggesting an API change, but to the Analyzer class instead,
giving it more power to affect individual fields. Sounds fine to me, especially
given some of the API specifics you mention below.


> You haven't solved my problem of how to pass
> visual text weight metadata, for example.
>

$curiosity->piqued. Can you offer an example?


> My inclination is to allow documents to use whatever-the-hell-they-want
> as field values. Filehandles. Arrayrefs. Arbitrary objects. Undefs.
> The only things really standing in the way of this now are:
>
> * The utf8::upgrade calls performed by InvIndexer, which
> can probably be moved to individual analyzers.

agreed. perhaps with a syntactically sweet wrapper in the base Analyzer class?
So analyzer methods that care could call:

$self->utf8ify( $field_value );


> * The field name verification regime, intended to thwart
> misspelled field names, which I'm not sure what to do
> about but would like to keep if possible.

This is the test in SegWriter add_doc() ?

again, perhaps moved to the base Analyzer class? So Analyzer subclasses could call:

$self->verify_field_name( $field_name );

of course, the Analyzer $self would need to hold a reference to $schema
internally (or does it already, by virtue of being called via
$schema->fetch_analyzer ?).


> * The current behavior of DocWriter/DocReader.
>
> To facilitate this, we can add a public, overrideable method:
> Analyzer->analyze_field. (Also, Analyzer->analyze should probably be
> renamed to process_batch or something like that.) Here's how
> LCNormalizer->analyze_field would look:
>
> sub analyze_field {
> my ( $self, $doc, $field_name ) = @_;
> utf8::upgrade( $doc->{$field_name} );
> return KinoSearch::Analysis::TokenBatch->new(
> text => lc( $doc->{$field_name} ),
> );
> }
>

makes sense.

[ snip ]

> Allowing KS documents to have arbitrary structure also moves us a few
> steps towards the concept of an OO database, which I'd really dig. It
> would also be great to allow integer or float type fields as well in
> addition to the string-type fields currently supported.

yes. perhaps also ints labeled as epoch for holding datetimes? Swish does that.

[snip]

> The other possibility is to add a tutorial under KinoSearch::Docs, or
> even publish such a tutorial on a WikiToBeNamedLater, reserving
> Analyzer's POD for concise API documentation. I lean towards stuffing
> everything into Analyzer, though.

docs_in_analyzer++



--
Peter Karman . http://peknet.com/ . peter@peknet.com