Marvin Humphrey scribbled on 5/2/07 9:36 PM:
>
> Peter,
>
> What's the ultimate goal here? Is it that you want to supply pre-parsed
> fields? I've been thinking about that a bit myself, because for HTML
> parsing with per-position boosts, I want to store a version with tags
> stripped, but the tags have to be there at parse-time to determine boost
> for each token (bigger, heavier text = bigger boost).
>
yes, if by 'pre-parsed fields' you mean that the content of a document has been
divvied up into distinct categories (fields) (or in Swish terms, MetaNames). In
my case, it's not for boosts but for preventing false positive phrase matches.
The ultimate goal in my suggestion was to convey basic structural positional
information about a doc's contents via the contents' data structure, rather than
explicitly setting pos_inc for each relevant token, which would require more
expert level coding of an Analyzer subclass, or hacking around it with the
NO_SUCH_WORD_HERE approach. It just seemed like a common enough case that KS
could handle it natively. At least, I know I wasn't the first person to raise
the issue, given the original email thread from Nov.
So instead of this:
'eats shoots and leaves NO_SUCH_WORD_HERE by the morning train'
this:
['eats shoots and leaves', 'by the morning train']
Here's another example of what I'm talking about.
$ cat foo.html
<html>
<body>
<div>eats shoots and leaves</div>
<div>by the morning train</div>
</body>
</html>
# the long way
my @divs;
foreach my $div ($parser->parse_html('foo.html'))
{
push(@divs, $div);
}
$invindexer->add_doc({ content => \@divs });
# the short way
$invindexer->add_doc({ content => $parser->parse_html('foo.html') });
Right now I would have to either write my own Analyzer (as you suggest below),
or do the hack I suggested in that email thread above:
$invindexer->add_doc({ content => join(' NO_SUCH_WORD_HERE ', @divs) });
That hacks works, but feels a little in-elegant somehow, because there's always
the risk (for example) that I'm indexing this particular mailing list archive
and so my special word appears in a legitimate context. ;)
> Another possibility would be to allow TokenBatch objects as field values
> rather than arrayrefs. But in either case we have the problem of how to
> join them together to form the string to be stored.
>
"them" == multiple TokenBatch objects for the same field? Yes, I wondered about
that. Would need something like a $tb1->add_token_batch($tb2) method.
My assumption in the original post was that calling add_batch() multiple times
on a single field name would be handled ok by the PostingWriter. But maybe not?
>> while ( my ( $title, $content ) = each %source_docs ) {
>> $invindexer->add_doc({
>> title => $title,
>> content => $content, # could be arrayref or scalar string
>> });
>> }
>>
>> where the field value of each hashref key/value pair could be a scalar
>> string (as it is now) or an arrayref of scalar strings.
>>
>> If it were an arrayref, then the pos_inc would bump by +1 for every
>> item in the array.
>
> What I would really like to see here is for this to be implemented as an
> Analyzer subclass. Possibly to be published on CPAN as a plugin within
> a "KinoSearchX" namespace. I want to accommodate this in such a way as
> it is convenient and fast.
Sure. I actually considered that at first. I also thought of constructing a
savvy regex that would achieve the same ends using the existing Tokenizer.
But then it occurred to me that the issue (preventing false positive phrase
matches) was probably a pretty common one for anyone indexing marked-up content
and perhaps KS could handle it in a Perlish way by just doing a ref() check on
the field value and then DWIM vis-a-vis the pos_inc bump.
>
> I am reluctant to complicate the API for InvIndexer->add_doc, though,
> because it's a bottleneck that many different problems must pass through
> -- like Searcher->search. It would be better design to divide and
> conquer this problem and implement a solution within a purpose-built
> class. Then we can work on it in isolation, or even replace it with a
> second version if a better algo occurs to us -- without disrupting other
> KS users or cluttering the API for an essential method.
>
fair enough.
[snip]
> Hmm. This gives me an idea about how to simplify add_doc. If we
> resurrect KinoSearch::Document::Doc, implemented as a blessed hash with
> boost stored as an inside-out member, the Doc object can carry the boost
> information -- and we can eliminate the extra args to InvIndexer->add_doc.
>
> See where I'm going with this?
>
somewhere tidy? :) yes, I think so.
>> Example:
>>
>> my $content = ['eats shoots and leaves', 'by the morning train'];
>
> Where are these texts coming from? If you join them with
> "A_SEPARATOR_THAT_NEVER_APPEARS_IN_THE_TEXT" you could hack up a custom
> Tokenizer which recognizes that string and bumps the position increment
> rather than adding a Token.
yes. but see my points above. Hacking up a custom Tokenizer for what I'm
guessing is a common case for marked up docs seems prohibitive for the casual user.
>
> Then you have the same problem as me with the HTML tags, though, because
> you don't want metadata like that separator polluting the stored
> version. Hmm.
exactly.
This is why Swish stores de-tagged text (PropertyNames) and token contextual
information (MetaNames) separately, so that you can return unblemished text
chunks in results but get granular in setting boosts, position, etc.
With KS's concept of a 'field' you'd have to almost pass the text in twice,
using a namespace convention of some kind:
$invindexer->add_doc({
token_content => $parsed_for_positions,
store_content => $parsed_for_display
});
and set up your Schema accordingly.
In Swish you can also alias your MetaNames and PropertyNames, so you can
retrieve 'content' in search results for highlighting, but 'content' is an alias
to 'store_content' (using the example above).
>
> Are there other reasons that solution wouldn't work for you?
>
I'd be certainly content to write my own Analyzer and put it on CPAN and use it
for Swish3 with KS. Something like KSx::Analysis::PhraseTokenizer. (Guess if I
wanted to support arrayrefs I'd have to get specific about where in the
PolyAnalyzer queue it appeared.) I was suggesting the core API change because it
seemed like a common enough case that the stock Analyzers wouldn't even need to
know about it.
(But 'common' is one of those politically charged words, as in "it's just plain
common sense to do such and such", when what we really mean is, "I want to do
such and such." :) )
Am I speaking in a void? Anyone else have an opinion on this score? (a-hem. is
this thing on?)
>> [1] "seems" because I'm having a hard time wrapping my head around
>> some of the magic in the interaction between TokenBatch and the Analyzer.
>
> Thanks for that bit of feedback. If we can improve the
> architecture/documentation of those two so that the API is easier to
> grok, great. Power is more important than ease of use, though, since
> relatively few users will need to write custom Analyzer subclasses.
>
I think the docs for each class are adequate, it's the relationship between them
that is a little murky to me. Once I got in to the code and lurked about a
little, I could see this happening quite a bit:
$token_batch = $analyzer->analyzer( $token_batch );
which just seemed like doublespeak till I looked at the C and XS, and then saw
how Tokens were being created, etc.
Perhaps an example in the Tutorial, or an Advanced Tutorial, showing how/why
someone would want to create their own Analyzer?
--
Peter Karman .
http://peknet.com/ . peter@peknet.com