Mailing List Archive: Serialized Schema (was KinoSearch::FieldSpec::text)

Serialized Schema (was KinoSearch::FieldSpec::text)

Sep 6, 2007, 3:26 PM

Post #1 of 7 (2466 views)

On Sep 6, 2007, at 6:57 AM, Peter Karman wrote:
> natively supported field types make a lot of sense to me.

Hmm... I wasn't originally thinking of this as "native support", just
flexible shorthand -- but now that you mention it, I guess the change
amounts to the same thing.

In addition to the keystroke-savings, the idea was that if someone
wrote a search app in another language, it would be able to read the
invindex, see "text", and know that the field should be assigned a
particular set of characteristics.

[... mind races ...]

It would be really nice if FieldSpecs themselves were completely
serializable. After the "text" change, we have exactly one fixed
class def. I was thinking about adding text::unstored,
text::unanalyzed, text::unstoredunanalyzed, etc... but that quickly
gets ridiculous.

Instead, what if you could insert a FieldSpec class def into an
invindex, then assign field names to it?

field_specs:
keyword:
analyzed: 0
stored: 0
fields:
title: text
body: text
category: keyword

The trick is that while FieldSpec specifies a bunch of booleans that
are trivial to serialize, it also contains things like analyzer that
aren't... or does it?

Turns out that it's possible to write sane serialization code for all
of KinoSearch's Analyzer classes. Rough sketch:

analyzers:
main_analyzer:
polyanalyzer:
language: en
whitespace_tokenizer:
tokenizer:
token_re: "\S+"

We could still make it possible to extend behavior with customized
non-serializable analyzers:

custom_analyzer: "MyApp::CustomAnalyzer"

The resulting invindex just wouldn't be portable.

Let's say all of this would go into a file called schema.yaml.

If we can stuff Analyzers and FieldSpecs into a serialized Schema,
then we've solved a problem in Lucene that I'd given up on solving:
it's not possible to read a Lucene index without knowing additional
information not present in the index itself -- you have to know the
Analyzer that was used.

Unfortunately, the decision to punt on that problem, which led to the
present implementation of Schema, left KinoSearch with a nasty,
though rarely encountered defect: if you change certain aspects of
the Schema class (e.g. analyzer choice or behavior), KS can crash or
behave bizarrely. But... if the Schema is fully described by its
serialized form, that problem goes away for everyone except people
doing non-serializable custom extensions.

Another advantage: I'm pretty sure that the Schema subclass is only
needed at index-time... so it would no longer be necessary to keep
track of an extra .pm file.

This is worth doing. :)

Peter, I know Swish works off of a configuration file. What do you
think of having Schema write out something analogous to the Swish
config file during InvIndexer->finish?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch