Mailing List Archive

Serialized Schema (was KinoSearch::FieldSpec::text)
On Sep 6, 2007, at 6:57 AM, Peter Karman wrote:
> natively supported field types make a lot of sense to me.

Hmm... I wasn't originally thinking of this as "native support", just
flexible shorthand -- but now that you mention it, I guess the change
amounts to the same thing.

In addition to the keystroke-savings, the idea was that if someone
wrote a search app in another language, it would be able to read the
invindex, see "text", and know that the field should be assigned a
particular set of characteristics.

[... mind races ...]

It would be really nice if FieldSpecs themselves were completely
serializable. After the "text" change, we have exactly one fixed
class def. I was thinking about adding text::unstored,
text::unanalyzed, text::unstoredunanalyzed, etc... but that quickly
gets ridiculous.

Instead, what if you could insert a FieldSpec class def into an
invindex, then assign field names to it?

field_specs:
keyword:
analyzed: 0
stored: 0
fields:
title: text
body: text
category: keyword

The trick is that while FieldSpec specifies a bunch of booleans that
are trivial to serialize, it also contains things like analyzer that
aren't... or does it?

Turns out that it's possible to write sane serialization code for all
of KinoSearch's Analyzer classes. Rough sketch:

analyzers:
main_analyzer:
polyanalyzer:
language: en
whitespace_tokenizer:
tokenizer:
token_re: "\S+"

We could still make it possible to extend behavior with customized
non-serializable analyzers:

custom_analyzer: "MyApp::CustomAnalyzer"

The resulting invindex just wouldn't be portable.

Let's say all of this would go into a file called schema.yaml.

If we can stuff Analyzers and FieldSpecs into a serialized Schema,
then we've solved a problem in Lucene that I'd given up on solving:
it's not possible to read a Lucene index without knowing additional
information not present in the index itself -- you have to know the
Analyzer that was used.

Unfortunately, the decision to punt on that problem, which led to the
present implementation of Schema, left KinoSearch with a nasty,
though rarely encountered defect: if you change certain aspects of
the Schema class (e.g. analyzer choice or behavior), KS can crash or
behave bizarrely. But... if the Schema is fully described by its
serialized form, that problem goes away for everyone except people
doing non-serializable custom extensions.

Another advantage: I'm pretty sure that the Schema subclass is only
needed at index-time... so it would no longer be necessary to keep
track of an extra .pm file.

This is worth doing. :)

Peter, I know Swish works off of a configuration file. What do you
think of having Schema write out something analogous to the Swish
config file during InvIndexer->finish?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema (was KinoSearch::FieldSpec::text) [ In reply to ]
On 9/6/07 5:26 PM, Marvin Humphrey wrote:

> Peter, I know Swish works off of a configuration file. What do you
> think of having Schema write out something analogous to the Swish config
> file during InvIndexer->finish?
>

I think I am suffering a strange sense of deja vu all over again ;)

http://www.rectangular.com/pipermail/kinosearch/2006-November/000560.html

Seriously though, I think it sounds like a fine idea. Swish has 3 native
field types: text, int and date (which is really just an int that gets
output as a timestamp string). All the info about those fields is stored
in the Swish-e index header. So doing something similar in KS, with more
robust field types, makes perfect sense to me, especially when you talk
about the index format in the context of Lucy (which is what I assume
you alluding to when you wrote about accessing the index using other
languages). Having well-defined native field types, and the ability to
de/serialize the FieldSpecs and Analyzers at play, sounds Good.

(And re: the url thread above: for the record, I like the .yml format
better than .xml; if libswish3 weren't already possessed of a full XML
parser, I would probably use .yml in Swish3 too.)

pek

--
Peter Karman . peter@peknet.com . http://www.peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema (was KinoSearch::FieldSpec::text) [ In reply to ]
On 9/6/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Sep 6, 2007, at 6:57 AM, Peter Karman wrote:
> > natively supported field types make a lot of sense to me.
>
> It would be really nice if FieldSpecs themselves were completely
> serializable. After the "text" change, we have exactly one fixed
> class def. I was thinking about adding text::unstored,
> text::unanalyzed, text::unstoredunanalyzed, etc... but that quickly
> gets ridiculous.

I fear I'm a slow student, but why is this ridiculous? The particular
names you have chosen are little cumbersome (because you are presuming
direct mapping to a subclasses) but aren't there only a few types one
wants to support natively: text, blob, keyword, number, maybe date,
what else?

Also, my instinct (perhaps because I've only been looking at the
Scorer side) is that these field types are going to be most useful if
they have a corresponding scorer, such that you can do stuff queries
like "keyword_field:tag text_field:word && number_field:<10". Would
recording the analyzer steps be enough to do this? If desirable, I
think it would need native semantics.

Nathan Kurz
nate@verse.com

[.Apologies for not providing the position passing stuff I promised.
Perhaps tomorrow.]

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema (was KinoSearch::FieldSpec::text) [ In reply to ]
On Sep 6, 2007, at 11:23 PM, Nathan Kurz wrote:

>> It would be really nice if FieldSpecs themselves were completely
>> serializable. After the "text" change, we have exactly one fixed
>> class def. I was thinking about adding text::unstored,
>> text::unanalyzed, text::unstoredunanalyzed, etc... but that quickly
>> gets ridiculous.
>
> I fear I'm a slow student, but why is this ridiculous? The particular
> names you have chosen are little cumbersome (because you are presuming
> direct mapping to a subclasses) but

The above formulation assumes a no-argument constructor. All the
information about the field type's behavior is carried by the class
name. If we simply add the following class to KS, along with alias
resolution in Schema.pm...

package KinoSearch::FieldSpec::text::unanalyzed;
use parent qw( KinoSearch::FieldSpec::text );
sub analyzed { 0 }

... then it becomes possible for a user to spec a %fields hash like so:

our %fields = (
title => 'text',
url => 'text::unanalyzed',
);

However, that stratagem scales poorly, because you need a unique
class name for each combination of characteristics.

If we make it possible to embed serialized FieldSpecs in an invindex,
though... we don't need to add all those subclasses to the KS core. :)

> aren't there only a few types one
> wants to support natively: text, blob, keyword, number, maybe date,
> what else?

Something like that.

'keyword' is not a very useful type because it's so close to 'text'.
It's not desirable because various 'keyword' fields might or might
not be analyzed (e.g. for lower-casing), vectorized, or stored.
Users will end up creating their own subclasses to get the exact
behavior they want anyway.

For now, I think we need only one: text. We might also add 'blob'
because it's easy and straightforward.

package KinoSearch::FieldSpec::blob;
use parent qw( KinoSearch::FieldSpec );

sub indexed { FALSE }
sub stored { TRUE }
sub analyzed { FALSE }
sub vectorized { FALSE }
sub binary { TRUE }
sub compressed { FALSE }

> Also, my instinct (perhaps because I've only been looking at the
> Scorer side) is that these field types are going to be most useful if
> they have a corresponding scorer,

Agreed.

> such that you can do stuff queries
> like "keyword_field:tag text_field:word && number_field:<10".

That kind of query would be nice to support.

> Would recording the analyzer steps be enough to do this?

For the various number types, and for 'date' as well depending on
implementation: the existing query classes won't work well, if at all.

However, I don't think that's an immediate concern. My main goal
with serializing Schema is to make the invindex file format self-
describing, so that it becomes possible to read one without the need
for any auxiliary information.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema (was KinoSearch::FieldSpec::text) [ In reply to ]
On 9/7/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> My main goal
> with serializing Schema is to make the invindex file format self-
> describing, so that it becomes possible to read one without the need
> for any auxiliary information.

Thanks for the explanation. I understand better now.

I think I agree with all of that, with the small exception that I
don't think you gain much by procedurally specify the tokenizer. I
think specifying it as
"tokenizer: whitespace" and letting the reader handle the
implementation is wiser than specifying a split on "\S+".

If you are trying to be language-agnostic, requiring the reader to be
able to handle what could be arbitrary expressions in a particular
regexp language seems onerous, even if it is a pretty standard one.
In particular, I can see wanting a straight C implementation using
flex rather than a regexp library.

I don't feel strongly about this, though, since if one really wants to
do this one could just do it non-portably.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema (was KinoSearch::FieldSpec::text) [ In reply to ]
On Sep 6, 2007, at 5:40 PM, Peter Karman wrote:

> On 9/6/07 5:26 PM, Marvin Humphrey wrote:
>
>> Peter, I know Swish works off of a configuration file. What do
>> you think of having Schema write out something analogous to the
>> Swish config file during InvIndexer->finish?
>
> I think I am suffering a strange sense of deja vu all over again ;)
>
> http://www.rectangular.com/pipermail/kinosearch/2006-November/
> 000560.html

The difference between then and now is that back then I didn't think
it was going to be possible to serialize a Schema well enough that
you'd not need the original class. In fact, at the time I regarded
that insight as a liberation: if you were stuck providing the
Analyzer externally, you might as well put a whole slew of stuff into
classes.

I was also ill-informed about the security of supplying regular
expressions via a potentially untrusted source: I didn't realize
that /(?{$code})/ was disabled by default. It wasn't until after
that discussion that I became familiar with the 're' pragma.

> Seriously though, I think it sounds like a fine idea. Swish has 3
> native field types: text, int and date (which is really just an int
> that gets output as a timestamp string). All the info about those
> fields is stored in the Swish-e index header. So doing something
> similar in KS, with more robust field types, makes perfect sense to
> me, especially when you talk about the index format in the context
> of Lucy (which is what I assume you alluding to when you wrote
> about accessing the index using other languages).

I'm not necessarily talking about Lucy. What I'd like to do is write
a formal spec for the "invindex" file format, opening things up for
other apps. Like the Lucene file format spec, except usable.

Since good programming is all about designing good data structures,
formally defining the spec would be a useful exercise. It might be
nice to issue some RFCs on the Lucene list, PerlMonks, Swish list
(?), etc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On Sep 6, 2007, at 5:40 PM, Peter Karman wrote:

> (And re: the url thread above: for the record, I like the .yml
> format better than .xml; if libswish3 weren't already possessed of
> a full XML parser, I would probably use .yml in Swish3 too.)

Have you considered JSON? :)

I'm annoyed by the fact that there isn't a minimal "YAML level 1"
spec. The complete YAML spec is grievously afflicted by featuritis.

Here's the problem: right now, KS uses custom routines to read/write
a small subset of YAML. But if other implementations start using the
file format, it will be easy for them to produce something that's
valid YAML but that KS isn't prepared to handle.

This is sort-of solvable by adding a fully compliant YAML parser to
the KS dependency chain -- which naturally I intend to avoid. But
the general problem would still exist: so long as the invindex file
format specifies "YAML", any implementation would be required to have
a complete -- and thus monstrous -- YAML parser to read externally
generated invindexes reliably.

CPAN has YAML::Tiny, which was inspired by the same sense of
revulsion I feel when perusing the YAML spec. Unfortunately, it's a
non-specific subset implementation, not a strictly defined spec.

I'm tempted to write a formal spec called "ASHL" -- Array Scalar Hash
Language. The target for ASHL level 1 would be non-trivial config
files -- basically, stuff that's too complex for .ini-style pairs.
It would use YAML's indentation and its notation for hashes and
arrays, but scalars would be single-line only and the character set
would be limited to ASCII.

The problem with that idea, though is that when you expand it
outwards to ASHL level 2, you need to add unicode escapes and multi-
line scalars. At that point, it starts to look an awful lot like
JSON, and it's hard to justify as an independent format.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch