Mailing List Archive

Re: Serialized Schema
On Sep 7, 2007, at 1:24 PM, Nathan Kurz wrote:

> On 9/7/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>> My main goal with serializing Schema is to make the invindex file
>> format self-describing, so that it becomes possible to read one
>> without the need for any auxiliary information.
>
> Thanks for the explanation. I understand better now.
>
> I think I agree with all of that, with the small exception that I
> don't think you gain much by procedurally specify the tokenizer. I
> think specifying it as
> "tokenizer: whitespace" and letting the reader handle the
> implementation is wiser than specifying a split on "\S+".

There are a couple of ways to do that.

We could have the Tokenizer constructor accept a "type" parameter,
which would then map to a particular implementation. But that
offers no advantage over a second, clearer option...

We could create a suite of officially sanctioned Tokenizer classes.
WhitespaceTokenizer, WordCharTokenizer, etc. However... few of these
are truly useful, and just about all the useful ones can be
implemented using a regex-based Tokenizer. Indeed, that is why KS
has only one Tokenizer class, while Lucene has several. A single
regex-based Tokenizer like the one we have now offers the greatest
combination of flexibility, power, and simplicity of implementation.

The problem we face now, though, is how to specify the token_re
argument portably.

By the way, I suspect it was only a brain-hiccup on your part, but
specifying a token_re of "\S+" is not the same as a split -- it's
actually the inverse. The regex is used to match the tokens
themselves rather than the separators between tokens.

> If you are trying to be language-agnostic, requiring the reader to be
> able to handle what could be arbitrary expressions in a particular
> regexp language seems onerous, even if it is a pretty standard one.

I thought about this for a while. Perl-compatible regular expression
syntax is very widespread. There isn't an official standard which
freezes the syntax a la POSIX, which is somewhat dissatisfying
because it means you can't guarantee compliance. But for practical
purposes, common regexes ought to be portable.

> In particular, I can see wanting a straight C implementation using
> flex rather than a regexp library.

> I don't feel strongly about this, though, since if one really wants to
> do this one could just do it non-portably.

Yes. A flex-based tokenizer for a C implementation would be cool, it
just wouldn't be part of the official list of blessed Analyzers.
Someone could release a KSx version to CPAN if so motivated -- but
back-compat issues would be handled as an independent project outside
of the KS core.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On 9/29/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> A single
> regex-based Tokenizer like the one we have now offers the greatest
> combination of flexibility, power, and simplicity of implementation.

Simple tokenizers work for the the things I want to do, but I'm not
sure a regex is really that generally useful. How many useful
regular expressions are we talking about here? Also, tasks like
tokenizing Asian languages seems like they would be hard with just a
regex. There was someone who wrote to the list asking about doing
that a while ago.

> By the way, I suspect it was only a brain-hiccup on your part, but
> specifying a token_re of "\S+" is not the same as a split -- it's
> actually the inverse.

Indeed. I partially reworded, didn't reread, and produced nonsense.

> I thought about this for a while. Perl-compatible regular expression
> syntax is very widespread.

Agreed in retrospect. If you are specifying a regex, this is a fine
way. More generally, I'm fine with the path you suggest, I'm just not
sure the generality of the regex approach actually produces much gain.
I have no objection to it in principle.

> Yes. A flex-based tokenizer for a C implementation would be cool, it
> just wouldn't be part of the official list of blessed Analyzers.

I'm thinking about it mostly on the search end, rather than the
indexing end, with a flex-based tokenizer reading a file produced by
another system. Instead of being a different blessed Analyzer, it
would just be another implementation capable of handling the same
named tokenization scheme.

> I'm tempted to write a formal spec called "ASHL" -- Array Scalar Hash
> Language.

I don't think this is a good use of your time, unless this spec is
explicitly written to define a subset of JSON or YAML. Although you
will be happy writing your own ASHL interpreter, most others who would
be using your format for some other purpose would likely prefer to use
an existing interpreter. If the point comes when some other
implementation using your file format gains sufficient popularity that
you need to support the full spec, count it as a success!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On Sun, Sep 30, 2007 at 03:56:21PM -0600, Nathan Kurz wrote:
> Simple tokenizers work for the the things I want to do, but I'm not
> sure a regex is really that generally useful. How many useful
> regular expressions are we talking about here?

There's the KS default:

qr/\w+(?:'\w+)*/

Then there's WhiteSpaceTokenizer:

qr/\S+/

Then there's the Lucene StandardTokenizer, which is implemented using javacc.
Plucene emulates it with regexes:

# Don't blame me, blame the Plucene people!
my $alpha = qr/\p{IsAlpha}+/;
my $apostrophe = qr/$alpha('$alpha)+/;
my $acronym = qr/$alpha\.($alpha\.)+/;
my $company = qr/$alpha(&|\@)$alpha/;
my $hostname = qr/\w+(\.\w+)+/;
my $email = qr/\w+\@$hostname/;
my $p = qr/[_\/.,-]/;
my $hasdigit = qr/\w*\d\w*/;
my $num = qr/\w+$p$hasdigit|$hasdigit$p\w+
|\w+($p$hasdigit$p\w+)+
|$hasdigit($p\w+$p$hasdigit)+
|\w+$p$hasdigit($p\w+$p$hasdigit)+
|$hasdigit$p\w+($p$hasdigit$p\w+)+/x;

=head2 token_re

The regular expression for tokenising.

=cut

sub token_re {
qr/
$apostrophe | $acronym | $company | $hostname | $email | $num
| \w+
/x;
}

(: That first comment is in the source. :)

For simplicity, KS has stayed away from offering that as a stock item, but you
can see how it would be useful.

Also, it's not uncommon to see messages to the Lucene user's list from someone
wanting to know how to tweak StandardTokenizer for a specific problem domain.
Variants on StandardTokenizer are possible with a regex-based tokenizer -- but
not with a named Tokenizer subclass.

> Also, tasks like tokenizing Asian languages seems like they would be hard
> with just a regex.

That's right. But tokenizing Asian languages, particularly Japanese, is
frightfully difficult and complex -- so core KS isn't really the right place
for such a tokenizer, and it shouldn't be part of the file format.

> I'm thinking about it mostly on the search end, rather than the
> indexing end, with a flex-based tokenizer reading a file produced by
> another system. Instead of being a different blessed Analyzer, it
> would just be another implementation capable of handling the same
> named tokenization scheme.

Yes, I can see how that would be handy.

Nothing would stop you, though, from implementing such a named tokenizer. You
just have to make sure that both implementations know how to handle it:

analyzer: "My::Custom::Tokenizer"

That, as opposed to this:

analyzer:
tokenizer:
token_re: "\S+"

All I'm saying is that to fully support the invindex file format, you have to
support a small set of Analyzers. (Not coincidentally, they're the ones that
are in KS right now.)

> > I'm tempted to write a formal spec called "ASHL" -- Array Scalar Hash
> > Language.
>
> I don't think this is a good use of your time, unless this spec is
> explicitly written to define a subset of JSON or YAML.

Agreed. I don't really want that task. I want the YAML people to define YAML
Level 1 so KS can use it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
Marvin Humphrey wrote on 9/29/07 5:32 PM:
>
> On Sep 6, 2007, at 5:40 PM, Peter Karman wrote:
>
>> (And re: the url thread above: for the record, I like the .yml format
>> better than .xml; if libswish3 weren't already possessed of a full XML
>> parser, I would probably use .yml in Swish3 too.)
>
> Have you considered JSON? :)
>
> I'm annoyed by the fact that there isn't a minimal "YAML level 1" spec.
> The complete YAML spec is grievously afflicted by featuritis.
>
> Here's the problem: right now, KS uses custom routines to read/write a
> small subset of YAML. But if other implementations start using the file
> format, it will be easy for them to produce something that's valid YAML
> but that KS isn't prepared to handle.
>

Sounds like what you want isn't an official subset of the language, but rather
something like a SGML document type definition (called (overload overload) a
schema in XML parlance). Just an official declaration of what constitutes a
legal KS header. When you write a sentence in English, you need a subject and a
predicate. Otherwise it isn't a sentence. You don't have to define a subset of
English to compose a sentence. But you do need to meet certain requirements.

Maybe the analogy is that the KS header supports the 'simple sentence'.

http://en.wikipedia.org/wiki/Sentence_%28linguistics%29

--
Peter Karman . http://peknet.com/ . peter@peknet.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On Oct 4, 2007, at 5:26 PM, Peter Karman wrote:

> Sounds like what you want isn't an official subset of the language,
> but rather something like a SGML document type definition (called
> (overload overload) a schema in XML parlance). Just an official
> declaration of what constitutes a legal KS header.

I started messing around with defining what aspects of YAML are key.
(http://www.rectangular.com/kinosearch/wiki/FileFormat).

Is there an XSD schema for the Swish format? I haven't written one
before, but being able to follow a schema for writing schemas
(overload overload overload) appeals to me.

A switch to XML for KS metadata file serialization might be in
order. It was kind of a toss up between the two contenders. But
when I brought this up on the Lucene list a while ago, people were
like "YAML, what's that?". And Swish uses XML. Might be time to go
with the flow. (Switching wouldn't even be disruptive, since we'd
just look for the segments_XXX.whatever file and parse it according
to the extension.)

I'd kind of like to stick with using a minimal custom parser rather
than adding a full-on XML parser as a dependency. That means placing
restrictions on the XML akin to those I laid out for YAML. You know
whether spec'ing those sort of restrictions is something XSD is set
up to handle?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On 10/04/2007 08:46 PM, Marvin Humphrey wrote:
>
> On Oct 4, 2007, at 5:26 PM, Peter Karman wrote:
>
>> Sounds like what you want isn't an official subset of the language,
>> but rather something like a SGML document type definition (called
>> (overload overload) a schema in XML parlance). Just an official
>> declaration of what constitutes a legal KS header.
>
> I started messing around with defining what aspects of YAML are key.
> (http://www.rectangular.com/kinosearch/wiki/FileFormat).
>
> Is there an XSD schema for the Swish format? I haven't written one
> before, but being able to follow a schema for writing schemas (overload
> overload overload) appeals to me.

No, Swish-e 2.x doesn't use XML to store the header, and Swish3 doesn't have an
official Schema. Yet. Probably will though, when I can get back to that project.

>
> A switch to XML for KS metadata file serialization might be in order.
> It was kind of a toss up between the two contenders. But when I
> brought this up on the Lucene list a while ago, people were like "YAML,
> what's that?". And Swish uses XML. Might be time to go with the flow.
> (Switching wouldn't even be disruptive, since we'd just look for the
> segments_XXX.whatever file and parse it according to the extension.)
>

well, Java folk are notoriously xml-centric, so that doesn't surprise me.

XML vs YAML got discussed here before. I'm not convinced you need XML; it's
probably a little harder to read than YAML, but XML does have wider adoption at
this point in history. Guess it in part depends on (1) how hard it is to write
your own parser for either, and (2) if you have any philosophical agenda to
promote.


> I'd kind of like to stick with using a minimal custom parser rather than
> adding a full-on XML parser as a dependency. That means placing
> restrictions on the XML akin to those I laid out for YAML. You know
> whether spec'ing those sort of restrictions is something XSD is set up
> to handle?
>

You can definitely restrict the kind of XML allowed with a Schema. See
http://www.w3schools.com/schema/schema_elements_ref.asp for example.

You might just simplify it to the point where you don't allow any attributes,
limited nesting of elements, require all lowercase element names, etc. That
would make parsing much simpler. The hardest thing I have found in rolling my
own XML parser is tracking the nesting. If you take a SAX approach this is less
important, but if you take a DOM approach, then it's harder to do.

Then again, libxml2 is pretty widely available on all *nix systems and there
are Win32 ports available too...

--
Peter Karman . peter@peknet.com . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On Oct 5, 2007, at 6:17 AM, Peter Karman wrote:

> I'm not convinced you need XML; it's
> probably a little harder to read than YAML, but XML does have wider
> adoption at
> this point in history.

You're right that we don't need XML. The framework the XSD provides
is nice, but it's more than we require.

Furthermore, while I'm confident that I could write a basic round-
trip parser for handling KinoSearch-specific XML, I'm not confident
that I could write a water-tight spec and a parser that's guaranteed
to handle all possible corner cases generated by conforming
applications.

Using YAML presents slightly different problems. The full YAML spec
is sadly bloated; declaring that the InvIndex file format uses "YAML"
means that everyone who implements fully it needs a full-blown YAML
parser. To avoid that, we might want to limit allowable constructs,
but since there's no YAML equivalent of XSD, we have to add our own
ad hoc restrictions. That might be doable, but seems hackish and
fiddly.

Time to consider a third alternative: "All InvIndex metadata files
use UTF-8 encoded JSON."

The JSON spec is tiny compared to XML and YAML, but it's sufficient.
It has an official RFC (<http://www.ietf.org/rfc/rfc4627.txt>), and
we probably don't need to impose any additional constraints beyond
specifying the UTF-8 encoding and referring to the RFC -- though an
ASCII-only limitation might be worth considering.

Leaving everything to the JSON spec itself would impose the
requirement for a full-blown JSON parser on all fully conforming
apps. However, that's less onerous than a YAML or XML parser, and it
would still be possible to write a miniature subset parser a la devel
KinoSearch's current YAML parser.

> Guess it in part depends on (1) how hard it is to write
> your own parser for either,

There are several JSON parsers on CPAN. One of them seems to stand
out: JSON::XS. <http://search.cpan.org/perldoc?JSON%3A%3AXS> The
author, Marc Lehmann, is gruff, but knowledgeable. Our two main
concerns are that the JSON be correct and that the distro build
reliably. Glancing over the documentation, the test results, the
Changes file, and some of the code, it looks to be suitable for
adding as a dependency.

> and (2) if you have any philosophical agenda to
> promote.

My goal is to write a inverted index file format spec that is easy to
implement and easy to extend. Whether metadata gets encoded as YAML,
XML, or JSON is incidental.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
Marvin Humphrey wrote on 11/3/07 4:06 PM:

> There are several JSON parsers on CPAN. One of them seems to stand out:
> JSON::XS. <http://search.cpan.org/perldoc?JSON%3A%3AXS> The author,
> Marc Lehmann, is gruff, but knowledgeable. Our two main concerns are
> that the JSON be correct and that the distro build reliably. Glancing
> over the documentation, the test results, the Changes file, and some of
> the code, it looks to be suitable for adding as a dependency.
>

I really like the JSON::Syck (and accompanying YAML::Syck) parser. libsyck is
BSD-licensed and very fast.

--
Peter Karman . http://peknet.com/ . peter@peknet.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On Nov 6, 2007, at 4:08 AM, Peter Karman wrote:

> I really like the JSON::Syck (and accompanying YAML::Syck) parser.
> libsyck is
> BSD-licensed and very fast.

The JSON::XS documentation has a "comparison" section where it runs
through the available alternatives, including JSON::Syck. Here's the
gripe list for JSON::Syck 0.21:

* Very buggy (often crashes).

* Very inflexible (no human-readable format supported, format
pretty much
undocumented. I need at least a format for easy reading by
humans and a
single-line compact format for use in a protocol, and preferably
a way to
generate ASCII-only JSON texts).

* Completely broken (and confusingly documented) Unicode handling
(unicode
escapes are not working properly, you need to set
ImplicitUnicode to
I<different> values on en- and decoding to get symmetric
behaviour).

* No roundtripping (simple cases work, but this depends on wether
the scalar
value was used in a numeric context or not).

* Dumping hashes may skip hash values depending on iterator state.

* Unmaintained (maintainer unresponsive for many months, bugs are not
getting fixed).

* Does not check input for validity (i.e. will accept non-JSON
input and
return "something" instead of raising an exception. This is a
security
issue: imagine two banks transfering money between each other
using JSON.
One bank might parse a given non-JSON request and deduct money,
while the
other might reject the transaction with a syntax error. While a
good
protocol will at least recover, that is extra unnecessary work
and the
transaction will still not succeed).

JSON::Syck is at 0.26 now, so it *is* maintained and some bugs are
getting fixed. Looking at the Changes file, though, I don't see any
mention of the rest.

Also, JSON::XS 1.52 has a better CPAN Testers report than YAML::Syck
0.99, the distro that contains JSON::Syck: 55 passes, 5 N/A vs 94
passes, 11 failures.

I think the main underlying difference is that libsyck has to deal
with YAML while JSON::XS is just JSON, and JSON is a much easier spec
to implement than YAML. Douglas Crockford, author of the JSON spec,
got KISS right -- and the YAML people didn't.

Something to bear in mind while writing the InvIndex file format spec.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On 11/07/2007 01:26 PM, Marvin Humphrey wrote:
>
> On Nov 6, 2007, at 4:08 AM, Peter Karman wrote:
>
>> I really like the JSON::Syck (and accompanying YAML::Syck) parser.
>> libsyck is
>> BSD-licensed and very fast.
>
> The JSON::XS documentation has a "comparison" section where it runs
> through the available alternatives, including JSON::Syck. Here's the
> gripe list for JSON::Syck 0.21:

you've convinced me. :)

--
Peter Karman . peter@peknet.com . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Serialized Schema [ In reply to ]
On Nov 7, 2007, at 12:01 PM, Peter Karman wrote:

> you've convinced me. :)

Cool. :)

The more I think of JSON as an independent data serialization
language rather than as "a subset of JavaScript", the happier I am
with using it as part of the InvIndex spec. It fills exactly the
same niche I wanted to fill with ASHL. The only thing I dislike is
that it's comment-less, but that doesn't matter in this case.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch