Mailing List Archive

adding a proximity scorer - Boilerplater
On Jun 14, 2007, at 10:30 PM, Nathan Kurz wrote:
> Is there a way I can do this with the BoilerPlater vtable stuff? I
> haven't actually figured out how ORScorer_tally() actually gets called
> from Scorer_Tally().

Scorer_tally ... is a function.
Scorer_Tally ... is a method.

If you call Scorer_tally(foo_scorer), it will execute Scorer_tally in
Scorer.c.

If you call Scorer_Tally(foo_scorer), it will execute FooScorer_tally
in FooScorer.c.

Take a look at the code BoilerPlater generates. Here's how
Scorer_Tally is defined:

#define Kino_Scorer_Tally(self) \
(self)->_->tally((kino_Scorer*)self)

self->_ is a vtable pointer. <http://en.wikipedia.org/wiki/Vtable>

To subclass OrScorer, create MyOrScorer.c and MyOrScorer.h following
the guidelines documented in the POD for devel/lib/BoilerPlater.pm.
All you'll need is a constructor and a MyOrScorer_tally function.
The rest will inherit.

(I'll reply to the rest of your message later, but I wanted to
expedite this bit.)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
adding a proximity scorer - Boilerplater [ In reply to ]
On 6/15/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> To subclass OrScorer, create MyOrScorer.c and MyOrScorer.h following
> the guidelines documented in the POD for devel/lib/BoilerPlater.pm.
> All you'll need is a constructor and a MyOrScorer_tally function.
> The rest will inherit.

Got it, and it seems to Build correctly. But I'll find out for sure
when I actually write something that calls it!

First subclass subquestion: why do I actually need a constructor? Is
the class_name field actually used at runtime? And if I do need a
constructor, is it possible to piggyback on top of the constructor for
the class I'm inheriting from? Something like:

MyORScorer *self = STEAL(MyORScorer, ORScorer_new(sim, subscorers));
(or BLESS, or RECREATE,
RECLASS, whatever)

I don't understand the innards of your vtable implementation well
enough to know quite what this would need to do, but if it is possible
it seems likely you've already done it somewhere but I just haven't
found where.

Nathan Kurz
nate@verse.com
adding a proximity scorer - Boilerplater [ In reply to ]
On Jun 15, 2007, at 1:52 PM, Nathan Kurz wrote:

> On 6/15/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>> To subclass OrScorer, create MyOrScorer.c and MyOrScorer.h following
>> the guidelines documented in the POD for devel/lib/BoilerPlater.pm.
>> All you'll need is a constructor and a MyOrScorer_tally function.
>> The rest will inherit.
>
> Got it, and it seems to Build correctly.

Wow, hot damn.

The BoilerPlater stuff came out well, but it wasn't and isn't really
designed to be a public API. It arose out of necessity because the
faked-up inheritance schemes that Dave Balmain was using with Ferret
and I was using with KS 0.15 were messy and scaled poorly. The
design was hashed out last fall on the Lucy developer's list.

> First subclass subquestion: why do I actually need a constructor?

In theory you could write an XS function that "reblessed" the object
by changing the vtable pointer to point at a different one.

MODULE = KinoSearch PACKAGE = KinoSearch::Search::MyOrScorer

kino_MyORScorer*
new(unused, sim, subscorers)
SV *unused;
kino_Similarity *sim;
kino_VArray *subscorers;
CODE:
RETVAL = kino_ORScorer_new(sim, subscorers);
RETVAL->_ = &KINO_MYORSCORER; /* <-------- rebless */
OUTPUT: RETVAL

I've opted never to do that mainly because I want XS code to be
limited to glue whenever possible. XS is powerful, but it's nasty
and esoteric.

Take a look at RichPostingScorer_new() -- it does exactly the same
thing as that XS function above, but within a dedicated C constructor.

> Is the class_name field actually used at runtime?

Yes, absolutely. self->_->class_name is used all over the place,
particularly when crossing the Perl/C boundary.

Note that class_name is a member of the vtable, and not a member of
the object struct. That means we don't have to waste space in every
object with an extra pointer to the class name -- but also that the
class name of an object is fixed.

> And if I do need a
> constructor, is it possible to piggyback on top of the constructor for
> the class I'm inheriting from? Something like:
>
> MyORScorer *self = STEAL(MyORScorer, ORScorer_new(sim,
> subscorers));
> (or BLESS, or RECREATE, RECLASS, whatever)

Cool idea, but it would have to look slightly different, because of
the limitations of C syntax. It would have to be either a function,
or a multi-line macro like this:

#define KINO_RECLASS(var, obj, type, vtable) \
type* var = (type*)(obj); \
var->_ = &(vtable)

> I don't understand the innards of your vtable implementation well
> enough to know quite what this would need to do, but if it is possible
> it seems likely you've already done it somewhere but I just haven't
> found where.

Yes. Dynamic subclassing is supported via DynVirtualTable and the
CREATE_SUBCLASS macro defined in MemManager.h.

The implementation is fairly complex, which is unfortunate, because
it doesn't accomplish very much -- it just allows the object to be
associated with an arbitrary class name. :\ The feature is used by
Schema, FieldSpec and Similarity to allow users to subclass via Perl
without knowing anything about the underlying C objects.

Ideally, our discussion will result in an improvement upon that
scheme that will allow you to write your ORScorer subclass without
touching BoilerPlater. Something like this:

package MyORScorer;
use base qw( KinoSearch::Search::ORScorer );

__PACKAGE__->register_c_method( tally => 'my_tally' );

use Inline => C << 'END_C';

kino_Tally*
my_tally(kino_OrScorer *self) {
/* ... */
}

END_C

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
adding a proximity scorer - Boilerplater [ In reply to ]
On 6/15/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> Wow, hot damn.

I'm continuing along, and things seem to be going well. I have a
subclassed Parser, Query, and Scorer that do very little other than
inherit from their counterparts and print some junk to prove they are
being called, but it's exciting that they exist. BoilerPlater seems
solid and flexible so far.

> The BoilerPlater stuff came out well, but it wasn't and isn't really
> designed to be a public API. It arose out of necessity because the
> faked-up inheritance schemes that Dave Balmain was using with Ferret
> and I was using with KS 0.15 were messy and scaled poorly. The
> design was hashed out last fall on the Lucy developer's list.

I read through some of that when I was trying to get my bearings.
I'll try to read some more. My impression so far is that you've got a
great implementation of a lousy API, and leaving Lucene in the dust is
definitely the right plan.

That sounds too harsh: there's a lot of good thought in Lucene, but
it's a little too much accretative thought and too little reductive
thought. I'd certainly prefer a clearer code path and fewer of the
twisty mazes.

> Cool idea, but it would have to look slightly different, because of
> the limitations of C syntax. It would have to be either a function,
> or a multi-line macro like this:

I'm going with this for now, which seems reasonable to me:

/* ADOPT is an alternative to CREATE for a subclass contructor that wants to
start where the parent constructor left off. For example:
SubClass *SubClass_new(parent_args, new_arg) {
ADOPT(self, Parent_new(parent_args), SubClass, SUBCLASS);
self->var = new_arg;
return self;
}
*/
#define ADOPT(var, instance, type, vtable) \
type *var = (type *) (instance); \
var->_ = &(vtable); \
var = KINO_REALLOCATE(var, 1, type);

Does it seem like this would work?

> Ideally, our discussion will result in an improvement upon that
> scheme that will allow you to write your ORScorer subclass without
> touching BoilerPlater. Something like this:
>
> package MyORScorer;
> use base qw( KinoSearch::Search::ORScorer );
>
> __PACKAGE__->register_c_method( tally => 'my_tally' );
>
> use Inline => C << 'END_C';
>
> kino_Tally*
> my_tally(kino_OrScorer *self) {
> /* ... */
> }
>
> END_C

That seems like a great goal. For now I'm happy writing C. Perhaps
more useful for most people would be the ability to override a
BoilerPlated C method with a Perl function, with it automatically
wrapped in just enough C to push the args. You aren't already doing
this anywhere, are you?

Personally, though, I'd probably rather see a greater split between
the Perl and the C. I love them both individually, but I'd be more
comfortable with a standard C library (libidf?) with a Perl wrapper
and a clearly defined boundary. I guess I think that would be both
clearer and potentially faster*. But it sure did feel slick to be
able to overlay a single function in C!


Goodnight!

--nate

* Yes, I read your testing about the negligible effect of the class
finalization.
But it's the function overhead that worries me, not the lookup. Being
addicted to speed, I drool about speedup possible if you flattened the
scoring loop into something inline, especially if you were going
directly over the mmap'd indexes.
adding a proximity scorer - Boilerplater [ In reply to ]
On Jun 16, 2007, at 12:05 AM, Nathan Kurz wrote:

> I'm going with this for now, which seems reasonable to me:
>
> /* ADOPT is an alternative to CREATE for a subclass contructor that
> wants to
> start where the parent constructor left off. For example:
> SubClass *SubClass_new(parent_args, new_arg) {
> ADOPT(self, Parent_new(parent_args), SubClass, SUBCLASS);
> self->var = new_arg;
> return self;
> }
> */
> #define ADOPT(var, instance, type, vtable) \
> type *var = (type *) (instance); \
> var->_ = &(vtable); \
> var = KINO_REALLOCATE(var, 1, type);
>
> Does it seem like this would work?

Your approach is probably cleaner than the one I've taken, which has
been to create two functions: Parent_new() and Parent_init_base().

An example taken from BitVector is pasted below my sig. DelDocs,
which is a subclass of BitVector, supplies its own memory chunk to
BitVec_init_base() instead of reallocating an instance returned by
BitVec_new() -- avoiding the expense of double allocation.

However, I can see doing things your way by default, then going to
the two-function setup only for performance-critical constructors.
There aren't very many of those -- and DelDocs_new() probably isn't
one of them. :)

What I really want to do, though, is figure out how to implement KS
objects using the inside-out object model. I believe that this is
the key to realizing the goals I've laid out for subclassing Scorer.

int
SubClass_new(parent_args, new_arg) {
int me = Parent_new(parent_args);
allocate_and_store(me, new_arg);
return me;
}

It can't be quite so simple, but that's the general idea.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
adding a proximity scorer - Boilerplater [ In reply to ]
On 6/16/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> Your approach is probably cleaner than the one I've taken, which has
> been to create two functions: Parent_new() and Parent_init_base().

Being more efficient, your way is the way I would approach if it I was
designing from scratch. A standard that class_new() does the
allocation and calls class_init() is my standard for C. It feels
funny to do it that way in Perl, though, since the allocation is
usually just an empty hash ref. And if the goal is to blur the
distinction between C and Perl, a common strategy for both would be
good.

My first (well, second) instinct was that the realloc() wouldn't be
that expensive, since allocations are often oversized to begin with,
and even if an changing the real size is necessary, resizing the last
object allocated might be optimized. But a little quick test in a
loop made it seem like the realloc call is just about as expensive as
the orginal malloc.

> However, I can see doing things your way by default, then going to
> the two-function setup only for performance-critical constructors.
> There aren't very many of those -- and DelDocs_new() probably isn't
> one of them. :)

And for most of the performance critical sections, one is going to
avoid dynamic allocation anyway. The intersection of performance
critical sections and those that call constructors should be small.

> What I really want to do, though, is figure out how to implement KS
> objects using the inside-out object model. I believe that this is
> the key to realizing the goals I've laid out for subclassing Scorer.

I hadn't read about that model until you mentioned it here. I can see
why the foreign object inheritance ability would be very useful. I
don't know if it would be a good fit for a performance sensitive
section in C, however. If the object is already a hash table, you
aren't losing much in performance. But if the object is a C struct,
looking up elements is currently zero-cost. Having to dereference (or
even call a function) every time you need an element seems both
expensive and at odds with any sort of compile time optimization. And
the lack of locality of an object seems like it would be be very bad
for staying within processor cache. Are there implementations that
would get around this?

I fear that until I'm more familiar with the KS code as it exists, I'm
not going to be very useful in design discussions. Hopefully I'll
have some better ideas once I've spent some more time working in
system.

Nathan Kurz
nate@verse.com