Hi Marvin --
In your last response, you asked why it was bad to have term scorers
be specific to TF/IDF scoring. I didn't do a good job of answering
clearly. I've been thinking about it more, and hopefully I can do a
better job of explaining why this time.
The problem is that while it is easy to subclass a term scorer, it's
difficult to get that subclassed scorer to be used. In the general
case you don't know what sort of posting format you are dealing with,
so you ask the posting list to generate a scorer for you. How do you
subclass something that is generated for you?
Here's some some alternatives I've considered:
I could bypass PostingList->make_scorer, as PhraseScorer does, and
work directly on the Posting. This feels presumptive, and reliant on
Posting internals. I'm confused by your comfort in having
PhraseScorer do so.
I could call PostingList->make_scorer, and then twiddle with VTable so
that the scorer calls my custom tally function. This feels tricksy,
and my custom tally function would still have to forcibly cast the
Posting into something specific.
I could try to wrap the posting specific scorer with my own, and never
call its tally, but the tally I want needs access to posting->impact
(among other things), so I'd have to peek the internals of the posting
anyway.
I could change the architecture so that rather than passing a Sim
object, I pass a Tally object with a run method. Then Scorer_Tally()
is changed to call scorer->tally->run(scorer, tally). Then I
wouldn't have to subclass the generated scorer, as it would already be
customized.
(I like that this further encapsulates the particular scoring scheme,
but it would be complex, and would still require the internals of the
Scorer to be exposed, if only to the Tally object. Either it would
need an ISA() check and a cast, or some way of interrogating the
Scorer as to whether certain features are available.)
Anyway, is there a solution you would recommend here? I feel like I
must be missing something obvious. I'd like to have a path that would
continue to work even as I move to other posting formats, particularly
the mmap() approach I mentioned.
Thanks again!
Nathan Kurz
nate@verse.com
In your last response, you asked why it was bad to have term scorers
be specific to TF/IDF scoring. I didn't do a good job of answering
clearly. I've been thinking about it more, and hopefully I can do a
better job of explaining why this time.
The problem is that while it is easy to subclass a term scorer, it's
difficult to get that subclassed scorer to be used. In the general
case you don't know what sort of posting format you are dealing with,
so you ask the posting list to generate a scorer for you. How do you
subclass something that is generated for you?
Here's some some alternatives I've considered:
I could bypass PostingList->make_scorer, as PhraseScorer does, and
work directly on the Posting. This feels presumptive, and reliant on
Posting internals. I'm confused by your comfort in having
PhraseScorer do so.
I could call PostingList->make_scorer, and then twiddle with VTable so
that the scorer calls my custom tally function. This feels tricksy,
and my custom tally function would still have to forcibly cast the
Posting into something specific.
I could try to wrap the posting specific scorer with my own, and never
call its tally, but the tally I want needs access to posting->impact
(among other things), so I'd have to peek the internals of the posting
anyway.
I could change the architecture so that rather than passing a Sim
object, I pass a Tally object with a run method. Then Scorer_Tally()
is changed to call scorer->tally->run(scorer, tally). Then I
wouldn't have to subclass the generated scorer, as it would already be
customized.
(I like that this further encapsulates the particular scoring scheme,
but it would be complex, and would still require the internals of the
Scorer to be exposed, if only to the Tally object. Either it would
need an ISA() check and a cast, or some way of interrogating the
Scorer as to whether certain features are available.)
Anyway, is there a solution you would recommend here? I feel like I
must be missing something obvious. I'd like to have a path that would
continue to work even as I move to other posting formats, particularly
the mmap() approach I mentioned.
Thanks again!
Nathan Kurz
nate@verse.com