Mailing List Archive

KSx-IndexManager 0.001
I've just uploaded KSx-IndexManager 0.001. As the version number indicates,
this isn't very polished yet.

I wrote this because I wanted to call fewer methods in code that dealt with
invindexes (both reading and writing). Compare eg/invindex.pl to KinoSearch's
sample/invindex.plx to get an idea of what IndexManager covers.

The big missing feature I have in mind is the concept of automatic data
collection; I want to tell the IndexManager to 'load from scratch' or 'update'
and have it know how to find the documents.

However, I also want to actually release this, and I'm running out of momentum,
so I figured I'd get a preliminary version onto CPAN and hope that it generates
interest.

hdp.
KSx-IndexManager 0.001 [ In reply to ]
On 7/1/07, Hans Dieter Pearcey <hdp@pobox.com> wrote:
> I've just uploaded KSx-IndexManager 0.001. As the version number indicates,
> this isn't very polished yet.

Either it hasn't propagated yet, or I'm looking in the wrong spot.
Would you have a direct link?

Nathan Kurz
nate@verse.com
KSx-IndexManager 0.001 [ In reply to ]
On Sun, Jul 01, 2007 at 12:19:46PM -0600, Nathan Kurz wrote:
> On 7/1/07, Hans Dieter Pearcey <hdp@pobox.com> wrote:
> >I've just uploaded KSx-IndexManager 0.001. As the version number
> >indicates,
> >this isn't very polished yet.
>
> Either it hasn't propagated yet, or I'm looking in the wrong spot.
> Would you have a direct link?

It hasn't propagated yet. I put up a copy here:

http://vex.pobox.com/~hdp/KSx-IndexManager-0.001.tar.gz

It's in a git repo that isn't public yet. I should set up gitweb soon.

hdp.
KSx-IndexManager 0.001 [ In reply to ]
On Jul 1, 2007, at 10:37 AM, Hans Dieter Pearcey wrote:

> I've just uploaded KSx-IndexManager 0.001.

I see that this module is now visible on search.cpan.org. (: ...and
so is KSx::Searcher::Abstract. :)

Cool stuff! The way it combines both indexing and searching reminds
me of Ferret::Index [.that's a Ruby class for those of you unfamiliar
with Ferret] and Plucene::Simple (upon which KinoSearch::Simple is
loosely based). It's more like Ferret::Index, though, in that it
doesn't hide the interfaces of the classes it replaces -- you still
need to grok Schema, InvIndexer, Searcher, Hits, and implicitly,
FieldSpec and PolyAnalyzer in order to use it.

> I wrote this because I wanted to call fewer methods in code that
> dealt with
> invindexes (both reading and writing).

I can certainly see how using it will result in less code. You and I
appear to have different, but complementary ideas about class design,
and I'm cheesed to see your alternative approach. :)

KinoSearch has a few convenience methods here and there, notably
Schema->open and friends. For the most part, though, I try to avoid
them.

* Fewer moving parts, so fewer things can go wrong.
* Less work before users feel confident that they've grokked an
entire class and acquire a sense of mastery over it.

Convenience methods are a burden if you don't actually use them, and
different people often have different ideas about what's convenient.
I think KinoSearch itself needs to stay streamlined and low-level
enough that people such as yourself have no trouble assembling KS
components into larger tools.

However, I also find a lot of what you've done attractive. Those
write() and append() methods are pretty slick! On some level, it
would be nice to add them to InvIndexer itself, along with add_docs()
-- but having them available via your module distro is even better!

> Compare eg/invindex.pl to KinoSearch's
> sample/invindex.plx to get an idea of what IndexManager covers.

I think that file is actually missing from your distro. It's easy
for me at least to see what's going on, but can you please forward
some sample code to the list?

In addition to the minimalist apps, I'd like to see an example of how
you would use KSx::IndexManager::Plugin::Partition to choose one
invindex from among many at both index-time and search-time. KS has
MultiSearcher, but it doesn't provide a stock answer for how to
multiplex indexing; perhaps you've come up with a good model.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
KSx-IndexManager 0.001 [ In reply to ]
On Sun, Jul 01, 2007 at 04:24:29PM -0700, Marvin Humphrey wrote:
> I think KinoSearch itself needs to stay streamlined and low-level
> enough that people such as yourself have no trouble assembling KS
> components into larger tools.

I agree, which is why I wrote these as separate distributions rather than
suggesting that they be patches to KinoSearch.

> >Compare eg/invindex.pl to KinoSearch's
> >sample/invindex.plx to get an idea of what IndexManager covers.
>
> I think that file is actually missing from your distro. It's easy
> for me at least to see what's going on, but can you please forward
> some sample code to the list?

Oops. 0.002 has the eg/ directory in it.

http://vex.pobox.com/~hdp/KSx-IndexManager-0.002.tar.gz

(or CPAN, soon; it's uploaded already)

> In addition to the minimalist apps, I'd like to see an example of how
> you would use KSx::IndexManager::Plugin::Partition to choose one
> invindex from among many at both index-time and search-time. KS has
> MultiSearcher, but it doesn't provide a stock answer for how to
> multiplex indexing; perhaps you've come up with a good model.

They're solving different problems. Partition is for when you index e.g. by
user, where users will never search each other's data. MultiSearcher is for
aggregation.

Suppose you have:

My::Manager->add_plugins(Partition => { method => 'id' });

Then:

my $mgr = My::Manager->new({ root => "/index" });

while (... get some input ...) {
my $user = User->get(... based on input, say id => 42);
$mgr->context($user);
my $hits = $mgr->search(%args); # searches /index/42
}

And in another program:

my $mgr = My::Manager->new({ root => "/index" });

for my $user (@users) {
$mgr->context($user);
$mgr->append([ $user->get_new_docs ]); # adds to /index/<ID>
}

hdp.
KSx-IndexManager 0.001 [ In reply to ]
>
> They're solving different problems. Partition is for when you index e.g. by
> user, where users will never search each other's data. MultiSearcher is for
> aggregation.
>
>
On the partition stuff. Lets say I have two types of searches. User
searches, which only search within a specific user. And global searches
that search everything. Would partition be appropriate for this?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.rectangular.com/pipermail/kinosearch/attachments/20070702/edbd9089/attachment.html
KSx-IndexManager 0.001 [ In reply to ]
On Mon, Jul 02, 2007 at 08:53:39AM -0700, Mike Wexler wrote:
> On the partition stuff. Lets say I have two types of searches. User
> searches, which only search within a specific user. And global searches
> that search everything. Would partition be appropriate for this?

It depends a great deal on what you're doing. The Partition plugin is just a
convenient way to access multiple invindexes that share a single schema. It's
written with the assumption that you can use MultiSearcher for anything that
needs to read from multiple partitions, which may or may not be true for you
(see MultiSearcher's docs for its limitations).

hdp.
KSx-IndexManager 0.001 [ In reply to ]
On Jul 2, 2007, at 8:12 AM, Hans Dieter Pearcey wrote:
> 0.002 has the eg/ directory in it.

Nice.

> Partition is for when you index e.g. by
> user, where users will never search each other's data.
> MultiSearcher is for
> aggregation.

Before you aggregate with MultiSearcher, you have to distribute index
writing. That phase looks identical.

It's not a terribly hard problem to solve. I think we need a HowTo,
though... so I've started one.

http://www.rectangular.com/kinosearch/wiki/ScalingUp

> for my $user (@users) {
> $mgr->context($user);
> $mgr->append([ $user->get_new_docs ]); # adds to /index/<ID>
> }

One issue with this architecture is that each InvIndexer consumes a
fair amount of memory.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
KSx-IndexManager 0.001 [ In reply to ]
On Tue, Jul 03, 2007 at 01:07:05AM -0700, Marvin Humphrey wrote:
> > for my $user (@users) {
> > $mgr->context($user);
> > $mgr->append([ $user->get_new_docs ]); # adds to /index/<ID>
> > }
>
> One issue with this architecture is that each InvIndexer consumes a
> fair amount of memory.

When using ->write and ->append, invindexers aren't cached (except during that
particular call). It automatically calls ->finish and then the invindexer
falls out of scope.

Originally I had intended that write and append could figure out per-document
which invindex that document was intended for. It ended up being too twisty
inside, though, so I settled for the explicit context-setting.

hdp.