Mailing List Archive: Roadmap .30 and Scorers

Roadmap .30 and Scorers

bramble.andrew at gmail

Jul 22, 2008, 5:28 PM

Post #1 of 7 (10800 views)

Hello,

After getting useful results and fast with KinoSearch .20 I began looking at
ways to narrow results further using field specific refinements. EG having
CPAN metadata indexed and being able to slice into it by a license field.
Might it be possible for a Scorer (I think it's a scorer) to compute from
within the set of matched results, the total frequency of tokens from a
given field. To use the CPAN example again, rather than choosing to search
for "date parser" and license:artistic , might the initial search for
"date parser" return the matching results AND a structure describing that of
100 matched documents, the field 'license' breaks down to perl=50,
artistic=30, gpl=10, bsd=5, apache=5.
One could then repeat the original search , adding 'license:perl' to
narrow the search to only the 50 matching documents.

Since this would required reading/examining each matched record I would
guess this belongs in the XS/C rather than perl.

Is it wishful thinking ? or might this be possible with subclassable
scorers/hit collectors.

++KinoSearch

Andrew

Re: Roadmap .30 and Scorers [ In reply to ]

justin at devuyst

Jul 22, 2008, 5:57 PM

Post #2 of 7 (10404 views)

Hello,

I was playing around with indexing and searching CPAN with KinoSearch
recently myself. Could you elaborate on what your plans are? I'd
like to move on to something else if someone else is already doing
what I would like to see happen.

Basically my goal is to make searchable, in one place, everything
known about modules on the CPAN. Whether KinoSearch can fit the
whole bill or just part of the bill I'm still not sure of.

Thanks,
jdv

Andrew Bramble wrote:
> Hello,
>
> After getting useful results and fast with KinoSearch .20 I began
> looking at
> ways to narrow results further using field specific refinements. EG
> having
> CPAN metadata indexed and being able to slice into it by a license
> field.
> Might it be possible for a Scorer (I think it's a scorer) to compute
> from
> within the set of matched results, the total frequency of tokens from
> a
> given field. To use the CPAN example again, rather than choosing to
> search
> for "date parser" and license:artistic , might the initial search for
> "date parser" return the matching results AND a structure describing
> that of
> 100 matched documents, the field 'license' breaks down to perl=50,
> artistic=30, gpl=10, bsd=5, apache=5.
> One could then repeat the original search , adding 'license:perl'
> to
> narrow the search to only the 50 matching documents.
>
> Since this would required reading/examining each matched record I
> would
> guess this belongs in the XS/C rather than perl.
>
> Is it wishful thinking ? or might this be possible with subclassable
> scorers/hit collectors.
>
> ++KinoSearch
>
> Andrew
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Roadmap .30 and Scorers [ In reply to ]

bramble.andrew at gmail

Jul 22, 2008, 7:11 PM

Post #3 of 7 (10397 views)

Justin ,

Plans ? Plans ? I'm still struggling for ideas :) Don't stop on my account.
I never finish anythi...

KinoSearch was one of several approaches at indexing and slicing CPAN data,
my original hack was not a text based search but rather an index of
meta.yaml information against distributions allowing for queries like
requires:Test::More (you'll never guess the headslap effect of
'set_heed_colons' when doing this with QueryParser) or

requires: Test::More
license: apache

The original naive implementation used Graph and some list utils for
intersections.

To be honest - I'm working more towards a product search engine than a CPAN
index in particular, CPAN data was safe to work on at home ... business
data - not so safe.

What pushed me towards KinoSearch was seeing some results from the
evo.combeta linked from
rectangular.com , evo.com appear to have the functionality I'm thinking of -
where the results have computed 'refinements' for categories like brand that
are presumably document fields. See
http://www.evo.com/search?q=cooking&tag=Lead-Free

On Wed, Jul 23, 2008 at 10:57 AM, Justin DeVuyst <justin@devuyst.com> wrote:

> Hello,
>
> I was playing around with indexing and searching CPAN with KinoSearch
> recently myself. Could you elaborate on what your plans are? I'd
> like to move on to something else if someone else is already doing
> what I would like to see happen.
>
> Basically my goal is to make searchable, in one place, everything
> known about modules on the CPAN. Whether KinoSearch can fit the
> whole bill or just part of the bill I'm still not sure of.
>
> Thanks,
> jdv
>
> Andrew Bramble wrote:
> > Hello,
> >
> > After getting useful results and fast with KinoSearch .20 I began
> > looking at
> > ways to narrow results further using field specific refinements. EG
> > having
> > CPAN metadata indexed and being able to slice into it by a license
> > field.
> > Might it be possible for a Scorer (I think it's a scorer) to compute
> > from
> > within the set of matched results, the total frequency of tokens from
> > a
> > given field. To use the CPAN example again, rather than choosing to
> > search
> > for "date parser" and license:artistic , might the initial search for
> > "date parser" return the matching results AND a structure describing
> > that of
> > 100 matched documents, the field 'license' breaks down to perl=50,
> > artistic=30, gpl=10, bsd=5, apache=5.
> > One could then repeat the original search , adding 'license:perl'
> > to
> > narrow the search to only the 50 matching documents.
> >
> > Since this would required reading/examining each matched record I
> > would
> > guess this belongs in the XS/C rather than perl.
> >
> > Is it wishful thinking ? or might this be possible with subclassable
> > scorers/hit collectors.
> >
> > ++KinoSearch
> >
> > Andrew
> > _______________________________________________
> > KinoSearch mailing list
> > KinoSearch@rectangular.com
> > http://www.rectangular.com/mailman/listinfo/kinosearch
> >
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>

Re: Roadmap .30 and Scorers [ In reply to ]

marvin at rectangular

Jul 22, 2008, 7:31 PM

Post #4 of 7 (10401 views)

On Jul 22, 2008, at 5:28 PM, Andrew Bramble wrote:

> To use the CPAN example again, rather than choosing to search for
> "date parser" and license:artistic , might the initial search for
> "date parser" return the matching results AND a structure describing
> that of 100 matched documents, the field 'license' breaks down to
> perl=50, artistic=30, gpl=10, bsd=5, apache=5.
> One could then repeat the original search , adding 'license:perl'
> to narrow the search to only the 50 matching documents.

Some people call this "faceted search". We discussed it at...

<http://www.gossamer-threads.com/lists/kinosearch/discuss/2911>

> Since this would required reading/examining each matched record I
> would guess this belongs in the XS/C rather than perl.

Probably, you'd want to do rapid prototyping in Perl, then port to C/XS.

The hitch is there isn't a formal C API yet. The fundamentals of the
OO design are done and ought to work great, but we need the surface
level stuff.

Whatever we come up with for the C API will be clunky and weird, but
we're just going to have to accept that. We aren't doing serious
language design, and if we pretend that we are, we'll never finish.

> Is it wishful thinking ? or might this be possible with subclassable
> scorers/hit collectors.

Absolutely, it's possible. It doesn't belong in core KS, but it's an
ideal extension, and I look forward to supporting whoever takes on the
task of writing it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Roadmap .30 and Scorers [ In reply to ]

marvin at rectangular

Jul 22, 2008, 7:33 PM

Post #5 of 7 (10393 views)

On Jul 22, 2008, at 5:57 PM, Justin DeVuyst wrote:

> Basically my goal is to make searchable, in one place, everything
> known about modules on the CPAN. Whether KinoSearch can fit the
> whole bill or just part of the bill I'm still not sure of.

KS can certainly handle the indexing/search aspect, but it would be
one tool among several in your toolbox. You'd also need CPAN::Mini, a
POD extractor, etc. I'n fact, for setting up a bare-bones CPAN
search, extracting the desirable data from all those archives is
bigger problem up-front problem the fundamental KS stuff.

Later on, tweaking the index and application design so that the
search's usefulness surpasses search.cpan.org, kobesearch, and site-
limited Googling becomes the challenge.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Roadmap .30 and Scorers [ In reply to ]

bramble.andrew at gmail

Jul 22, 2008, 7:42 PM

Post #6 of 7 (10428 views)

>
> also need CPAN::Mini, a POD extractor, etc. I'n fact, for setting up a
> bare-bones CPAN search, extracting the desirable data from all those
> archives is bigger problem up-front problem the fundamental KS stuff.
>

I must agree , if you're even considering indexing CPAN with KS, CPAN::Mini
and CPAN::Mini::Extract and the code to turn these into say a document per
'.pm' requires a heap of work with things like pod extractors , ppi etc.

Re: Roadmap .30 and Scorers [ In reply to ]

justin at devuyst

Jul 22, 2008, 9:33 PM

Post #7 of 7 (10406 views)

Thanks guys but I'm already past that point. What I've been
working on lately is attempting to aggregate data from CPAN
and other places into the index in a useful way. But while
I'm doing this I'm asking myself if KS can handle very specific
queries. Maybe something like "give me a highly rated (doc
boot based on cpanratings data) XML module that's cited as a
prereq at least a few times and has been uploaded in the past
2 years or has few bugs". Of course KS would have an easier
time with "xml" but that's no fun:)

Clearly KinoSearch is a great choice for indexing all the big
text like POD, code, reviews, etc... My latest thoughts lean
towards some sort of a mix between KS and a more standard DB
approach. I'm not sure how the mixing would happen though.

Maybe its be better to keep them seperate. KS for the simple
and more intelligent searching and the DB for more detailed
dumb searches.

The faceted searching sounds cool except for that one problem.

Any ideas?

Thanks,
jdv

Andrew Bramble wrote:
>>
>> also need CPAN::Mini, a POD extractor, etc. I'n fact, for setting
>> up a
>> bare-bones CPAN search, extracting the desirable data from all those
>> archives is bigger problem up-front problem the fundamental KS
>> stuff.
>>
>
> I must agree , if you're even considering indexing CPAN with KS,
> CPAN::Mini
> and CPAN::Mini::Extract and the code to turn these into say a
> document per
> '.pm' requires a heap of work with things like pod extractors , ppi
> etc.
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch