Mailing List Archive: Does it make sense to add Bayesian search to lucent?

Does it make sense to add Bayesian search to lucent?

Sep 19, 2002, 11:54 AM

Post #1 of 6 (847 views)

I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).

http://www.wired.com/wired/archive/8.02/autonomy_pr.html

It gives what I believe is a simplified version of Baye's theorem:

P(t|y) = [ P(y|t)P(t) ] / P(y)

P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.

In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.

To calculate the right side you do the following:

P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).

P(y) is done the same way.

P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".

Then voilla, you can calculate it out.

I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.

Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.

I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms

#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).

Now for the questions - is the above generally right i.e.
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].

And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in
the [0...1] range.

I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found on citeseer.com
where not what I was looking for.

thx,
Dave

PS
I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.

PPS

Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.

99.42% ayuda
99.42% customers
99.42% packet
92.36% disk
92.06% caen
92.06% gopher
92.06% kiosk
92.06% setting
89.79% beginner
89.54% rexx
89.54% soluciones
88.68% buscador
88.68% columns
88.68% input
88.68% lining
88.68% tyneside
87.23% anesthesia
87.23% fortran
85.49% alternativa
85.49% cease
85.49% iomega
85.49% linus
85.49% respect
85.49% rh
85.49% streams
82.46% beehive
82.46% mastering
82.46% techtv
80.81% demos
80.81% fs
80.78% sybase
80.17% tk
79.59% bizkaia
79.59% delaney
79.59% lilac
79.59% millions
78.83% todos
76.87% avi
76.87% centurion
76.87% emirates
76.87% ett
76.87% fidonet
76.87% libri
76.87% taken
76.87% zmir
75.08% monterrey
74.57% tipps
74.29% clic
74.29% forensics
74.29% rampage
74.29% wr
73.80% modem
71.83% ee
71.83% freiburger
71.83% mic
71.83% poder
71.80% plug
71.58% apps
70.37% joins
69.93% pluto
69.70% unix
69.50% azur
69.50% dilemma
69.50% lis
69.50% meal
69.50% yhdistys
68.33% wan
67.27% balancing
67.27% broadcasters
67.27% diverse
67.27% embrace
67.27% finances
67.27% processors
67.08% drivers
65.97% cluster
65.30% penguin
65.29% proxy
65.15% faithful
65.15% macs
65.15% touts
65.12% hat
64.14% kde
64.13% administrators
63.85% tuxedo
63.13% feds
63.13% handicapped
63.13% ids
63.13% readies
63.13% waikato
61.21% improving
61.21% inventor
61.21% spiele
61.21% stavanger
60.28% powered
59.84% tome
59.80% checklist
59.37% announces
59.37% bursa
59.37% cali
59.37% chennai
59.37% curve
59.37% washtenaw
59.16% beginners
58.53% nutshell
58.48% benchmark
57.61% beos
57.61% modules
57.61% nella
57.61% shut
55.93% alford
55.93% australien
55.93% promoting
55.93% tivoli
55.54% parallel
54.52% enthusiasts
54.32% choosing
54.32% pretoria
54.32% scheduling
54.32% schwabach
54.32% szczecin
54.32% unicode
53.85% intranet
52.99% backup
52.78% latvian
52.78% opengl
52.78% processor
52.78% vendors
52.12% tcp
51.72% armada
51.30% longer
51.30% weinheim

Re: Does it make sense to add Bayesian search to lucent? [ In reply to ]

jackpark at thinkalong

Sep 19, 2002, 12:17 PM

Post #2 of 6 (823 views)

Permalink

My response to your question "did I reach the right list?" is: Yes. I've
mentioned Latent Semantic Analysis on this list before. Bayes, I think, is
just as valid and important. I am personally eager to see something like
Lucene serve a variety of purposes in knowledge space.

FWIW, in order to avoid the one-size-fits-all bloated application that
Lucene could easily become, I am wondering if there is a way to perform a
major refactoring of the Lucene concept into a society of agents, all
coordinated by, say, tuplespace.

Just a few EUROs...
Jack

At 11:54 AM 9/19/2002 -0700, you wrote:
>I'm not convinced I understand Bayesian search/logic/probabilities.
>I thought this was a great article about the company Autonomy in Wired
>and it gives a kind of technical intro to it and surprisingly for such
>magazines, there's even a math equation it in(!).
>
>http://www.wired.com/wired/archive/8.02/autonomy_pr.html
>
>It gives what I believe is a simplified version of Baye's theorem:
>
>P(t|y) = [ P(y|t)P(t) ] / P(y)
>
>P(t|y) is the probability that 't' implies 'y'.
>P(t) is the probability of 't' happening.
>
>In the context of a search engine I think 't' and 'y' are
>two different words - the equation says that if you're searching
>for 't', the probability that 'y' is relevant is, well, the right side
>of the equation.
>
>To calculate the right side you do the following:
>
>P(t) is the prob of 't' occurring in a document, thus searching with the
>query "+t" and
>calling Hits.length() returns this (well, you divide the the # of docs
>in the collection
>IndexReader.numDocs() to get the probability).
>
>P(y) is done the same way.
>
>P(y|t), I believe, is done as follows. Is it the prob of both words
>occurring
>divided by the number of docs that y occurs, so it's the count for "+y
>+t" divided
>by the count for "+y".
>
>Then voilla, you can calculate it out.
>
>I believe then that the way Lucene can be used to support this kind of
>logic is
>once the index is built go thru all terms (IndexReader.terms()) and
>compute
>P(t|y) for every pair of words indexed.
>
>Then if someone queries on 't' you do something similar to what a fuzzy
>search does - you
>form a larger query and boost the other words based on their probability
>(i.e. P(t|y)).
>The result is that you might get
>docs you didn't expect i.e. docs without 't' (the term you were
>searching on) which is
>where this becomes valuable - it can find docs w/o any of your search
>terms as you search terms
>in a sense imply that other words are relevant.
>
>I have 2 test programs:
>[1] Given a pair of words and an index this app computes P(t|y)
>[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
>all other terms
>
>#1 runs fast enough while #2 is similar to what would be done in
>production, and takes
>forever on a large index (several hours on my 1GB/3Million record index
>of the dmoz.org content).
>
>Now for the questions - is the above generally right i.e.
>(a) my descr of how to calculate P(t|y)
>(b) my descr of the need to precalc P(t|y) for all pairs of words in the
>index, and thus
>it must take forever to do this in a large index [(b1) does Autonomy do
>this?].
>
>And the one other thing - I'm still checking my work but in some cases
>P(t|y) is > 1.0, which
>seems to make no sense as probabilities must be between 0 and 1 , but in
>the equation of
>P(t) is high and P(y) is low then maybe this "makes sense" and thus the
>equation is not always in
>the [0...1] range.
>
>I'm also interested in more, not-too-technical info on Bayesian search.
>Various docs I found on citeseer.com
>where not what I was looking for.
>
>thx,
> Dave
>
>PS
> I hope I hit the relevant mailing list - sorry if this is outside the
>dev charter.
>
>
>PPS
>
>Here's the output of my test program when I asked it to calcluate P(t|y)
>for t='linux'
>going over index of dmoz (index contains title+desc). It seem to make
>some sense - the
>case I was interested in was y=penguin (the linux mascot) which ranks in
>at 65%, thus
>if you searched on linux you might be (pleasantly?) suprised to get docs
>with pictures
>of baby penguins or whatnot. Output is all words < 100% and > 50%. I
>have not audited
>this extensively, esp as to why ayuda is so high(means help in
>Spanish..). Might be that
>small #'s of matches gives scores too high.
>
>
>
> 99.42% ayuda
> 99.42% customers
> 99.42% packet
> 92.36% disk
> 92.06% caen
> 92.06% gopher
> 92.06% kiosk
> 92.06% setting
> 89.79% beginner
> 89.54% rexx
> 89.54% soluciones
> 88.68% buscador
> 88.68% columns
> 88.68% input
> 88.68% lining
> 88.68% tyneside
> 87.23% anesthesia
> 87.23% fortran
> 85.49% alternativa
> 85.49% cease
> 85.49% iomega
> 85.49% linus
> 85.49% respect
> 85.49% rh
> 85.49% streams
> 82.46% beehive
> 82.46% mastering
> 82.46% techtv
> 80.81% demos
> 80.81% fs
> 80.78% sybase
> 80.17% tk
> 79.59% bizkaia
> 79.59% delaney
> 79.59% lilac
> 79.59% millions
> 78.83% todos
> 76.87% avi
> 76.87% centurion
> 76.87% emirates
> 76.87% ett
> 76.87% fidonet
> 76.87% libri
> 76.87% taken
> 76.87% zmir
> 75.08% monterrey
> 74.57% tipps
> 74.29% clic
> 74.29% forensics
> 74.29% rampage
> 74.29% wr
> 73.80% modem
> 71.83% ee
> 71.83% freiburger
> 71.83% mic
> 71.83% poder
> 71.80% plug
> 71.58% apps
> 70.37% joins
> 69.93% pluto
> 69.70% unix
> 69.50% azur
> 69.50% dilemma
> 69.50% lis
> 69.50% meal
> 69.50% yhdistys
> 68.33% wan
> 67.27% balancing
> 67.27% broadcasters
> 67.27% diverse
> 67.27% embrace
> 67.27% finances
> 67.27% processors
> 67.08% drivers
> 65.97% cluster
> 65.30% penguin
> 65.29% proxy
> 65.15% faithful
> 65.15% macs
> 65.15% touts
> 65.12% hat
> 64.14% kde
> 64.13% administrators
> 63.85% tuxedo
> 63.13% feds
> 63.13% handicapped
> 63.13% ids
> 63.13% readies
> 63.13% waikato
> 61.21% improving
> 61.21% inventor
> 61.21% spiele
> 61.21% stavanger
> 60.28% powered
> 59.84% tome
> 59.80% checklist
> 59.37% announces
> 59.37% bursa
> 59.37% cali
> 59.37% chennai
> 59.37% curve
> 59.37% washtenaw
> 59.16% beginners
> 58.53% nutshell
> 58.48% benchmark
> 57.61% beos
> 57.61% modules
> 57.61% nella
> 57.61% shut
> 55.93% alford
> 55.93% australien
> 55.93% promoting
> 55.93% tivoli
> 55.54% parallel
> 54.52% enthusiasts
> 54.32% choosing
> 54.32% pretoria
> 54.32% scheduling
> 54.32% schwabach
> 54.32% szczecin
> 54.32% unicode
> 53.85% intranet
> 52.99% backup
> 52.78% latvian
> 52.78% opengl
> 52.78% processor
> 52.78% vendors
> 52.12% tcp
> 51.72% armada
> 51.30% longer
> 51.30% weinheim
>
>
>

---------------------------------------------------------------------------
XML Topic Maps: Creating and Using Topic Maps for the Web.
Addison-Wesley, ISBN 0-201-74960-2.

http://www.nexist.org/wiki/User0Blog

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Does it make sense to add Bayesian search to lucent? [ In reply to ]

m.vanwijland at quicknet

Sep 19, 2002, 12:59 PM

Post #3 of 6 (832 views)

Permalink

Dave,

this is very interesting. I have done something similair with Mutual
Information statistics.
The trick is to build a semantic network. First, you go through all the
terms and perform
a NEAR search within a preset window. Next you calculate the probability,
and word
pairs with a higher than 0,6 probability are added to the semantic network.

This network can then be referenced when performing a search. Simply search
the
term in the semantic network, and get all related words. Next add them to
the query
and set the boost factor properly (for instance, the entered term boost this
times 2, and the others
can be boosted related to the probability of the ord pair).

There is a believe that for documents within a fixed domain need to rebuild
a semantic network
only when there is a substantial addition of documents to the collection.

Maybe we can swap some code, or in case anyone else is interested, I can
submit my
Mutual Information code to the list? But it is based on my Dutch Analyzer
and dutch stopwords.

Doug, you have much experience in this area of statistics and clustering, I
recall some documents
on Gather/Scatter. Is Lucene going this way or will Lucene remain a Free
Text search engine?
What are the plans for the near future of Lucene?

regards,

maurits van wijland

----- Original Message -----
From: "Spencer, Dave" <dave@lumos.com>
To: <lucene-dev@jakarta.apache.org>
Sent: Thursday, September 19, 2002 8:54 PM
Subject: Does it make sense to add Bayesian search to lucent?

I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).

http://www.wired.com/wired/archive/8.02/autonomy_pr.html

It gives what I believe is a simplified version of Baye's theorem:

P(t|y) = [ P(y|t)P(t) ] / P(y)

P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.

In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.

To calculate the right side you do the following:

P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).

P(y) is done the same way.

P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".

Then voilla, you can calculate it out.

I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.

Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.

I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms

#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).

Now for the questions - is the above generally right i.e.
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].

And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in
the [0...1] range.

I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found on citeseer.com
where not what I was looking for.

thx,
Dave

PS
I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.

PPS

Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.

99.42% ayuda
99.42% customers
99.42% packet
92.36% disk
92.06% caen
92.06% gopher
92.06% kiosk
92.06% setting
89.79% beginner
89.54% rexx
89.54% soluciones
88.68% buscador
88.68% columns
88.68% input
88.68% lining
88.68% tyneside
87.23% anesthesia
87.23% fortran
85.49% alternativa
85.49% cease
85.49% iomega
85.49% linus
85.49% respect
85.49% rh
85.49% streams
82.46% beehive
82.46% mastering
82.46% techtv
80.81% demos
80.81% fs
80.78% sybase
80.17% tk
79.59% bizkaia
79.59% delaney
79.59% lilac
79.59% millions
78.83% todos
76.87% avi
76.87% centurion
76.87% emirates
76.87% ett
76.87% fidonet
76.87% libri
76.87% taken
76.87% zmir
75.08% monterrey
74.57% tipps
74.29% clic
74.29% forensics
74.29% rampage
74.29% wr
73.80% modem
71.83% ee
71.83% freiburger
71.83% mic
71.83% poder
71.80% plug
71.58% apps
70.37% joins
69.93% pluto
69.70% unix
69.50% azur
69.50% dilemma
69.50% lis
69.50% meal
69.50% yhdistys
68.33% wan
67.27% balancing
67.27% broadcasters
67.27% diverse
67.27% embrace
67.27% finances
67.27% processors
67.08% drivers
65.97% cluster
65.30% penguin
65.29% proxy
65.15% faithful
65.15% macs
65.15% touts
65.12% hat
64.14% kde
64.13% administrators
63.85% tuxedo
63.13% feds
63.13% handicapped
63.13% ids
63.13% readies
63.13% waikato
61.21% improving
61.21% inventor
61.21% spiele
61.21% stavanger
60.28% powered
59.84% tome
59.80% checklist
59.37% announces
59.37% bursa
59.37% cali
59.37% chennai
59.37% curve
59.37% washtenaw
59.16% beginners
58.53% nutshell
58.48% benchmark
57.61% beos
57.61% modules
57.61% nella
57.61% shut
55.93% alford
55.93% australien
55.93% promoting
55.93% tivoli
55.54% parallel
54.52% enthusiasts
54.32% choosing
54.32% pretoria
54.32% scheduling
54.32% schwabach
54.32% szczecin
54.32% unicode
53.85% intranet
52.99% backup
52.78% latvian
52.78% opengl
52.78% processor
52.78% vendors
52.12% tcp
51.72% armada
51.30% longer
51.30% weinheim

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Does it make sense to add Bayesian search to lucent? [ In reply to ]

dave at lumos

Sep 19, 2002, 1:18 PM

Post #4 of 6 (841 views)

Permalink

Thanks for info, sounds interesting.
Yes, I'd like to see what you've done.
I'll try to post my 2 Bayesian "test" programs + my dmoz.org indexer
[.the later
in case people want a 1 GB index to play with]. Will prob take a day or
two
to get this stuff together.

-----Original Message-----
From: Maurits van Wijland [mailto:m.vanwijland@quicknet.nl]
Sent: Thursday, September 19, 2002 12:59 PM
To: Lucene Developers List
Cc: Doug Cutting; Spencer, Dave
Subject: Re: Does it make sense to add Bayesian search to lucent?

Dave,

this is very interesting. I have done something similair with Mutual
Information statistics.
The trick is to build a semantic network. First, you go through all the
terms and perform
a NEAR search within a preset window. Next you calculate the
probability,
and word
pairs with a higher than 0,6 probability are added to the semantic
network.

This network can then be referenced when performing a search. Simply
search
the
term in the semantic network, and get all related words. Next add them
to
the query
and set the boost factor properly (for instance, the entered term boost
this
times 2, and the others
can be boosted related to the probability of the ord pair).

There is a believe that for documents within a fixed domain need to
rebuild
a semantic network
only when there is a substantial addition of documents to the
collection.

Maybe we can swap some code, or in case anyone else is interested, I can
submit my
Mutual Information code to the list? But it is based on my Dutch
Analyzer
and dutch stopwords.

Doug, you have much experience in this area of statistics and
clustering, I
recall some documents
on Gather/Scatter. Is Lucene going this way or will Lucene remain a Free
Text search engine?
What are the plans for the near future of Lucene?

regards,

maurits van wijland

----- Original Message -----
From: "Spencer, Dave" <dave@lumos.com>
To: <lucene-dev@jakarta.apache.org>
Sent: Thursday, September 19, 2002 8:54 PM
Subject: Does it make sense to add Bayesian search to lucent?

I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).

http://www.wired.com/wired/archive/8.02/autonomy_pr.html

It gives what I believe is a simplified version of Baye's theorem:

P(t|y) = [ P(y|t)P(t) ] / P(y)

P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.

In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.

To calculate the right side you do the following:

P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).

P(y) is done the same way.

P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".

Then voilla, you can calculate it out.

I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.

Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.

I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms

#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).

Now for the questions - is the above generally right i.e.
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].

And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in
the [0...1] range.

I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found on citeseer.com
where not what I was looking for.

thx,
Dave

PS
I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.

PPS

Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.

99.42% ayuda
99.42% customers
99.42% packet
92.36% disk
92.06% caen
92.06% gopher
92.06% kiosk
92.06% setting
89.79% beginner
89.54% rexx
89.54% soluciones
88.68% buscador
88.68% columns
88.68% input
88.68% lining
88.68% tyneside
87.23% anesthesia
87.23% fortran
85.49% alternativa
85.49% cease
85.49% iomega
85.49% linus
85.49% respect
85.49% rh
85.49% streams
82.46% beehive
82.46% mastering
82.46% techtv
80.81% demos
80.81% fs
80.78% sybase
80.17% tk
79.59% bizkaia
79.59% delaney
79.59% lilac
79.59% millions
78.83% todos
76.87% avi
76.87% centurion
76.87% emirates
76.87% ett
76.87% fidonet
76.87% libri
76.87% taken
76.87% zmir
75.08% monterrey
74.57% tipps
74.29% clic
74.29% forensics
74.29% rampage
74.29% wr
73.80% modem
71.83% ee
71.83% freiburger
71.83% mic
71.83% poder
71.80% plug
71.58% apps
70.37% joins
69.93% pluto
69.70% unix
69.50% azur
69.50% dilemma
69.50% lis
69.50% meal
69.50% yhdistys
68.33% wan
67.27% balancing
67.27% broadcasters
67.27% diverse
67.27% embrace
67.27% finances
67.27% processors
67.08% drivers
65.97% cluster
65.30% penguin
65.29% proxy
65.15% faithful
65.15% macs
65.15% touts
65.12% hat
64.14% kde
64.13% administrators
63.85% tuxedo
63.13% feds
63.13% handicapped
63.13% ids
63.13% readies
63.13% waikato
61.21% improving
61.21% inventor
61.21% spiele
61.21% stavanger
60.28% powered
59.84% tome
59.80% checklist
59.37% announces
59.37% bursa
59.37% cali
59.37% chennai
59.37% curve
59.37% washtenaw
59.16% beginners
58.53% nutshell
58.48% benchmark
57.61% beos
57.61% modules
57.61% nella
57.61% shut
55.93% alford
55.93% australien
55.93% promoting
55.93% tivoli
55.54% parallel
54.52% enthusiasts
54.32% choosing
54.32% pretoria
54.32% scheduling
54.32% schwabach
54.32% szczecin
54.32% unicode
53.85% intranet
52.99% backup
52.78% latvian
52.78% opengl
52.78% processor
52.78% vendors
52.12% tcp
51.72% armada
51.30% longer
51.30% weinheim

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

AW: Does it make sense to add Bayesian search to lucent? [ In reply to ]

Kai.Kramer at web

Sep 19, 2002, 2:33 PM

Post #5 of 6 (831 views)

Permalink

>
> Maybe we can swap some code, or in case anyone else is
> interested, I can submit my Mutual Information code to the
> list? But it is based on my Dutch Analyzer and dutch stopwords.
>
I think this is a good idea.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Does it make sense to add Bayesian search to lucent? [ In reply to ]

otis_gospodnetic at yahoo

Sep 26, 2002, 3:22 PM

Post #6 of 6 (824 views)

Permalink

I'd like to see all the code mentioned :)

Also, recently I read an article about a CBR tool called Selection
Engine, in one of the recent JavaPRO issues. The project if on SF.net.
Doug mentioned refactoring Lucene to allow the similarity computation
to be more pluggable, I believe.
I wonder if that would allow you and others to plug in different
components for calculating similarity, like the CBR tool mentioned,
this Bayesian approach, etc.

Otis

--- "Spencer, Dave" <dave@lumos.com> wrote:
> Thanks for info, sounds interesting.
> Yes, I'd like to see what you've done.
> I'll try to post my 2 Bayesian "test" programs + my dmoz.org indexer
> [.the later
> in case people want a 1 GB index to play with]. Will prob take a day
> or
> two
> to get this stuff together.
>
> -----Original Message-----
> From: Maurits van Wijland [mailto:m.vanwijland@quicknet.nl]
> Sent: Thursday, September 19, 2002 12:59 PM
> To: Lucene Developers List
> Cc: Doug Cutting; Spencer, Dave
> Subject: Re: Does it make sense to add Bayesian search to lucent?
>
>
> Dave,
>
> this is very interesting. I have done something similair with Mutual
> Information statistics.
> The trick is to build a semantic network. First, you go through all
> the
> terms and perform
> a NEAR search within a preset window. Next you calculate the
> probability,
> and word
> pairs with a higher than 0,6 probability are added to the semantic
> network.
>
> This network can then be referenced when performing a search. Simply
> search
> the
> term in the semantic network, and get all related words. Next add
> them
> to
> the query
> and set the boost factor properly (for instance, the entered term
> boost
> this
> times 2, and the others
> can be boosted related to the probability of the ord pair).
>
> There is a believe that for documents within a fixed domain need to
> rebuild
> a semantic network
> only when there is a substantial addition of documents to the
> collection.
>
> Maybe we can swap some code, or in case anyone else is interested, I
> can
> submit my
> Mutual Information code to the list? But it is based on my Dutch
> Analyzer
> and dutch stopwords.
>
> Doug, you have much experience in this area of statistics and
> clustering, I
> recall some documents
> on Gather/Scatter. Is Lucene going this way or will Lucene remain a
> Free
> Text search engine?
> What are the plans for the near future of Lucene?
>
> regards,
>
> maurits van wijland
>
>
> ----- Original Message -----
> From: "Spencer, Dave" <dave@lumos.com>
> To: <lucene-dev@jakarta.apache.org>
> Sent: Thursday, September 19, 2002 8:54 PM
> Subject: Does it make sense to add Bayesian search to lucent?
>
>
> I'm not convinced I understand Bayesian search/logic/probabilities.
> I thought this was a great article about the company Autonomy in
> Wired
> and it gives a kind of technical intro to it and surprisingly for
> such
> magazines, there's even a math equation it in(!).
>
> http://www.wired.com/wired/archive/8.02/autonomy_pr.html
>
> It gives what I believe is a simplified version of Baye's theorem:
>
> P(t|y) = [ P(y|t)P(t) ] / P(y)
>
> P(t|y) is the probability that 't' implies 'y'.
> P(t) is the probability of 't' happening.
>
> In the context of a search engine I think 't' and 'y' are
> two different words - the equation says that if you're searching
> for 't', the probability that 'y' is relevant is, well, the right
> side
> of the equation.
>
> To calculate the right side you do the following:
>
> P(t) is the prob of 't' occurring in a document, thus searching with
> the
> query "+t" and
> calling Hits.length() returns this (well, you divide the the # of
> docs
> in the collection
> IndexReader.numDocs() to get the probability).
>
> P(y) is done the same way.
>
> P(y|t), I believe, is done as follows. Is it the prob of both words
> occurring
> divided by the number of docs that y occurs, so it's the count for
> "+y
> +t" divided
> by the count for "+y".
>
> Then voilla, you can calculate it out.
>
> I believe then that the way Lucene can be used to support this kind
> of
> logic is
> once the index is built go thru all terms (IndexReader.terms()) and
> compute
> P(t|y) for every pair of words indexed.
>
> Then if someone queries on 't' you do something similar to what a
> fuzzy
> search does - you
> form a larger query and boost the other words based on their
> probability
> (i.e. P(t|y)).
> The result is that you might get
> docs you didn't expect i.e. docs without 't' (the term you were
> searching on) which is
> where this becomes valuable - it can find docs w/o any of your search
> terms as you search terms
> in a sense imply that other words are relevant.
>
> I have 2 test programs:
> [1] Given a pair of words and an index this app computes P(t|y)
> [2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
> all other terms
>
> #1 runs fast enough while #2 is similar to what would be done in
> production, and takes
> forever on a large index (several hours on my 1GB/3Million record
> index
> of the dmoz.org content).
>
> Now for the questions - is the above generally right i.e.
> (a) my descr of how to calculate P(t|y)
> (b) my descr of the need to precalc P(t|y) for all pairs of words in
> the
> index, and thus
> it must take forever to do this in a large index [(b1) does Autonomy
> do
> this?].
>
> And the one other thing - I'm still checking my work but in some
> cases
> P(t|y) is > 1.0, which
> seems to make no sense as probabilities must be between 0 and 1 , but
> in
> the equation of
> P(t) is high and P(y) is low then maybe this "makes sense" and thus
> the
> equation is not always in
> the [0...1] range.
>
> I'm also interested in more, not-too-technical info on Bayesian
> search.
> Various docs I found on citeseer.com
> where not what I was looking for.
>
> thx,
> Dave
>
> PS
> I hope I hit the relevant mailing list - sorry if this is outside
> the
> dev charter.
>
>
> PPS
>
> Here's the output of my test program when I asked it to calcluate
> P(t|y)
> for t='linux'
> going over index of dmoz (index contains title+desc). It seem to make
> some sense - the
> case I was interested in was y=penguin (the linux mascot) which ranks
> in
> at 65%, thus
> if you searched on linux you might be (pleasantly?) suprised to get
> docs
> with pictures
> of baby penguins or whatnot. Output is all words < 100% and > 50%. I
> have not audited
> this extensively, esp as to why ayuda is so high(means help in
> Spanish..). Might be that
> small #'s of matches gives scores too high.
>
>
>
> 99.42% ayuda
> 99.42% customers
> 99.42% packet
> 92.36% disk
> 92.06% caen
> 92.06% gopher
> 92.06% kiosk
> 92.06% setting
> 89.79% beginner
> 89.54% rexx
> 89.54% soluciones
> 88.68% buscador
> 88.68% columns
> 88.68% input
> 88.68% lining
> 88.68% tyneside
> 87.23% anesthesia
> 87.23% fortran
> 85.49% alternativa
> 85.49% cease
> 85.49% iomega
> 85.49% linus
> 85.49% respect
> 85.49% rh
> 85.49% streams
> 82.46% beehive
> 82.46% mastering
> 82.46% techtv
> 80.81% demos
> 80.81% fs
> 80.78% sybase
> 80.17% tk
> 79.59% bizkaia
> 79.59% delaney
> 79.59% lilac
> 79.59% millions
> 78.83% todos
> 76.87% avi
> 76.87% centurion
> 76.87% emirates
> 76.87% ett
> 76.87% fidonet
> 76.87% libri
> 76.87% taken
> 76.87% zmir
> 75.08% monterrey
> 74.57% tipps
> 74.29% clic
> 74.29% forensics
> 74.29% rampage
> 74.29% wr
> 73.80% modem
> 71.83% ee
> 71.83% freiburger
> 71.83% mic
> 71.83% poder
> 71.80% plug
> 71.58% apps
> 70.37% joins
> 69.93% pluto
> 69.70% unix
> 69.50% azur
> 69.50% dilemma
> 69.50% lis
> 69.50% meal
> 69.50% yhdistys
> 68.33% wan
> 67.27% balancing
> 67.27% broadcasters
> 67.27% diverse
> 67.27% embrace
> 67.27% finances
> 67.27% processors
> 67.08% drivers
> 65.97% cluster
> 65.30% penguin
> 65.29% proxy
> 65.15% faithful
> 65.15% macs
> 65.15% touts
> 65.12% hat
> 64.14% kde
> 64.13% administrators
> 63.85% tuxedo
> 63.13% feds
> 63.13% handicapped
> 63.13% ids
> 63.13% readies
> 63.13% waikato
> 61.21% improving
> 61.21% inventor
> 61.21% spiele
> 61.21% stavanger
> 60.28% powered
> 59.84% tome
> 59.80% checklist
> 59.37% announces
> 59.37% bursa
> 59.37% cali
> 59.37% chennai
> 59.37% curve
> 59.37% washtenaw
> 59.16% beginners
> 58.53% nutshell
> 58.48% benchmark
> 57.61% beos
> 57.61% modules
> 57.61% nella
> 57.61% shut
> 55.93% alford
> 55.93% australien
> 55.93% promoting
> 55.93% tivoli
> 55.54% parallel
> 54.52% enthusiasts
> 54.32% choosing
> 54.32% pretoria
> 54.32% scheduling
> 54.32% schwabach
> 54.32% szczecin
> 54.32% unicode
> 53.85% intranet
> 52.99% backup
> 52.78% latvian
> 52.78% opengl
> 52.78% processor
> 52.78% vendors
> 52.12% tcp
> 51.72% armada
> 51.30% longer
> 51.30% weinheim
>
>
>
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>

__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>