I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).
http://www.wired.com/wired/archive/8.02/autonomy_pr.html
It gives what I believe is a simplified version of Baye's theorem:
P(t|y) = [ P(y|t)P(t) ] / P(y)
P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.
In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.
To calculate the right side you do the following:
P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).
P(y) is done the same way.
P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".
Then voilla, you can calculate it out.
I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.
Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.
I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms
#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).
Now for the questions - is the above generally right i.e.
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].
And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in
the [0...1] range.
I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found on citeseer.com
where not what I was looking for.
thx,
Dave
PS
I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.
PPS
Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.
99.42% ayuda
99.42% customers
99.42% packet
92.36% disk
92.06% caen
92.06% gopher
92.06% kiosk
92.06% setting
89.79% beginner
89.54% rexx
89.54% soluciones
88.68% buscador
88.68% columns
88.68% input
88.68% lining
88.68% tyneside
87.23% anesthesia
87.23% fortran
85.49% alternativa
85.49% cease
85.49% iomega
85.49% linus
85.49% respect
85.49% rh
85.49% streams
82.46% beehive
82.46% mastering
82.46% techtv
80.81% demos
80.81% fs
80.78% sybase
80.17% tk
79.59% bizkaia
79.59% delaney
79.59% lilac
79.59% millions
78.83% todos
76.87% avi
76.87% centurion
76.87% emirates
76.87% ett
76.87% fidonet
76.87% libri
76.87% taken
76.87% zmir
75.08% monterrey
74.57% tipps
74.29% clic
74.29% forensics
74.29% rampage
74.29% wr
73.80% modem
71.83% ee
71.83% freiburger
71.83% mic
71.83% poder
71.80% plug
71.58% apps
70.37% joins
69.93% pluto
69.70% unix
69.50% azur
69.50% dilemma
69.50% lis
69.50% meal
69.50% yhdistys
68.33% wan
67.27% balancing
67.27% broadcasters
67.27% diverse
67.27% embrace
67.27% finances
67.27% processors
67.08% drivers
65.97% cluster
65.30% penguin
65.29% proxy
65.15% faithful
65.15% macs
65.15% touts
65.12% hat
64.14% kde
64.13% administrators
63.85% tuxedo
63.13% feds
63.13% handicapped
63.13% ids
63.13% readies
63.13% waikato
61.21% improving
61.21% inventor
61.21% spiele
61.21% stavanger
60.28% powered
59.84% tome
59.80% checklist
59.37% announces
59.37% bursa
59.37% cali
59.37% chennai
59.37% curve
59.37% washtenaw
59.16% beginners
58.53% nutshell
58.48% benchmark
57.61% beos
57.61% modules
57.61% nella
57.61% shut
55.93% alford
55.93% australien
55.93% promoting
55.93% tivoli
55.54% parallel
54.52% enthusiasts
54.32% choosing
54.32% pretoria
54.32% scheduling
54.32% schwabach
54.32% szczecin
54.32% unicode
53.85% intranet
52.99% backup
52.78% latvian
52.78% opengl
52.78% processor
52.78% vendors
52.12% tcp
51.72% armada
51.30% longer
51.30% weinheim
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).
http://www.wired.com/wired/archive/8.02/autonomy_pr.html
It gives what I believe is a simplified version of Baye's theorem:
P(t|y) = [ P(y|t)P(t) ] / P(y)
P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.
In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.
To calculate the right side you do the following:
P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).
P(y) is done the same way.
P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".
Then voilla, you can calculate it out.
I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.
Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.
I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms
#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).
Now for the questions - is the above generally right i.e.
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].
And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in
the [0...1] range.
I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found on citeseer.com
where not what I was looking for.
thx,
Dave
PS
I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.
PPS
Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.
99.42% ayuda
99.42% customers
99.42% packet
92.36% disk
92.06% caen
92.06% gopher
92.06% kiosk
92.06% setting
89.79% beginner
89.54% rexx
89.54% soluciones
88.68% buscador
88.68% columns
88.68% input
88.68% lining
88.68% tyneside
87.23% anesthesia
87.23% fortran
85.49% alternativa
85.49% cease
85.49% iomega
85.49% linus
85.49% respect
85.49% rh
85.49% streams
82.46% beehive
82.46% mastering
82.46% techtv
80.81% demos
80.81% fs
80.78% sybase
80.17% tk
79.59% bizkaia
79.59% delaney
79.59% lilac
79.59% millions
78.83% todos
76.87% avi
76.87% centurion
76.87% emirates
76.87% ett
76.87% fidonet
76.87% libri
76.87% taken
76.87% zmir
75.08% monterrey
74.57% tipps
74.29% clic
74.29% forensics
74.29% rampage
74.29% wr
73.80% modem
71.83% ee
71.83% freiburger
71.83% mic
71.83% poder
71.80% plug
71.58% apps
70.37% joins
69.93% pluto
69.70% unix
69.50% azur
69.50% dilemma
69.50% lis
69.50% meal
69.50% yhdistys
68.33% wan
67.27% balancing
67.27% broadcasters
67.27% diverse
67.27% embrace
67.27% finances
67.27% processors
67.08% drivers
65.97% cluster
65.30% penguin
65.29% proxy
65.15% faithful
65.15% macs
65.15% touts
65.12% hat
64.14% kde
64.13% administrators
63.85% tuxedo
63.13% feds
63.13% handicapped
63.13% ids
63.13% readies
63.13% waikato
61.21% improving
61.21% inventor
61.21% spiele
61.21% stavanger
60.28% powered
59.84% tome
59.80% checklist
59.37% announces
59.37% bursa
59.37% cali
59.37% chennai
59.37% curve
59.37% washtenaw
59.16% beginners
58.53% nutshell
58.48% benchmark
57.61% beos
57.61% modules
57.61% nella
57.61% shut
55.93% alford
55.93% australien
55.93% promoting
55.93% tivoli
55.54% parallel
54.52% enthusiasts
54.32% choosing
54.32% pretoria
54.32% scheduling
54.32% schwabach
54.32% szczecin
54.32% unicode
53.85% intranet
52.99% backup
52.78% latvian
52.78% opengl
52.78% processor
52.78% vendors
52.12% tcp
51.72% armada
51.30% longer
51.30% weinheim