Mailing List Archive

Googlifying lucene querys
Hello,

Despite of the confusing subject ;) my question is simple. I'm just
trying out Lucene for the first time and would like to know how one
would go on implementing the search on the index with the same logic
that Google uses.
For example, if the user input is "george bush white house", how
do I easily construct a query that searches ALL of the words above? If I
have understood correctly, passing the search string above to the
queryParser creates a query that search for ANY of the words above.

Thanks for any help,

Jari Aarniala
foo@welho.com



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Googlifying lucene querys [ In reply to ]
+george +bush +white +house


--
Ian.

Jari Aarniala wrote:
>
> Hello,
>
> Despite of the confusing subject ;) my question is simple. I'm just
> trying out Lucene for the first time and would like to know how one
> would go on implementing the search on the index with the same logic
> that Google uses.
> For example, if the user input is "george bush white house", how
> do I easily construct a query that searches ALL of the words above? If I
> have understood correctly, passing the search string above to the
> queryParser creates a query that search for ANY of the words above.
>
> Thanks for any help,

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
> +george +bush +white +house

Well, that's pretty obvious even for me :) If you have separate words,
just tokenize the string and add a plus in front of each of the words.
But what I'm trying to do here is this:

Let's say I have a more complicated query, say

'george bush "white house"'

There you have two separate words, "george" and "bush" and then
"white house" enclosed in quotes. If I use a piece of simple
tokenization code, the above query becomes

+georbe +bush +"white +house"

See what I mean? That won't work the way expected.
Anyway, I'm still a bit confused the inner workings of Lucene,
so maybe I'll come up with something myself.

Jari Aarniala
foo@welho.com



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Googlifying lucene querys [ In reply to ]
Hi Jari,

Lucene is designed as an API with different components broken out so a
developer can create the uniqueness required.

One part of Lucene is the QueryParser. The QueryParser takes a search string
and create a set of classes based on the current QueryParser.jj
implementation and turns it into a Lucene Query. This is meant to be a good
solution for most people, but it is just a sample of what can be done.

In the current implementation of QueryParser

'george bush "white house"'
Will create an OR query of
George OR bush OR "white house"
Basically, the default is an OR between words unless otherwise specified.

You can use other boolean operators like AND, and NOT
So
'george AND bush OR "white house" NOT ford'

Lucene and the current QueryParser supports
wildcards with the * character
Single character replace with the ? Character
Fuzzy searches with the ~ character when next to a single word term
Proximity searches (just added to QueryParser) with the ~3 next to a phrase
term

Again, you can create your own QueryParser to create your desired
implementation.

I hope this helps.

--Peter




On 2/23/02 8:19 AM, "Jari Aarniala" <foo@welho.com> wrote:

>> +george +bush +white +house
>
> Well, that's pretty obvious even for me :) If you have separate words,
> just tokenize the string and add a plus in front of each of the words.
> But what I'm trying to do here is this:
>
> Let's say I have a more complicated query, say
>
> 'george bush "white house"'
>
> There you have two separate words, "george" and "bush" and then
> "white house" enclosed in quotes. If I use a piece of simple
> tokenization code, the above query becomes
>
> +georbe +bush +"white +house"
>
> See what I mean? That won't work the way expected.
> Anyway, I'm still a bit confused the inner workings of Lucene,
> so maybe I'll come up with something myself.
>
> Jari Aarniala
> foo@welho.com
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
In the Lucene build that we've got (2/21) the question mark does not do a
single-character replace. Does anyone know why? We're using the
StandardAnalyzer and the default QueryParser.

-----Original Message-----
From: Peter Carlson [mailto:carlson@bookandhammer.com]
Sent: Saturday, February 23, 2002 5:23 PM
To: Lucene Users List
Subject: Re: Googlifying lucene querys


Hi Jari,

Lucene is designed as an API with different components broken out so a
developer can create the uniqueness required.

One part of Lucene is the QueryParser. The QueryParser takes a search string
and create a set of classes based on the current QueryParser.jj
implementation and turns it into a Lucene Query. This is meant to be a good
solution for most people, but it is just a sample of what can be done.

In the current implementation of QueryParser

'george bush "white house"'
Will create an OR query of
George OR bush OR "white house"
Basically, the default is an OR between words unless otherwise specified.

You can use other boolean operators like AND, and NOT
So
'george AND bush OR "white house" NOT ford'

Lucene and the current QueryParser supports
wildcards with the * character
Single character replace with the ? Character
Fuzzy searches with the ~ character when next to a single word term
Proximity searches (just added to QueryParser) with the ~3 next to a phrase
term

Again, you can create your own QueryParser to create your desired
implementation.

I hope this helps.

--Peter




On 2/23/02 8:19 AM, "Jari Aarniala" <foo@welho.com> wrote:

>> +george +bush +white +house
>
> Well, that's pretty obvious even for me :) If you have separate words,
> just tokenize the string and add a plus in front of each of the words.
> But what I'm trying to do here is this:
>
> Let's say I have a more complicated query, say
>
> 'george bush "white house"'
>
> There you have two separate words, "george" and "bush" and then
> "white house" enclosed in quotes. If I use a piece of simple
> tokenization code, the above query becomes
>
> +georbe +bush +"white +house"
>
> See what I mean? That won't work the way expected.
> Anyway, I'm still a bit confused the inner workings of Lucene,
> so maybe I'll come up with something myself.
>
> Jari Aarniala
> foo@welho.com
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
If you put the title in a separate field from the contents, and search both
fields, matches in the title will usually be stronger, without explicit
boosting. This is because the scores are normalized by the length of the
field, and the title tends to be much shorter than the contents. So even
without boosting, title matches usually come before contents matches.

Doug

> -----Original Message-----
> From: Spencer, Dave [mailto:dave@lumos.com]
> Sent: Monday, February 25, 2002 10:22 AM
> To: Lucene Users List
> Subject: RE: Googlifying lucene querys
>
>
> I'm pretty sure google gives priority to the words appearing in the
> title and URL.
>
> I believe sect 4.2.5 says this here:
> http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzww
> w-db.stanf
> ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf
> from here:
> http://citeseer.nj.nec.com/brin98anatomy.html
>
> So you have to have Lucene store the title as a separate field.
>
> This is then what you'd have if like me you boost (the caret
> is "boost")
> the title by *5 and the URL by *2:
>
> +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0
> url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0
> contents:white) +(title:house^5.0 url:house^2.0 contents:house)
>
>
> -----Original Message-----
> From: Ian Lea [mailto:ian.lea@blackwell.co.uk]
> Sent: Saturday, February 23, 2002 8:15 AM
> To: Lucene Users List
> Subject: Re: Googlifying lucene querys
>
>
> +george +bush +white +house
>
>
> --
> Ian.
>
> Jari Aarniala wrote:
> >
> > Hello,
> >
> > Despite of the confusing subject ;) my question is simple. I'm just
> > trying out Lucene for the first time and would like to know how one
> > would go on implementing the search on the index with the same logic
> > that Google uses.
> > For example, if the user input is "george bush white house",
> how
> > do I easily construct a query that searches ALL of the
> words above? If
> I
> > have understood correctly, passing the search string above to the
> > queryParser creates a query that search for ANY of the words above.
> >
> > Thanks for any help,
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
I'm pretty sure google gives priority to the words appearing in the
title and URL.

I believe sect 4.2.5 says this here:
http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzwww-db.stanf
ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf
from here:
http://citeseer.nj.nec.com/brin98anatomy.html

So you have to have Lucene store the title as a separate field.

This is then what you'd have if like me you boost (the caret is "boost")
the title by *5 and the URL by *2:

+(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0
url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0
contents:white) +(title:house^5.0 url:house^2.0 contents:house)


-----Original Message-----
From: Ian Lea [mailto:ian.lea@blackwell.co.uk]
Sent: Saturday, February 23, 2002 8:15 AM
To: Lucene Users List
Subject: Re: Googlifying lucene querys


+george +bush +white +house


--
Ian.

Jari Aarniala wrote:
>
> Hello,
>
> Despite of the confusing subject ;) my question is simple. I'm just
> trying out Lucene for the first time and would like to know how one
> would go on implementing the search on the index with the same logic
> that Google uses.
> For example, if the user input is "george bush white house",
how
> do I easily construct a query that searches ALL of the words above? If
I
> have understood correctly, passing the search string above to the
> queryParser creates a query that search for ANY of the words above.
>
> Thanks for any help,

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
>
> You cannot, in general, structure a Lucene query such that it
> will yield
> the same document rankings that Google would for that (query, document
> set). The reason for this is that Google employs a scoring
> algorithm that
> includes information about the topology of the pages (i.e., how the
> pages are linked together). (An overview of what Google does in this
> regard may be found at http://www.google.com/technology/index.html .)
> Thus, in order to get Lucene to do "what Google does", you'd have to
> rewrite large chunks of it.

I don't agree with your conclusion: you would not have to re-write much of
Lucene to incorporate this sort of information. To my understanding, Google
uses linking information as a factor in scoring. Thus every document in the
index has a factor computed from its links that is multiplied into its
score.

Lucene already keeps a factor per document that is multiplied into its
score, but one that is computed from the document's length, not its links.
Thus, once one has computed link scores, to add them to Lucene we just need
to permit applications to affect this factor, with something like a
Document.setBoost(float) method. The representation of the per-document
factor would also need to change a little internally. It is currently
stored as a single byte, and multiplying in an arbitrary factor would cause
overflow. But enlarging it to 16 bits would be a small change.

So adding such a capability would require re-writing only a very small chunk
of Lucene. Computing a link-based factor would also take some code, but
that's writing, not re-writing.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
You cannot, in general, structure a Lucene query such that it will yield
the same document rankings that Google would for that (query, document
set). The reason for this is that Google employs a scoring algorithm that
includes information about the topology of the pages (i.e., how the
pages are linked together). (An overview of what Google does in this
regard may be found at http://www.google.com/technology/index.html .)
Thus, in order to get Lucene to do "what Google does", you'd have to
rewrite large chunks of it.

Joshua

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Mon, 25 Feb 2002, Spencer, Dave wrote:

> I'm pretty sure google gives priority to the words appearing in the
> title and URL.
>
> I believe sect 4.2.5 says this here:
> http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzwww-db.stanf
> ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf
> from here:
> http://citeseer.nj.nec.com/brin98anatomy.html
>
> So you have to have Lucene store the title as a separate field.
>
> This is then what you'd have if like me you boost (the caret is "boost")
> the title by *5 and the URL by *2:
>
> +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0
> url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0
> contents:white) +(title:house^5.0 url:house^2.0 contents:house)
>
>
> -----Original Message-----
> From: Ian Lea [mailto:ian.lea@blackwell.co.uk]
> Sent: Saturday, February 23, 2002 8:15 AM
> To: Lucene Users List
> Subject: Re: Googlifying lucene querys
>
>
> +george +bush +white +house
>
>
> --
> Ian.
>
> Jari Aarniala wrote:
> >
> > Hello,
> >
> > Despite of the confusing subject ;) my question is simple. I'm just
> > trying out Lucene for the first time and would like to know how one
> > would go on implementing the search on the index with the same logic
> > that Google uses.
> > For example, if the user input is "george bush white house",
> how
> > do I easily construct a query that searches ALL of the words above? If
> I
> > have understood correctly, passing the search string above to the
> > queryParser creates a query that search for ANY of the words above.
> >
> > Thanks for any help,
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
On Mon, 25 Feb 2002, Doug Cutting wrote:

> > From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
> >
> > You cannot, in general, structure a Lucene query such that it
> > will yield
> > the same document rankings that Google would for that (query, document
> > set). The reason for this is that Google employs a scoring
> > algorithm that
> > includes information about the topology of the pages (i.e., how the
> > pages are linked together). (An overview of what Google does in this
> > regard may be found at http://www.google.com/technology/index.html .)
> > Thus, in order to get Lucene to do "what Google does", you'd have to
> > rewrite large chunks of it.
>
> I don't agree with your conclusion: you would not have to re-write
> much of Lucene to incorporate this sort of information. To my
> understanding, Google uses linking information as a factor in scoring.
> Thus every document in the index has a factor computed from its links
> that is multiplied into its score.

It's not quite as simple as you're making it sound. Google's PageRank
algorithm is recursive: the "authority" of a page (a factor in determining
its rank) is determined in part by the authority of the pages that link to
it, and by the authority of the pages that link to each of the pages that
link to it, and so on.

> Lucene already keeps a factor per document that is multiplied into its
> score, but one that is computed from the document's length, not its
> links. Thus, once one has computed link scores, to add them to Lucene
> we just need to permit applications to affect this factor, with
> something like a Document.setBoost(float) method. The representation
> of the per-document factor would also need to change a little
> internally. It is currently stored as a single byte, and multiplying
> in an arbitrary factor would cause overflow. But enlarging it to 16
> bits would be a small change.
>
> So adding such a capability would require re-writing only a very small
> chunk of Lucene. Computing a link-based factor would also take some
> code, but that's writing, not re-writing.

On reflection, I think you're right (although I am not certain that some
other aspect of the algorithm might not require more extensive re-writing;
I'd have to check). On that basis, I retract my prefix. :)

If the original poster were interested in exactly replicating Google's
results, though, he's probably out of luck, simply because while PageRank
is a published algorithm, there have been revisions to the Google
architecture that I am reasonably sure have not been published.

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Googlifying lucene querys [ In reply to ]
I think this document is very interesting:
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

peter

> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> Sent: Monday, February 25, 2002 7:15 PM
> To: 'Lucene Users List'
> Subject: RE: Googlifying lucene querys
>
>
> If you put the title in a separate field from the contents,
> and search both
> fields, matches in the title will usually be stronger,
> without explicit
> boosting. This is because the scores are normalized by the
> length of the
> field, and the title tends to be much shorter than the
> contents. So even
> without boosting, title matches usually come before contents matches.
>
> Doug
>
> > -----Original Message-----
> > From: Spencer, Dave [mailto:dave@lumos.com]
> > Sent: Monday, February 25, 2002 10:22 AM
> > To: Lucene Users List
> > Subject: RE: Googlifying lucene querys
> >
> >
> > I'm pretty sure google gives priority to the words appearing in the
> > title and URL.
> >
> > I believe sect 4.2.5 says this here:
> > http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzww
> > w-db.stanf
> > ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf
> > from here:
> > http://citeseer.nj.nec.com/brin98anatomy.html
> >
> > So you have to have Lucene store the title as a separate field.
> >
> > This is then what you'd have if like me you boost (the caret
> > is "boost")
> > the title by *5 and the URL by *2:
> >
> > +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0
> > url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0
> > contents:white) +(title:house^5.0 url:house^2.0 contents:house)
> >
> >
> > -----Original Message-----
> > From: Ian Lea [mailto:ian.lea@blackwell.co.uk]
> > Sent: Saturday, February 23, 2002 8:15 AM
> > To: Lucene Users List
> > Subject: Re: Googlifying lucene querys
> >
> >
> > +george +bush +white +house
> >
> >
> > --
> > Ian.
> >
> > Jari Aarniala wrote:
> > >
> > > Hello,
> > >
> > > Despite of the confusing subject ;) my question is
> simple. I'm just
> > > trying out Lucene for the first time and would like to
> know how one
> > > would go on implementing the search on the index with the
> same logic
> > > that Google uses.
> > > For example, if the user input is "george bush
> white house",
> > how
> > > do I easily construct a query that searches ALL of the
> > words above? If
> > I
> > > have understood correctly, passing the search string above to the
> > > queryParser creates a query that search for ANY of the
> words above.
> > >
> > > Thanks for any help,
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>