Mailing List Archive

Lucene and the numbers (again!)
Dear mates :

Yeah I know that i´m not very original and maybe the FAQ can resolve my problems but I didn´t find any real help there. Ok... here we go :

The problem is... I have a J2EE application working with LUCENE and the basic searching works properly but when I try to use a query with a number, Lucene gives me back all my indexed documents and I don´t understand why.

I´m using the SimpleAnalyzer. If I have for example a document with a field named "name". How can I search for example a name like '456' ?

Thank U for your help and advice !!!

__________________________
David Bonilla Fuertes
THE BIT BANG NETWORK
http://www.bit-bang.com
Profesor Waksman, 8, 6º B
28036 Madrid
SPAIN
Tel.: (+34) 914 577 747
Móvil: 656 62 83 92
Fax: (+34) 914 586 176
__________________________
Re: Lucene and the numbers (again!) [ In reply to ]
David,


> Yeah I know that i?m not very original and maybe the FAQ can resolve
> my problems but I didn?t find any real help there. Ok... here we go:

This is indeed a FAQ, and it also comes up often on the list, if
you check the archives.

Come to think of it, where *are* the archives now? I'm looking
at http://www.mail-archive.com/ and I don't see the more recent
(post-move-to-jakarta) postings there. The archive seems to end on
October 5th. Are we using a new archive now? Are the messages from
the old archive there?

> The problem is... I have a J2EE application working with LUCENE and
> the basic searching works properly but when I try to use a query
> with a number, Lucene gives me back all my indexed documents and I
> don?t understand why.
>
> I?m using the SimpleAnalyzer. If I have for example a document with
> a field named "name". How can I search for example a name like '456'?

If you look into... hm, well, I was going to say if you look into
the API docs, but it's not that simple. I remember somebody in the
past saying (on this list) to simply use StandardAnalyzer instead of
StopAnalyzer. I don't know if this works (I'll have to take time
later this afternoon and check the source code - this is one of the
things that's been on my to-do list for a month or so, but I've been
preoccupied with other areas of my project).

I guess I should note the following details:

StandardAnalyzer is not listed in the "Package
org.apache.lucene.analysis" page of the API docs (from the
lucene-1.2-rc2 checkout), just SimpleAnalyzer and StopAnalyzer.

When I track down the API docs for StandardAnalyzer and compare them,
neither StandardAnalyzer or StopAnalyzer says anything about numbers.

Nor do StandardFilter or StopFilter mention numbers.

Searching for "numeric" turns up nothing, searching for "number" turns
up:

"27. How does Lucene handle numbers and special characters ?

This depends of the analyzer you are using for indexing an searching."



I checked out the Lucene FAQ source in the past,with the intent
of going through it and checking for typos, etc, as a good way to
force myself to read it all as well as contributing something back to
Lucene. I think that was from the pre-jakarta days, I should probably
get a fresh, jakarta-based checkout. Is all of that stuff still in
the "website" checkout?

Steven J. Owens
puff@darksleep.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Lucene and the numbers (again!) [ In reply to ]
FWIW, i had a similar problem a while back (wanting to index the "model #"
in my index and then let people that searched for model # get what they were
expecting ... ) ... So i had to stop using the stock LowerCaseTokenizer and
I made up my own (cutting and pasting all of the stuff from
LowerCaseTokenizer) ... minus the removal of numerics ...

My analyzer is:

public class PorterAnalyzer extends Analyzer
{
public TokenStream tokenStream(Reader reader)
{
TokenStream stream = new PorterStemFilter(new StopFilter(new
LowerCaseNumericTokenizer(reader),StopAnalyzer.ENGLISH_STOP_WORDS));
}
}

My tokenizer is as follows (still haven't recompiled with the jakarta
lucene):

package mypackage.index;
import java.io.Reader;
import com.lucene.analysis.Tokenizer;
import com.lucene.analysis.Token;

/* This class is identical to the com.lucene.analysis.LowerCaseTokenizer
* class except that it will not trim off numbers
*/
public final class LowerCaseNumericTokenizer extends Tokenizer {
public LowerCaseNumericTokenizer(Reader in) {
input = in;
}

private int offset = 0, bufferIndex=0, dataLen=0;
private final static int MAX_WORD_LEN = 255;
private final static int IO_BUFFER_SIZE = 1024;
private final char[] buffer = new char[MAX_WORD_LEN];
private final char[] ioBuffer = new char[IO_BUFFER_SIZE];

public final Token next() throws java.io.IOException {
int length = 0;
int start = offset;
while (true)
{
final char c;

offset++;
if (bufferIndex >= dataLen)
{
dataLen = input.read(ioBuffer);
bufferIndex = 0;
};

if (dataLen == -1)
{
if (length > 0)
break;
else
return null;
}
else
c = (char) ioBuffer[bufferIndex++];

if (Character.isLetterOrDigit(c))
{ // if it's a letter

if (length == 0) // start of token
start = offset-1;

buffer[length++] = Character.toLowerCase(c);
// buffer it
if (length == MAX_WORD_LEN) // buffer overflow!
break;

}
else if (length > 0) // at non-Letter w/ chars
break; // return 'em

}

return new Token(new String(buffer, 0, length), start, start+length);
}
}

> -----Original Message-----
> From: Steven J. Owens [mailto:puffmail@darksleep.com]
> Sent: Tuesday, November 13, 2001 2:07 PM
> To: Lucene Users List; David Bonilla
> Subject: Re: Lucene and the numbers (again!)
>
>
> David,
>
>
> > Yeah I know that i?m not very original and maybe the FAQ can resolve
> > my problems but I didn?t find any real help there. Ok... here we go:
>
> This is indeed a FAQ, and it also comes up often on the list, if
> you check the archives.
>
> Come to think of it, where *are* the archives now? I'm looking
> at http://www.mail-archive.com/ and I don't see the more recent
> (post-move-to-jakarta) postings there. The archive seems to end on
> October 5th. Are we using a new archive now? Are the messages from
> the old archive there?
>
> > The problem is... I have a J2EE application working with LUCENE and
> > the basic searching works properly but when I try to use a query
> > with a number, Lucene gives me back all my indexed documents and I
> > don?t understand why.
> >
> > I?m using the SimpleAnalyzer. If I have for example a document with
> > a field named "name". How can I search for example a name like '456'?
>
> If you look into... hm, well, I was going to say if you look into
> the API docs, but it's not that simple. I remember somebody in the
> past saying (on this list) to simply use StandardAnalyzer instead of
> StopAnalyzer. I don't know if this works (I'll have to take time
> later this afternoon and check the source code - this is one of the
> things that's been on my to-do list for a month or so, but I've been
> preoccupied with other areas of my project).
>
> I guess I should note the following details:
>
> StandardAnalyzer is not listed in the "Package
> org.apache.lucene.analysis" page of the API docs (from the
> lucene-1.2-rc2 checkout), just SimpleAnalyzer and StopAnalyzer.
>
> When I track down the API docs for StandardAnalyzer and compare them,
> neither StandardAnalyzer or StopAnalyzer says anything about numbers.
>
> Nor do StandardFilter or StopFilter mention numbers.
>
> Searching for "numeric" turns up nothing, searching for "number" turns
> up:
>
> "27. How does Lucene handle numbers and special characters ?
>
> This depends of the analyzer you are using for indexing an searching."
>
>
>
> I checked out the Lucene FAQ source in the past,with the intent
> of going through it and checking for typos, etc, as a good way to
> force myself to read it all as well as contributing something back to
> Lucene. I think that was from the pre-jakarta days, I should probably
> get a fresh, jakarta-based checkout. Is all of that stuff still in
> the "website" checkout?
>
> Steven J. Owens
> puff@darksleep.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Lucene and the numbers (again!) [ In reply to ]
I had the same problem (searching for Alfa 147 [it's a very cool car]).

SimpleAnalyzer uses LowerCaseTokenizer. This "divides text at
non-letters and converts
them to lower case." (source: API docs) Since numbers are non-letters,
it makes as much as three tokens from the string "147".

I managed to use numbers in the query after using StandardAnalyzer at
query side and at index side as well. Try it!
(org.apache.lucene.analysis.standard package)

peter

> -----Original Message-----
> From: Steven J. Owens [mailto:puffmail@darksleep.com]
> Sent: Tuesday, November 13, 2001 8:07 PM
> To: Lucene Users List; David Bonilla
> Subject: Re: Lucene and the numbers (again!)
>
>
> David,
>
>
> > Yeah I know that i?m not very original and maybe the FAQ can resolve
> > my problems but I didn?t find any real help there. Ok... here we go:
>
> This is indeed a FAQ, and it also comes up often on the list, if
> you check the archives.
>
> Come to think of it, where *are* the archives now? I'm looking
> at http://www.mail-archive.com/ and I don't see the more recent
> (post-move-to-jakarta) postings there. The archive seems to end on
> October 5th. Are we using a new archive now? Are the messages from
> the old archive there?
>
> > The problem is... I have a J2EE application working with LUCENE and
> > the basic searching works properly but when I try to use a query
> > with a number, Lucene gives me back all my indexed documents and I
> > don?t understand why.
> >
> > I?m using the SimpleAnalyzer. If I have for example a document with
> > a field named "name". How can I search for example a name
> like '456'?
>
> If you look into... hm, well, I was going to say if you look into
> the API docs, but it's not that simple. I remember somebody in the
> past saying (on this list) to simply use StandardAnalyzer instead of
> StopAnalyzer. I don't know if this works (I'll have to take time
> later this afternoon and check the source code - this is one of the
> things that's been on my to-do list for a month or so, but I've been
> preoccupied with other areas of my project).
>
> I guess I should note the following details:
>
> StandardAnalyzer is not listed in the "Package
> org.apache.lucene.analysis" page of the API docs (from the
> lucene-1.2-rc2 checkout), just SimpleAnalyzer and StopAnalyzer.
>
> When I track down the API docs for StandardAnalyzer and compare them,
> neither StandardAnalyzer or StopAnalyzer says anything about numbers.
>
> Nor do StandardFilter or StopFilter mention numbers.
>
> Searching for "numeric" turns up nothing, searching for "number" turns
> up:
>
> "27. How does Lucene handle numbers and special characters ?
>
> This depends of the analyzer you are using for indexing an searching."
>
>
>
> I checked out the Lucene FAQ source in the past,with the intent
> of going through it and checking for typos, etc, as a good way to
> force myself to read it all as well as contributing something back to
> Lucene. I think that was from the pre-jakarta days, I should probably
> get a fresh, jakarta-based checkout. Is all of that stuff still in
> the "website" checkout?
>
> Steven J. Owens
> puff@darksleep.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>