Mailing List Archive

StrictAnalyzer Proposal
Everyone -

I was told to post any proposed Lucene changes/additions to the list, so
here goes:

(1)I rewrote StandardAnalyzer as StrictAnalyzer for the project I am working
on. StandardAnalyzer does not filter enough words for my liking.
Basically all I did was add to the STOP_WORDS array. The stop words I added
are based on the default values in SQL Server 2000's text indexing. (Source
code below)

(2)I would also like to propose a change to StandardTokenizer which supports
strings with a trailing and/or leading comma(s) such as "therefore," and
",ice,". Currently StandardTokenizer is not returning any results for some
of my most basic searches because of commas adjacent to words.

Comments, suggestions, questions?

Thanks,
Alan


import org.apache.lucene.analysis.*;
import java.io.Reader;
import java.util.Hashtable;

/** Filters {@link StandardTokenizer} with {@link StandardFilter}, {@link
* LowerCaseFilter} and {@link StopFilter}. */
public final class StrictAnalyzer extends Analyzer {
private Hashtable stopTable;

/** An array containing some common English words that are not usually
useful
for searching. */
public static final String[] STOP_WORDS = {
"0","1","2","3","4","5","6","7","8","9",
"$",
"about", "after", "all", "also", "an", "and",
"another", "any", "are", "as", "at", "be", "because",
"been", "before", "being", "between", "both", "but",
"by","came","can","come","could","did","do","does",
"each","else","for","from","get","got","has","had",
"he","have","her","here","him","himself","his","how",
"if","in","into","is","it","its","just","like","make",
"many","me","might","more","most","much","must","my",
"never","now","of","on","only","or","other","our","out",
"over","re","said","same","see","should","since","so",
"some","still","such","take","than","that","the","their",
"them","then","there","these","they","this","those","through",
"to","too","under","up","use","very","want","was","way","we",
"well","were","what","when","where","which","while","who","will",
"with","would","you","your",

"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s",
"t","u","v","w","x","y","z"

};

/** Builds an analyzer. */
public StrictAnalyzer() {
this(STOP_WORDS);
}

/** Builds an analyzer with the given stop words. */
public StrictAnalyzer(String[] stopWords) {
stopTable = StopFilter.makeStopTable(stopWords);
}

/** Constructs a {@link StandardTokenizer} filtered by a {@link
* StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopTable);
return result;
}
}

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: StrictAnalyzer Proposal [ In reply to ]
Hello,

> (1)I rewrote StandardAnalyzer as StrictAnalyzer for the project I am
> working
> on. StandardAnalyzer does not filter enough words for my liking.
> Basically all I did was add to the STOP_WORDS array. The stop words
> I added
> are based on the default values in SQL Server 2000's text indexing.
> (Source code below)

The change seems simple and looks fine to me. If nobody complains
until tonight I'll commit it.
I'd recommend using explicit imports (not import ....*;) in the future.

> (2)I would also like to propose a change to StandardTokenizer which
> supports
> strings with a trailing and/or leading comma(s) such as "therefore,"
> and
> ",ice,". Currently StandardTokenizer is not returning any results
> for some
> of my most basic searches because of commas adjacent to words.
>
> Comments, suggestions, questions?

Hm, shouldn't that be filtered by one of the analyzers at both indexing
and searching time? Are you using Stop analyzer?
Please also see http://www.jguru.com/faq/view.jsp?EID=538308

Otis

> import org.apache.lucene.analysis.*;
> import java.io.Reader;
> import java.util.Hashtable;
>
> /** Filters {@link StandardTokenizer} with {@link StandardFilter},
> {@link
> * LowerCaseFilter} and {@link StopFilter}. */
> public final class StrictAnalyzer extends Analyzer {
> private Hashtable stopTable;
>
> /** An array containing some common English words that are not
> usually
> useful
> for searching. */
> public static final String[] STOP_WORDS = {
> "0","1","2","3","4","5","6","7","8","9",
> "$",
> "about", "after", "all", "also", "an", "and",
> "another", "any", "are", "as", "at", "be", "because",
> "been", "before", "being", "between", "both", "but",
> "by","came","can","come","could","did","do","does",
> "each","else","for","from","get","got","has","had",
> "he","have","her","here","him","himself","his","how",
> "if","in","into","is","it","its","just","like","make",
> "many","me","might","more","most","much","must","my",
> "never","now","of","on","only","or","other","our","out",
> "over","re","said","same","see","should","since","so",
> "some","still","such","take","than","that","the","their",
> "them","then","there","these","they","this","those","through",
> "to","too","under","up","use","very","want","was","way","we",
> "well","were","what","when","where","which","while","who","will",
> "with","would","you","your",
>
>
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s",
> "t","u","v","w","x","y","z"
>
> };
>
> /** Builds an analyzer. */
> public StrictAnalyzer() {
> this(STOP_WORDS);
> }
>
> /** Builds an analyzer with the given stop words. */
> public StrictAnalyzer(String[] stopWords) {
> stopTable = StopFilter.makeStopTable(stopWords);
> }
>
> /** Constructs a {@link StandardTokenizer} filtered by a {@link
> * StandardFilter}, a {@link LowerCaseFilter} and a {@link
> StopFilter}. */
> public final TokenStream tokenStream(String fieldName, Reader
> reader) {
> TokenStream result = new StandardTokenizer(reader);
> result = new StandardFilter(result);
> result = new LowerCaseFilter(result);
> result = new StopFilter(result, stopTable);
> return result;
> }
> }


__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: StrictAnalyzer Proposal [ In reply to ]
On Wed, 20 Feb 2002, Otis Gospodnetic wrote:

> > (1)I rewrote StandardAnalyzer as StrictAnalyzer for the project I am
> > working
> > on. StandardAnalyzer does not filter enough words for my liking.
> > Basically all I did was add to the STOP_WORDS array. The stop words
> > I added
> > are based on the default values in SQL Server 2000's text indexing.
> > (Source code below)
>
> The change seems simple and looks fine to me. If nobody complains
> until tonight I'll commit it.

As Dmitry said, it seems to me that adding classes to a project which
differ from one another only in static data is poor software engineering
practice, and probably confusing to users. Since StopAnalyzer has a
constructor which allows users to specify their own arrays of stop words,
I'm not sure what the benefit of StrictAnalyzer is.

On the other hand, I do think that providing a repository of alternative
prefabricated stop word arrays would be useful to users. I suggest the
following:

(1) Create an area on the Lucene website to a repository of such
things. (Does Lucene have a 'contributions' ftp site?)
(2) Leave StopAnalyzer as is, to avoid confusion by people upgrading to
the new version, but include a link in the documentation to the
aforementioned repository.

Joshua

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.






--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: StrictAnalyzer Proposal [ In reply to ]
> I suggest the following:
>
> (1) Create an area on the Lucene website to a repository of such
> things. (Does Lucene have a 'contributions' ftp site?)
> (2) Leave StopAnalyzer as is, to avoid confusion by people upgrading
> to
> the new version, but include a link in the documentation to the
> aforementioned repository.

I think these and Dmitry's points are good, so I won't be committing
anything. Having a good stop word list would be good, and not just for
English!

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>