Everyone -
I was told to post any proposed Lucene changes/additions to the list, so
here goes:
(1)I rewrote StandardAnalyzer as StrictAnalyzer for the project I am working
on. StandardAnalyzer does not filter enough words for my liking.
Basically all I did was add to the STOP_WORDS array. The stop words I added
are based on the default values in SQL Server 2000's text indexing. (Source
code below)
(2)I would also like to propose a change to StandardTokenizer which supports
strings with a trailing and/or leading comma(s) such as "therefore," and
",ice,". Currently StandardTokenizer is not returning any results for some
of my most basic searches because of commas adjacent to words.
Comments, suggestions, questions?
Thanks,
Alan
import org.apache.lucene.analysis.*;
import java.io.Reader;
import java.util.Hashtable;
/** Filters {@link StandardTokenizer} with {@link StandardFilter}, {@link
* LowerCaseFilter} and {@link StopFilter}. */
public final class StrictAnalyzer extends Analyzer {
private Hashtable stopTable;
/** An array containing some common English words that are not usually
useful
for searching. */
public static final String[] STOP_WORDS = {
"0","1","2","3","4","5","6","7","8","9",
"$",
"about", "after", "all", "also", "an", "and",
"another", "any", "are", "as", "at", "be", "because",
"been", "before", "being", "between", "both", "but",
"by","came","can","come","could","did","do","does",
"each","else","for","from","get","got","has","had",
"he","have","her","here","him","himself","his","how",
"if","in","into","is","it","its","just","like","make",
"many","me","might","more","most","much","must","my",
"never","now","of","on","only","or","other","our","out",
"over","re","said","same","see","should","since","so",
"some","still","such","take","than","that","the","their",
"them","then","there","these","they","this","those","through",
"to","too","under","up","use","very","want","was","way","we",
"well","were","what","when","where","which","while","who","will",
"with","would","you","your",
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s",
"t","u","v","w","x","y","z"
};
/** Builds an analyzer. */
public StrictAnalyzer() {
this(STOP_WORDS);
}
/** Builds an analyzer with the given stop words. */
public StrictAnalyzer(String[] stopWords) {
stopTable = StopFilter.makeStopTable(stopWords);
}
/** Constructs a {@link StandardTokenizer} filtered by a {@link
* StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopTable);
return result;
}
}
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
I was told to post any proposed Lucene changes/additions to the list, so
here goes:
(1)I rewrote StandardAnalyzer as StrictAnalyzer for the project I am working
on. StandardAnalyzer does not filter enough words for my liking.
Basically all I did was add to the STOP_WORDS array. The stop words I added
are based on the default values in SQL Server 2000's text indexing. (Source
code below)
(2)I would also like to propose a change to StandardTokenizer which supports
strings with a trailing and/or leading comma(s) such as "therefore," and
",ice,". Currently StandardTokenizer is not returning any results for some
of my most basic searches because of commas adjacent to words.
Comments, suggestions, questions?
Thanks,
Alan
import org.apache.lucene.analysis.*;
import java.io.Reader;
import java.util.Hashtable;
/** Filters {@link StandardTokenizer} with {@link StandardFilter}, {@link
* LowerCaseFilter} and {@link StopFilter}. */
public final class StrictAnalyzer extends Analyzer {
private Hashtable stopTable;
/** An array containing some common English words that are not usually
useful
for searching. */
public static final String[] STOP_WORDS = {
"0","1","2","3","4","5","6","7","8","9",
"$",
"about", "after", "all", "also", "an", "and",
"another", "any", "are", "as", "at", "be", "because",
"been", "before", "being", "between", "both", "but",
"by","came","can","come","could","did","do","does",
"each","else","for","from","get","got","has","had",
"he","have","her","here","him","himself","his","how",
"if","in","into","is","it","its","just","like","make",
"many","me","might","more","most","much","must","my",
"never","now","of","on","only","or","other","our","out",
"over","re","said","same","see","should","since","so",
"some","still","such","take","than","that","the","their",
"them","then","there","these","they","this","those","through",
"to","too","under","up","use","very","want","was","way","we",
"well","were","what","when","where","which","while","who","will",
"with","would","you","your",
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s",
"t","u","v","w","x","y","z"
};
/** Builds an analyzer. */
public StrictAnalyzer() {
this(STOP_WORDS);
}
/** Builds an analyzer with the given stop words. */
public StrictAnalyzer(String[] stopWords) {
stopTable = StopFilter.makeStopTable(stopWords);
}
/** Constructs a {@link StandardTokenizer} filtered by a {@link
* StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopTable);
return result;
}
}
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>