Mailing List Archive

Is there a StrlenFilter yet?
Use case - you want to protect yourself against pathalogical docs
such as one with a string of a million consectutive characters - any
normal tokenizer will consider this one big token but there's probably
no point in indexing a string that is a million characters long.
One example is indexing a mailing list which could contain uuencoded
attachments - there
could be lots of lamo lines 72 or so chars long.

Anyway - I've attached a possible impl.

Discussion question is, let's say the filter is told to only return
tokens <= 5 chars long (note:
I think 16 or so would be more realistic for most docs -this is just for
sake of example).

What if there is one 6 chars long then i.e. longer than the limit - say
it
is "abcdef".

Then either:

[a] we ignore "abcdef" and assume it is garbage
or
[b] we return "abcde" and "bcdef" i.e. all 5 char substrings
of it, so that if someone wants to search on the 6 char string they
sort of still can (at least w/ a carefully chosen query...hmmm..).

Anyway here's some code.
If popular it could be put into StandardAnalyzer.

--------
package com.tropo.lucene;

import java.io.IOException;
import org.apache.lucene.analysis.*;

/**
* Removes words that are too long and too short from the stream
*/
public final class StrlenFilter
extends TokenFilter
{
/**
* Build a filter that removes words that are too long or too
short from the text.
*/
public StrlenFilter(TokenStream in, int min, int max)
{
input = in;
this.min = min;
this.max =max;
}

/** Returns the next input Token whose termText() is the right
len
*/
public final Token next() throws IOException
{
// return the first non-stop word found
for (Token token = input.next(); token != null; token =
input.next())
{
final int len = token.termText().length();
if ( len >= min && len <= max)
return token;
// note: else we ignore it but should we index
each part of it?
}
// reached EOS -- return null
return null;
}
final int min;
final int max;
}



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>