Mailing List Archive

Changing type of the tokens generated by pattern tokenizer
I am creating a custom Pattern Tokenizer to change the type of the
generated tokens. By incrementToken() function looks like the below code:

public boolean incrementToken() {
if (index >= str.length()) return false;
clearAttributes();
if (group >= 0) {

// match a specific group
while (matcher.find()) {
index = matcher.start(group);
final int endIndex = matcher.end(group);
if (index == endIndex) continue;
termAtt.setEmpty().append(str, index, endIndex);
offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));
//Changing Token Type based on the pattern matcher
Pattern pattern = Pattern.compile("\\p{Alnum}+");
Matcher matcher = pattern.matcher(input.toString());
boolean matchFound = matcher.find();
if (matchFound) {
typeAttribute.setType("some_random_type".toLowerCase());
}
return true;
}
}
}

I'm trying to change the type of the generated tokens based on the
condition that whenever the token encounters a particular regex, using the
typeAttribute, the type of the token should be changed. Here, I am using
the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its
type should be changed.

Currently, I am getting the token as:

"tokens" : [. { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
"type" : "word", "position" : 0 }, ]

I want the above token to be like:

"tokens" : [. { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
"type" : "some_random_type", "position" : 0 }, ]

Since the token matches with the pattern "\p{Alnum}+", the type of the
token should be changed to the type specified inside the
"typeAttribute.setType."

But, the code that I have done is spitting out all the tokens of the type
"some_random_type." If any token is not being matched with the pattern
"\p{Alnum}+", it is also getting the type "some_random_type".

How can I make only the specific tokens get the type "some_random_type"
which matches the pattern "some_random_type".
Re: Changing type of the tokens generated by pattern tokenizer [ In reply to ]
Hi,
you pass input.toString() to the matcher - this is the entire source
character stream to be tokenized; I think this would lead to the result you
saw.
If you'd like to match the pattern to the specific token (a substring of
the input), I think you may want to give the substring of the input string
to the matcher, like termAtt.append() do so in your code.
Also, I'd suggest including "^" and "$" in your regex to avoid
unintentional matches.

Tomoko


2022?5?3?(?) 16:47 dishant sharma <dishantsharma0903@gmail.com>:

> I am creating a custom Pattern Tokenizer to change the type of the
> generated tokens. By incrementToken() function looks like the below code:
>
> public boolean incrementToken() {
> if (index >= str.length()) return false;
> clearAttributes();
> if (group >= 0) {
>
> // match a specific group
> while (matcher.find()) {
> index = matcher.start(group);
> final int endIndex = matcher.end(group);
> if (index == endIndex) continue;
> termAtt.setEmpty().append(str, index, endIndex);
> offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));
> //Changing Token Type based on the pattern matcher
> Pattern pattern = Pattern.compile("\\p{Alnum}+");
> Matcher matcher = pattern.matcher(input.toString());
> boolean matchFound = matcher.find();
> if (matchFound) {
> typeAttribute.setType("some_random_type".toLowerCase());
> }
> return true;
> }
> }
> }
>
> I'm trying to change the type of the generated tokens based on the
> condition that whenever the token encounters a particular regex, using the
> typeAttribute, the type of the token should be changed. Here, I am using
> the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its
> type should be changed.
>
> Currently, I am getting the token as:
>
> "tokens" : [. { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
> "type" : "word", "position" : 0 }, ]
>
> I want the above token to be like:
>
> "tokens" : [. { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
> "type" : "some_random_type", "position" : 0 }, ]
>
> Since the token matches with the pattern "\p{Alnum}+", the type of the
> token should be changed to the type specified inside the
> "typeAttribute.setType."
>
> But, the code that I have done is spitting out all the tokens of the type
> "some_random_type." If any token is not being matched with the pattern
> "\p{Alnum}+", it is also getting the type "some_random_type".
>
> How can I make only the specific tokens get the type "some_random_type"
> which matches the pattern "some_random_type".
>
Re: Changing type of the tokens generated by pattern tokenizer [ In reply to ]
As an alternative to writing a custom tokenizer, you can use built-in
PatternTypingFilter which does exactly this (sets type based on
whether it matches some regex).

https://lucene.apache.org/core/9_1_0/analysis/common/org/apache/lucene/analysis/pattern/PatternTypingFilter.html

On Tue, May 3, 2022 at 3:47 AM dishant sharma
<dishantsharma0903@gmail.com> wrote:
>
> I am creating a custom Pattern Tokenizer to change the type of the generated tokens. By incrementToken() function looks like the below code:
>
> public boolean incrementToken() {
> if (index >= str.length()) return false;
> clearAttributes();
> if (group >= 0) {
>
> // match a specific group
> while (matcher.find()) {
> index = matcher.start(group);
> final int endIndex = matcher.end(group);
> if (index == endIndex) continue;
> termAtt.setEmpty().append(str, index, endIndex);
> offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));
> //Changing Token Type based on the pattern matcher
> Pattern pattern = Pattern.compile("\\p{Alnum}+");
> Matcher matcher = pattern.matcher(input.toString());
> boolean matchFound = matcher.find();
> if (matchFound) {
> typeAttribute.setType("some_random_type".toLowerCase());
> }
> return true;
> }
> }
> }
>
> I'm trying to change the type of the generated tokens based on the condition that whenever the token encounters a particular regex, using the typeAttribute, the type of the token should be changed. Here, I am using the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its type should be changed.
>
> Currently, I am getting the token as:
>
> "tokens" : [. { "token" : "testing", "start_offset" : 0, "end_offset" : 7, "type" : "word", "position" : 0 }, ]
>
> I want the above token to be like:
>
> "tokens" : [. { "token" : "testing", "start_offset" : 0, "end_offset" : 7, "type" : "some_random_type", "position" : 0 }, ]
>
> Since the token matches with the pattern "\p{Alnum}+", the type of the token should be changed to the type specified inside the "typeAttribute.setType."
>
> But, the code that I have done is spitting out all the tokens of the type "some_random_type." If any token is not being matched with the pattern "\p{Alnum}+", it is also getting the type "some_random_type".
>
> How can I make only the specific tokens get the type "some_random_type" which matches the pattern "some_random_type".

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org