Mailing List Archive: lucene 4.10.4 punctuation

lucene 4.10.4 punctuation

Aug 25, 2021, 9:33 AM

Post #1 of 4 (414 views)

Hello
i m part of a team that maintain
http://exist-db.org/exist/apps/homepage/index.html
its an Open Source XML database
and we use lucene 4.10.4
i m trying to introduce punctuation in search feature
is there an analyzer that provides that or a way to do it in 4.10.4 API

thanks Younes

RE: lucene 4.10.4 punctuation [ In reply to ]

uwe at thetaphi

Aug 25, 2021, 9:43 AM

Post #2 of 4 (414 views)

Permalink

Hi,

you should explain to use what you exactly want to do: How do you want to search, how do your documents look like? Why is it important to match on punctuation and how should this matching look like?

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Younes Bahloul <younes@evolvedbinary.com>
> Sent: Wednesday, August 25, 2021 6:34 PM
> To: java-user@lucene.apache.org
> Subject: lucene 4.10.4 punctuation
>
> Hello
> i m part of a team that maintain
> http://exist-db.org/exist/apps/homepage/index.html
> its an Open Source XML database
> and we use lucene 4.10.4
> i m trying to introduce punctuation in search feature
> is there an analyzer that provides that or a way to do it in 4.10.4 API
>
> thanks Younes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: lucene 4.10.4 punctuation [ In reply to ]

younes at evolvedbinary

Aug 26, 2021, 3:06 AM

Post #3 of 4 (414 views)

Permalink

Hi thanks for getting back to me so quickly
So to give some context, there are two things we would like to be able to
do:

1. We want to have the option to be able to search on terms that
include punctuation. So for example, if we have the two texts: "they
sent an S.O.S from", and "she wrote SOS, but she meant Soz", the user
may want to search for the acronym of 'Save our Souls', which would be
"S.O.S", and in this instance they only want to match the first text,
i.e. "they sent an S.O.S from", and not the second.

2. We want to have the option to make our searches case-sensitive. By
default, I think in Lucene with the StandardAnalyzer everything is
converted to lower-case at both index and search time. Instead we want
upper/lower case to be important, so that for example the texts "Hello
said bob", "mike says hello to bob", "The project HeLlO" and "HELLO is
the acronym for" are all different texts, and if the user were to
search for "Hello" they would only match one of the texts, i.e. "Hello
said bob".

Does that help?

On Wed, 25 Aug 2021 at 18:43, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> you should explain to use what you exactly want to do: How do you want to
> search, how do your documents look like? Why is it important to match on
> punctuation and how should this matching look like?
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Younes Bahloul <younes@evolvedbinary.com>
> > Sent: Wednesday, August 25, 2021 6:34 PM
> > To: java-user@lucene.apache.org
> > Subject: lucene 4.10.4 punctuation
> >
> > Hello
> > i m part of a team that maintain
> > http://exist-db.org/exist/apps/homepage/index.html
> > its an Open Source XML database
> > and we use lucene 4.10.4
> > i m trying to introduce punctuation in search feature
> > is there an analyzer that provides that or a way to do it in 4.10.4 API
> >
> > thanks Younes
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: lucene 4.10.4 punctuation [ In reply to ]

trevor at castingthevoid

Aug 26, 2021, 3:31 AM

Post #4 of 4 (414 views)

Permalink

Hi

You want to write your own analyzer which does not lowercase terms and which splits terms at non-alpha or non-alphanumeric characters. You'd use the same analyzer for indexing and for searching. Thus when building the index S.O.S is indexed as the five terms S . O . S and if you search for S.O.S you search for the five consecutive terms S . O . S. If you don't split terms like this then words at the end of sentences will be indexed separately from the same word within a sentence.

So something like the following, which is adapted from an application I have running here (note that I'm using Lucene 8.6.3 so you will need to make the appropriate adjustments)

public class MyAnalyzer extends Analyzer {

@Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new MyTokenFilter(src);
return new TokenStreamComponents(src, result);
}
}

and

public class MyTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;

public MyTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {

if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();

if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(myTokens(currentTerm)));
current = captureState();
}
}
}

if (!this.termStack.isEmpty()) {

String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);

return true;
}
else {
return false;
}
}

public static String[] myTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char c;
Boolean inWord = false;

for (int i = 0; i < t.length(); i++) {
c = t.charAt(i);
if (Character.isLetterOrDigit(c) || "_".equals(c)) {
next.append(c);
inWord = true;
}
else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on whitespace
}
else {
tokenlist.add(String.valueOf(c));
}
inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}

Cheers
T

-----Original Message-----
From: Younes Bahloul <younes@evolvedbinary.com>
Sent: Thursday, 26 August 2021 22:07
To: java-user@lucene.apache.org
Subject: Re: lucene 4.10.4 punctuation

Hi thanks for getting back to me so quickly So to give some context, there are two things we would like to be able to
do:

1. We want to have the option to be able to search on terms that include punctuation. So for example, if we have the two texts: "they sent an S.O.S from", and "she wrote SOS, but she meant Soz", the user may want to search for the acronym of 'Save our Souls', which would be "S.O.S", and in this instance they only want to match the first text, i.e. "they sent an S.O.S from", and not the second.

2. We want to have the option to make our searches case-sensitive. By default, I think in Lucene with the StandardAnalyzer everything is converted to lower-case at both index and search time. Instead we want upper/lower case to be important, so that for example the texts "Hello said bob", "mike says hello to bob", "The project HeLlO" and "HELLO is the acronym for" are all different texts, and if the user were to search for "Hello" they would only match one of the texts, i.e. "Hello said bob".

Does that help?

On Wed, 25 Aug 2021 at 18:43, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> you should explain to use what you exactly want to do: How do you want
> to search, how do your documents look like? Why is it important to
> match on punctuation and how should this matching look like?
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Younes Bahloul <younes@evolvedbinary.com>
> > Sent: Wednesday, August 25, 2021 6:34 PM
> > To: java-user@lucene.apache.org
> > Subject: lucene 4.10.4 punctuation
> >
> > Hello
> > i m part of a team that maintain
> > http://exist-db.org/exist/apps/homepage/index.html
> > its an Open Source XML database
> > and we use lucene 4.10.4
> > i m trying to introduce punctuation in search feature is there an
> > analyzer that provides that or a way to do it in 4.10.4 API
> >
> > thanks Younes
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org