Mailing List Archive: text highlighting problem

text highlighting problem

Mar 11, 2007, 4:21 AM

Post #1 of 3 (2685 views)

Hi !
I've got some problems with text highlighting. I'am using some special
library that handles text reduceing/canonization (used language is
Polish). The reduced words has no flexion (which is usefull feature in
the context of searching). The problem is when i try to perform text
highlighting because the highlighting tags are not in a correct position
(they are shifted), for example:
this is s<br>ome tex</br> t to highlight. The canonization library is
integrated with analyzer in the following manner:
public final class MyAnalyzer extends Analyzer
{
public TokenStream tokenStream(String fieldName, Reader reader)
{
MyCanonizer textCanonizer = new MyCanonizer();
TokenStream ts = new
StandardTokenizer(textCanonizer.peformCanonization(reader));
return ts;
}
}

Could anybody say why the highlights are shifted and/or how to solve the
problem ?

Thanks,
JanK

Re: text highlighting problem [ In reply to ]

sarowe at syr

Mar 12, 2007, 7:35 AM

Post #2 of 3 (2508 views)

Permalink

Hi Jan,

It sounds like your "canonizer" reduces the number of Java characters in
the input. Have you have performed character decomposition, so that
each diacritic is a separate Java character, before the canonizer does
its work? If so, the character offsets recorded by StandardTokenizer
will not match the original input text.

If the above is true:

1. You could implement character decomposition[1], as well as diacritic
stripping (which I assume is part of its job) in your canonizer, instead
of in some other pre-processing step. This way, the character counts
will remain the same before and after the canonizer does its work.
(This may require you to add a character composition step to your
pre-processing.)

2. If the input text cannot be changed (that is, it must remain
decomposed), then you could have your canonizer put one space per
diacritic after each word containing one, instead of just stripping
diacritics. In this way, word boundaries will remain in the same positions.

Hope it helps,
Steve

[1] http://unicode.org/reports/tr15/

JanK wrote:
> Hi !
> I've got some problems with text highlighting. I'am using some special
> library that handles text reduceing/canonization (used language is
> Polish). The reduced words has no flexion (which is usefull feature in
> the context of searching). The problem is when i try to perform text
> highlighting because the highlighting tags are not in a correct position
> (they are shifted), for example:
> this is s<br>ome tex</br> t to highlight. The canonization library is
> integrated with analyzer in the following manner:
> public final class MyAnalyzer extends Analyzer
> {
> public TokenStream tokenStream(String fieldName, Reader reader)
> {
> MyCanonizer textCanonizer = new MyCanonizer();
> TokenStream ts = new
> StandardTokenizer(textCanonizer.peformCanonization(reader));
> return ts;
> }
> }
>
> Could anybody say why the highlights are shifted and/or how to solve the
> problem ?
>
> Thanks,
> JanK

Re: text highlighting problem [ In reply to ]

hossman_lucene at fucit

Mar 12, 2007, 1:41 PM

Post #3 of 3 (2505 views)

Permalink

: MyCanonizer textCanonizer = new MyCanonizer();
: TokenStream ts = new
: StandardTokenizer(textCanonizer.peformCanonization(reader));
: return ts;

: Could anybody say why the highlights are shifted and/or how to solve the
: problem ?

the highlights are shifted because hte positions hte highlighter knows
about are not hte same positions from your source reader -- they are the
positions in the reader returned by textCanonizer.peformCanonization(reader)

you would probably be better implimenting your special logic as either a
TokenFIlter in which case you just modify the text but leave the position
info alone, or as Tokenizer that emits Tokens in which you have already
modified the text, but you record the orriginal positions.

(not understanding what exactly Canonization is makes it hard to know if
the TokenFilter approach will work for you, but if it does it's probably
the simplest/most reusable)

-Hoss