Mailing List Archive: about bigram based word segment

about bigram based word segment

Sep 12, 2002, 6:43 PM

Post #1 of 3 (292 views)

> I don't know any Asian languages but from earlier experimentations, I
> remember that some time bigram tokenization could hurt matching, e.g.:
>
> w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> miss a search for w2. w1 w2 w3 would work better.
>
if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
you search "w1w2" and "w2w1" will return with same the result. isn't it?

with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
will avoid above charactor sequence problem.

According to the stat. the bigram based word segment returned best resutls. but need queryParser parser query with "and" relation by default

You can try the bigram based word segment at http://search.163.com in category search and news search(web page is powered by google).
google's Chinese language analysis is provided by basistech with Dictionary based word segment.
http://www.basistech.com/products/language-analysis/cma.html

Che, Dong

Re: about bigram based word segment [ In reply to ]

murzaku at yahoo

Sep 13, 2002, 5:46 AM

Post #2 of 3 (282 views)

Permalink

--- Che Dong <chedong@hotmail.com> wrote:
> if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
> you search "w1w2" and "w2w1" will return with same the result. isn't
> it?

That wouldn't be the case if you quote the two characters (therefore
you submit a "phrase query".) But this discussion would be more
appropriate in the user group...

=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: about bigram based word segment [ In reply to ]

hchen at intumit

Sep 13, 2002, 9:07 AM

Post #3 of 3 (285 views)

Permalink

I think there's another flaw with the bigram approach when the query
consists of 3+ characters. i.e. a query of w1w2w3 would match such
text as w1w2w4w2w3. Currently I do unigram tokenization and perform
auto phrase queries for cjk searches, but performance could take a hit in
large-scale situations.

----- Original Message -----
From: "Che Dong" <chedong@hotmail.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Friday, September 13, 2002 9:43 AM
Subject: about bigram based word segment

> > I don't know any Asian languages but from earlier experimentations, I
> > remember that some time bigram tokenization could hurt matching, e.g.:
> >
> > w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> > miss a search for w2. w1 w2 w3 would work better.
> >
> if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
> you search "w1w2" and "w2w1" will return with same the result. isn't it?
>
>
> with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
> or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
> will avoid above charactor sequence problem.
>
> According to the stat. the bigram based word segment returned best
resutls. but need queryParser parser query with "and" relation by default
>
> You can try the bigram based word segment at http://search.163.com in
category search and news search(web page is powered by google).
> google's Chinese language analysis is provided by basistech with
Dictionary based word segment.
> http://www.basistech.com/products/language-analysis/cma.html
>
>
>
> Che, Dong
>
>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>