> I don't know any Asian languages but from earlier experimentations, I
> remember that some time bigram tokenization could hurt matching, e.g.:
>
> w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> miss a search for w2. w1 w2 w3 would work better.
>
if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
you search "w1w2" and "w2w1" will return with same the result. isn't it?
with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
will avoid above charactor sequence problem.
According to the stat. the bigram based word segment returned best resutls. but need queryParser parser query with "and" relation by default
You can try the bigram based word segment at http://search.163.com in category search and news search(web page is powered by google).
google's Chinese language analysis is provided by basistech with Dictionary based word segment.
http://www.basistech.com/products/language-analysis/cma.html
Che, Dong
> remember that some time bigram tokenization could hurt matching, e.g.:
>
> w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> miss a search for w2. w1 w2 w3 would work better.
>
if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3,
you search "w1w2" and "w2w1" will return with same the result. isn't it?
with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
will avoid above charactor sequence problem.
According to the stat. the bigram based word segment returned best resutls. but need queryParser parser query with "and" relation by default
You can try the bigram based word segment at http://search.163.com in category search and news search(web page is powered by google).
google's Chinese language analysis is provided by basistech with Dictionary based word segment.
http://www.basistech.com/products/language-analysis/cma.html
Che, Dong