Mailing List Archive: high precision CompoundWordTokenFilter

Hey,

I'm trying to build a high-precision decompounder. I tried to use Dictionary- and Hyphenation- CompoundWordTokenFilters with dictionary for German language (I used the one prepared by Uwe Schindler [1]) but noticed one class of false positives that worry me. When I try to extract compounds from "Klavierkonzert"[2] I get ["klavier", "vier", "konzert"]. I understand that I get "vier" as it's there in the dictionary, but it's not right as in this case "klavier" already covers that.

I checked documentation and found `onlyLongestMatch` param, thinking that it might solve my problem, but unfortunately I doesn't work that way. Lucene's `testHyphenationCompoundWordsDELongestMatch` demonstrates how it works [3], where "basketball" match excludes potential "basket" match, but not "ball" (similarly "klavier" does not exclude "vier").

Things are even more complicated as I'm not even sure what language I'm dealing with - German is quite frequent in the corpus, but there are more languages there. (Also: the documents are really short and they sometimes contain words from multiple languages).

I thought about applying dictionary based CompoundWordTokenFilters on the corpus, getting N top compound words, manually filtering false positives and translating those results into SynonymMap for SynonymFilter. Even top 1k rules like that should significantly improve recall for my users without scarifying precision. But I'm thinking that maybe there's a better way.

What if I used CompoundWordTokenFilter that would only emit a sequence of dictionary items that would construct a given word if concatenated? That would solve the "klavier" problem (as there's no "kla" word in the dictionary, but even if there was, "klavier" is longer than "kla" and "vier" and would take the precedence). Would that make sense?

[1] https://github.com/uschindler/german-decompounder
[2] https://en.wiktionary.org/wiki/Klavierkonzert#German
[3] https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/analysis/common/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java#L81
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org