Mailing List Archive

Partial Word Match
Hi Marvey,

I'm now definitly sure that there's no compound word splitter for german
available, not in perl nor in any other language. There are a lot of
theories and drafts but they are so complex I don't even understand the
functions and methods they used for compound word splitting.

So, can you tell me if there is (or will be) a possibility to do a
"partial word" match search option in Kinosearch. Like when I search for
"like" I also find the words "likes", "likewise", "unlike" and so on
(ok, likes would be turned into "like" if you use stemming but you get
the idea). I'm not talking about wildcards now, just about partial word
match. Would this cause the same problems as wildcards or would it be
possible?

Best regards,

Marc

P.S. The fixes like the one for the negative search operator bug are all
in the trunk too I suspect, right?
Partial Word Match [ In reply to ]
Marc Elser scribbled on 8/22/06 12:35 AM:
> I'm not talking about wildcards now, just about partial word
> match. Would this cause the same problems as wildcards or would it be
> possible?
>

I don't grasp the (subtle?) difference between a partial word match and a wildcard.
--
Peter Karman . http://peknet.com/ . peter@peknet.com
Partial Word Match [ In reply to ]
> Marc Elser scribbled on 8/22/06 12:35 AM:
>> I'm not talking about wildcards now, just about partial word
>> match. Would this cause the same problems as wildcards or would it be
>> possible?
>>
>
> I don't grasp the (subtle?) difference between a partial word match and
> a wildcard.

That's very easy: a wildcard let's say "l*ness" would find "loniness",
"lossless".

A partial word match for "ness" would match "usefulness","tiredness" or
"business" so there's no wildcard in it.

The partial word match (as the name suggests) matches not the whole word
but only part of it but that part must be an exact match. but with
wildcards you have multiple parts which could match a word in the above
example the word must start with an 'l' folloed by anything and ended by
'ness'. If you want to write the partial word match as a wildcard it
would be "*ness*", but as I said there would be nothing in between and
of course wildcards can be much more complex using multiple wildcards
for example "hi*there*people*", ok there's no such word like
"hihowdytheremypeoplegoing" or so which would match this wildcard but
you get the idea.

So maybe there's a possiblity to do a partial word match with not too
much speed loss in Kinosearch so I wonder what marvin will say. Problem
is in german or other languages like swedish and numerous others where
you have a lot of compound words that would highly increase the "hits"
when searching but I can imagine that's it's a nightmare too because the
words in the index don't have to start with the part you're looking for,
they can, but you don't know.
Partial Word Match [ In reply to ]
On Aug 21, 2006, at 10:35 PM, Marc Elser wrote:

> I'm now definitly sure that there's no compound word splitter for
> german available, not in perl nor in any other language. There are
> a lot of theories and drafts but they are so complex I don't even
> understand the functions and methods they used for compound word
> splitting.

I imagine the problem is basically the same as it is in English --
how do you tell which words are compound words? Should "basically"
match against "basic" and "ally"?

But then, Japanese is a much harder nut to crack, and apparently
dictionary-based tokenizers are the way to go there. I understand
that "Mecab" is one.

http://mecab.sourceforge.jp/
http://search.cpan.org/dist/Text-MeCab/

> So, can you tell me if there is (or will be) a possibility to do a
> "partial word" match search option in Kinosearch. Like when I
> search for "like" I also find the words "likes", "likewise",
> "unlike" and so on (ok, likes would be turned into "like" if you
> use stemming but you get the idea). I'm not talking about wildcards
> now, just about partial word match. Would this cause the same
> problems as wildcards or would it be possible?

The reason that Lucene doesn't allow wildcards at the front of
strings (I think this is still true for the default Lucene
QueryParser) is that you have to iterate over every term in the
index, performing a substring search to see if it matches, which is
quite expensive. The same problem would exist with the "partial
word" search, as you describe it. Imagine consulting every entry in
the index in the back of a large book to see whether it contained the
string "like".


Marvin Humphrey

--
I'm looking for a part time job.