Mailing List Archive: compound word splitting

compound word splitting

Aug 24, 2006, 12:43 AM

Post #1 of 4 (1904 views)

Hi Marvin,

I'was googling more for compound word splitting and maybe there's a
solution which could work for KinoSearch too.

There's a program called TSearch V2
(http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) which is a
PostgreSQL extension which enhances PostGres by adding an inverted
fulltext search indexes and adds new functions to PostgreSQL. One
function is 'lexize' which you must pass the encoding and a word which
returns the compound words if you have a dictionary which is tagged for
compound words, but there are some dictionaries for swedish, german and
other languages although I don't know if the other dictionaries are
tagged too.

The extension is written in C and the code is not too big, I wonder if
you could take a look at it and decide if it would be possible to create
a new Analyzer maybe Analyzer::CompoundSplitter for KinoSearch. This
would be really great and would work for all available dictionaries with
compound tagging. There's also a helper script which can create ispell
dictionaries out of myspell dictionaries (they are from openoffice) and
also a helper script which can tag dictionaries for compound words.

Cheers,

Marc

compound word splitting [ In reply to ]

marvin at rectangular

Aug 24, 2006, 11:45 AM

Post #2 of 4 (1875 views)

Permalink

On Aug 24, 2006, at 12:37 AM, Marc Elser wrote:
> I'was googling more for compound word splitting and maybe there's a
> solution which could work for KinoSearch too.
>
> There's a program called TSearch V2 (http://www.sai.msu.su/~megera/
> postgres/gist/tsearch/V2/) which is a PostgreSQL extension which
> enhances PostGres by adding an inverted fulltext search indexes and
> adds new functions to PostgreSQL. One function is 'lexize' which
> you must pass the encoding and a word which returns the compound
> words if you have a dictionary which is tagged for compound words,
> but there are some dictionaries for swedish, german and other
> languages although I don't know if the other dictionaries are
> tagged too.

I had a look, but I bounced off of the TSearch source code, which is
sparsely commented and uses some PostgreSQL stuff I'm unfamiliar
with. I'm not sure which files to look at, and the files I did look
at I didn't grok.

I tried googling this subject myself, and I came up with this page:

http://www.glue.umd.edu/~oard/courses/708a/fall01/838/P2/

It would be possible to write a compound splitter based on longest
match. Start with a lexically sorted array of words that can be
substrings. Proceed character by character through the token text,
finding the longest possible match using binary search. If you find
a match, lop it off the token text and start again from there. Index
multiple tokens at one position by setting the position increment
argument for TokenBatch::append() to 0. (This technique is also how
you would implement a synonym analyzer.) It wouldn't be perfect, for
the reasons discussed on that page, but it would be better than nothing.

I don't have time to write, test, or support something like this, but
I'd be happy to continue discussing the design on list. My primary
interest lies in providing the most elegant abstract framework
possible within which such things can be implemented.

Marvin Humphrey

--
I'm looking for a part time job.

compound word splitting [ In reply to ]

melser at gmx

Aug 24, 2006, 12:37 PM

Post #3 of 4 (1870 views)

Permalink

Hi Marvin,

> It would be possible to write a compound splitter based on longest
> match. Start with a lexically sorted array of words that can be
> substrings. Proceed character by character through the token text,
> finding the longest possible match using binary search. If you find a
> match, lop it off the token text and start again from there. Index
> multiple tokens at one position by setting the position increment
> argument for TokenBatch::append() to 0. (This technique is also how you
> would implement a synonym analyzer.) It wouldn't be perfect, for the
> reasons discussed on that page, but it would be better than nothing.
This sounds great and to too complex but I have one big question: Is
this stuff I need todo all in the perl section of KinoSearch or does it
involve writing XS/C Code? Because if it does, I don't even have to
start because I don't know C.

And what is a lexical sorted array???
>
> I don't have time to write, test, or support something like this, but
> I'd be happy to continue discussing the design on list. My primary
> interest lies in providing the most elegant abstract framework possible
> within which such things can be implemented.
No problem, If I can implement it I will try, but I also have to do some
more googling about how to find binding-chars in compound words, as far
as I know there are only a few.

Best regards,

Marc

compound word splitting [ In reply to ]

marvin at rectangular

Aug 24, 2006, 1:59 PM

Post #4 of 4 (1872 views)

Permalink

On Aug 24, 2006, at 12:30 PM, Marc Elser wrote:

> Is this stuff I need todo all in the perl section of KinoSearch or
> does it involve writing XS/C Code? Because if it does, I don't even
> have to start because I don't know C.

It's possible to implement a decompound() function using Perl, *test*
it with Perl, then refactor it for improved speed using C if need
be. I wouldn't even start thinking about refactoring, though, until
1) you have a decent set of tests and 2) you've determined that
decompound() is a bottleneck.

=head2 decompound

my @substrings = decompound( $token_text \@possible_substrings );

C<decompound> breaks a string into a set of substrings if possible. It
returns an empty list if no coherent set of substrings can be found.

=cut

The fastest way to implement this would be to combine it with
Tokenizer functionality.

sub analyze {
my ( $self, $source_batch ) = @_;
my $dest_batch = KinoSearch::Analysis::TokenBatch->new;

while ( $source_batch->next ) {
my $source_text = $source_batch->get_text;

my $orig_length = bytes::length($source_text);
while (1) {
# break source text into tokens, find start and end
$source_text =~ s/$separator_re//;
my $start = $orig_length - bytes::length($source_text);
last unless $source_text =~ s/($token_re)//;
my $end = $orig_length - bytes::length($source_text);
my $token_text = $1;

# add subtokens
my @substrings
= decompound( $token_text, \@possible_substrings );
for my $substring (@substrings) {
# add each subtoken with position increment of 0
$dest_batch->append( $substring, $start, $end, 0 );
}

# add main token with position increment of 1
$dest_batch->append( $token_text, $start, $end, 1 );
}
}

return $dest_batch;
}

Anybody who's interested in a synonym analyzer need only replace the
decompound() function with a find_synonyms() function. :)

> And what is a lexical sorted array???

That's just a more general way of saying "alphabetically".

> No problem, If I can implement it I will try, but I also have to do
> some more googling about how to find binding-chars in compound
> words, as far as I know there are only a few.

I'll be curious to learn more about the theory on this, so please
report any good sites you turn up.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/