Mailing List Archive

OT - Algorithm blues...
Hi,
I know this is really not the place for this, but there are some great
folks on this list... ;-)
The situation is this. I need some way to identify random length
deviations in two random length strings. An example:

ABC123
ABC234

Here the one and the four would be highlighted, because they are the
supposed anomalies - 23 is a substring which we are guessing is
supposed to be there, so is not highlighted. The elements of the
strings will not be unique (ie, aa11111 is ok), and we may well get
strings of different lengths (ie, abcd1234 and abc234, with d and 1
highlighted) we might also get stuff like
a. ABC123
b. ABC2312
In which case the b. 23 should be left as good and the one in a. and
both the one and two (last two chars) should be marked as bad.

So is there a nifty wee algorithm to do this, and I am just too stupid
to get it, or is it my boss who is expecting a little much for a
couple of hours work...?
Cheers
Antoine


--
G System, The Evolving GUniverse - http://www.g-system.at

--
gentoo-user@gentoo.org mailing list
Re: OT - Algorithm blues... [ In reply to ]
On Thu, 7 Oct 2004 13:16:38 +0200, Antoine <melser.anton@gmail.com> wrote:
> Hi,
> I know this is really not the place for this, but there are some great
> folks on this list... ;-)

I would disagree..... I think this list is a good place to get
questions answered for just about anything. If people don't think it's
the place, they just won't answer you. All sorts of questions on the
forums are answered that aren't gentoo directly related, and this list
is similar. That's why I love this community so much!!!

.....as long as you put [OT], I don't think people will mind too
much..... then again, there are those who feel just the opposite than
I do about this :(

-Josh

--
gentoo-user@gentoo.org mailing list
Re: OT - Algorithm blues... [ In reply to ]
On Thu, 7 Oct 2004 13:16:38 +0200 Antoine <melser.anton@gmail.com>
wrote:
| The situation is this. I need some way to identify random length
| deviations in two random length strings. An example:
<snip description of the diff algorithm>

You're looking at it the wrong way. The easy way to do it is to find the
longest common subsequence, discard it, and repeat. See Algorithm::Diff
(perl) or http://www.rubynet.org/mirrors/diff/ (ruby) for code, or any
reasonably recent computer science algorithms book for how to do longest
common subsequence (often known as LCS) quickly.

Sidenote: LCS is often a hell of a lot faster if you first manually trim
off any common elements at the start and end.

--
Ciaran McCreesh : Gentoo Developer (Sparc, MIPS, Vim, Fluxbox)
Mail : ciaranm at gentoo.org
Web : http://dev.gentoo.org/~ciaranm
Re: OT - Algorithm blues... [ In reply to ]
On Thu, 7 Oct 2004 13:16:38 +0200, Antoine <melser.anton@gmail.com> wrote:
> Hi,
> I know this is really not the place for this, but there are some great
> folks on this list... ;-)
> The situation is this. I need some way to identify random length
> deviations in two random length strings. An example:
>
> ABC123
> ABC234
>
> Here the one and the four would be highlighted, because they are the
> supposed anomalies - 23 is a substring which we are guessing is
> supposed to be there, so is not highlighted. The elements of the
> strings will not be unique (ie, aa11111 is ok), and we may well get
> strings of different lengths (ie, abcd1234 and abc234, with d and 1
> highlighted) we might also get stuff like
> a. ABC123
> b. ABC2312
> In which case the b. 23 should be left as good and the one in a. and
> both the one and two (last two chars) should be marked as bad.
>
> So is there a nifty wee algorithm to do this, and I am just too stupid
> to get it, or is it my boss who is expecting a little much for a
> couple of hours work...?

Hmmm, I did this problem in my Computer Science course in college. ;-)
If you can assure me that I'm not doing your homework, I'd be glad to
send you some source code I wrote to deal with this. It takes...hmm, I
can't remember the time complexity, but the space complexity is O(n*m)
where n and m are the sizes of the strings. You can make it O(2*n) by
making the time complexity another multiple of m.

Actually, after looking at the code, this was a special version of the
algorithm which matched RNA sequences, so it looked at groups of 3
characters, but the algorithm will work with minimal changes for
single chars.
--
paperCrane --Justin Patrin--

--
gentoo-user@gentoo.org mailing list