Mailing List Archive

Fuzzy matching of postal addresses
I have a problem that is suspect isn't unusual and I'm looking to see if
there is any code available to help. I've Googled without success.

Basically, I have two databases containing lists of postal addresses and
need to look for matching addresses in the two databases. More
precisely, for each address in database A I want to find a single
matching address in database B.

I'm 90% of the way there, in the sense that I have a simplistic approach
that matches 90% of the addresses in database A. But the extra cases
could be a pain to deal with!

It's probably not relevant, but I'm using ZODB to store the databases.

The current approach is to loop over addresses in database A. I then
identify all addresses in database B that share the same postal code
(typically less than 50). The database has a mapping that lets me do
this efficiently. Then I look for 'good' matches. If there is exactly
one I declare a success. This isn't as efficient as it could be, it's
O(n^2) for each postcode, because I end up comparing all possible pairs.
But it's fast enough for my application.

The problem is looking for good matches. I currently normalise the
addresses to ignore some irrelevant issues like case and punctuation,
but there are other issues.

Here are just some examples where the software didn't declare a match:

1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

The challenge is to fix some of the false negatives above without
introducing false positives!

Any pointers gratefully received.

--
Andrew McLean
--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2005-01-18, Andrew McLean <spam-trap-095@at-andros.demon.co.uk> wrote:
> I have a problem that is suspect isn't unusual and I'm looking to see if
> there is any code available to help. I've Googled without success.

I have done something very similar (well, near as dammit identical,
actually), unfortunately, I did this in Perl[1].

I can give you some general pointers of the way to go about this
type of thing, but can't actually provide any code, as it is at work.

> Basically, I have two databases containing lists of postal addresses and
> need to look for matching addresses in the two databases. More
> precisely, for each address in database A I want to find a single
> matching address in database B.

In my implementation this is the exact same setup, database A
(OS/RM addresspoint data) contained a metric shitload of addresses,
with addresses to be matched against them coming from client supplied
data.

> I'm 90% of the way there, in the sense that I have a simplistic approach
> that matches 90% of the addresses in database A. But the extra cases
> could be a pain to deal with!

There is no way to guarantee 100% accuracy if one (or both) of the
datasets is collated from a non-authoratative source, ie the actual
homeowners. ;)

> It's probably not relevant, but I'm using ZODB to store the databases.

> The current approach is to loop over addresses in database A. I then
> identify all addresses in database B that share the same postal code
> (typically less than 50). The database has a mapping that lets me do
> this efficiently. Then I look for 'good' matches. If there is exactly
> one I declare a success. This isn't as efficient as it could be, it's
> O(n^2) for each postcode, because I end up comparing all possible pairs.
> But it's fast enough for my application.

OK, this is a good start, the first thing to do is to clean the data,
especially the postcodes.

something along the lines of: (pseudocode, can't remember the exact python
implementation I did)

postcode = resultarray[postcode]
len = length(postcode)
for (i = 0; i < len; i++):
# if the fourth character is a digit or a space append it to the string
if (i == 3 && postcode[i] =~ /(\d|\s)/:
cleanPostcode .= postcode[i]
# if this isn't the fourth character and it is an aplhanumeric character
# append it to the string
else if (postcode[i] =~ /(\w|\d):
cleanPostcode .= postcode[i]

That puts all the postcodes into the format that the Royal Mail uses.

Then search on postcode ;)

The next thing I did was to split each individual word out fom
it field, and matched that in a specific order (I found starting
with elements such as house name, number, and street name
was the best approach).

If a word matched it was assigned a score (1 for a exact match, 0.7 for
a metaphone match and 0.6 for a soundex macth IIRC), and when the searching
was finished I took the resulting score and divided it by the number of
word elements.

If that score was higher than any of the prevous scores then it was put
in a given variable. If there where a number of equally good(bad?) matches
then they were appended onto an array, and if there was no clear winner
yb the time that the last of the records for that postcode had been process
it spat out a multiple choice list.

The trick is picking a threshold level below which no matches are
put into the DB, even if they are the best scoring. (I used a threshold
of 0.3

This can be refined, the current, extremely baroque, perl script that
does this currently drops out certain values from the data array if
there is an exact match with certain fields, such as house name.
It doesn't reduce the value of the integer that the result is divided
by though, thus favouring results that return an exact match on a couple
of given fields.

> The challenge is to fix some of the false negatives above without
> introducing false positives!
>
> Any pointers gratefully received.

Hope this is a) understandable, and b) useful ;)

FWIW, the perl script (an I would expect a similarly implemented python
script to perform about as well) running a somewhat flaky set of
user collated data against the Royal Mail Addresspoint data managed
a 75% hit rate, with an additional 5% requiring user intervention,
and as near as I have been able to ascertain a >1% false positive
count, from a dataset of nearly 17,000 addresses.

With cleaner and more up to date data I would expect the results
to be noticably better.

[1] It is still my main language, I don't use python enough to
think in it as easily as I think in perl ;)

- --
James jamesk[at]homeric[dot]co[dot]uk

"Consistency is the last resort of the unimaginative."
- -- Bob Hope
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB7FurqfSmHkD6LvoRAgdVAJ4t2HCaT52qbuqyT5yN59X+az0ZQwCfZgOH
L5nTnPj+TF95Z+FCM65CzV0=
=UkeW
-----END PGP SIGNATURE-----
--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Andrew McLean wrote:

> The problem is looking for good matches. I currently normalise the
> addresses to ignore some irrelevant issues like case and punctuation,
> but there are other issues.


I'd do a bit more extensive normalization. First, strip off the city
through postal code (e.g. 'Beaminster, Dorset, DT8 3SS' in your
examples). In the remaining string, remove any punctuation and words
like "the", "flat", etc.

> Here are just some examples where the software didn't declare a match:

And how they'd look after the transformation I suggest above:

> 1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
> THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

1 Brantwood
BEECHES 1 BRANTWOOD

> Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
> 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

2 Bethany House Broadwindsor Road
2 BETHANY HOUSE

> Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
> PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

Penthouse Old Vicarage 1 Clay Lane
PENTHOUSE OLD VICARAGE 1 CLAY LANE

> St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
> THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

St Johns Presbytery Shortmoor
PRESBYTERY SHORTMOOR

> The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
> PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

Pinnacles White Sheet Hill
PINNACLES WHITESHEET HILL


Obviously, this is not perfect, but it's closer. At this point, you
could perhaps say that if either string is a substring of the other,
you have a match. That should work with all of these examples except
the last one. You could either do this munging for all address
lookups, or you could do it only for those that don't find a match in
the simplistic way. Either way, you can store the Database B's
pre-munged address so that you don't need to constantly recompute
those. I can't say for certain how this will perform in the false
positives department, but I'd expect that it wouldn't be too bad.

For a more-detailed matching, you might look into finding an algorithm
to determine the "distance" between two strings and using that to
score possible matches.

Jeff Shannon
Technician/Programmer
Credit International

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
"Andrew McLean" <spam-trap-095@at-andros.demon.co.uk> wrote in message
news:96B2w2E$HF7BFwSq@at-andros.demon.co.uk...
>I have a problem that is suspect isn't unusual and I'm looking to see if
>there is any code available to help. I've Googled without success.

There isn't any publically availible code that I'm aware of.
Companies that do a good job of address matching regard
that code as a competitive advantage on a par with the
crown jewels.

> Basically, I have two databases containing lists of postal addresses and
> need to look for matching addresses in the two databases. More precisely,
> for each address in database A I want to find a single matching address in
> database B.
>
> I'm 90% of the way there, in the sense that I have a simplistic approach
> that matches 90% of the addresses in database A. But the extra cases could
> be a pain to deal with!

>From a purely pragmatic viewpoint, is this a one-off, and how many
non-matches do you have to deal with? If the answers are yes,
and not all that many, I'd do the rest by hand.

> It's probably not relevant, but I'm using ZODB to store the databases.

I doubt if it's relevant.

> The current approach is to loop over addresses in database A. I then
> identify all addresses in database B that share the same postal code
> (typically less than 50). The database has a mapping that lets me do this
> efficiently. Then I look for 'good' matches. If there is exactly one I
> declare a success. This isn't as efficient as it could be, it's O(n^2) for
> each postcode, because I end up comparing all possible pairs. But it's
> fast enough for my application.
>
> The problem is looking for good matches. I currently normalise the
> addresses to ignore some irrelevant issues like case and punctuation, but
> there are other issues.

I used to work on a system that had a reasonably decent address
matching routine. The critical issue is, as you suspected, normalization.
You're not going far enough. You've also got an issue here that doesn't
exist in the States - named buildings.

>
> Here are just some examples where the software didn't declare a match:
>
> 1 Brantwood, BEAMINSTER, DORSET, DT8 3SS
> THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS

The first line is a street address, the second is a named building and a
street
without a house number. There's no way of matching this unless you know
that The Beaches doesn't have flat (or room, etc.) numbers and can move the
1 to being the street address. On the other hand, this seems to be a
consistent problem in your data base - in the US, the street address must
be associated with the street name. No comma is allowed between the two.

> Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP
> 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP

The first is a flat, house name and street name, the second is a number
and a house name. Assuming that UK postal standards don't allow
more than one named building in a postal code, this is easily matchable
if you do a good job of normalization.

> Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU
> PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU

The issue here is to use the words "flat" and "the" to split the flat
name and the house name. Then the house number is in the wrong
part - it shoud go with the street name. See the comment above.
>
> St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL
> THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL

This one may not be resolvable, unless there is only one house name
with "presbytery" in it within the postal code. Notice that "the" should
probably be dropped when normalizing.

> The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF
> PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF

Spelling correction needed.

> The challenge is to fix some of the false negatives above without
> introducing false positives!
>
> Any pointers gratefully received.

If, on the other hand, this is a repeating problem that's simply going
to be an ongoing headache, I'd look into commercial address correction
software. Here in the US, there are a number of vendors that have
such software to correct addresses to the standards of the USPS.
They also have data bases of all the legitimate addresses in each
postal code. They're adjuncts of mass mailers, and they exist
because the USPS gives a mass mailing discount based on the
number of "good" addresses you give them.

I don't know what the situation is in the UK, but I'd be surprised
if there wasn't some availible address data base, either commercial
or free, possibly as an adjunct of the postal service.

The later, by the way, is probably the first place I'd look. The
postal service has a major interest in having addresses that they
can deliver without a lot of hassle.

Another place is google. The first two pages using "Address
Matching software" gave two UK references, and several
Australian references.

John Roth
>
> --
> Andrew McLean

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Andrew McLean <spam-trap-095@at-andros.demon.co.uk> wrote:
> I have a problem that is suspect isn't unusual and I'm looking to
> see if there is any code available to help. I've Googled without success.
>
> Basically, I have two databases containing lists of postal addresses and
> need to look for matching addresses in the two databases. More
> precisely, for each address in database A I want to find a single
> matching address in database B.

Have a look at http://febrl.sf.net

It is a prototype probabilistic record linkage (fuzzy matching) engine, in Python, which
includes, inter alia, a hidden Markov model address parser. It might be overkill or (at
this stage of its development) too slow for your application. The next version (ina
month or two) will include a geocoding engine which could also be used for what you
want to do, I suspect.

Tim C

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Andrew> I'm 90% of the way there, in the sense that I have a simplistic
Andrew> approach that matches 90% of the addresses in database A. But
Andrew> the extra cases could be a pain to deal with!

Based upon the examples you gave, here are a couple things you might try to
reduce the size of the difficult comparisons:

* Remove "the" and commas as part of your normalization process

* Split each address on white space and convert the resulting list to a
set, then consider the size of the intersection with other addresses
with the same postal code:

>>> a1 = "St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL".upper().replace(",", "")
>>> a1
"ST JOHN'S PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL"
>>> a2 = "THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL".upper().replace(",", "").replace("THE ", "")
>>> a2
'PRESBYTERY SHORTMOOR BEAMINSTER DORSET DT8 3EL'
>>> a1 == a2
False
>>> sa1 = set(a1.split())
>>> sa2 = set(a2.split())
>>> len(sa1)
8
>>> len(sa2)
6
>>> len(sa1 & sa2)
6

Skip
--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Ermmm ... only remove "the" when you are sure it is a whole word. Even
then it's a dodgy idea. In the first 1000 lines of the nearest address
file I had to hand, I found these: Catherine, Matthew, Rotherwood,
Weatherall, and "The Avenue".

Ermmm... don't rip out commas (or other punctuation); replace them with
spaces. That way "SHORTMOOR,BEAMINSTER" doesn't become one word
"SHORTMOORBEAMINSTER".


A not-unreasonable similarity metric would be float(len(sa1 & sa2)) /
len(sa1 | sa2). Even more reasonable would be to use trigrams instead
of words -- more robust in the presence of erroneous insertion or
deletion of spaces (e.g. Short Moor and Bea Minster are plausible
variations) and spelling errors and typos. BTW, the OP's samples look
astonishingly clean to me, so unlike real world data.

Two general comments addressed to the OP:
(1) Your solution doesn't handle the case where the postal code has
been butchered. e.g. "DT8 BEL" or "OT8 3EL".
(2) I endorse John Roth's comments. Validation against an address data
base that is provided by the postal authority, using either an
out-sourced bureau service, or buying a licence to use
validation/standardisation/repair software, is IMHO the way to go. In
Australia the postal authority assigns a unique ID to each delivery
point. This "DPID" has to be barcoded onto the mail article to get bulk
postage discounts. Storing the DPID on your database makes duplicate
detection a snap. You can license s/w (from several vendors) that is
certified by the postal authority and has batch and/or online APIs. I
believe the situation in the UK is similar. At least one of the vendors
in Australia is a British company. Google "address deduplication
site:.uk"
Actually (3): If you are constrained by budget, pointy-haired boss or
hubris to write your own software (a) lots of luck (b) you need to do a
bit more research -- look at the links on the febrl website, also
Google for "Monge Elkan", read their initial paper, look at the papers
referencing that on citeseer; also google for "merge purge"; also
google for "record linkage" (what the statistical and medical
fraternity call the problem) (c) and have a damn good look at your data
[like I said, it looks too clean to be true] and (d) when you add a
nice new wrinkle like "strip out 'the'", do make sure to run your
regression tests :-)
Would you believe (4): you are talking about cross-matching two
databases -- don't forget the possibility of duplicates _within_ each
database.


HTH,
John

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
You can't even get anywhere near 100% accuracy when comparing
"authoritative sources" e.g. postal authority and the body charged with
maintaining a database of which streets are in which electoral district
-- no, not AUS, but close :-)

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Andrew McLean wrote:
> I have a problem that is suspect isn't unusual and I'm looking to see if
> there is any code available to help. I've Googled without success.
>
> Basically, I have two databases containing lists of postal addresses and
> need to look for matching addresses in the two databases. More
> precisely, for each address in database A I want to find a single
> matching address in database B.

I had a similar problem to solve a while ago. I can't give you my code,
but I used this paper as the basis for my solution (BibTeX entry from
http://citeseer.ist.psu.edu/monge00adaptive.html):

@misc{ monge-adaptive,
author = "Alvaro E. Monge",
title = "An Adaptive and Efficient Algorithm for Detecting
Approximately Duplicate
Database Records",
url = "citeseer.ist.psu.edu/monge00adaptive.html" }

There is a lot of literature--try a google search for "approximate
string match"--but very little publically available code in this area,
from what I could gather. Removing punctuation, etc., as others have
suggested in this thread, is _not_sufficient_. Presumably you want to
be able to match typos or phonetic errors as well. This paper's
algorithm deals with those problems quite nicely,

--
--------------------------------------------------------------------
Aaron Bingham
Application Developer
Cenix BioScience GmbH
--------------------------------------------------------------------

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
You might find these at least periperally useful:
<http://www.brunningonline.net/simon/blog/archives/001291.html>
<http://www.brunningonline.net/simon/blog/archives/001292.html>

They refer to address formatting rather than de-duping - but
normalising soulds like a useful first step to me.

--
Cheers,
Simon B,
simon@brunningonline.net,
http://www.brunningonline.net/simon/blog/
--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
I think you guys are missing the point. All you would need to add to
get a 'probable match' is add another search that goes through the 10%
that didnt get matched and do a "endswith" search on the data. From the
example data you showed me, that would match a good 90% of the 10%,
leaving you with a 1% that must be hand matched. You would have to
combine this idea with Jeff Shannon's idea to make it work more
efficiently.

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Thanks for all the suggestions. There were some really useful pointers.

A few random points:

1. Spending money is not an option, this is a 'volunteer' project. I'll
try out some of the ideas over the weekend.

2. Someone commented that the data was suspiciously good quality. The
data sources are both ones that you might expect to be authoritative. If
you use as a metric, having a correctly formatted and valid postcode, in
one database 100% the records do in the other 99.96% do.

3. I've already noticed duplicate addresses in one of the databases.

4. You need to be careful doing an endswith search. It was actually my
first approach to the house name issue. The problem is you end up
matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

I am tempted to try an approach based on splitting the address into a
sequence of normalised tokens. Then work with a metric based on the
differences between the sequences. The simple case would look at
deleting tokens and perhaps concatenating tokens to make a match.

--
Andrew McLean
--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
John Machin wrote:
> Ermmm ... only remove "the" when you are sure it is a whole word.
Even
> then it's a dodgy idea. In the first 1000 lines of the nearest
address
> file I had to hand, I found these: Catherine, Matthew, Rotherwood,
> Weatherall, and "The Avenue".
>

Partial apologies: I wasn't reading Skip's snippet correctly -- he had
"THE ", I read "THE". Only "The Avenue" is a problem in the above list.
However Skip's snippet _does_ do damage in cases where the word ends in
"the". Grepping lists of placenames found 25 distinct names in UK,
including "The Mythe" and "The Wrythe".

Addendum: Given examples in the UK like "Barton in the Beans" (no
kiddin') and "Barton-on-the-Heath", replacing "-" by space seems
indicated.

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
"Andrew McLean" <spam-trap-095@at-andros.demon.co.uk> wrote in message
news:$aq2IQPQ8X7BFwRy@at-andros.demon.co.uk...
> Thanks for all the suggestions. There were some really useful pointers.
>
> A few random points:
>
> 1. Spending money is not an option, this is a 'volunteer' project. I'll
> try out some of the ideas over the weekend.
>
> 2. Someone commented that the data was suspiciously good quality. The data
> sources are both ones that you might expect to be authoritative. If you
> use as a metric, having a correctly formatted and valid postcode, in one
> database 100% the records do in the other 99.96% do.
>
> 3. I've already noticed duplicate addresses in one of the databases.
>
> 4. You need to be careful doing an endswith search. It was actually my
> first approach to the house name issue. The problem is you end up matching
> "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".
>
> I am tempted to try an approach based on splitting the address into a
> sequence of normalised tokens. Then work with a metric based on the
> differences between the sequences. The simple case would look at deleting
> tokens and perhaps concatenating tokens to make a match.

It's been a while since I did this stuff. The trick in dealing
with address normalization is to parse the string backwards,
and try to slot the pieces into the general pattern.

In your case, the postal code, district (is that what it's called
in the UK?) and city seem to be fine, it's when you get to the
street (with or without house number), building name and flat
or room number that there's a difficulty.

We always had a list of keywords that could be trusted
to be delimiters. In your examples, "the" should be pretty
reliable in indicating a building name. Of course, that might
have some trouble with Tottering on the Brink.

John Roth
>
> --
> Andrew McLean

--
http://mail.python.org/mailman/listinfo/python-list
Re: Re: Fuzzy matching of postal addresses [ In reply to ]
Andrew McLean <spam-trap-095@at-andros.demon.co.uk> wrote:
>
> Thanks for all the suggestions. There were some really useful pointers.
>
> A few random points:
>
> 1. Spending money is not an option, this is a 'volunteer' project. I'll
> try out some of the ideas over the weekend.
> ...
> I am tempted to try an approach based on splitting the address into a
> sequence of normalised tokens. Then work with a metric based on the
> differences between the sequences. The simple case would look at
> deleting tokens and perhaps concatenating tokens to make a match.

Do please have a look at the Febrl project at http://febrl.sf.net

We would be most interested to learn how well its HMM address parser works well for
all those "quaint" English addresses, and its Fellegi-Sunter probabilistic matching
engine should give good results on your data (or use the simpler deterministic engine if
you like). Provided that your data are not too large (eg more than a few hundred
thousand records), Febrl should work fairly well. We'd be pleased to get any feedback
you may have.

Tim C
--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Andrew McLean wrote:
> Thanks for all the suggestions. There were some really useful pointers.
>
> A few random points:
[snip]
> 4. You need to be careful doing an endswith search. It was actually my
> first approach to the house name issue. The problem is you end up
> matching "12 Acacia Avenue, ..." with "2 Acacia Avenue, ...".

Is that really a problem? That looks like a likely typo to me. I guess
it depends on your data set. In my case, the addresses were scattered
all over the place, with relatively few in a given city, so the
likelyhood of two addresses on the same street in the same town was very
low. We /wanted/ to check for this kind of 'duplication'.

Note that endswith will not deal with 'Avenue' vs. 'Ave.', but I supose
a normalization phase could take care of this for you. The Monge
algorithm I pointed you to takes care of this pretty nicely.

--
--------------------------------------------------------------------
Aaron Bingham
Application Developer
Cenix BioScience GmbH
--------------------------------------------------------------------

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Andrew,

> Basically, I have two databases containing lists of postal addresses
and
> need to look for matching addresses in the two databases. More
> precisely, for each address in database A I want to find a single
> matching address in database B.

What percent of addresses in A have a unique corresponding address in
B? (i.e. how many addresses will have some match in B?)

This is a standard document retrieval task. Whole books could be
written about the topic. (In fact, many have been).

I suggest you don't waste your time trying to solve this problem from
scratch, and instead capitalize on the effort of others. Hence, my
proposal is pretty simple:
1. Regularize the punctuation of the text (e.g. convert it all to
uppercase), since it is uninformative and---at best---a confounding
variable.
2. Use a free information retrieval package to find matches.
e.g. LEMUR: http://www-2.cs.cmu.edu/~lemur/

In this case, a "document" is an address in Database B. A "query" is an
address in Database A. (Alternately, you could switch A and B to see if
that affects accuracy.)

Good luck.

Joseph

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Tim,

do you think Ferbel can parse properly with non English data-sets? I
mean do you think it will work properly with data they include non
English characters as well? As we live in Europe, we have to solve such
a problems here :)))) If the software needs some changes, I am ready,
according to your suggestions, to try to do it.

Regards

Petr Jakes

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
McBooCzech wrote:

>Tim,
>
>do you think Ferbel can parse properly with non English data-sets?
>
The official name for the project is "Febrl" (freely-extensible
biomedical record linkage) but perhaps "Furball" would be better name,
given its focus on fuzziness (if that is not a contradiction in terms).
My cat could cough up a suitable graphical logo, no doubt.

> I
>mean do you think it will work properly with data they include non
>English characters as well? As we live in Europe, we have to solve such
>a problems here :)))) If the software needs some changes, I am ready,
>according to your suggestions, to try to do it.
>
>
The underlying method (hidden Markov models) should work with any kind
of data. If the non-English characters can be read as normal Python
strings then it will definitely work. If you need to use Unicode
strings, it may work, but that is not something we have tested -
although it is on the TO-DO list. Peter Christen could comment further
on that (off-line). You would need to supply your own tagging look-up
tables of course, and bootstrap your own models as it unlikely that the
sample models for Australian addresses will work well for Czech
addresses. But none of that should be too hard.

Tim C

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
Sorry for my "Ferbl typo". For the local anti-smoking campaign I am
trying to link some addresses which contain following "linkable"
informations (data fields) only:

RECORD_ID, Street + No., City, Post code,

All data are now w/o Unicode characters. Do you think it possible to
try to link it with Febrl w/o deep code modification?

I did try to link our data but the result is just a plenty of warning
messages but no links. What is your suggestion? Please understand I do
not want to bother you with my questions. I am just asking you your
comments or pointers before I will try to dig in to the code. You
probably know some "tricks" in data organization or something like
that, which can be much easier then code digging.

I can send our CSVs to you (they are small, just about 3204 records in
the A data-set and about 1241 records in the B data-set) and a log as
well.

I have tried to organized oru files as following:
FEBRL reqirements : Our data
==============================
'rec_id': RECORD_ID,
'given_name': ""
'surname': ""
'street_num': ""
'address_part_1': ""
'address_part_2': Street + No.
'suburb': City
'postcode': Post code
'state': ""
'date_of_birth': ""
'soc_sec_id': ""

Thanks for your answer and suggestions

Petr

--
http://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy matching of postal addresses [ In reply to ]
McBooCzech wrote:

>Sorry for my "Ferbl typo".
>
No offense was caused, I was just joking.

>For the local anti-smoking campaign
>
As a public health epidemiologist, that's the sort of application of our
project I like to see! And judging by the reports by Martin Mcgee and
colleagues at the European Public Health Observatory, such campaigns are
sorely needed in most of eastern Europe. I am sure Vaclav Havel is
convinced of the need for anti-smoking campaigns.

>I am
>trying to link some addresses which contain following "linkable"
>informations (data fields) only:
>
>RECORD_ID, Street + No., City, Post code,
>
>All data are now w/o Unicode characters. Do you think it possible to
>try to link it with Febrl w/o deep code modification?
>
>
Yes.

>I did try to link our data but the result is just a plenty of warning
>messages but no links. What is your suggestion? Please understand I do
>not want to bother you with my questions. I am just asking you your
>comments or pointers before I will try to dig in to the code. You
>probably know some "tricks" in data organization or something like
>that, which can be much easier then code digging.
>
>I can send our CSVs to you (they are small, just about 3204 records in
>the A data-set and about 1241 records in the B data-set) and a log as
>well.
>
>I have tried to organized oru files as following:
> FEBRL reqirements : Our data
> ==============================
> 'rec_id': RECORD_ID,
> 'given_name': ""
> 'surname': ""
> 'street_num': ""
> 'address_part_1': ""
> 'address_part_2': Street + No.
> 'suburb': City
> 'postcode': Post code
> 'state': ""
> 'date_of_birth': ""
> 'soc_sec_id': ""
>
>Thanks for your answer and suggestions
>
>
We would be pleased to assist - but off-list. Please send the exact text
of the error messages (the log) directly to Peter Christen and myself in
the first instance. Further discussion of these applictaon-specific
issues is not appropriate for the general Python list - but you could
report back to the Python list on your overall experience after we have
solved the problems with you.

Regards,

Tim C

--
http://mail.python.org/mailman/listinfo/python-list