Soundex - Phonetic search algorithm

Ralph Schwegler · August 28, 2007

Dear left side brainers!

Has anyone implemented a phonetic search algorithm, like the one described here:

SoundEx

Or is there any other algorithm used and well known?

I could need it to clean up a database with 300k+ addresses.

Thank you for any help or pointer.

Fenton · August 28, 2007

Mikhail Edoshin wrote an article on this:

http://edoshin.skeletonkey.com/2006/01/soundex_and_mir.html#more

example file:

http://edoshin.skeletonkey.com//Soundex.fp7.zip

Fenton · August 28, 2007

I don't really know that it will be a great deal of help with addresses. From my experience (tedious hours) the most important part of an address, especially when you're looking for duplicates, is the numbers at the first, a few letters of the street name afterwards, and the zip code (if available). The least important part is the end, "Rd", "Road", "St", "Street", etc.. If you leave that off you will get a lot more matches.

You will also get some false positives however. So you still have to visually inspect. What I do is to create a 1-2 row portal, in a list view, based on a self-relationship, omitting the current record (≠ serial ID). That way I can see the match, and delete it if it is a true duplicate (button in the portal).

If you start with a self-relationship on the full address (100% duplicates) and person's name, then delete all those (not the original!). Then take the end off the address (St., Street, etc.), and use that in a self-relationship. That will get a bunch more. If the number is the same, the street name is the same, the zip code is the same, and the person's name is the same (problematic however), but the abbreviation at the end is different, you can be pretty sure it's a duplicate.

Then try just the number and the first few letters of the street name. That will find a few, with misspellings. These have to be closely visually inspected. There won't be all that many.

Alternatively you can try and "standardize" all the abbreviations. But with 300k you'd be amazed at how many different endings there are. Finding them all and Replacing is tedious. Unless you find a way to automate this operation.

This is a winnowing series of operations, fewer and fewer duplicates; until you can't find any more.

Vaughan · August 29, 2007

"Has anyone implemented a phonetic search algorithm"

No, but it SOUNDS like fun! :

Vaughan · August 29, 2007

Ralph and Fenton: thanks for posting the links to Soundex. Another whole area of learning that I have been unaware of.

So much to learn, so little time! :

BruceJ · August 29, 2007

These phonetic matching techniques are a lifesaver, but they are really most useful if built into the infancy of the application as a preventative tool. To apply them profolactically is very tedious, but hopefully it's a one time process. No matter how careful you are - it's invitable that you'll either delete records you assume are duplicate or leave duplicated unidentified - or both. Weigh your pros and cons and include the value or damage of the loss of data and go from there.

LaRetta · August 29, 2007

Why is phonetic not spelled like it sounds?

:backtotopic:

Genx · August 29, 2007

That's like saying why is dyslexia such a strange looking word... or why is antidisestablishmentarianism (probably how you spell it) so damned redundant?

Its the english language, it seems that the words themselves often illustrate something. E.g. antidis and so on represents the tediousness and round-about ways of the politicians that many of us hold so near and dear to our hearts. Dyslexia couldn't possibly be an easy word for dyslexic people and Phoentic isn't spell fonetikaly.

As to the topic.

It would be an interesting thing to see that algorithm work reliabely especially in a high volume of 300k records. Try create a stored calc field that follows the soundex algorithm and let us know if you have any success with it.

The actual algorithm coded in the bottom looks harder than it is. Ignore the C that loops through the string as an array and skip down to javascript which just does basic replacement functions - it'll make a lot more sense.

comment · August 29, 2007

IIRC, Bob Weaver also showed a Soundex implementation somewhere in this thread:

http://fmforums.com/forum/showtopic.php?tid/147438/

Perhaps it should be pointed out that (a) Soundex was designed to work with names, and (: there are variants of the algorithm for different languages - you shouldn't use the North-American variant if the majority of your population has Swedish names, for example.

antdis and so on represents the tediousness and round-about ways of the politicians that many of us hold so near and dear to our hearts.

Søren, is that you?

Genx · August 29, 2007

Lol no, I'm just in a mood for whining about politics in general given the millions of dollars being wasted on crappy pre-electoral advertising campaigns.

Sign In

Soundex - Phonetic search algorithm

Recommended Posts

Ralph Schwegler

Fenton

Fenton

Vaughan

Vaughan

BruceJ

LaRetta

Genx

comment

Genx

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information