Filtering Duplicates

stanley · May 2, 2005

Hello All:

I'm working with a non-profit which buys mailing lists from other non-profits, and imports them into their contact db in FMP. Quite often they end up with duplicates, some of which aren't necessarily exact duplicates (due to typing errors, formatting differences, etc.) I'm hoping someone out there has a good strategy for eliminated duplicate entries, as it's bothering me a bit, and they have literally thousands of duplicates on their system already...

Thanks in advance

-Stanley

Vaughan · May 3, 2005

If I had a silver bullet solution for reliably identifying duplicate addresses I sure wouldn't post it here... I'd patent it and live off the royalities!

From my limited experience with bulk mail, duplicate and dead addresses are part of the deal. Most address-selling companies have guarantees regarding maximum error rates etc, or they should have.

I worked on one system a year or so ago. They would dump all address information each 6 months and import a whole new set each time they did a mail out.

sdpetemc · May 3, 2005

What version of FileMaker are you using? With versin 6.0 I had great results using the 'Update matching records in current found set' option on import. To use this, you must identify a matching value. Maybe you could use a concatenation of First and Last Name... or something like that. Anyway, to use this option you set the match field to import with a double sided arrow. In this case it will UPDATE data from a duplicate record, but will not ADD a new or duplicate record. Note: the database must have at least one record for this to work, and you should also select the option to 'Add remaining records'.

I have had difficulty getting this to work with FileMaker 7. For my systems that run in 7, I have implemented a Duplicate Relationship Calculation field. I found this in the FM 7 help files by searching on Duplicate Records. I will not describe it here, as the help file covers everything you need to quickly get rid of any dups.

Vaughan · May 3, 2005

Most of the problem is *identifying* duplicate records in the first place, not updating or deleting them.

Should two different people at the same address be considered duplicates?

I dunno. You have to decide. It's a business rule.

What about the same person at two different addresses?

I dunno either. You decide. Business rule.

Then there is the slew of addresses that differ by abbreviation (Road, Rd, Rd. etc) all of which are different but *mean* the same. How do you program all that into a database? I dunno.

I recently worked on a database of people that has a "nationality" field. Users had free range regarding data entry into this field. So something simple like "China" had...

China

Peoples Republic of China

People's Republic of China

PRC

P.R.C.

P.R.C

P R C

PRoC

PR China

P.R China

P.R.China

PR of China

Chinese

... plus a multitude of variations caused by bad typing. In the end I made a script that sorted out 90% of the records... but that left 10% of 50,000 records to sort out by hand, which was a full day's work. Or full night, as it turned out.

I charge for this kind of data clean-up by the hour.

Fenton · May 3, 2005

I can't tell you how many hours spent looking for duplicates in crappy data. One little trick I've found is to only use the first part of the address. As Vaughn says, the ends of addresses are pretty much a nightmare to match exactly. And, for a basic duplicate test, they are not necessary. If you use the numbers, and only part of the rest, you will match quite a few more, with very few false positives. Use a calculation.

Another technique is to show the results of such a self-relationship match in a list view. Remove the current record's ID from the match, by removing it from the self-relationship (serial ID not equal to self-relationship::serial ID). Then you will only see a match for a duplicate. This makes it very easy to quickly check a duplicate to see if it is false, as you can see both at once (or several, but there's usually only 1, if that).

In other words, instead of trying to go for absolutely equal matches (you do those 1st), loosen up your matches a little, and make it very easy to see and edit or delete the matches.

stanley · May 3, 2005

Thanks, guys; all good points, especially Vaughan's about multiple ways of entering the same data - this is the kind of thing I'm running into. It's all about finding what to human eyes is a duplicate, but what to the computer is clearly NOT a duplicate. Fenton's trick of having a self-relationship showing dupes but not the current record is excellent - I may stick that in a portal, tell the client to pay attention & build a method to delete the records shown in the portal.

Regarding SD Pete's point about matching fields - there's a big problem with that. There is no way to know clearly if the data being imported is newer than the data in the db, meaning you don't know if you want to replace the data with the match or discard the match... best to let a human figure that out, as it is what they're paid to do.

Other than that, if I have to charge by the hour and clean it up manually, I will just have to do that.

Thanks

-Stanley

comment · May 3, 2005

Perhaps this could be of interest to you:

http://www.fmforums.com/threads/showflat...true#Post136270

stanley · May 3, 2005

Comment:

Fascinating stuff. Thanks.

-Stanley

Kas · October 24, 2005

Hi,

I just stumbled across this post, and even though it seems rather late to add on this topic, I'm posting my own strategy for anyone still interested.

I base my duplicates on the address since it seems wasteful and unprofessional to send multiple mailing to one address (in most cases), even if the recipients differ. It does make sense though to send to a single person at different addresses if you can't determine the most approriate one.

So, I have a concatenated field that grabs the first 4 characters of the address and the zip, and I search that field for duplicates.

Sign In

Filtering Duplicates

Recommended Posts

stanley

Vaughan

sdpetemc

Vaughan

Fenton

stanley

comment

stanley

Kas

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information