stanley Posted May 2, 2005 Posted May 2, 2005 Hello All: I'm working with a non-profit which buys mailing lists from other non-profits, and imports them into their contact db in FMP. Quite often they end up with duplicates, some of which aren't necessarily exact duplicates (due to typing errors, formatting differences, etc.) I'm hoping someone out there has a good strategy for eliminated duplicate entries, as it's bothering me a bit, and they have literally thousands of duplicates on their system already... Thanks in advance -Stanley
Vaughan Posted May 3, 2005 Posted May 3, 2005 If I had a silver bullet solution for reliably identifying duplicate addresses I sure wouldn't post it here... I'd patent it and live off the royalities! From my limited experience with bulk mail, duplicate and dead addresses are part of the deal. Most address-selling companies have guarantees regarding maximum error rates etc, or they should have. I worked on one system a year or so ago. They would dump all address information each 6 months and import a whole new set each time they did a mail out.
sdpetemc Posted May 3, 2005 Posted May 3, 2005 What version of FileMaker are you using? With versin 6.0 I had great results using the 'Update matching records in current found set' option on import. To use this, you must identify a matching value. Maybe you could use a concatenation of First and Last Name... or something like that. Anyway, to use this option you set the match field to import with a double sided arrow. In this case it will UPDATE data from a duplicate record, but will not ADD a new or duplicate record. Note: the database must have at least one record for this to work, and you should also select the option to 'Add remaining records'. I have had difficulty getting this to work with FileMaker 7. For my systems that run in 7, I have implemented a Duplicate Relationship Calculation field. I found this in the FM 7 help files by searching on Duplicate Records. I will not describe it here, as the help file covers everything you need to quickly get rid of any dups.
Vaughan Posted May 3, 2005 Posted May 3, 2005 Most of the problem is *identifying* duplicate records in the first place, not updating or deleting them. Should two different people at the same address be considered duplicates? I dunno. You have to decide. It's a business rule. What about the same person at two different addresses? I dunno either. You decide. Business rule. Then there is the slew of addresses that differ by abbreviation (Road, Rd, Rd. etc) all of which are different but *mean* the same. How do you program all that into a database? I dunno. I recently worked on a database of people that has a "nationality" field. Users had free range regarding data entry into this field. So something simple like "China" had... China Peoples Republic of China People's Republic of China PRC P.R.C. P.R.C P R C PRoC PR China P.R China P.R.China PR of China Chinese ... plus a multitude of variations caused by bad typing. In the end I made a script that sorted out 90% of the records... but that left 10% of 50,000 records to sort out by hand, which was a full day's work. Or full night, as it turned out. I charge for this kind of data clean-up by the hour.
Fenton Posted May 3, 2005 Posted May 3, 2005 I can't tell you how many hours spent looking for duplicates in crappy data. One little trick I've found is to only use the first part of the address. As Vaughn says, the ends of addresses are pretty much a nightmare to match exactly. And, for a basic duplicate test, they are not necessary. If you use the numbers, and only part of the rest, you will match quite a few more, with very few false positives. Use a calculation. Another technique is to show the results of such a self-relationship match in a list view. Remove the current record's ID from the match, by removing it from the self-relationship (serial ID not equal to self-relationship::serial ID). Then you will only see a match for a duplicate. This makes it very easy to quickly check a duplicate to see if it is false, as you can see both at once (or several, but there's usually only 1, if that). In other words, instead of trying to go for absolutely equal matches (you do those 1st), loosen up your matches a little, and make it very easy to see and edit or delete the matches.
stanley Posted May 3, 2005 Author Posted May 3, 2005 Thanks, guys; all good points, especially Vaughan's about multiple ways of entering the same data - this is the kind of thing I'm running into. It's all about finding what to human eyes is a duplicate, but what to the computer is clearly NOT a duplicate. Fenton's trick of having a self-relationship showing dupes but not the current record is excellent - I may stick that in a portal, tell the client to pay attention & build a method to delete the records shown in the portal. Regarding SD Pete's point about matching fields - there's a big problem with that. There is no way to know clearly if the data being imported is newer than the data in the db, meaning you don't know if you want to replace the data with the match or discard the match... best to let a human figure that out, as it is what they're paid to do. Other than that, if I have to charge by the hour and clean it up manually, I will just have to do that. Thanks -Stanley
comment Posted May 3, 2005 Posted May 3, 2005 Perhaps this could be of interest to you: http://www.fmforums.com/threads/showflat...true#Post136270
Kas Posted October 24, 2005 Posted October 24, 2005 Hi, I just stumbled across this post, and even though it seems rather late to add on this topic, I'm posting my own strategy for anyone still interested. I base my duplicates on the address since it seems wasteful and unprofessional to send multiple mailing to one address (in most cases), even if the recipients differ. It does make sense though to send to a single person at different addresses if you can't determine the most approriate one. So, I have a concatenated field that grabs the first 4 characters of the address and the zip, and I search that field for duplicates.
Recommended Posts
This topic is 6971 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now