Duplicate records/store fronts

February 20, 200422 yr

Hey folks,

I'm looking at working with a new client. They've got a contacts db that contains company and people info. They've had a problem with staff inadvertently hitting CMD D instead of CMD F and creating duplicate records and then not deleting the duplicate, so over time, the duplicate record starts to look different from the original because sometimes the original may get updated and sometimes the duplicate may get updated. Add to that nightmare the fact that some companies have multiple locations/store fronts, but of course there is no ID that associates multiple store fronts to a single owner. Store fronts can look very much like a duplicated record that has been modified from the original, but is in fact, not a duplicate.

My issues are: what's a good direction to start thinking with respect to finding duplicate records and determining which one should be kept, and how to merge the data from the duplicate(s) into each original record while at the same time not accidentially merging a store/front into the "master company" record?

Right now I'm not looking for specifics, just more of some ideas to get the juices flowing.

February 20, 200422 yr

Basically, a lot of work. My last big job had over 10,000 records entered into a "flat-file" containing both customers and jobs, everything entered manually, lots of spelling mistakes, abbreviations, etc.. It took a long time, both myself and an inside helper, to get it more or less straight. I'm not saying it's not worth doing; it's critical for any real database functionality; I'm just saying to protect yourself and make it clear that this is hourly work.

It is especially difficult if a latter entry might be a valid separate "customer" record (which should be a separate file from the "many" jobs (or whatever). In my case this was sometimes true, but only if the address was different; there are several Burger Kings, etc..

I used several methods. Basically you match up what you can automatically, giving them a unique CustomerID (from a file), then create methods that optimize viewing the problem records, so that you can manually match them.

The first part, fixing obvious matches, you know. Here's some ideas for not so obvious matches.

Create a "partial" key, using only part of the fields. I used something like:

Left (CompanyName, 6) and Left(Address, 4)

The reason is that most typo errors occur after the first few letters, especially in Addresses (St., St, Street, nothing).

Build a self-relationship and a portal on that, with just a few rows, so you can view the records as a list, but still see possible matches, with whatever fields you need to make a decision.

The reason for the portal is that you have a Found set of non-ID'd records. What you really want is to see whether a previously ID'd record matches this "fuzzy" relationship.

Search for unfixed records having more than 1 match. Sort by Name. Look at the matches. See if they're really different. If not, give matching records the unique ID of the first one. If there's no other matching record, mark the record for "New," so it can be imported into Customer file.

[Enable "drag-and-drop" editing. Make ID field "Select contents"]

When you're all done you should have unique records in a Customer file, and every record in the "many" file should have its correct CustomerID.

But there's still the data problem, ie., some records had the phone number, others didn't. I used a brute force method, looping through, after sorting by CustomerID, setting data into globals, populating the field only if it was empty in later records. At the end you have all possible data in the last record for each customer. This is then imported into the Customer file.

The file is then a bloated pig. So delete customer data that is not needed in the file for functionality (or change to related calculation). Then Save A Copy Compressed. You then have a lean file with the right data.

February 20, 200422 yr

Author

Thanks Fenton. Creating the partial key isn't something I'd considered. I've encountered the problem of company and people records being in the same file along with the problem of store fronts, but because there were only about 1000 records in that db, I had my client manually go through and eliminate duplicates (better to pay a staff member to spend the time on that project than a consultant), but this db contains about 60,000 records, so asking for that isn't reasonable.

This will be hourly work, and because of that my client is looking for as quick a fix as possible.