Techniques for Merging Duplicates

August 1, 200619 yr

I have a database with a non-trivial number of records which are non-identical duplicates, but should be combined or cleaned up in some way. For example, a record for "John Johnson 11110043" and "John Q Johnson 11120043" that turn out, upon further examination, to be the same person, mistakenly entered twice with an error in their primary ID # and a missing middle name.

My question is, are there any "tried & true" ways of dealing with this. I've thought about some different ways:

Combine the relevant data from record A and record B into a new record C, with a new ID number. Mark A and B as "obsolete" or "unused".
Combine the relevant data from record A and record B into a new record C with a new ID number. Delete A and B.
Combine the relevant data from record A and record B into a new record C. Export A and B to a "deleted records" table, then delete A and B.
Combine the relevant data from record A and record B into the existing record B (same ID number). Delete A.
Etc.

I'm torn between the ideas of keeping all old data forever (even bad, duplicate data), and cleaning & pruning the data so that the current set is as "pure" as possible.

It seems like this is a common problem that others have dealt with. Any advice?

August 1, 200619 yr

Author

To clarify: My question is not about Detecting the duplicate records (that problem, though difficult, is basically solved). My question is about what to do with the duplicates once they are discovered.

Added complexity: Both Record A and Record B may have related employment data that should be saved i the merge process. (e.g. a person works for a while as John Johnson, leaves for a bit, then returns and works under the new name / ID. The merged record ideally should include their entire work history from both records.