Identifying "duplicate" contacts

August 5, 200421 yr

Apologies in advance if this is not the right forum - please recommend a better one -)

This must come up fairly often: how do you identify "duplicates" in a file of people containing name, address, city, state, zip, and maybe telephone #? My definition of "duplicate" is inherently loose - it tolerates a certain amount of discrepancy due to abbreviations, typos, and omitted info, as well as discrepancies due to a change of address. Thus, any two records would have a certain *probability* of referring to the same individual. A human would make the final call, but I want something that would greatly narrow the field beforehand.

This sort of logic would seem to be well beyond the ability of FM scripting to implement, so I thought there might be a plug-in that did this kind of task.

Thanks,

Chap

August 6, 200421 yr

You can create a calculated field that aggregates all of the criteria you specify for possible dupes. Say you want to look for all records that have the same first and last name. Create a calc field that concatenates these fields ( or any others you want), then create a self join based on that field (say you named the relationship "dupeFinder" as below). Another calc field would then be used to identify the records:

Case((IsEmpty(FirstName) or IsEmpty(LastName));"Empty";

Case(

Count(dupeFinder::FullName) = 1; "Unique";

Case(

GetAsNumber(contactID) = GetAsNumber(dupeFinder::contactID); "Original"; "Duplicate"

)

If the count function returns the value "1", that means there is only one value with a match (the record itself) Otherwise it compares the unique ID field of the current record with the first one in the the related set. If it matches, then the first record created must equal the current record, thus it's the original one, and all others will be flagged duplicate by default.

To relax or constrain your criteria, just change how your calc field concatenates the fields of interest.

(Like if you wanted to use just the first initial and the last name)

Hope this helps.

Dana

August 6, 200421 yr

Author

Thanks, Dana

It turns out there's been a sizeable amount of research done on the subject of "approximate string matching". There are some pretty complex algorithms that can assign a probable match rating that tolerates a certain amount of discrepancy such as letter inversions, substitutions, and omissions. It would be interesting to start with keys such as you describe, and then do "approximate matching" to expand the matching set to include minor typos & abbreviations.

Maybe there's a product here! -)

Chap