Locating Similar Records

March 19, 201412 yr

Hello,

I have a contacts solution that may contain similar names but not identical (perhaps). My goal is to locate possible duplicate records. What would be the best way of searching through a list of more than 1000+ contacts and display possible "duplicates" for review? These are contacts that may have been created twice and not actually duplicated. So they all have a unique id. I've seen scripts that use the Levansthein (or something like that) method for comparing similar text. Any thoughts?

Thanks and regards!

Stephen

March 19, 201412 yr

Hi Stephen,

Do you have their emails or phone numbers by chance? :-)

Otherwise it can get complex. Are the names at least in their separate fields of FirstName and LastName or is it a single string? If a single string, your first priority would be to split them into their proper fields.

Also, do you have their address? Many times their state or zip can be used in conjunction with a last name etc.

But we need to know more about the data itself to assist.

March 19, 201412 yr

Here is one possible way to detect duplicates that uses the relationship graph.

detecting duplicates.zip

March 19, 201412 yr

Hi Kris,

Providing a sample is nice. A few things: Your example is a multi-predicate (many keys) and not an equijoin (as noted in your file) which is =. Also, if you are comparing multiple records in this self-join, you will need to eliminate the unique ID from that join.

The benefit of using a relationship and a self-join is that you can place a portal of the self-join on your layout and review all duplicates for a single person in one screen so it is certainly valuable. This technique is helpful once we are sure the data resides in separate fields but it will not match if they are not exact. And if they are exact then concatenating the fields and performing a find for ! would find those duplicates as well.

Phone numbers and emails actually work better than names, if available. For those records which are similar and not resolved by phone number or email, a custom function and human eyeballs usually are required.

March 19, 201412 yr

LaRetta..

Perhaps i erred in my terminology in the file.. What I meant to say is that all you have to do is add a member to the join with the equality symbol to add criteria that will make the detection of duplicates "tighter". if one were to place a calc that asks if the record is a duplicate i.e Data::is_duplicte = If ( duplicates::pk ; true ; false ) it should allow finding all records where is_duplicate = True and delete them in one fell swoop with assurance that the uniques will be preserved and the duplicates deleted. I used this technique a few years ago and it worked like a charm. Now weather i reproduced it exactly right from memory is debatable.

March 19, 201412 yr

Author

Hi Stephen,

Do you have their emails or phone numbers by chance? :-)

Otherwise it can get complex. Are the names at least in their separate fields of FirstName and LastName or is it a single string? If a single string, your first priority would be to split them into their proper fields.

Also, do you have their address? Many times their state or zip can be used in conjunction with a last name etc.

But we need to know more about the data itself to assist.

I do have fields for emails and phone numbers. The name strings are also separated into FirstName and LastName. I would like to use a relationship so it could display the possible records for removal, merging, etc.

March 19, 201412 yr

Then you can use a self join as Kris suggests. Also I noticed that the primary key (in the file indicated as PK) ≠ and not = ; my mistake, Kris, it's hard to see that little slash sometimes. :-)

By using the 'not equal', only records that match other than the one being viewed will appear in the portal. Place the portal rows tall enough to hold more information about the person so you can compare ALL of the record. Sometimes the current record has information that the duplicate doesn't and vice versa.

Also keep in mind that, if even one of the fields used in the join is empty (on one side or both), it will not match so even if both the parent and the related have an empty phone for example, the relationship will fail.

Good luck on your project!