Customer Data Matching

DataCAD · September 22, 2008

I'm new to this forum, so pardon me if I am posting this in the wrong area.

I have been using Filemaker for about ten years, and I'm basically self-taught. I am confortable creating databases, relationships, layouts, calculations and scripts. I've been able to muddle through on my own... until now.

Ten years ago we began using one of the Filemaker example templates as our order entry system. The mistake was not setting up a true customer database at that time. Now, 10 years and 30,000 sales records later, I need to create a customer database. This means finding all matching sales records for each customer, based on name, company, address, or some combination.

Does anyone have experience going through this process who could point me to resources?

Thank you for any advice.

Fitch · September 22, 2008

Is phone number an option? That or the address should be "unique enough." Let's say it's phone. Even better, let's make a calc field PhoneCalc: Filter(phone;"0123456789"). This will compensate for most data-entry variations.

1. Create a CustomerID field

2. Find all

3. Sort by PhoneCalc

4. Export the data, summarized by PhoneCalc

The exported file is your customer list. I.e., you should have one record per phone number.

5. Import the list into a new table.

6. Replace the CustomerID with serial numbers

7. Create a customer::sales relationship by PhoneCalc

8. In the sales table, replace the CustomerID with the related customer ID

Does that make sense? Once you're done you can get rid of the PhoneCalc relationship and standardize on the reliable CustomerID field for relating sales to customers.

DataCAD · September 22, 2008

Thanks. That seems like a common sense approach. It would easily find most matches, but since these records have been accumulating for 10 years the data every field (including phone number) has changed at least once. There is probably no way to avoid manually checking the record matches anyway, but I was hoping there is some kind of fuzzy logic comparison (probabilistic matching)which I could perform using a combination of calculated fields and a script which would return a match probability factor.

Do you (or does anyone in FMPro Land) know anything about implementing Dice's Coeffient, Levenshtein Distance, or the Jaccard Index?

comment · September 22, 2008

Do you (or does anyone in FMPro Land) know anything about implementing Dice's Coeffient, Levenshtein Distance, or the Jaccard Index?

I don't think such methods would be very helpful. Do you have a lot a instances where Smith has been entered as Simth or Snith? Not to mention that it would be terribly inefficient: you would need to compute the distance between every possible pair of records.

Perhaps Soundex might suit you better - hard to say without knowing what kind of problems your data has, and what is your ultimate goal - for example, are sales to be associated with companies, or with individual contacts in those companies (who might by now work for another company), and so on.

Fitch · September 22, 2008

You can't escape manual error checking here. But in my experience, a company will tend to keep its phone number even when it moves to a new address. Unless as Michael points out, you're talking about individual contacts that change within a company.

DataCAD · September 23, 2008

Thanks again for the ideas. Depending on your point of view this is either a challenging problem or a big pain in the... neck. Personally, I like puzzles, and I'm sure there must be a reasonable solution for this.

Ten years worth of records means phone numbers, company names, individual contact names, addresses & email addresses have changed at one time or another. This is not always true, but is is common enough that I have to deal with it. Believe me, if I thought I could get away with using the phone number I won't bother posting here.

Here is one fairly representative example. Over the past 10 years, this customer has 16 sales records.

Some records have the name: Howard Taylor

Other records have the name: Howard G. Taylor

Some have the company name: HGT Architect

Other records use the company: HGT Architects

Some records have no company name at all.

There are at least 2 different street addresses, and several different phone numbers. His email address has changed even more frequently.

Add to this, there is no address standardization in our database (you'll find "Street" or "St"). Also, individual contact names actually change frequently with some firms. It might be as simple as "Steve" vs "Steven", or it could be an entirely diffenrent name. Company names also change more frequently than you might expect. Architectural partnerships form and dissolve quicker than a marriage in the 70's.

The more sales records involved, the more likely some piece of data has changed over time. Of the 6 or 8 fields useful in identifying the customer, often one or more changes from one sales record to the next (for a give customer).

I suppose I could try doing s "strict" match using the phone number and see how many individual "orphaned" records I have left over, but what will happen is I'll end up with thousands of disconnected clusters, and many clients will end up with multiple customer ID's.

DataCAD · September 23, 2008

Our sales records are usually based on company name, but many of our customers do not have a company name. Many of are customers are in the database both ways (with and without). Firm names also change more often than I like. There are many cases where the individual contact leaves a firm, but the software license (what we sell) stays with the firm anyway.

comment · September 23, 2008

It might be easier to adopt a policy of inclusion, rather than exclusion, for the first part. For example, matching the records by a return-separated list of last name, company name, phone number, e-mail, etc. This may err by clustering together some false positives that a human would need to split in the next stage.

DataCAD · September 23, 2008

So, if I'm understanding you correctly, if the data in ANY of those fields match, then the records are considered a match. Then clean up the false matches later.

Definitely worth considering. Especially since all that fuzzy logic stuff is probably WAY over my head anyway. :P

comment · September 23, 2008

So, if I'm understanding you correctly, if the data in ANY of those fields match, then the records are considered a match.

Yes. This could be fine-tuned by including/excluding elements, and/or by adding some 'smart' processing (e.g. include every word of address that has 7 or more characters) - until you get a reasonable starting point.

I would also include 'cascade clustering', i.e. if record 1 matches record 20, and record 20 matches record 100 (but record 1 and record 100 do not match each other), then all three would be clustered together nevertheless.

DataCAD · September 23, 2008

That all makes sense. Thanks for helping me get back on track. It sounds like you can do this stuff in your sleep.

I'm still open to any other ideas on the subject.

comment · September 23, 2008

LOL, don't get me wrong: this is going to take a LOT of work - and human intervention at some stage will be inevitable.

Another point to keep in mind: make sure you have a back-tracking mechanism in place. If you decide to split a cluster, you need a way to link the resulting records back to their originating sales.

DataCAD · September 23, 2008

Another good point. I had considered the need for the ability to manually merge two customers and their associated sales data, but I hadn't considered the need to be able to spilt them apart. This is the kind of stuff I usually only figure out after doing it wrong. Thanks again.

Sign In

Customer Data Matching

Recommended Posts

DataCAD

Fitch

DataCAD

comment

Fitch

DataCAD

DataCAD

comment

DataCAD

comment

DataCAD

comment

DataCAD

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information