Advanced Fuzzy Search Function

racemysunfish · January 16, 2008

Hello, To start I'm really sorry if this is in the wrong forum :

I need some help with some pseudocode.

My client has a filemaker database filled with business records, with fields labeled "phone number" "business address" "city" "zip" etc.

I have an excel file filled with other records with the same named fields. I need to compare these two databases together and figure out which entries are NOT already in the filemaker database.

Problem: Matching records do NOT have the exact same information. For instance:

"Joes Pizza Parlor" and "Joe's Pizza" are in fact the same business, but the records are slightly off.

"1156 N. Willow" and "1156 Willow street" should be the same record.

Obviously I can't rely on simple searches to match the two files. Can someone point me in the right direction on how to compare slightly different records?

The matches don't all have to be EXACTLY exact, but every record that falls through the cracks loses my client money.

Søren Dyhr · January 16, 2008

I can't promise that it will get all but some might show up by approaching it this way.

--sd

vicinity.zip

comment · January 16, 2008

I don't get it: if the first character matches, there is a match. If it doesn't, there won't be a match, no matter how many more characters you add - so why keep adding them?

I think we need a better statement of purpose here. Obviously, the best that can be done is to present suspected duplicates for inspection by a human. What would constitute a suspected duplicate is a conceptual problem that needs to be solved before it can become a computation problem. The two examples given are hardly enough to suggest a suitable approach. Off the bat, I would say either one of these could be considered:

• common word (or common sequence of n words)

• common sequence of n characters

• Soundex

Søren Dyhr · January 16, 2008

is a conceptual problem that needs to be solved before it can become a computation problem.

Humans are indeed much better than machines here, but I disagree here, both context and purpose are pretty well described but the idea of letting machinery be in charge of it is kind of making promises which can't be fulfilled!

I don't get it: if the first character matches, there is a match.

Provided the the remaining fields also are in the vicinity - I do not believe in non humanized processing so my suggestion is merely a aid to the process, not that it gets particular much unless the postcode/zip is like in the UK where each small gathering of houses shares one unique code.

But the mission as such needs to get scrutinized here, can it really pay off to let a developer waste his time think up a nowhere near system/algorithm ... couldn't it be wise if he just occupied himself with the returns from the campaigns.

So it all falls back on how efficiently the postcode system seems to be.

--sd

comment · January 16, 2008

Provided the the remaining fields also are in the vicinity

I must be missing something. You have a repeating field. The first repetition returns the first character. The second repetition returns the first two characters. Why do we need the second repetition? Is there a combination of two strings that will be matched by the second repetition, but not the first?

comment · January 16, 2008

both context and purpose are pretty well described

I'm afraid I cannot agree. The problem here is not (yet) how to get the answer, but which question to ask. I can get "Joes Pizza Parlor" and "Joe's Pizza" to match in twenty different ways. But each way will have a different set of false positives and failed matches. There is no way to evaluate them without either seeing a lot more of the data or getting the OP to formulate some rules.

LaRetta · January 16, 2008

Phone number (stripped to numbers) and first word of name can assist here. Also, zip and first word of name. But it all is a guess regardless. And even if they are the same, if they vary at ALL (in the address or phone or?) then which of the records takes priority, FM or Excel? And if FM has an address and Excel doesn't but Excel has a phone and FM doesn't? Yes, rules are required. You will also get into combining issues. All you can do is provide comparison portals with ability to shift data from one side to the other (per field) with a human verifying each one.

corn · January 16, 2008

I built a system that does this for people but unfortunately I can't distribute it. The gist of it is that there are two parts to the question: acquisition and discrimination.

The "fuzzy search" is the acquisition part. You want to find a group of records that match in a loose sense of the word. I attacked this by defining different match criteria for each attribute. For example, looking at address I might consider only the numeric part of the street and the zip code to make my match. For phone number I might consider only the last four digits. Some of the matches are done using functions in a proprietary plug-in created specifically for the task. The combined set of records from each of the individual match criteria form the set of potential matches.

The discrimination part involves scoring the potential matches on an open-ended scale based on how "close" they are to the record in question. This was done using another plug-in function. I then present the potential matches in a portal sorted by the score. For automatic operation you can define minimum and maximum threshold for the score that defines a "match." The system automatically matches records below the minimum score threshold and adds records above the maximum score threshold - the remainder are presented for the user to manually inspect.

Sign In

Advanced Fuzzy Search Function

Recommended Posts

racemysunfish

Søren Dyhr

comment

Søren Dyhr

comment

comment

LaRetta

corn

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information