Fuzzy string comparison to find duplicates based on text strings that could vary slightly?

Steven Cappiello · April 16, 2012

Hi there,

I am trying to clean up a database of journal articles. I am particularly

interested in identifying potential duplicates based on comparing the title of the article.

I am thinking that a calculation could compare two text strings and return a "%similar" value,

so that the following two strings would be flagged as a potential duplicate.

"Deep-inspiration breath-hold PET/CT of the thorax"

"Deep-inspiration breath-hold PET-CT of the thorax"

As would these:

"∆p in the thorax precipitates asynchromatic sarcoma"

"delta-p in the thorax precipitates asynchromatic sarcoma"

however this would get a low % similar rating:

"Alpha beta in the left quadrant"

"Geriatric sarcoma unhinged"

I've looked around and have not found a solution for this in the FMForums nor have a found a custom function.. though, I suspect this problem has been addressed by someone before.

Does anyone know of a method I might be able to employ?

Thanks in advance for any information you can provide.

:)

Lee Smith · April 16, 2012

One way would to be to Sub out the character like


Substitute ( Title ;

[ "/" ; " " ] ;

[ "-" ; " " ]

)

and then search using the operator for duplications !

Lee

comment · April 16, 2012

It seems that in your case "similarity" would increase in proportion to the number of words common to both titles?

Russell Barlow · April 16, 2012

Does anyone know of a method I might be able to employ?

Fuzzy String Comparison's can be handled with regular expressions. Few different plugin options for doing it. Wikipedia Article has some places to start, ScriptMaster or bBox plugin and many others would give you the ability to leverage reg-ex. I personally have not gone to this degree with Filemaker, but have done some similar things using Ruby.

Ocean West · April 17, 2012

you may need to explore this and the associated wiki articles. http://www.briandunning.com/cf/965

Steven Cappiello · April 17, 2012

Thank you all.. the Levenshtein calculation on Brian Dunning's page seems to be something along what I had in mind.. LOL.. I would have never searched for "Levenshtein"

Thanks for all the useful feedback.. I will see what I can come up with and may post it back here for review.

psmithw · April 19, 2012

My first thought was the Levenshtein calculation too. But, I've done this manually before (and prefer the manual process in most situation). Using "Insert from Index" in find mode, and "Insert from Index" and "Replace All" on the found set, you can get through them pretty quickly (as long you only have a couple hundred strings to merge), plus you have human intelligence doing the matching.

Just think twice each time before you click the "Replace All" button.

Sign In

Fuzzy string comparison to find duplicates based on text strings that could vary slightly?

Recommended Posts

Steven Cappiello

Lee Smith

comment

Russell Barlow

Ocean West

Steven Cappiello

psmithw

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information