Steven Cappiello Posted April 16, 2012 Posted April 16, 2012 Hi there, I am trying to clean up a database of journal articles. I am particularly interested in identifying potential duplicates based on comparing the title of the article. I am thinking that a calculation could compare two text strings and return a "%similar" value, so that the following two strings would be flagged as a potential duplicate. "Deep-inspiration breath-hold PET/CT of the thorax" "Deep-inspiration breath-hold PET-CT of the thorax" As would these: "∆p in the thorax precipitates asynchromatic sarcoma" "delta-p in the thorax precipitates asynchromatic sarcoma" however this would get a low % similar rating: "Alpha beta in the left quadrant" "Geriatric sarcoma unhinged" I've looked around and have not found a solution for this in the FMForums nor have a found a custom function.. though, I suspect this problem has been addressed by someone before. Does anyone know of a method I might be able to employ? Thanks in advance for any information you can provide. :)
Lee Smith Posted April 16, 2012 Posted April 16, 2012 One way would to be to Sub out the character like Substitute ( Title ; [ "/" ; " " ] ; [ "-" ; " " ] ) and then search using the operator for duplications ! Lee
comment Posted April 16, 2012 Posted April 16, 2012 It seems that in your case "similarity" would increase in proportion to the number of words common to both titles?
Russell Barlow Posted April 16, 2012 Posted April 16, 2012 Does anyone know of a method I might be able to employ? Fuzzy String Comparison's can be handled with regular expressions. Few different plugin options for doing it. Wikipedia Article has some places to start, ScriptMaster or bBox plugin and many others would give you the ability to leverage reg-ex. I personally have not gone to this degree with Filemaker, but have done some similar things using Ruby.
Ocean West Posted April 17, 2012 Posted April 17, 2012 you may need to explore this and the associated wiki articles. http://www.briandunning.com/cf/965
Steven Cappiello Posted April 17, 2012 Author Posted April 17, 2012 Thank you all.. the Levenshtein calculation on Brian Dunning's page seems to be something along what I had in mind.. LOL.. I would have never searched for "Levenshtein" Thanks for all the useful feedback.. I will see what I can come up with and may post it back here for review.
Newbies psmithw Posted April 19, 2012 Newbies Posted April 19, 2012 My first thought was the Levenshtein calculation too. But, I've done this manually before (and prefer the manual process in most situation). Using "Insert from Index" in find mode, and "Insert from Index" and "Replace All" on the found set, you can get through them pretty quickly (as long you only have a couple hundred strings to merge), plus you have human intelligence doing the matching. Just think twice each time before you click the "Replace All" button.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now