April 16, 201213 yr Hi there, I am trying to clean up a database of journal articles. I am particularly interested in identifying potential duplicates based on comparing the title of the article. I am thinking that a calculation could compare two text strings and return a "%similar" value, so that the following two strings would be flagged as a potential duplicate. "Deep-inspiration breath-hold PET/CT of the thorax" "Deep-inspiration breath-hold PET-CT of the thorax" As would these: "∆p in the thorax precipitates asynchromatic sarcoma" "delta-p in the thorax precipitates asynchromatic sarcoma" however this would get a low % similar rating: "Alpha beta in the left quadrant" "Geriatric sarcoma unhinged" I've looked around and have not found a solution for this in the FMForums nor have a found a custom function.. though, I suspect this problem has been addressed by someone before. Does anyone know of a method I might be able to employ? Thanks in advance for any information you can provide. :)
April 16, 201213 yr One way would to be to Sub out the character like Substitute ( Title ; [ "/" ; " " ] ; [ "-" ; " " ] ) and then search using the operator for duplications ! Lee
April 16, 201213 yr It seems that in your case "similarity" would increase in proportion to the number of words common to both titles?
April 16, 201213 yr Does anyone know of a method I might be able to employ? Fuzzy String Comparison's can be handled with regular expressions. Few different plugin options for doing it. Wikipedia Article has some places to start, ScriptMaster or bBox plugin and many others would give you the ability to leverage reg-ex. I personally have not gone to this degree with Filemaker, but have done some similar things using Ruby.
April 17, 201213 yr you may need to explore this and the associated wiki articles. http://www.briandunning.com/cf/965
April 17, 201213 yr Author Thank you all.. the Levenshtein calculation on Brian Dunning's page seems to be something along what I had in mind.. LOL.. I would have never searched for "Levenshtein" Thanks for all the useful feedback.. I will see what I can come up with and may post it back here for review.
April 19, 201213 yr Newbies My first thought was the Levenshtein calculation too. But, I've done this manually before (and prefer the manual process in most situation). Using "Insert from Index" in find mode, and "Insert from Index" and "Replace All" on the found set, you can get through them pretty quickly (as long you only have a couple hundred strings to merge), plus you have human intelligence doing the matching. Just think twice each time before you click the "Replace All" button.
Create an account or sign in to comment