cjaeger Posted April 18, 2003 Posted April 18, 2003 Hi there, any idea on how to implement a "fuzzy" find? Or find similar based on certain criteria) The problem is that I have a product database with thousands of misspelled records, all in one field, and would like to find duplicates or similar records. I tried ranking with multikeys (replace(textfield;" ";"q|"), giving me every word as a key for a portal, and a calc which would add up the occurrences of every word of the original record in the related record. (sorted portal). painfully slow and too many finds. Then I added a filter which would filter out common words like "q|theq|", "q|andq|", "q|mmq|". How do I filter out words under a certain length? And what would be the best way to concenate hyphenated words (german words can be very long ....). And yes, I did a lot of regexing in BBedit before importing the almost 100000 records. but it now appears that I get those files on a regular basis, so manually cleaning them up is not an option.
LiveOak Posted April 18, 2003 Posted April 18, 2003 I'm not sure from your description exactly what problems you are trying to solve. You didn't give any examples of the sorts of the kinds of errors (other than misspellings) you are trying to fix. How are duplicates defined? For misspellings, I would use the spell checker with an enhanced dictionary to make corrections. -bd
cjaeger Posted April 18, 2003 Author Posted April 18, 2003 I'm not sure from your description exactly what problems you are trying to solve. You didn't give any examples of the sorts of the kinds of errors (other than misspellings) you are trying to fix. How are duplicates defined? For misspellings, I would use the spell checker with an enhanced dictionary to make corrections. -bd misspellings: I generated a word list from all 100000 records, that gave me over 9000 different words , almost 2.5 Million words total - Technical descriptions with a lot of numbers in between - Nothing you can solve with FileMaker spell check. duplicates: evey word matches every word -> 100% duplicate (possibly) I also have a match quota of over 100% when I have duplicate words in either of the list. As I forgot to mention, yes, I divide the finds by count of words ..., so it looks somewhat like this: Main File: A quick brown fox jumps over the lazy dog once. Portal: The quick brown fox jumps over the lazy dog once. 100% dup "the", missing "a" The quick brown fox jumps over a lazy dog once 100% real dup A quick bron fox jumps over the lazy dog once 90% spell error in bron Fox the brown dog hunts the fox 70% (not 60) The brown fox 30% The house 10% In the original file, there are many duplicates, duplicate words, hard-hyphened words. As I said, the import of the records is on a regular basis (with 2000 out of 100000 records changed each time). The changed records are not flagged, though. Nor will the corrections made in Filemaker carried over to the source system. The records originate from 10 different companies having a similar product range, thus there are duplicates but cannot be identified by sku or manufacturers sku (the first is not related between companies, the latter not available). The FM Application targets at identifying identical products, and similar products by keyword. Also some cleaning up and correction will be performed in Filemaker manually. The final outcome will be a unified product range for all the companies.
danjacoby Posted April 18, 2003 Posted April 18, 2003 The problem here is that computers are stupid. Except for the truly state-of-the-art, high end computers (which you and I combined couldn't afford), they're incapable of doing anything that resembles a value judgment, such as when two words are almost the same... ...unless they're specifically programmed for such. Spell checkers are programmed this way, which is why, WITH AN ENHANCED DICTIONARY (as Oak says, emphasis mine) so that you can add technical terms, you can correct the slight misspellings. Now, you can program self-relationships based on any number of variables using calc fields, and they can speed up your search, but you're always going to bump up against the GIGO law -- Garbage In, Garbage Out. This limitation on computers is one reason why I use value lists as often as possible, in order to cut down on misspellings. I had a client who had a DB with several different spellings of various names in one field, and it took a long time to boil the list down. That's life.
LiveOak Posted April 18, 2003 Posted April 18, 2003 This is a good example of why products should be handled by product number with the description coming via relationship from a central product catalog. Are any product numbers in use? Could you just enter product numbers, dump the existing descriptions and pickup new ones from a central source? As to sorting through the description problem, this is real research grade stuff. The promised artificial intelligence products which sold a bunch of IPO's in the mid 80's never arrived. Rather than editing descriptions, I would create a product catalog and assign a product number and standard description to each unique item and then code each of the 100,000 records with the appropriate part number. If you truly have 100,000 separate, unique products, this won't help. -bd
cjaeger Posted April 19, 2003 Author Posted April 19, 2003 This is a good example of why products should be handled by product number with the description coming via relationship from a central product catalog. Are any product numbers in use? Could you just enter product numbers, dump the existing descriptions and pickup new ones from a central source? -bd Central Product catalog is exactly what will we the final outcome of the project. The imput is from formerly separate companies and different systems. It's just one of the problems you face when companies merge .... The other approach i am pursuing was "nomalizing" the data. Since all descriptions include technical data, i try to extract this data into separate fields and then compare baesd on tech specs. Problem is, different people use different abbreviations diameter 5 mm 5mm
LiveOak Posted April 19, 2003 Posted April 19, 2003 Good luck. This is a tough problem for which I have never seen a simple solution. The only lesson is when setting up a catalog file think about these sorts of things. Establish standards for descriptions and proof the descriptions for errors and standardization. -bd
Recommended Posts
This topic is 7888 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now