shady Posted October 15, 2006 Posted October 15, 2006 I have a database of 100,000 records. the records vary from random letters to words to words with random characters. Anyone know if it is possible to compare these records to a list of words from a dictionary and "FIND" all records that have dictionary entries in them?
CoZiMan Posted October 15, 2006 Posted October 15, 2006 Yes. Create the two table relationship: X_Recordtxt >= Dict_Recordtxt AND X_Recordtxt <= Dict_Recordtxt Should return the found set of matching entries in a portal set to this relationship. K?
LaRetta Posted October 15, 2006 Posted October 15, 2006 Shady, you have Advanced. You can use a custom function to create calculation which explodes the field containing the multiple words. Then join this calc field on = to your dictionary word field. If you don't want to use an exploded CF, you can hardcode it thus: LeftWords ( theField ; 1 ) & ¶ & MiddleWords ( theField ; 2 ; 1 ) & ¶ & MiddleWords ( theField ; 3 ; 1 ) & ¶ & ... etc
shady Posted October 15, 2006 Author Posted October 15, 2006 Thanks. It's too early in the morning for me to absorb this. Can you lay this out in easy terms of the logic of each step? I'm not following the syntax at this point. So the possibilities for a record are: 1> "word" (record is a match as is) 2> "abcword" (record has a word in it but also some random characters either before or after word) 3> "ncufubf" (record has no matching words in it) 4> "fbbaorhur" (want to skip any one character matches like "a" and "i") So I'd like to "find" all records that include 1> and 2> Thanks!
LaRetta Posted October 15, 2006 Posted October 15, 2006 abcword is a problem because it is not separated by space. It means you will have to explode (and considering you want to exclude single characters), as: ab abc abcw abcwo abcwor abcword bc bcw bcwo bcwor bcword cw cwo cwor cword wo wor word This requires an exploded permutation calculation and requires a custom function. Search here on Forums for Ender's Exploded Permutations (two custom functions together which handle it). There are variations of the technique around. I'll hunt for some as well (later today if you haven't found any) but I'm just heading out the door.
Fenton Posted October 16, 2006 Posted October 16, 2006 Here is a somewhat bloated little file that tests some text against all the words that begin with "a" (9074 -). It uses several Custom Functions (by others), including one that seems like it might work for this, but doesn't. It is fairly simple to find text that matches any of the whole words. But, as LaRetta says, finding "partials" requires a CF producing every permutation of the text. Which is long. It could alternatively be done by a script, which passes the word to the other table, and loops through all the records. But with my tables it was much slower. It could also be done with a Find using wildcards; which turns out to be about 75% of the speed of the double-loop script above. WordMatch.fp7.zip
shady Posted October 16, 2006 Author Posted October 16, 2006 Thanks. This looks like it may be tougher than I thought. I wish there was some kind of function that could do a simple compare each record to the dictionary table and mark each record that has a match or partial match. Doing permutations on each record makes sense but i was hoping for a macro or ready made script to do it. I'll look through the section you mentioned and see if I can find something close.
shady Posted October 16, 2006 Author Posted October 16, 2006 I'm using 5.5 and couldn't open your attachment. can you save backwards to 5.5? I tried to change my profile but can't find the setting to change it.
Ender Posted October 16, 2006 Posted October 16, 2006 I'm using 5.5 and couldn't open your attachment. That's not what you indicated in your first post. It's kind of hard to help if you change the requirements. can you save backwards to 5.5 No, .fp7 files are not backward compatible with prior versions, and recursive custom functions are not available in FM6 and below.
Fenton Posted October 16, 2006 Posted October 16, 2006 You could solve it by using a modified* version of my Partials wFind script in my file. It doesn't use a CF either; though it's kind of slow. *Mine uses a script Variable, but you could just use a global field instead. As Ender says, you said 8 Advanced, so I used the modern method. You could download the trial version of 8.5. But I'll paste the script here. I think it'll work the same. You could ignore my "View As" steps, but they make it faster. Words: table with the dictionary words, in 1 field, 1 record per word MyWords: table with your text field to test, 1 field, 1 record each #Mark Partials wFind script Freeze Window Go to Layout [ “Words” (Words_) ] Show All Records Go to Record/Request/Page [ First ] View As [ View as Form ] Loop Set Field [ MyWords:_gWord; Value:Words_::Word_ ] # Go to Layout [ “MyWords” (MyWords) ] Enter Find Mode [ ] Set Field [ MyWords::Text_; "*" & MyWords::_gWord & "*" ] Set Error Capture [ On ] Perform Find [ ] If [ Get (LastError) = 0 ] Replace Field Contents [ MyWords::MarkPartial; Replace with calculation: 1 ] [ No dialog ] End If # Go to Layout [ “Words” (Words_) ] Go to Record/Request/Page [ Next; Exit after last ] End Loop Go to Record/Request/Page [ First ] View As [ View as List ] # Go to Layout [ “MyWords” (MyWords) ] Perform Find [ Specified Find Requests: Find Records; Criteria: MyWords::MarkPartial: “1” ] [ Restore ] View As [ View as List ]
Fenton Posted October 16, 2006 Posted October 16, 2006 P.S. I neglected to put that first: Go To Record/Request [First] in my file, in either of those Partial Loop scripts. It should be there. I didn't notice, 'cause I always began on the 1st. But that's just negligent [but not criminally so, I hope -] P.S.2 Change the ";" to "," for FileMaker 5/6, unless you're European
BobWeaver Posted October 16, 2006 Posted October 16, 2006 ...So the possibilities for a record are: 1> "word" (record is a match as is) 2> "abcword" (record has a word in it but also some random characters either before or after word) 3> "ncufubf" (record has no matching words in it) 4> "fbbaorhur" (want to skip any one character matches like "a" and "i") I'm not sure which dictionary you are planning to use, but example 4 contains the words "a" and "or." So, that raises a few questions: 1. Do you need to skip words smaller than a certain number of characters? 2. If an entry contains more than one word, do you need to extract them both? And, in the correct order? 3. What about an entry like "xyzcattle" which contains "cat" and "cattle" but aren't really two separate words? And one final question: Has anyone stopped to consider how many comparisons will need to be done to match multiple permutations of 100,000 records with a dictionary which would typically have 30,000 entries? I hope you have a very fast computer. (Hint: multiply 30,000 x 100,000 x (average no of permutations of each entry).) Fenton's method of using wildcard searches from the dictionary table may be the most efficient, but could have problems with the situation where there are two or more words embedded into an entry as in my questions 2 and 3 above. Even then, I expect that it would have to do about 30,000 find operations.
shady Posted October 16, 2006 Author Posted October 16, 2006 I'm not sure which dictionary you are planning to use, but example 4 contains the words "a" and "or." So, that raises a few questions: 1. Do you need to skip words smaller than a certain number of characters? 2. If an entry contains more than one word, do you need to extract them both? And, in the correct order? 3. What about an entry like "xyzcattle" which contains "cat" and "cattle" but aren't really two separate words? And one final question: Has anyone stopped to consider how many comparisons will need to be done to match multiple permutations of 100,000 records with a dictionary which would typically have 30,000 entries? I hope you have a very fast computer. (Hint: multiply 30,000 x 100,000 x (average no of permutations of each entry).) Fenton's method of using wildcard searches from the dictionary table may be the most efficient, but could have problems with the situation where there are two or more words embedded into an entry as in my questions 2 and 3 above. Even then, I expect that it would have to do about 30,000 find operations. - Okay, I'll add in more info. - Only care about words 2 characters or more. so yes, my "fbbaorhur" is not a good example. Sorry. - Only need to flag records with any instance of a word that matches dictionary table. Don't need to extract any words or do any logic to them, just need to flag the record as containing a word. - I currently do it manually now so even if it took my Dell notebook an hour, it'll be heaven sent. Plus my dictionary table has been trimmed down to words I'm looking for. So closer to 5,000 words in it.
comment Posted October 16, 2006 Posted October 16, 2006 Perhaps you should do it the other way around, then. Set a global gWord field to the first word in your dictionary. Go to your word table and replace the contents of the Marked field with the calculation: Marked or PatternCount ( word , gWord ) Set gWord to the next word in the dictionary. This takes "only" 5,000 iterations (not counting the Replace... step that has to go thru 100k records on each round).
Recommended Posts
This topic is 6670 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now