Compare records to dictionary

shady · October 15, 2006

I have a database of 100,000 records. the records vary from random letters to words to words with random characters.

Anyone know if it is possible to compare these records to a list of words from a dictionary and "FIND" all records that have dictionary entries in them?

CoZiMan · October 15, 2006

Yes. Create the two table relationship:

X_Recordtxt >= Dict_Recordtxt

AND X_Recordtxt <= Dict_Recordtxt

Should return the found set of matching entries in a portal set to this relationship.

K?

comment · October 15, 2006

:qwery:

a ≤ b AND a ≥ b

equals:

a = b

LaRetta · October 15, 2006

Shady, you have Advanced. You can use a custom function to create calculation which explodes the field containing the multiple words. Then join this calc field on = to your dictionary word field. If you don't want to use an exploded CF, you can hardcode it thus:

LeftWords ( theField ; 1 ) & ¶ &

MiddleWords ( theField ; 2 ; 1 ) & ¶ &

MiddleWords ( theField ; 3 ; 1 ) & ¶ &

... etc

shady · October 15, 2006

Thanks. It's too early in the morning for me to absorb this. Can you lay this out in easy terms of the logic of each step? I'm not following the syntax at this point. So the possibilities for a record are:

1> "word" (record is a match as is)

2> "abcword" (record has a word in it but also some random characters either before or after word)

3> "ncufubf" (record has no matching words in it)

4> "fbbaorhur" (want to skip any one character matches like "a" and "i")

So I'd like to "find" all records that include 1> and 2>

Thanks!

LaRetta · October 15, 2006

abcword is a problem because it is not separated by space. It means you will have to explode (and considering you want to exclude single characters), as:

ab

abc

abcw

abcwo

abcwor

abcword

bc

bcw

bcwo

bcwor

bcword

cw

cwo

cwor

cword

wo

wor

word

This requires an exploded permutation calculation and requires a custom function. Search here on Forums for Ender's Exploded Permutations (two custom functions together which handle it). There are variations of the technique around. I'll hunt for some as well (later today if you haven't found any) but I'm just heading out the door.

Fenton · October 16, 2006

Here is a somewhat bloated little file that tests some text against all the words that begin with "a" (9074 -). It uses several Custom Functions (by others), including one that seems like it might work for this, but doesn't.

It is fairly simple to find text that matches any of the whole words. But, as LaRetta says, finding "partials" requires a CF producing every permutation of the text. Which is long.

It could alternatively be done by a script, which passes the word to the other table, and loops through all the records. But with my tables it was much slower. It could also be done with a Find using wildcards; which turns out to be about 75% of the speed of the double-loop script above.

WordMatch.fp7.zip

shady · October 16, 2006

Thanks. This looks like it may be tougher than I thought. I wish there was some kind of function that could do a simple compare each record to the dictionary table and mark each record that has a match or partial match.

Doing permutations on each record makes sense but i was hoping for a macro or ready made script to do it.

I'll look through the section you mentioned and see if I can find something close.

shady · October 16, 2006

I'm using 5.5 and couldn't open your attachment. can you save backwards to 5.5?

I tried to change my profile but can't find the setting to change it.

Ender · October 16, 2006

I'm using 5.5 and couldn't open your attachment.

That's not what you indicated in your first post. It's kind of hard to help if you change the requirements.

can you save backwards to 5.5

No, .fp7 files are not backward compatible with prior versions, and recursive custom functions are not available in FM6 and below.

Fenton · October 16, 2006

You could solve it by using a modified* version of my Partials wFind script in my file. It doesn't use a CF either; though it's kind of slow.

*Mine uses a script Variable, but you could just use a global field instead. As Ender says, you said 8 Advanced, so I used the modern method.

You could download the trial version of 8.5. But I'll paste the script here. I think it'll work the same.

You could ignore my "View As" steps, but they make it faster.

Words: table with the dictionary words, in 1 field, 1 record per word

MyWords: table with your text field to test, 1 field, 1 record each

#Mark Partials wFind script

Freeze Window

Go to Layout [ “Words” (Words_) ]

Show All Records

Go to Record/Request/Page [ First ]

View As [ View as Form ]

Loop

Set Field [ MyWords:_gWord; Value:Words_::Word_ ]

#

Go to Layout [ “MyWords” (MyWords) ]

Enter Find Mode [ ]

Set Field [ MyWords::Text_; "*" & MyWords::_gWord & "*" ]

Set Error Capture [ On ]

Perform Find [ ]

If [ Get (LastError) = 0 ]

Replace Field Contents [ MyWords::MarkPartial; Replace with calculation: 1 ] [ No dialog ]

End If

#

Go to Layout [ “Words” (Words_) ]

Go to Record/Request/Page [ Next; Exit after last ]

End Loop

Go to Record/Request/Page [ First ]

View As [ View as List ]

#

Go to Layout [ “MyWords” (MyWords) ]

Perform Find [ Specified Find Requests: Find Records; Criteria: MyWords::MarkPartial: “1” ] [ Restore ]

View As [ View as List ]

Fenton · October 16, 2006

P.S. I neglected to put that first:

Go To Record/Request [First]

in my file, in either of those Partial Loop scripts. It should be there. I didn't notice, 'cause I always began on the 1st. But that's just negligent [but not criminally so, I hope -]

P.S.2 Change the ";" to "," for FileMaker 5/6, unless you're European

BobWeaver · October 16, 2006

...So the possibilities for a record are:
1> "word" (record is a match as is)

2> "abcword" (record has a word in it but also some random characters either before or after word)

3> "ncufubf" (record has no matching words in it)

4> "fbbaorhur" (want to skip any one character matches like "a" and "i")

I'm not sure which dictionary you are planning to use, but example 4 contains the words "a" and "or." So, that raises a few questions:

1. Do you need to skip words smaller than a certain number of characters?

2. If an entry contains more than one word, do you need to extract them both? And, in the correct order?

3. What about an entry like "xyzcattle" which contains "cat" and "cattle" but aren't really two separate words?

And one final question:

Has anyone stopped to consider how many comparisons will need to be done to match multiple permutations of 100,000 records with a dictionary which would typically have 30,000 entries? I hope you have a very fast computer. (Hint: multiply 30,000 x 100,000 x (average no of permutations of each entry).)

Fenton's method of using wildcard searches from the dictionary table may be the most efficient, but could have problems with the situation where there are two or more words embedded into an entry as in my questions 2 and 3 above. Even then, I expect that it would have to do about 30,000 find operations.

shady · October 16, 2006

I'm not sure which dictionary you are planning to use, but example 4 contains the words "a" and "or." So, that raises a few questions:

1. Do you need to skip words smaller than a certain number of characters?

2. If an entry contains more than one word, do you need to extract them both? And, in the correct order?

3. What about an entry like "xyzcattle" which contains "cat" and "cattle" but aren't really two separate words?

And one final question:

Has anyone stopped to consider how many comparisons will need to be done to match multiple permutations of 100,000 records with a dictionary which would typically have 30,000 entries? I hope you have a very fast computer. (Hint: multiply 30,000 x 100,000 x (average no of permutations of each entry).)

Fenton's method of using wildcard searches from the dictionary table may be the most efficient, but could have problems with the situation where there are two or more words embedded into an entry as in my questions 2 and 3 above. Even then, I expect that it would have to do about 30,000 find operations.

- Okay, I'll add in more info.

- Only care about words 2 characters or more. so yes, my "fbbaorhur" is not a good example. Sorry.

- Only need to flag records with any instance of a word that matches dictionary table. Don't need to extract any words or do any logic to them, just need to flag the record as containing a word.

- I currently do it manually now so even if it took my Dell notebook an hour, it'll be heaven sent. Plus my dictionary table has been trimmed down to words I'm looking for. So closer to 5,000 words in it.

comment · October 16, 2006

Perhaps you should do it the other way around, then. Set a global gWord field to the first word in your dictionary. Go to your word table and replace the contents of the Marked field with the calculation:

Marked or PatternCount ( word , gWord )

Set gWord to the next word in the dictionary.

This takes "only" 5,000 iterations (not counting the Replace... step that has to go thru 100k records on each round).

Sign In

Compare records to dictionary

Recommended Posts

shady

CoZiMan

comment

LaRetta

shady

LaRetta

Fenton

shady

shady

Ender

Fenton

Fenton

BobWeaver

shady

comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information