FilterValues functionality with partial matched or wildcards

Hammerton · January 11, 2013

I have buit the following calculation to compare free text with a library of key words, and count the occurrence of the library words in the free text. This works great, but.......

The library lists for some words are stems, like "happi" to capture "happier" and "happiness". The solution below works fine for exact matches between target text words and whole(non stem) library words. Is there a way I can capture the partial stem matches, either within the framework I have or in some other way. Text samples will run up to 1,000 words, and libraries may have as many as 500 words.

Let([
t = text1 ;
r = search_library::sad ;
adj_t = Substitute ( t ; ", " ; ¶ ) ;
adj_t = Substitute ( t ; "; " ; ¶ ) ;
adj_r = List ( r )
];
ValueCount ( FilterValues ( adj_t ; adj_r ) )
)

bruceR · January 11, 2013

Not an answer to the question but why are there two different calcs for adj_t ?

Right now you only get the last result.

Perhaps you should be using this?

adj_t = Substitute ( t ; [ ", " ; ¶] ; [ "; " ; ¶] );

comment · January 12, 2013

On 1/11/2013 at 10:38 PM, Hammerton said:
The solution below works fine for exact matches between target text words and whole(non stem) library words. Is there a way I can capture the partial stem matches

I am not sure I understand your question. I am quite sure you cannot use FilterValues() with wild cards - though sometimes it can be used to pass "values that begin with ..." or "values that end with ...". Another option is to "explode" one or both sides of the comparison, so that "happiness" becomes "happiness¶happines¶happine¶happin¶happi¶happ¶hap", for instance. I think the task here needs to be defined better than by a single example.

David Jondreau · January 12, 2013

Sounds like you've got two blocks of text. One the source, with a bunch of words and the other a library with another bunch of words. You want to go through each word in the source and see if it, or a version of it, appears in the library?

I think you're going to need a recursive custom function using the PatternCount() function.

Of course, the English language is super-complex and it'll be hard to do real pattern matching. Happier and happiness both have the happi- stem. But happy does not. happ would be the stem, but that would then catch happenings and happenstance.

comment · January 12, 2013

It seems this is more complex than one would think at first sight:

http://en.wikipedia.org/wiki/Stemming

Hammerton · January 14, 2013

Thanks to you all.

Bruce - I was unaware that the adj_t calculation would be performed only once. I assumed that the first one would deal with "," delimited text and the second would deal with ";" semi-colon delimited text. That is not a big problem. We are comfortable in our abilities to conver the target text into a list of words.

To the anonymous commentator - Not sure how to describe the task more clearly. I have text samples that I am converting into a long list of words, CR-separated, in a single FM text field. I wish to check the occurrence of each word in this target text list against a standardized library of words. The current library contains about 4500 words (these comprise about 80 different libraries) about 75% of the words are entered like accept*, for accepted, accepting, and acceptable; or happy* for happier, happiest, and happiness (happy appears on the list as its own entry). I can solve the problem with brute force by simply adding all the words that are currently being captured by the single stem with the wildcard. But that's like 3500 stems that need to be addressed. An automated solution would be preferable. Thanks for the link too.

David - I was thinking along those lines as well. I do not have a lot of experience with writing custom functions, and even less with working with lists. As I note above, I have total confidence that my libraries are constructed to avoid omission of matches; that is the list contains happy as its own word and happi* to cover the others.

If you happen to have an example of a function that reads evaluates and operates on list items, I am pretty sure I can use patterncount to do the job.

Thanks to all.

bruceR · January 14, 2013

Yes, the first one instance of adj_t would do what you want. Briefly.

But since the second instance makes no reference to the first instance, it "re-declares" adj_t completely.

To get the results you intended you need to use the calculation I suggested.

The multi-operation subsitute method is something you should learn in any case.

If you are going to use your sequential method, you would need to properly refer to adj_t from the first operation in the second operation:

adj_t = Substitute ( t ; ", " ; ¶ ) ;

adj_t = Substitute ( adj_t ; "; " ; ¶ ) ;

Hammerton · January 14, 2013

Ah. I see. Thank you.

comment · January 16, 2013

On 1/14/2013 at 12:14 AM, Hammerton said:

Thanks to you all.

Not sure how to describe the task more clearly. I have text samples that I am converting into a long list of words, CR-separated, in a single FM text field. I wish to check the occurrence of each word in this target text list against a standardized library of words.

You need to refine the definition of a match. If the word "happier" is supposed to match the library entry "happi" AND you are sure there cannot be another library entry of "happ", then exploding "happier" as I explained above should do the job, I think?

---

BTW, I am no more anonymous than you or any other member; "comment" is a screen name, as is "Hammerton". :tongue:

Edited January 16, 2013 by comment

Sign In

FilterValues functionality with partial matched or wildcards

Recommended Posts

Hammerton

bruceR

comment

David Jondreau

comment

Hammerton

bruceR

Hammerton

comment

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information