Harvest a list of Substitutions

Quito · May 23, 2019

Hi,

I've created a few thousand (or more) substitutions where a code is appended to a phrase, such as:

Substitute ( text ;

["DIABETES MELLITUS, TYPE II";TextStyleAdd("Diabetes mellitus, type II (E119)";Bold+ HighlightYellow)];

["ALZHEIMER'S DISEASE";TextStyleAdd("Alzheimer's Disease (F009)";Bold+ HighlightYellow)];

... and so on.

I'd like to harvest the codes that have been added to the text and list them, first alphabetically, then numerically. Using the previous example the resulting harvested file should look like:

Date | ID | Code 1 | Code 2

2019-05-23 | HGTY87654 | (E119) | (F009)

Currently, I'm harvesting the codes by hand but, as the database grows about a million records every 3 months, it's slowing down FM and the tediousness of it is wearing me out. The records are pulled from an AS400 i9 DB2 using DBeavers' JDBC to populate FM17a.

There's probably a much simpler way to catch the codes. I can imagine it's possible do three (or four) chores in one go (Substitute; TextStyleAdd; Export Code to another field, Remove duplicates if found). I'm open up to any suggestions.

Best regards,

Daniel

comment · May 23, 2019

What does the original text look like? Are there multiple entries in the field, and if so, what separates them?

Quito · May 25, 2019

Thanks Comment,

The original text is UTF-8 Uppercase. I believe the maximum number of characters allowed per field on the AS400/9i is 82.

In the screenshot, the first field holds the initial text. The text can be substituted by one or many substitution filters. The last one should harvest the code between parenthesis and send it to a plain text field.

Indeed, it is possible to have multiple substitutions in a single field. Please see the second screenshot.

All the very best,

Daniel

comment · May 25, 2019

It seems like you are in a very inconvenient situation. Performing "a few thousand (or more) substitutions" is certainly not the optimal way to process data. I would look for some other way to determine the presence of the symptoms you're looking for. But this could work only if the input follows some rules, and you know what these rules are.

For example, in your second sample you could separate the given symptoms to individual lines by substituting the "+" with a carriage return. Then you could lookup the codes from a table, instead of hard-coding them in a calculation formula. Or perhaps you could use a scripted process to create a related record for each symptom.

Question:
Do you actually need the substituted text, with the codes inserted? Or is it only an intermediate step to getting the codes?

Quito · June 13, 2019

On 5/25/2019 at 6:25 PM, comment said:

It seems like you are in a very inconvenient situation. Performing "a few thousand (or more) substitutions" is certainly not the optimal way to process data. I would look for some other way to determine the presence of the symptoms you're looking for. But this could work only if the input follows some rules, and you know what these rules are.

For example, in your second sample you could separate the given symptoms to individual lines by substituting the "+" with a carriage return. Then you could lookup the codes from a table, instead of hard-coding them in a calculation formula. Or perhaps you could use a scripted process to create a related record for each symptom.

Question:
Do you actually need the substituted text, with the codes inserted? Or is it only an intermediate step to getting the codes?

Hi Comment,

Sorry for the delay in responding. I agree, the software does what it's intended to do, yet it'll reach a performance hit real soon. I must find a better way to process the data. The input has no rules, so I'm creating rules on the fly (catching phrases, misspellings, and key words).

The example you provided sounds viable for my task. Could you please elaborate a bit further? I know how to perform Lookups in Excel and Google Spreadsheets, but I'm a bit lost on how to set it up in FileMaker. I've seen a bit in the FMForum but it still surpasses me.

Regarding your question: the tool also performs as a visual checker, as it's necessary to determine if a specific symptom belongs to the patient (or is otherwise being pulled from the family history), if its current (or a past reason for the patient approaching the hospital), or if it's a greedy code error. Your question rings true, as I should harvest the codes only from a few specific fields but the problem is that the needed information is dispersed among all fields and large timestamp ranges.

comment · June 13, 2019

Have look at the attached demo. It does two things:

First, it looks for the occurrence of known patterns in the given text. When any of the known patterns is found, the corresponding code is retrieved. This part is accomplished using a looping script. I could have used a custom function for this, but I am not sure how well this would perform with an unstored calculation going over a large list of patterns at every screen redraw..

The second step is highlighting the found patterns. This is done in a calculation field, so that the original is preserved. Here I am using a pair of custom functions (could be just one, but I also wanted to preserve the original case, so I did not use Substitute()).

Most importantly, the file separates data and logic: the patterns and the corresponding codes are kept in a table and both the script and the calculation access them there, with no hard-coding of data in calculation formulas..

MultiPatternCount+Hilite.fmp12

Quito · July 4, 2019

Hi Comment,

I tried your demo and ran into the following problem: Codes are not unique (as I have dozens of synonyms for each code) but patterns are. So, in your Patterns table I removed Unique from Code and placed Unique on Pattern.

Nonetheless, when I click on the Get Codes button, it recovers any pattern that coincides with the first word. Please see the attached screenshot.

Best regards,

Daniel

comment · July 4, 2019

My method tests each pattern individually and in isolation. Since the text contains the pattern "Alpha", the corresponding code 101 is retrieved. If you want to test if the found occurrence is within an occurrence of another, longer pattern, that's going to be MUCH more complicated.

comment · July 5, 2019

I've given this some more thought.

If you sort the patterns by their length and work from the longest to the shortest, and work on a copy of the original text in a variable, you can substitute each found pattern with some arbitrary string in order to prevent it from being found again using a shorter pattern. The arbitrary string is required to prevent accidental joining of adjacent words into a new pattern.

Alternatively, you could go over the original text word-by-word and see if there are any patterns that begin with this word. If yes, check if there are any patterns that begin with this word and the following one - and so on, until you get a negative answer in which case you go one step back. But this would be a lot of work to implement in Filemaker.

Quito · July 5, 2019

Yes, I was thinking of something similar. I'm worried it might be overkill, though.

Currently, with my massive substitution list there is a 2 second delay when going from one full record to the next. I added tab controls to sort the two dozen or so portal fields and discovered that if I select an "empty" tab control, delay is reduced to zero. I added visual indicators to know beforehand which tab control is empty when I need to see portions of the full records in "Quick mode". This will give me some more time to think it through.

Thanks Comment, I'll look further into your solution over the next couple of months and post again afterwards.

Best regards,

Daniel

comment · July 5, 2019

4 hours ago, Quito said:

there is a 2 second delay when going from one full record to the next.

IIUC, your records are imported and not modified further. If so, eliminate the unstored calculation and populate the field using a script immediately after import.

Quito · September 10, 2021

On 7/5/2019 at 9:22 AM, comment said:

I've given this some more thought.

If you sort the patterns by their length and work from the longest to the shortest, and work on a copy of the original text in a variable, you can substitute each found pattern with some arbitrary string in order to prevent it from being found again using a shorter pattern. The arbitrary string is required to prevent accidental joining of adjacent words into a new pattern.

Alternatively, you could go over the original text word-by-word and see if there are any patterns that begin with this word. If yes, check if there are any patterns that begin with this word and the following one - and so on, until you get a negative answer in which case you go one step back. But this would be a lot of work to implement in Filemaker.

Hi Comment,

So finally I was able to study your reply and replace my "8000 substitutions + CF" with your solution. It takes a little longer to process, as it needs to populate the code list but I'm fine with that as the end result is much much better now. Although the patterns are sorted in descending order, I'm not too sure that they are being processed that way. Maybe the sort needs to be scripted as well?

I tried placing quotation marks to each pattern in hopes of forcing "Match phrase from word start" but it didn't work.

Could you please elaborate a bit further on your solution "substitute each found pattern with some arbitrary string in order to prevent it from being found again using a shorter pattern"? It seems a plausible way of forcing "Match whole word" to MultiPatternCount+Hilite.fmp12. Does this mean that the arbitrary string in the "copy of the original text in a variable" would become a jumble of characters or repeated words? It sounds very interesting as the proof of correct alignment would be the codes themselves. Cool!

Maybe I can make it recursive. Something like "If first letter of pattern is Bold+Hilite, skip"

Thank you. I really appreciate it!

Best regards,

Daniel

Edited September 10, 2021 by Quito

comment · September 10, 2021

Wow, how long has it been? More than two years. I am afraid I am having trouble recalling what all this was about.

IIUC, you are asking about nested substitutions performed by the Substitute() function. The file I posted here does not use the Substitute() function, hence my confusion. AFAICT, my remarks were concerned with your method, not mine.

So, assuming we are indeed talking about nested substitutions, let's point out two potential problems, with some very arbitrary examples:

1. A searchString is contained in another searchString

input: you are young ;
required replacements: you ==> we ; young ==> old ;
expected result: we are old

Substitute ( input ; 
[ "you" ; "we" ] ;
[ "young" ; "old" ] 
)

returns: we are weng

Here the solution is to substitute the longer searchString first:

Substitute ( input ; 
[ "young" ; "old" ] ; 
[ "you" ; "we" ] 
)

returns the expected result of we are old

2. A searchString is contained in the replaceString of another substitution pair

input: John is red ;
required replacements: John ==> Fred ; red ==> blue ;
expected result: Fred is blue

Substitute ( input ; 
[ "John" ; "Fred" ] ; 
[ "red" ; "blue" ] 
)

returns: Fblue is blue

This is the problem I suggested solving by using intermediary arbitrary strings to avoid the collision:

Substitute ( input ; 
[ "John" ; "pattern001" ] ; 
[ "red" ; "pattern002" ] ;
[ "pattern001" ; "Fred" ] ; 
[ "pattern002" ; "blue" ] 
)

returns the expected result of Fred is blue

The strings must be such that cannot appear in the input nor be a part of another intermediate string, searchString or replaceString. This is easily accomplished by using characters that cannot appear in the input as wrappers - for example:

Substitute ( input ; 
[ "John" ; "«John»" ] ; 
[ "red" ; "«red»" ] ;
[ "«John»" ; "Fred" ] ; 
[ "«red»" ; "blue" ] 
)

which is also easier to follow.

Edited September 11, 2021 by comment

Quito · September 10, 2021

That's brilliant! And very useful for other projects I have that are still using Substitute. Thanks!

Will start another post to continue the train of thought of MultiPatternCount+Hilite.fmp12

Best regards,

Daniel

Sign In

Harvest a list of Substitutions

Recommended Posts

Quito

comment

Quito

comment

Quito

comment

Quito

comment

comment

Quito

comment

Quito

comment

Quito

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information