CF - to get a list of all the text items...

Simon K · November 25, 2007

Hi,

I have a record that contains a field of "body" text - I need a custom function that will extract all the text items enclosed in square braces into a separate list eg:

Body_text = "Cats and dogs and [s1t1] make [s1t2] pets though they need [s2y1] handling."

The result would be a paragraph delimited list containing

"s1t1"¶"s1t2"¶"s2y1"

Does anybody know of such/a similar function or have any pointers on how to create such a cf

Many thanks

Edited November 25, 2007 by Guest

Agnès · November 25, 2007

Hello,

You can make this with a recursive or with CustomList () http://www.briandunning.com/cf/747, with this calculation

Let ([

$BodyText = "Cats and dogs and [s1t1] make [s1t2] pets though they need [s2y1] handling"

];

CustomList ( 1 ; PatternCount ( $BodyText ; "[" ) ;

"let ([

PosL = Position ( $BodyText ;""["" ; 1 ; [n] ) + 1 ;

PosR = Position ( $BodyText ; ""]"" ; PosL ; 1 )];Middle ( $BodyText ; PosL ; PosR - PosL ) )" )

)

Agnès

hum..... or finally, without cf, this calculation perhaps :

Let ([

BodyText = Substitute ( "Cats and dogs and [s1t1] make [s1t2] pets though they need [s2y1] handling"; [" " ; ¶] ) ;

Result = FilterValues ( BodyText ; Substitute ( ¶ & BodyText & ¶ ; [ ¶ ; "¶##" ] ; ["¶##[" ; "¶["]) )

] ;

Substitute ( Result ; [ "[" ; ""] ; [ "]" ; ""] )

)

Edited November 25, 2007 by Guest
.....

comment · November 25, 2007

Or:


Let ( 

temp = Evaluate ( 

Substitute ( 

Quote (  "]" & text & "[" ) ; 

[ "]" ; Quote ( " & /*" ) ] ;

[ "[" ; Quote ( " */" ) & "¶"  ] 

)

) ;

MiddleValues (temp ; 2 ; ValueCount ( temp ) - 1 )

)

Adapted from here:

http://www.fmforums.com/forum/showtopic.php?tid/149507/post/149690/#149690

Agnès · November 25, 2007

Yep, I like */ . thanks a lot for the calcul and the link.

(I allowed myself to change mine to remove the finale "" )

Agnès

hhum... I like evaluate, substitute and... tests...

I test the 3 calculations [intel and MacTiger - time and number of "["] :

if the text has more than 4524 "[", the calcul with evaluate does not turn over a result, (with 4523, it's ok from 4 seconds)

the calcul with FilterValues find result with 24 seconds !!! (text contains 18000 words, perhaps is the reason for this timing)

and finally, CustomList (), 5000 "[" in 5 seconds, it's ok until 18700 "[" because I make calculation on the number of "[" and not the word count

I hope not to annoy you with my tests, I am wary of the limit of evaluate

ok, I know, 4525 "[" it is very much

Agnès

Edited November 25, 2007 by Guest
tests......

comment · November 25, 2007

No it's interesting, but you are missing a result for a recursive function. I am aware of some of the limits - I believe Substitute() is the limiting factor.

You might find this interesting too:

http://www.fmforums.com/forum/showtopic.php?tid/187248/post/253049/#253049

Agnès · November 25, 2007

No it's interesting, but you are missing a result for a recursive function. I am aware of some of the limits - I believe Substitute() is the limiting factor.

I am not sure, substitute is really fast. even for substitute 50000 values....

I changed the calculation with FilterValues

Let ([

BodyText = Substitute ( "Cats and dogs and [s1t1] make [s1t2] pets though they need [s2y1] handling" ; [ ¶ ; " " ] ; ["[" ; "##¶" ] ; ["]" ; "¶##" ]) ;

sub = Substitute ( BodyText ; "##" ; "" )

] ;

FilterValues ( BodyText ; Sub )

)

The result from 5000 "[" and 18000 words is now ok from 5 seconds.... FilterValues doesn t like ## and other symbols or too many values....(?)

No, I don't test recursive, I am sure that it will spend more time

Thanks for the Link with TimingTests, I go to test my cf "SwitchValues()" now :

comment · November 25, 2007

Well, then I don't know - maybe it's the combination of Evaluate ( Substitute ( ... )). I'm not talking about speed, but about a limit on the number of substitutions it will accept.

Simon K · November 25, 2007

Hi guys - thanks for the responses - now breaking them down to understand how they work ( a bit mystifying at the moment but I will get there).

In the meantime I have noticed that (using Michael's formula) if one of square brackets of a pair is missing then the calc doesn't return any values (even if there are other valid strings to extract)?

Anyway thanks again

Agnès · November 26, 2007

(even if there are other valid strings to extract)?

because Evaluate find an error, and can't have a result (if you put the calculation on the data viewer and you desactivate just "Evaluate", you saw.

for all calculations, you can test Number of "[" = number of "]" and if is not equal, you stop the calcul and note the erreur

I'm not talking about speed, but about a limit on the number of substitutions it will accept.

Ok. I don't know too.

Simon K · November 26, 2007

Hi Michael, Agnes,

I always feel as though I have to really "understand" whatever I put my application and now I do - it took a while but the penny finally dropped so thank you.

Just as some feed back and in case other people need to know some of the differences, I found the following:

The calc from Agnes feels simpler and works well when you are in control of/aware of all of the characters in the body_text.

In my particular case I am working with some large html text blocks and I have found that it does include some "nasty" non-printing characters that I couldn't quite get my hands on.

So in this case the pure "completeness" of Michaels calc works far more efficiently. I suspect it will always require different opening and closing delimiters - which, again in my case is no problem because we are also in charge of the originating HTML.

Many thanks

Simon

Simon K · November 27, 2007

I don't know whether I should start another thread - just in case anybody looks...

Is there a way of making this final list into just unique occurences of each value extracted from the main text?

thanks

S

Edited November 27, 2007 by Guest

comment · November 28, 2007

I think it would be better to move to a recursive function. I don't have time to do it now, but roughly the function would look something like:

ExtractUnique ( text ; startCode ; endCode ; result )

and it would look for the first occurrence of startCode in text. If found, extract the string between it and the first occurence of endCode in text. If the string is not already in result, append it to the result.

Then call itself again with the rest of the text and the new result. If not found, return the result.

Sign In

CF - to get a list of all the text items...

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Important Information