Jump to content

web scraping getting text between two text strings.


This topic is 2438 days old. Please don't post here. Open a new topic instead.

Recommended Posts

On 1/11/2015 at 10:57 AM, LaRetta said:

Reading up in FM Help on these parsing functions will help also but here is a breakdown of the calculation I provided:


Let ( [
field = Table::email ;
begin = "<meta property="og:url" content="" ;
start = Position ( field ; begin ; 1 ; 1 ) + Length ( begin ) ;
end = Position ( field ; "/>" ; start ; 1 ) - 1
] ;
Middle ( field ; start ; end - start )
)
  • 'begin' string is the beginning tag string from its opening < clear through to the point of your text you wish to capture.
  • 'start' uses the Position() function to 'count characters' to where the < starts so if you put this in the data viewer, it will produce the location of the <, with a result similar to 675, being the 675th character in that field.
  • to find the character start of your text within, we must then add the Length() of that begin string so if the string is 25 in length, the NEW starting position is 675 + 25 (700) to get to the beginning of your desired text.
  • 'end' finds the first /> immediately after the begin string so it might be 800.
  • Then the calculation Middle()  says to go to the start position 700 and grab everything for the next 'x' characters.  We know 'x' because 800-700 means the text in the middle is 100 characters long.
  • So Middle ( email field ; 700 ; 800 - 700 ) returns text between start and end.

Reading Help on these functions, open your data viewer and create just the 'start' variable.  In place of 'field', insert the field value and in place of 'begin', insert the quoted string.  Then do same with Middle().  You can watch the results and adjust your calculations accordingly.

 

What can trip you up is escaping in these types of strings.  Quotes within quotes must be preceded with .  And then, if invisible garbage remains in your resultant text, there are various techniques to strip it, an example was my quick-and-dirty method of using LeftWords() but you might use Substitute() or other text functions.  When stuck in a calculation which won't let you out, copy it and paste it here within code (using the <> icon) and we can help you identify why it breaks.

 

And welcome back.

Hi LaRetta,

Following your advice to look at the calculation suggested by Comment, I changed your

Middle ( field ; start ; end - start )

to

Trim ( Substitute ( Middle ( field ; start ; end - start ) ; Char ( 10 ) ; "" ) )

and was able to remove all beginning and trailing Carriage Returns from my text.

Thank you and best regards,

Daniel

Link to comment
Share on other sites

This topic is 2438 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.