Jump to content
View in the app

A better way to browse. Learn more.

FMForums.com

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

web scraping getting text between two text strings.

Featured Replies

http://www.episcopalchurch.org/parish/all-saints-episcopal-church-duncan-ok

http://www.episcopalchurch.org/parish/all-saints-episcopal-church-briarcliff-manor-ny

http://www.episcopalchurch.org/parish/all-saints-episcopal-church-greensboro-nc

 

I have included 3 sample urls containing information I am trying to scrape

 

I would like to open the URLS within filemakers' web browser and scrape them for this information

I have included 3 samples because they vary. I would simply be happy to get the text between

The title of the church and "see map: Google Maps"

 

Basically there are a lot of variables. Sometimes there are paragraphs of text in the middle of where I would like to select.

 

I am interested mostly in getting these fields populated if the web page has them

 

Name of Church

Address

City

State

zip

Clergy

Website

Email

Phone

Facebook

Twitter.

 

But I would be very happy to get the text scraped from the title to the "see map: Google Maps"

then it would be a matter of parsing the individual fields.

Thanks for help
I included a graphic of what I want to copy on a sample page

 

 

post-112058-0-15159300-1419882233_thumb.

Here's something you could use as your starting point:

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "content" ) ;
prefix = "<span class="locality">" ;
suffix = "</span>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Middle ( text ; start ; end - start )
)

This extracts the City part of the address.

 

You need to examine the page source in order to find "anchors" for each data item you want to extract.

Edited by comment

  • Author

I'm Lost

When I try the formula you gave above I name my webviewer but the Prefix is highlighted and I get the error file not found.

I need to make a trial file and see what I can do. But I haven't been able to begin.

I will keep trying to work with your formula

Thanks

Hi hownow,

 

I'm Lost

When I try the formula you gave above I name my webviewer but the Prefix is highlighted and I get the error file not found.

I need to make a trial file and see what I can do. But I haven't been able to begin.

I will keep trying to work with your formula

Thanks

 

You have discovered the Self taught method of learning. :)

 

Keep in mind, you can always post a file that you are playing with so we can see first hand what you are doing.

 

This really helps when we discuss your question using your layouts, field names, scripts, relationship graphs, etc.

 

Good luck,

 

Lee

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "source" ) 
 

Simple copy/paste issue ... add a semi-colon at the end of this line.  

​... and corrected 'past' to 'paste' on mine, LOL.

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "source" ) 
 

Another issue: substitute "source" with "content".

Remember to give a name to your Web Viewer (  this example works if the web viewer name is "yourWebViewer" ).

That calculation must be unstored.

 

You could by-pass the web viewer approach using the script step: Insert From URL

Arrgh, I need new glasses ... Thanks, LaRetta and Daniele.

  • Author

I tried it and made a file to experiment which I am enclosing (parsing and scraping)

I tried to just get it to enter the city with the script "copy and parse" but It doesn't send anything to the field "city"

 

That is all I have and It doesn't seem to work.

Thanks for starting me and all the corrections

 

parsing and scraping.fmp12.zip

Change the script calculation with:

 

Let ( [
text = GetLayoutObjectAttribute ( "webwindow" ; "content" ) ;
prefix = "<span class="locality">" ;
suffix = "</span>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Middle ( text ; start ; end - start )
)

  • Author

Thank you

I got the State from that. I just tried to do a <div class>  for the address. but it pastes it with a big space before it so I am doing something wrong...

I enclosed it with that modification. If you do the script you can see what it does.

 

I can't seem to get the other areas . I am sure it is somehow the same principal but I don't know what I am missing

  I sort of have  the Name (But somehow it adds a lot more stuff than I need and I don't know why,

the address ( lot of white space)

City is fine

State is fine

Country is fine  so

Here is the amended file.

Please help me figure out what I have missing.  I get the principles of the example. But I can't figure out the others. I need help with those.

thanks

4parsing and scraping #4.fmp12.zip

Wich is the name that you want to extract from this HTML part:

 

<title>All Saints&#039; Episcopal Church, Greensboro, NC | Episcopal Church</title>

:question:

 

To get the address change the calculation to:

 

Let ( [
text = GetLayoutObjectAttribute ( "webwindow" ; "content" ) ;
prefix ="<div class="street-address">" ;
suffix = "</div>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )
)

  • Author

Hi and Thanks so much

I just want the name of the church , The problem is that sometimes there is a "'" apostrophe and that takes the form of" &#039"

But it is not always there.

I have the rest of the address so just the church name is important in the Name field

 

I am placing the source code here to help discuss the other fields.

the EMAIL if there is one and the Clergy and the Website fields are the most important. So I will include to codes here. When I try them they don't work.

 

for the Website the code around it is:

</div>
<div class="field field-type-text field-field-website">
<div class="field-items">
<div class="field-item odd">
<div class="field-label-inline-first">
Website:&nbsp;</div>
<a href="http://www.allsts.org/">http://www.allsts.org/</a> </div>
</div>

 

 

for the Email it is

 

<div class="field field-type-text field-field-email">
<div class="field-items">
<div class="field-item odd">
<div class="field-label-inline-first">
Email:&nbsp;</div>
<a href="mailto:[email protected]">[email protected]</a> </div>
</div>

 

 

For the CLERGY it is

 

<div class="field-label-inline-first">
Clergy:&nbsp;</div>
The Rev. Kurt Wiesner </div>
</div>
</div>

 

 

I don't understand any of these when I try them.

If I get these done I will be able to apply to others

I thought the only way to webscrape was to get the text between to areas and then parse it in another field. But this way is so much more effective if I could just get it!

 

So thanks for answering and all your help.

The problem is that sometimes there is a "'" apostrophe and that takes the form of" &#039"

But it is not always there.

 

 

 

Actually, it takes the form of &#039; and you can use the Substitute() function to replace it with an actual apostrophe (along with other HTML entities that you may find).

 

 

 

I am not sure why you're having problems with the other items. For example, for CLERGY you could use:

prefix = "Clergy:&nbsp;</div>" ;
suffix = "</div>" ;
  • Author

I am having problems with those others. I applied the  prefix = "Clergy:&nbsp;</div>" ;
suffix = "</div>" ; that you gave me and the trim function

Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )
) and that worked.

 

If any code for fields are not there (missing)  how is that handled? Is there any code to instruct filemaker not to capture if nothing is there? ( sometimes there is no clergy or website or email . Sometimes there is....

  • Author

I am enclosing the amended file to show what I can get and what I can't get

I cannot figure out Email and Web address 

Those are the most important information.

 

I also don't know what to do to eliminate bringing in pure code into the field if that item is  not present in the website.

You can see it if you try some of the records.  Some sites do not contain websites or clergy or email.. they seem to have all the other fields. I need to address that. 

5parsing and scraping #5.fmp12.zip

Some sites do not contain websites or clergy or email.. they seem to have all the other fields. I need to address that.

I'm wondering why you are using?

Insert Calculated Result (Pastes the result of a calculation into the current field in the current record)

instead of

Set Field (Replaces the entire contents of the specified field in the current record with the result of a calculation)

I can count on one hand since the release of version 7 that I have used Insert Calculated Result.

The examples file doesn’t show any emails, Twitter or Facebook data.

I would use the Hide object when if the empty fields are annoying to you. For the fields use IsEmpty (Self) and for the field names IsEmpty (Table::email) etc.

HTH

Lee

  • Author

Thank you Lee for that insight , I didn't really know the difference.

I need to add an example that has the facebook or twitter data. But that isn't so important to me.

I had considered what you are saying about the ISEmpty function but my problem is still trying to get in

the web address and the email address which is my point for doing this whole thing.

But since the HTML is different , I haven't a clue about how to do it. Everything I have tried doesn't work.

Thanks for your input.

Much appreciated.

If any code for fields are not there (missing)  how is that handled? Is there any code to instruct filemaker not to capture if nothing is there? ( sometimes there is no clergy or website or email . Sometimes there is....

 

I thought all these sites were using the same template. If some "fields" can be missing, then use the following pattern =

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "content" ) ;
prefix = "<span class="locality">" ;
suffix = "</span>" ;
pos = Position ( text ; prefix ; 1 ; 1 ) ;
start = pos + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Case ( pos ; Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" )  )  )
)

This will return nothing when the source HTML doesn't contain the prefix.

  • Author

Outstanding. That is a big load off my mind. I am going to get right on that.

Thanks so much. That will rid all those massive code captures.

How would I add the code Raybaudi gave me earlier as well because that took care of

a long white space.

This is what that was:

 

Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )

How would I add the code Raybaudi gave me earlier as well
 
It's already there.
  • Author

Ah okay. thanks

If you have any suggestions about getting the email or the website field . I WOULD BE SO HAPPY.

I Don't know how to figure that out. I have parsed the name field to rid the apostrophe and everything else is working.

Go ahead. Make my NEW YEARS EVE LOL

 

Anyway Happy and

Blessed New Year to you and thanks for all the Help

This is a wonderful site.

  • Author

Here is what I am using to try to scrape the source code to get the web address.

 

Let ( [
text = GetLayoutObjectAttribute ( "webwindow" ; "content" ) ;
prefix = "Website:&nbsp;</div>
<a href=" ;
suffix = "</a>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )
)

 

This gives me a huge amount of the source code and I don't know why

I don't understand the difference in how to apply the tags that are different in HTML source.

I can't find out where to find these answers.

 

This is the same type of code for both the email and the web address. But inherently they are a href tags. But how do I differentiate them from the simpler ones?

prefix = "Website:&nbsp;</div>

<a href=" ;

 

The reason why this doesn't work for you is that an actual carriage return within a calculation formula is read as a space. If there is a carriage return in the original HTML, you need to write it as ¶, i.e.

prefix = "Website:&nbsp;</div>¶<a href=" ;

If the new line character is a linefeed rather than a carriage return, you will need to use:

prefix = "Website:&nbsp;</div>" & Char (10) & "<a href=" ;
  • Author

Hi and thank you but that didn't work.

I have enclosed my trial file so you can see what it does.

 

You can just use command 1 to activate the script.

Website loads the whole page.

 

 

It is only a little bit of help I need to finish this . If someone could just look over this file and see what happens when you use the command 1 script .... It gives a whole page of source and all I need is the website and the email and I just don't know how to do that one thing.

Please help.

Thank you

6parsing and scraping #6.fmp12.zip

  • Author

I wish someone would help me. It is so easy for you and so hard for me. I am getting exhausted and learning nothing because it keeps failing  . I asked if someone would check the file in my last post but since last week no one has downloaded it and no one has looked at it. Very discouraging. This should be a good example that teaches  a lot of people important skills. I am just a novice asking questions from people with experience.

OK?

I'd love to help, but this is well beyond me; sorry :-(

You were instructed at message 16 to use set field instead of insert calculated result.

 

You're still not doing that; which makes it harder to troubleshoot.

  • Author

ok I did that so it can be figured out easier  . It didnt  seem to change anything.

It is enclosed

 

7parsing and scraping #7.fmp12.zip

Take a look at this approach. You use variables to declare the text, prefix, and suffix, then calculate the result and set the field.

It is easier to change and test the prefix and suffix values.

Also, by storing the text in a field, you can examine it more easily.

You don't have a simple problem to solve.

6parsing and scraping MOD.fmp12.zip

  • Author

Thank you

I looked it over (your modifications) (thank you) and I see the logic, as you pointed out, of making it easier to change and examine. I am sorry, as a novice, that I don't quite understand the principles of taking that to this next level for me.

I tried to do the address and the website (as the way you did it) and failed at both( in your way of doing things) and I am not sure what I am doing wrong.

If you would look at the changes I made you might be able to correct my understanding.  I tried the address and the website.

I have a lot of this kind of thing to do so it is vital I understand it.

Thank you again

 

Ps I added a bigger website content field so it would be easier to see.

6parsing and scraping MOD 2 Website email name address.fmp12.zip

You do have to work accurately.

 

On the address, you omitted the first quote.

And, it's a div not a span.

post-62898-0-09358100-1420746529_thumb.p

post-62898-0-20593600-1420746941_thumb.p

Even more accuracy problems.

 

The format of this method is:

 

set the prefix

set the suffix

get a calculated result

set the target field to the result

 

Compare what you are doing with Zip; country; and website.

ALSO: I strongly suggest you get FileMaker Pro Advanced; if you do not have it.

(Or update your profile if you do have it)

That aids enormously in troubleshooting.

post-62898-0-61222400-1420747299_thumb.p

  • Author

Thank you Bruce. I do have advanced.

I put in the quotation omission and it somehow gives me a whole page of developer code.

I am sorry but I don't understand why this happens.

In the file I am uploading now you can see that with any record by pressing command 1 for activating the script.

But when I added the " mark where you said it was omitted it still brings all that page source into the address field.

The email field is a disaster too. I am sorry I am so thick but I am trying hard.

parsing and scraping MOD 3.fmp12.zip

Really. Pay attention. Look again at BOTH IMAGES on message 31.

Regarding website:

There are four steps.

In ZIPS you do step 1,2,3,4

In Country, you do step 2, 3

In website, you do step 1,2,4

  • Author

I see the two images you sent and

I corrected the span to a div and it still

gives me the same result.

Thank you Bruce. I do have advanced.

 

hownow, there has been a ton of time spent by others to help you. In order to get good help, you need to provide us with the most thorough picture of what is happening. This begins with your basic profile. There is a lot of differences between developing in the client version of FileMaker and being able to use the tools available in the Advance version.

 

You need to update your profile to reflect your current FileMaker version. Use this link MY PROFILE

***WHAT GIVES YOU THE SAME RESULT?***

 

You have at least three problems going on. Address; county; website.

Note that there is no instance of "website" in the content.

  • Author

I meant the address, wasn't working. I am only going to tackle one at a time.

I upgraded my profile

 

So I have only tried to do one thing at a time

This file I changed the scan to div and included the "

I am only trying to get the address field to work.

 

8mod scraping and parsing (address change.fmp12.zip

No; actually LOOK at the address field content. Click in the field.

 

It IS working. It has leading spaces, tabs, etc.

 

More attention to detail.

Hi

 

I upgraded my profile

Your Profile does NOT reflect a change? Perhaps you missed the button to save changes? or did I misunderstand your post.

  • Author

Thank you for that. YOu are right. Why the spaces and all.

I didn't see it.

So I would have to make another script step to correct the extra space?

I will do  that

I thank you.

I will work on the others. I have an emergency

I will post when I try the next field

 

 

  • FM Application:13 Advance

GREAT!!!

"Why the spaces and all."

I suggest you take a shot at explaining that.

  • Author

I don't know. I was using a trim function the way we were doing it before. I am not knowledgeable of the reason there is so much white space. my next thought was to parse it somehow but to correct why it started that way --- I dont know.

Keep trying; keep LOOKING at your data and your process until you CAN explain it.

I haven't been following this thread.  After 43 posts, I had given up and Bruce has been more than patient.  However, try this on the address field:

Trim ( LeftWords ( Table::address ; WordCount ( Table::address ) ) )

The thing is ... Trim() does not remove carriage returns nor does it remove other hidden characters.  One caveat about using LeftWords() as I've presented is that it drops word-delimiter characters such as $, #, =, ¶.

 

This dropping word-delimiters would only happen from the beginning of the field (in the case of LeftWords) or from the end (in the case of RightWords).

  • Author

Thank you LaRetta That part worked but Bruce has me trying to do this another way. I am now trying to figure out the website and email. Which have been my most important requests to find out how to do.

No, I am not trying to have you do it another way.

 

I am trying to get you to look at what you are doing, look at the data, perform accurate and complete work, and understand how this is working.

Explaining things to others is a good way to develop an understanding.

 

If you had been able to do that, you would have explained that the original data captured from the web site contains all these spaces, returns, etc.

  • Author

I'm trying , really trying

Create an account or sign in to comment

Important Information

By using this site, you agree to our Terms of Use.

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.