web scraping getting text between two text strings.

hownow · December 29, 2014

Â

I have included 3 sample urls containing information I am trying to scrape

Â

I would like to open the URLS within filemakers' web browser and scrape them for this information

I have included 3 samples because they vary. I would simply be happy to get the text between

The title of the church and "see map: Google Maps"

Â

Basically there are a lot of variables. Sometimes there are paragraphs of text in the middle of where I would like to select.

Â

I am interested mostly in getting these fields populated if the web page has them

Â

Name of Church

Address

City

State

zip

Clergy

Website

Email

Phone

Facebook

Twitter.

Â

But I would be very happy to get the text scraped from the title to the "see map: Google Maps"

then it would be a matter of parsing the individual fields.

Thanks for help
I included a graphic of what I want to copy on a sample page

Â

comment · December 29, 2014

Here's something you could use as your starting point:

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "content" ) ;
prefix = "<span class="locality">" ;
suffix = "</span>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Middle ( text ; start ; end - start )
)

This extracts the City part of the address.

You need to examine the page source in order to find "anchors" for each data item you want to extract.

Edited December 30, 2014 by comment

hownow · December 30, 2014

I'm Lost

When I try the formula you gave above I name my webviewer but the Prefix is highlighted and I get the error file not found.

I need to make a trial file and see what I can do. But I haven't been able to begin.

I will keep trying to work with your formula

Thanks

Lee Smith · December 30, 2014

Hi hownow,

I'm Lost

When I try the formula you gave above I name my webviewer but the Prefix is highlighted and I get the error file not found.

I need to make a trial file and see what I can do. But I haven't been able to begin.

I will keep trying to work with your formula

Thanks

You have discovered the Self taught method of learning.

Keep in mind, you can always post a file that you are playing with so we can see first hand what you are doing.

This really helps when we discuss your question using your layouts, field names, scripts, relationship graphs, etc.

Good luck,

Lee

LaRetta · December 30, 2014

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "source" )

Simple copy/paste issue ... add a semi-colon at the end of this line.

... and corrected 'past' to 'paste' on mine, LOL.

Raybaudi · December 30, 2014

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "source" )

Another issue: substitute "source" with "content".

Remember to give a name to your Web Viewer ( this example works if the web viewer name is "yourWebViewer" ).

That calculation must be unstored.

You could by-pass the web viewer approach using the script step: Insert From URL

comment · December 30, 2014

Arrgh, I need new glasses ... Thanks, LaRetta and Daniele.

hownow · December 30, 2014

I tried it and made a file to experiment which I am enclosing (parsing and scraping)

I tried to just get it to enter the city with the script "copy and parse" but It doesn't send anything to the field "city"

That is all I have and It doesn't seem to work.

Thanks for starting me and all the corrections

parsing and scraping.fmp12.zip

Raybaudi · December 30, 2014

Change the script calculation with:

Let ( [
text = GetLayoutObjectAttribute ( "webwindow" ; "content" ) ;
prefix = "<span class="locality">" ;
suffix = "</span>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Middle ( text ; start ; end - start )
)

hownow · December 30, 2014

Thank you

I got the State from that. I just tried to do a <div class> for the address. but it pastes it with a big space before it so I am doing something wrong...

I enclosed it with that modification. If you do the script you can see what it does.

I can't seem to get the other areas . I am sure it is somehow the same principal but I don't know what I am missing

I sort of have the Name (But somehow it adds a lot more stuff than I need and I don't know why,

the address ( lot of white space)

City is fine

State is fine

Country is fine so

Here is the amended file.

Please help me figure out what I have missing. I get the principles of the example. But I can't figure out the others. I need help with those.

thanks

4parsing and scraping #4.fmp12.zip

Raybaudi · December 31, 2014

Wich is the name that you want to extract from this HTML part:

<title>All Saints' Episcopal Church, Greensboro, NC | Episcopal Church</title>

:question:

To get the address change the calculation to:

Let ( [
text = GetLayoutObjectAttribute ( "webwindow" ; "content" ) ;
prefix ="<div class="street-address">" ;
suffix = "</div>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )
)

hownow · December 31, 2014

Hi and Thanks so much

I just want the name of the church , The problem is that sometimes there is a "'" apostrophe and that takes the form of" &#039"

But it is not always there.

I have the rest of the address so just the church name is important in the Name field

I am placing the source code here to help discuss the other fields.

the EMAIL if there is one and the Clergy and the Website fields are the most important. So I will include to codes here. When I try them they don't work.

for the Website the code around it is:

</div>
<div class="field field-type-text field-field-website">
<div class="field-items">
<div class="field-item odd">
<div class="field-label-inline-first">
Website: </div>
<a href="http://www.allsts.org/">http://www.allsts.org/</a> </div>
</div>

for the Email it is

<div class="field field-type-text field-field-email">
<div class="field-items">
<div class="field-item odd">
<div class="field-label-inline-first">
Email: </div>
<a href="mailto:[email protected]">[email protected]</a> </div>
</div>

For the CLERGY it is

<div class="field-label-inline-first">
Clergy: </div>
The Rev. Kurt Wiesner </div>
</div>
</div>

I don't understand any of these when I try them.

If I get these done I will be able to apply to others

I thought the only way to webscrape was to get the text between to areas and then parse it in another field. But this way is so much more effective if I could just get it!

So thanks for answering and all your help.

comment · December 31, 2014

The problem is that sometimes there is a "'" apostrophe and that takes the form of" &#039"
But it is not always there.

Actually, it takes the form of ' and you can use the Substitute() function to replace it with an actual apostrophe (along with other HTML entities that you may find).

I am not sure why you're having problems with the other items. For example, for CLERGY you could use:

prefix = "Clergy:&nbsp;</div>" ;
suffix = "</div>" ;

hownow · December 31, 2014

I am having problems with those others. I applied the prefix = "Clergy: </div>" ;
suffix = "</div>" ; that you gave me and the trim function

Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )
) and that worked.

If any code for fields are not there (missing) how is that handled? Is there any code to instruct filemaker not to capture if nothing is there? ( sometimes there is no clergy or website or email . Sometimes there is....

hownow · December 31, 2014

I am enclosing the amended file to show what I can get and what I can't get

I cannot figure out Email and Web address

Those are the most important information.

I also don't know what to do to eliminate bringing in pure code into the field if that item is not present in the website.

You can see it if you try some of the records. Some sites do not contain websites or clergy or email.. they seem to have all the other fields. I need to address that.

5parsing and scraping #5.fmp12.zip

Lee Smith · December 31, 2014

Some sites do not contain websites or clergy or email.. they seem to have all the other fields. I need to address that.

I'm wondering why you are using?

Insert Calculated Result (Pastes the result of a calculation into the current field in the current record)

instead of

Set Field (Replaces the entire contents of the specified field in the current record with the result of a calculation)

I can count on one hand since the release of version 7 that I have used Insert Calculated Result.

The examples file doesn’t show any emails, Twitter or Facebook data.

I would use the Hide object when if the empty fields are annoying to you. For the fields use IsEmpty (Self) and for the field names IsEmpty (Table::email) etc.

HTH

Lee

hownow · December 31, 2014

Thank you Lee for that insight , I didn't really know the difference.

I need to add an example that has the facebook or twitter data. But that isn't so important to me.

I had considered what you are saying about the ISEmpty function but my problem is still trying to get in

the web address and the email address which is my point for doing this whole thing.

But since the HTML is different , I haven't a clue about how to do it. Everything I have tried doesn't work.

Thanks for your input.

Much appreciated.

comment · December 31, 2014

If any code for fields are not there (missing) how is that handled? Is there any code to instruct filemaker not to capture if nothing is there? ( sometimes there is no clergy or website or email . Sometimes there is....

I thought all these sites were using the same template. If some "fields" can be missing, then use the following pattern =

Let ( [
text = GetLayoutObjectAttribute ( "yourWebViewer" ; "content" ) ;
prefix = "<span class="locality">" ;
suffix = "</span>" ;
pos = Position ( text ; prefix ; 1 ; 1 ) ;
start = pos + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Case ( pos ; Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" )  )  )
)

This will return nothing when the source HTML doesn't contain the prefix.

hownow · December 31, 2014

Outstanding. That is a big load off my mind. I am going to get right on that.

Thanks so much. That will rid all those massive code captures.

How would I add the code Raybaudi gave me earlier as well because that took care of

a long white space.

This is what that was:

Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )

comment · December 31, 2014

How would I add the code Raybaudi gave me earlier as well

It's already there.

hownow · December 31, 2014

Ah okay. thanks

If you have any suggestions about getting the email or the website field . I WOULD BE SO HAPPY.

I Don't know how to figure that out. I have parsed the name field to rid the apostrophe and everything else is working.

Go ahead. Make my NEW YEARS EVE LOL

Anyway Happy and

Blessed New Year to you and thanks for all the Help

This is a wonderful site.

hownow · January 1, 2015

Here is what I am using to try to scrape the source code to get the web address.

Let ( [
text = GetLayoutObjectAttribute ( "webwindow" ; "content" ) ;
prefix = "Website: </div>
<a href=" ;
suffix = "</a>" ;
start = Position ( text ; prefix ; 1 ; 1 ) + Length ( prefix ) ;
end = Position ( text ; suffix ; start ; 1 )
] ;
Trim ( Substitute ( Middle ( text ; start ; end - start ) ; Char ( 10 ) ; "" ) )
)

This gives me a huge amount of the source code and I don't know why

I don't understand the difference in how to apply the tags that are different in HTML source.

I can't find out where to find these answers.

This is the same type of code for both the email and the web address. But inherently they are a href tags. But how do I differentiate them from the simpler ones?

comment · January 2, 2015

prefix = "Website: </div>

<a href=" ;

The reason why this doesn't work for you is that an actual carriage return within a calculation formula is read as a space. If there is a carriage return in the original HTML, you need to write it as ¶, i.e.

prefix = "Website:&nbsp;</div>¶<a href=" ;

If the new line character is a linefeed rather than a carriage return, you will need to use:

prefix = "Website:&nbsp;</div>" & Char (10) & "<a href=" ;

hownow · January 2, 2015

Hi and thank you but that didn't work.

I have enclosed my trial file so you can see what it does.

You can just use command 1 to activate the script.

Website loads the whole page.

It is only a little bit of help I need to finish this . If someone could just look over this file and see what happens when you use the command 1 script .... It gives a whole page of source and all I need is the website and the email and I just don't know how to do that one thing.

Please help.

Thank you

6parsing and scraping #6.fmp12.zip

hownow · January 7, 2015

I wish someone would help me. It is so easy for you and so hard for me. I am getting exhausted and learning nothing because it keeps failing . I asked if someone would check the file in my last post but since last week no one has downloaded it and no one has looked at it. Very discouraging. This should be a good example that teaches a lot of people important skills. I am just a novice asking questions from people with experience.

OK?

MikeKD · January 7, 2015

I'd love to help, but this is well beyond me; sorry :-(

bruceR · January 7, 2015

You were instructed at message 16 to use set field instead of insert calculated result.

You're still not doing that; which makes it harder to troubleshoot.

hownow · January 8, 2015

ok I did that so it can be figured out easier . It didnt seem to change anything.

It is enclosed

7parsing and scraping #7.fmp12.zip

bruceR · January 8, 2015

Take a look at this approach. You use variables to declare the text, prefix, and suffix, then calculate the result and set the field.

It is easier to change and test the prefix and suffix values.

Also, by storing the text in a field, you can examine it more easily.

You don't have a simple problem to solve.

6parsing and scraping MOD.fmp12.zip

hownow · January 8, 2015

Thank you

I looked it over (your modifications) (thank you) and I see the logic, as you pointed out, of making it easier to change and examine. I am sorry, as a novice, that I don't quite understand the principles of taking that to this next level for me.

I tried to do the address and the website (as the way you did it) and failed at both( in your way of doing things) and I am not sure what I am doing wrong.

If you would look at the changes I made you might be able to correct my understanding. I tried the address and the website.

I have a lot of this kind of thing to do so it is vital I understand it.

Thank you again

Ps I added a bigger website content field so it would be easier to see.

6parsing and scraping MOD 2 Website email name address.fmp12.zip

bruceR · January 8, 2015

You do have to work accurately.

Â

On the address, you omitted the first quote.

And, it's a div not a span.

bruceR · January 8, 2015

Even more accuracy problems.

Â

The format of this method is:

Â

set the prefix

set the suffix

get a calculated result

set the target field to the result

Â

Compare what you are doing with Zip; country; and website.

ALSO: I strongly suggest you get FileMaker Pro Advanced; if you do not have it.

(Or update your profile if you do have it)

That aids enormously in troubleshooting.

hownow · January 8, 2015

Thank you Bruce. I do have advanced.

I put in the quotation omission and it somehow gives me a whole page of developer code.

I am sorry but I don't understand why this happens.

In the file I am uploading now you can see that with any record by pressing command 1 for activating the script.

But when I added the " mark where you said it was omitted it still brings all that page source into the address field.

The email field is a disaster too. I am sorry I am so thick but I am trying hard.

parsing and scraping MOD 3.fmp12.zip

bruceR · January 8, 2015

Really. Pay attention. Look again at BOTH IMAGES on message 31.

bruceR · January 8, 2015

Regarding website:

There are four steps.

In ZIPS you do step 1,2,3,4

In Country, you do step 2, 3

In website, you do step 1,2,4

Sign In

web scraping getting text between two text strings.

Recommended Posts

Create an account or sign in to comment

Create an account

Sign in

Important Information