Jump to content

web scraping getting text between two text strings.


This topic is 2439 days old. Please don't post here. Open a new topic instead.

Recommended Posts

Thank you Bruce. I do have advanced.

 

hownow, there has been a ton of time spent by others to help you. In order to get good help, you need to provide us with the most thorough picture of what is happening. This begins with your basic profile. There is a lot of differences between developing in the client version of FileMaker and being able to use the tools available in the Advance version.

 

You need to update your profile to reflect your current FileMaker version. Use this link MY PROFILE

Link to comment
Share on other sites

***WHAT GIVES YOU THE SAME RESULT?***

 

You have at least three problems going on. Address; county; website.

Note that there is no instance of "website" in the content.

Link to comment
Share on other sites

No; actually LOOK at the address field content. Click in the field.

 

It IS working. It has leading spaces, tabs, etc.

 

More attention to detail.

Link to comment
Share on other sites

Thank you for that. YOu are right. Why the spaces and all.

I didn't see it.

So I would have to make another script step to correct the extra space?

I will do  that

I thank you.

I will work on the others. I have an emergency

I will post when I try the next field

Link to comment
Share on other sites

I don't know. I was using a trim function the way we were doing it before. I am not knowledgeable of the reason there is so much white space. my next thought was to parse it somehow but to correct why it started that way --- I dont know.

Link to comment
Share on other sites

I haven't been following this thread.  After 43 posts, I had given up and Bruce has been more than patient.  However, try this on the address field:

Trim ( LeftWords ( Table::address ; WordCount ( Table::address ) ) )

The thing is ... Trim() does not remove carriage returns nor does it remove other hidden characters.  One caveat about using LeftWords() as I've presented is that it drops word-delimiter characters such as $, #, =, ¶.

 

This dropping word-delimiters would only happen from the beginning of the field (in the case of LeftWords) or from the end (in the case of RightWords).

Link to comment
Share on other sites

No, I am not trying to have you do it another way.

 

I am trying to get you to look at what you are doing, look at the data, perform accurate and complete work, and understand how this is working.

Explaining things to others is a good way to develop an understanding.

 

If you had been able to do that, you would have explained that the original data captured from the web site contains all these spaces, returns, etc.

Link to comment
Share on other sites

In each instance of this there is a beginning point and an ending point through which the text is captured/scraped.

It is like a sandwich at a buffet of many things and i only want the contents of this sandwich.

I identify the sandwich in very specific locations. For me it is like a foreign language I can't speak.

I am trying to ask with correct grammar of which I have no understanding. I have been seeking the way to speak this to correctly get what I need. I have learned to speak the prefix of "<span> and the end of (suffix) <span> and I can get that content.

I am now speaking blindly and rattling on to the best of my ability and through the help of others, I get to understand the grammar of <div> and <div> .... People at the buffet await my discovery of their words in their language but every command I try doesn't follow the same grammar and I blunder.... Because to get the name of the food I still have to understand the grammar of <title>

But especially hard to speak is the mysterious references to website and email which as a novice i have tried to address but to no avail , the sandwich is more like a club sandwich with three slices of bread! I dont know how to use this grammar because the suffix and prefix vary and I do not know the rules. So when i ask wrongly i get spooned large amounts of wrong things .. I only know the principle but this language is beyond me. I only can repeat what others teach me. I am getting to where I am afraid to ask

Link to comment
Share on other sites

That seems like a good start. Web scraping is notoriously difficult. It certainly helps to learn the tools (scripting and calculation etc) but you may be assuming that the sandwich is structured consistently.

 

There is absolutely no guarantee of that and no guarantee that what you figure out today will work tomorrow.

Link to comment
Share on other sites

Is it better then to capture the whole text and collect data from between two text strings .That was my original intention. My topic was  web scraping   getting text between two text strings.    So if I have a word like "seconds"  and another text string like "another way of saying" . In the below example.

 

 

"He was merely seconds away from finding out how to get an email. Another way of saying he was never going to get his project done.

 

How can I just capture "away from finding out how to get an email."  that will give some way to do this. Which is the most reliable and useful way to go this...

Link to comment
Share on other sites

It feels like you keep putting this back on us - that we do not answer you correctly.  I do not like that at all.  It is your responsibility not ours ... if you would study what we've given you in more depth, you can figure it out.  Yes, it is difficult work and why we get paid the big bucks.  I know FileMaker makes it look easy but it is not. However, it IS easy to replicate what we give you and learn from it.  It does not appear that you are learning any of the text parsing techniques we are presenting.

 

Having said that, the concept remains the same ... find the string you wish to parse, find its beginning, find its end and grab everything in between.  Going by your file (and again, I do not have time to really delve into this), you have a large block of text in your email field.  To find the website from it:

Let ( [
field = Table::email ;
begin = "<meta property="og:url" content="" ;
start = Position ( field ; begin ; 1 ; 1 ) + Length ( begin ) ;
end = Position ( field ; "/>" ; start ; 1 ) - 1
] ;
Middle ( field ; start ; end - start )
)

So find the data you want in the text, go backwards until you find it's opening tag, put the tag in the begin portion, then look for the first closing tag past that.  

 

I have done ZERO web scraping (just haven't had the need) but it still is just parsing text, from what I can gather.  Now go back up to post #2 where Comment provided you EXACTLY the principle you needed to get this done.

Link to comment
Share on other sites

BTW, the calculation I presented you in post #47 was because Trim() was not removing the beginning spaces and carriage return.  Those were obviously not regular spaces but other hidden garbage (for lack of proper word).  So by using Left() as I had, it ignored all the beginning invisible junk FM does not consider a word.  But you didn't even ask me why it worked!  I was showing you how to eliminate "all that white space at the beginning".  Sometimes external text can hold garbage characters and you'll need to address that.

 

Also, you said another time that you couldn't get out of the calculation dialog.  You should have copy/pasted your calculation here for us to review instead of again throwing up your hands and telling us something we suggested didn't work.  Over 50 posts here where it should have only taken 5-6 at most.  You expect the perfect answer be given you and YOU must do the work to get there - not us.

Link to comment
Share on other sites

Also note that if quoted text is inside a quote, as in this case, you must escape it out by beginning it with  before the quote character.  Again, if you get stuck, copy your calculation which is throwing an error, tell us exactly where it is highlighting the error and we can help you fix your calculation.  That does NOT mean what we gave you was incorrect ... only that the specific text string you are using probably contains a character which must be escaped out.

Link to comment
Share on other sites

Thank you for your answers ,all of you. I am sorry, after I retired 23 years ago, because of health issues, I became heavily medicated. I was a kind, award winning educator but my thought processes have faded. I cannot think as I used to. I never meant any hostility or disrespect to anyone. I have tried my best but I think it is time for me to stop trying as Bruce suggested. I am sorry I put all of you who are experts through such a waste of your time.

I do, however, take exception with inferring I am not doing my part. I ask because I don't know. I worked very very hard to try to get this right. Sometimes I didn't know what I was looking for and I am glad people made me look harder. BUT I DID LOOK HARDER. I made 31 files (builds) and didn't get it. Just because you tell me something doesn't mean I am not trying hard. I do know how It can be frustrating. I had some slow kids but I never let them know that. One of them is now a very accomplished musician and I taught him to play his instrument.

 

I am giving up on this now. I am too old, and sickly, I wanted to do something with the last part of my life. Keep going full strength to newcomers, be patient if your endeavors with others.  I valued your thoughts and admired your advice but it is too late for me. I am just a novice and I need to bow out. Filemaker is too difficult. I thought there was a simple solution to get the text between two text strings within all the page source to extract the email and the web address. Sorry to all of you for your time and effort on my part. Of course I know your intentions are well intended. Good Bye and God Bless

Link to comment
Share on other sites

Come on chap; don't give up. You're already doing stuff that's miles more advanced than what I'm trying to tackle!

 

It sounds like you're getting frustrated with yourself for not understanding bits and for making silly typo / type mistakes. Laretta / Bruce etc will always help to the best of their ability, but sometimes their hands are tied when they don't have enough information.

 

If I'm honest; I'd have no confidence in my own ability to find the text you're looking for, but that's because it's a difficult task, not because I'm an idiot (though I am a music teacher too.....)

 

Best wishes,

Mike

Link to comment
Share on other sites

Don't worry about it, hownow.  Take a break then come back and try the suggestions again.  We wouldn't be hanging in there with you if we didn't care.

 

I've quit this business a thousand times because of the same types of frustrations ... I think we all have.  :-)

Link to comment
Share on other sites

Thanks Mike and LaRetta  and of course to all of you who have helped. 

I was very tired, I tried for over a week and got stuck. Most of all stuck in my own head. Sorry for that.

I took a long nap LOL and feel refreshed. Every effort is appreciated. I have a better handle on my problem now.

Again thanks -- I will be glad to continue to ask questions because I have many. Filemaker is such an enduring program and I am as excited as I can be about the new versions everytime like everyone. I want to remain part of this community. k

So thanks everyone. I will get it sooner or later. I just need to pace myself.

Link to comment
Share on other sites

Reading up in FM Help on these parsing functions will help also but here is a breakdown of the calculation I provided:

Let ( [
field = Table::email ;
begin = "<meta property="og:url" content="" ;
start = Position ( field ; begin ; 1 ; 1 ) + Length ( begin ) ;
end = Position ( field ; "/>" ; start ; 1 ) - 1
] ;
Middle ( field ; start ; end - start )
)
  • 'begin' string is the beginning tag string from its opening < clear through to the point of your text you wish to capture.
  • 'start' uses the Position() function to 'count characters' to where the < starts so if you put this in the data viewer, it will produce the location of the <, with a result similar to 675, being the 675th character in that field.
  • to find the character start of your text within, we must then add the Length() of that begin string so if the string is 25 in length, the NEW starting position is 675 + 25 (700) to get to the beginning of your desired text.
  • 'end' finds the first /> immediately after the begin string so it might be 800.
  • Then the calculation Middle()  says to go to the start position 700 and grab everything for the next 'x' characters.  We know 'x' because 800-700 means the text in the middle is 100 characters long.
  • So Middle ( email field ; 700 ; 800 - 700 ) returns text between start and end.

Reading Help on these functions, open your data viewer and create just the 'start' variable.  In place of 'field', insert the field value and in place of 'begin', insert the quoted string.  Then do same with Middle().  You can watch the results and adjust your calculations accordingly.

 

What can trip you up is escaping in these types of strings.  Quotes within quotes must be preceded with .  And then, if invisible garbage remains in your resultant text, there are various techniques to strip it, an example was my quick-and-dirty method of using LeftWords() but you might use Substitute() or other text functions.  When stuck in a calculation which won't let you out, copy it and paste it here within code (using the <> icon) and we can help you identify why it breaks.

 

And welcome back.

Link to comment
Share on other sites

By the way, please use Comment's calculation in post #2.  I see he lists it more clearly for you than mine.  You can simply change the prefix and suffix for each piece you need to extract. :-)

Link to comment
Share on other sites

Thank you for all of that . That is going to be my Sunday reading.

I am having trouble when I ask to look for a text string ">

When I put it in quotes for a text constant  it gives me an error message.     I was trying "">"

 

How can I write that to incorporate it into a calculation?

Link to comment
Share on other sites

Thank you Bruce. I will be studying what you have done. I need to understand more about variables.  What LaRetta told me about the mark helped me tremendously. I had no idea on how to do that. Your work has shown me a lot to study further. When I really get it - If It is okay I will ask you some more questions to help me clarify. Thanks so much

Link to comment
Share on other sites

For practice I tried to do a couple other items.

I was able to make the phone work and the Fax work

But using the model for Twitter I couldn't get Facebook

and Diocese is a little different.

Those are the only two I can't get

But I was able to get those other 2

Link to comment
Share on other sites

This topic is 2439 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.