Jump to content
Sign in to follow this  
greenfields

extract url from html body

Recommended Posts

I have exported the html contents of a webviewer as text to a text field in another layout.

I am trying to write a script to go through the text and extract urls from within it.

All urls in the text begin http and the string is contained within " "

I am running a looping script that goes to the first word, if its first 4 characters are http (Left ( $thisword ; 4 ) = "http") then i set the current word plus the next 20 to another variable which is exported to another field.

The current word then increments by +1 and stops when current word = last word.

This is ugly! I wonder if anyone can help me get the exact text of the url contained in the " " separators.

Any help much appreciated!

Basically, how do i exit a loop if the current word is followed by a " separator....

Edited by Guest

Share this post


Link to post
Share on other sites

Why don't you search for the n-th occurence of "http", extract from there till the first following quote, then bump n up by 1.

Share this post


Link to post
Share on other sites

sorry, I've never used these functions, would you mind elaborating?

An example would be superb!

Thanks for the input....

Share this post


Link to post
Share on other sites

Roughly:

Loop

Set Variable [ $i ; $i + 1 ]

Exit Loop If [ $i > PatternCount ( text ; "http" ) ]

SetVariable [ $url ; <> ]

Peform Script [ New URL Record ; parameter: $url ]

End Loop

and the <> would be:

Let ( [

start = Position ( text ; "http" ; 1 ; $i ) ;

end = Position ( text ; """ ; start ; 1 )

] ;

Middle ( text ; start ; end - start )

)

Share this post


Link to post
Share on other sites

To extract URLs from a html, asp, php, text, etc. documents, there is a good script posted at http://www.biterscripting.com/SS_URLs.html .

To use, do the following. (With high speed internet, this entire process, including installation, should take no more than a couple of minutes.)

1. Download and install biterscripting at http://www.biterscripting.com .

2. Start biterscripting and enter the following command .

script "http://www.biterscripting.com/Download/SS_AllSamples.txt"




(biterscripting can execute scripts directly from a web site)



3. Now you are ready to use the SS_URLs script to extract URLs. This is done with the following command.




script "C:/Scripts/SS_URLs.txt" URL("http://....")




The above will extract URLs referenced in that web page. OR,




script "C:/Scripts/SS_URLs.txt" URL("C:/....")

The above will extract URLs referenced in that local file.

Hope this helps.

Patrick

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Who Viewed the Topic

    1 member has viewed this topic:
    _ian 
×

Important Information

By using this site, you agree to our Terms of Use.