Jump to content
Claris Engage 2025 - March 25-26 Austin Texas ×

This topic is 6015 days old. Please don't post here. Open a new topic instead.

Recommended Posts

Posted

hi guys first post here. I am looking for a way to parse each instance of "friendid=870987" in an html code i have scraped with the web browser. This is my first foray into parsing data and after several failed attempts i'm getting a little discouraged. Any help would be much appreciated.

Dave

an example would be pulling this data from the following html---->

[*]

Steve McQueen

s_0c2c1a7319f1a14f5c76789004aeb00c.jpg>>>>>>

184040237 Steve McQueen

also making a new record for each unique found id.

Thanks

Posted

Seems like you can search for the text pattern of:

Posted

Assuming there is some consistency here (hard to tell from a single example), you could try:

Let ( [

pos = Position ( text ; "friendid="184040237"" ; 1 ; 1 ) ;

start = Position ( text ; ">" ; pos ; 2 ) + 1 ;

end = Position ( text ; "<" ; start ; 1 )

] ;

Case ( pos ; Middle ( text ; start ; end - start ) )

)

also making a new record for each unique found id

I don't think you gave us enough information to make a suggestion.

Posted (edited)

thank you so much guys,

I still can't seem to work it out.

Here is a better example:

How would I pull each instance of the 6-8 number strings found after "friendid=" in the following html source i scraped from the webviewer.

I really appreciate your help as I've been trying to figure this out for a couple of weeks to no avail. Thanks

D

Editted (Sorry)

Edited by Guest
Posted

[bigger][color:red]Please use an attachment next time. [/bigger]

Since the number of occurrences of the search string is unknown, you need some sort of recursive procedure. It seems like a script would be most suitable here. See a rough sketch in the attached file. It extracts the ID's and names as lists - I presume you'd eventually want to create records from the parsed data. Another thing that needs to be added is the replacement of HTML entities with the corresponding characters.

ParseFriends.fp7.zip

Posted

I promise I won't drop 30 page sources files into the forum again as well.

Well yes especially when then contain profanity. :)

Posted

I've been looking at the code but, i'm not sure exactly how the script is finding the name. It also seems that it doesnt always work but, still a great script. Any chance of a short explanation. Also what direction would you head if you wanted to make each of the numbers a new record with related name?

Thanks!

Posted

The script looks for the n-th occurrence of "friendid=" in the text. Once it has found it, it extracts the ID which begins 9 characters (the length of "friendid=") further and ends with the first quotation mark. Similarly, the name begins after the first following carriage return and ends with another carriage return. After extracting the ID and the name, n is increased by 2 (because there is another paragraph containing "friendid=" in the same block).

This should work as long as the text is consistent in following the rules assumed above.

To create records, put the extracted ID and name into $variables (instead of the Set Field[] steps), go to a layout of the contacts table, create a new record, set the ID and the Name fields to the corresponding $variables, return to the original layout and continue with the loop.

Or, if the text is in a global field in the contacts table, just create a new record at each iteration, set the fields, commit the record and continue with the loop.

Posted

Could you be more specific? In your file, there are 84 occurences of "friendid=" in the text - and the script manages to extract exactly 42 IDs and 42 names. Which name is not followed by a carriage return?

Posted

in the example I gave your script worked perfectly, unfortunately I am designing the program to extract ids from a html pages that sometimes have varying syntax. because of this it seems that sometimes the script gives me the names as well as a large amount of other data as well. I am well on my way to understanding this though so I may be able to figure it out on my own. On the previous attached fp7 doc. you can see the extra data I am receiving in addition to the ids. Please let me know if there is anyway I can repay you for your help. You have been a lifesaver.

D

Posted

Please let me know if there is anyway I can repay you for your help.

Cash, credit card, Jewelry, Ferrari ... many things work very well. :laugh2: And Comment's real name is LaRetta and you can send the payment here.

Posted

Sorry, I must have been looking at the wrong file.

What can I say - this code doesn't follow the same rules as the one before. It's easy to adapt the script to this (see attached), but the question is: will it hold for the next page you want to scrape? Not to mention that the site may change the coding at any time.

ParseFriends2.fp7.zip

Posted (edited)

got that down wow!

now i'm trying to get fancy by looking for a couple more pieces of data, mainly the word following the Genre: tag after the found string. you can see the example as well as the exact html page i will be parsing (by pressing search)

in my attached example.

oh and i will try and stop bothering you guys. I've almost got this nailed.

ParseFriends3.fp7.zip

Edited by Guest
Posted (edited)

here is a better example how would you parse a simple html string that included friendid=:)?:?,Genre,City,Fans, with a comma as the delimiter.

I still am not completely clear on how to use position to find new delimiters after the found search string. Thanks

Edited by Guest
Posted (edited)

The Position function has start and occurrence parameters which can be very useful here. Suppose your text is:

... friendid=A, GenreA, CityA, FansA, friendid=B, GenreB, CityB, FansB, friendid=C, GenreC, CityC, FansC, ...friendid=Z, GenreZ, CityZ, FansZ ...




and you want to extract the city of the n-th friend. The friend's data begins at:



Position ( text ; "friendid=" ; 1 ; n )



and his/her city is between the following 2nd and 3rd commas:





Let ( [

anchor = Position ( text ; "friendid=" ; 1 ; n ) ;

start = Position ( text ; ", " ; anchor ; 2 ) + 2 ; 

end = Position ( text ; ", " ; anchor ; 3 ) 

] ;

Middle ( text ; start ; end - start )

)

Edited by Guest
Added spaces after commas, because it was messing with the forums index view
Posted

hi comment,

that makes sense. if the deliminater is different that a comma will this still work (if you enter the symbols used as a delimiter or does anchor apply only to deliniators that are the same each time. Would you use a replace function to replace the seperators.

Thank you so much for holding my hand through this. Here is a specific example of how the text that I am going to parse looks right now. (i have finally narrowed down exactly what data I need to parse).

Please don't hesitate to let me know someway I can repay you for your help. I am very grateful.

html_to_parse.txt

Posted

In the above example, the anchor is established with no regard to the delimiters. After that is done, you look for delimiters that follow the current anchor, using the anchor position as the start parameter. The delimiter can be anything, you could even use a variable for this, e.g.:)


Let ( [

anchor = Position ( text ; $searchString ; 1 ; $i ) ;

start = Position ( text ; $delimiter ; anchor ; $j ) + Length ( $delimiter )  ; 

end = Position ( text ; $delimiter ; anchor ; $j + 1 ) 

] ;

Middle ( text ; start ; end - start )

)

If your script pre-defines the variables $searchString, $delimiter, $i and $j, you can then use this as a generic calculation to extract a specific "field" ($j) from a specific "record" ($i). This would be quite similar to writing a custom function with the same four parameters.

Here is a specific example of how the text that I am going to parse

It seems to be an rtf file? Not that that should make a difference, but the name indicates html.

Posted (edited)

okay I know you guys are probably tired of my questions, but I still can't get this to work properly. I am having a beautiful time parsing the html but my code breaks down after name and id.

how would you parse the 3rd and 4th pieces of data from this line of html

I promise I will leave you guys alone once I figure this out. i have learned so much but still can't get the genre, city and plays to parse correctly.


"Panic At The Disco

										
[b]Genre:[/b] Rock / Big Beat / Techno

										

											
[b]Location:[/b] Las Vegas, Nevada

										

										
[b]Last Update:[/b]

										

											

												



													23 May 2008, 17:20

											

										

									

									

									

										[b]Plays:[/b] 202,107,106

										
[b]Views:[/b] 40,554,775

										
[b]Fans:[/b] 1,411,855

									

									

								

							"

thanks heaps.

Edited by Guest
Added Code tag
Posted

Note:

Please do not post HTML directly in the message - it gets modified by the forum software. Use a 'code' tag, or an attachment.

This is again a different problem, but the technique is the same: I would look for the first occurence of "Location:", starting from the anchor (i.e. the position of "friendID=" in the text). That would give me the starting point.

The ending point seems to be the first occurrence of "
", starting from the above starting point.

This topic is 6015 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.