how would i parse this

Medusa Productions · May 27, 2008

hi guys first post here. I am looking for a way to parse each instance of "friendid=870987" in an html code i have scraped with the web browser. This is my first foray into parsing data and after several failed attempts i'm getting a little discouraged. Any help would be much appreciated.

Dave

an example would be pulling this data from the following html---->

[*]

Steve McQueen

>>>>>>

184040237 Steve McQueen

also making a new record for each unique found id.

Thanks

mr_vodka · May 27, 2008

Seems like you can search for the text pattern of:

comment · May 27, 2008

Assuming there is some consistency here (hard to tell from a single example), you could try:

Let ( [

pos = Position ( text ; "friendid="184040237"" ; 1 ; 1 ) ;

start = Position ( text ; ">" ; pos ; 2 ) + 1 ;

end = Position ( text ; "<" ; start ; 1 )

] ;

Case ( pos ; Middle ( text ; start ; end - start ) )

)

also making a new record for each unique found id

I don't think you gave us enough information to make a suggestion.

Medusa Productions · May 31, 2008

thank you so much guys,

I still can't seem to work it out.

Here is a better example:

How would I pull each instance of the 6-8 number strings found after "friendid=" in the following html source i scraped from the webviewer.

I really appreciate your help as I've been trying to figure this out for a couple of weeks to no avail. Thanks

D

Editted (Sorry)

Edited June 1, 2008 by Guest

comment · June 1, 2008

[bigger][color:red]Please use an attachment next time. [/bigger]

Since the number of occurrences of the search string is unknown, you need some sort of recursive procedure. It seems like a script would be most suitable here. See a rough sketch in the attached file. It extracts the ID's and names as lists - I presume you'd eventually want to create records from the parsed data. Another thing that needs to be added is the replacement of HTML entities with the corresponding characters.

ParseFriends.fp7.zip

Medusa Productions · June 1, 2008

You did it! That is so awesome. Thank you Thank you. I just learned so much from that I can't thank you enough. I promise I won't drop 30 page sources files into the forum again as well.

YES!

mr_vodka · June 1, 2008

I promise I won't drop 30 page sources files into the forum again as well.

Well yes especially when then contain profanity. :)

Medusa Productions · June 2, 2008

I've been looking at the code but, i'm not sure exactly how the script is finding the name. It also seems that it doesnt always work but, still a great script. Any chance of a short explanation. Also what direction would you head if you wanted to make each of the numbers a new record with related name?

Thanks!

Medusa Productions · June 2, 2008

here is the error as it is occurring in my solution. I think the error is due to my original example not your coding. Thanks in advance

ParseFriends.fp7.zip

comment · June 2, 2008

The script looks for the n-th occurrence of "friendid=" in the text. Once it has found it, it extracts the ID which begins 9 characters (the length of "friendid=") further and ends with the first quotation mark. Similarly, the name begins after the first following carriage return and ends with another carriage return. After extracting the ID and the name, n is increased by 2 (because there is another paragraph containing "friendid=" in the same block).

This should work as long as the text is consistent in following the rules assumed above.

To create records, put the extracted ID and name into $variables (instead of the Set Field[] steps), go to a layout of the contacts table, create a new record, set the ID and the Name fields to the corresponding $variables, return to the original layout and continue with the loop.

Or, if the text is in a global field in the contacts table, just create a new record at each iteration, set the fields, commit the record and continue with the loop.

comment · June 2, 2008

Where is the error?

Medusa Productions · June 2, 2008

sorry not an error so much as it appears that the name doesn't always follow or (is followed) by a carriage return. I can't thank you enough for your help and advice.

D

comment · June 2, 2008

Could you be more specific? In your file, there are 84 occurences of "friendid=" in the text - and the script manages to extract exactly 42 IDs and 42 names. Which name is not followed by a carriage return?

Medusa Productions · June 3, 2008

in the example I gave your script worked perfectly, unfortunately I am designing the program to extract ids from a html pages that sometimes have varying syntax. because of this it seems that sometimes the script gives me the names as well as a large amount of other data as well. I am well on my way to understanding this though so I may be able to figure it out on my own. On the previous attached fp7 doc. you can see the extra data I am receiving in addition to the ids. Please let me know if there is anyway I can repay you for your help. You have been a lifesaver.

D

LaRetta · June 3, 2008

Please let me know if there is anyway I can repay you for your help.

Cash, credit card, Jewelry, Ferrari ... many things work very well. :laugh2: And Comment's real name is LaRetta and you can send the payment here.

comment · June 3, 2008

Sorry, I must have been looking at the wrong file.

What can I say - this code doesn't follow the same rules as the one before. It's easy to adapt the script to this (see attached), but the question is: will it hold for the next page you want to scrape? Not to mention that the site may change the coding at any time.

ParseFriends2.fp7.zip

comment · June 3, 2008

LOL, you're just jealous of all the money I made from selling Field Factory.

Medusa Productions · June 3, 2008

what can i say. you are a godsend. Need any audio production work hit me up.

Medusa Productions · June 3, 2008

bing bang! I'm making really good progress now and think I understand what the script is doing. I'm going to try and add the create new record aspect. Wish me luck.

Medusa Productions · June 3, 2008

got that down wow!

now i'm trying to get fancy by looking for a couple more pieces of data, mainly the word following the Genre: tag after the found string. you can see the example as well as the exact html page i will be parsing (by pressing search)

in my attached example.

oh and i will try and stop bothering you guys. I've almost got this nailed.

ParseFriends3.fp7.zip

Edited June 3, 2008 by Guest

Medusa Productions · June 5, 2008

here is a better example how would you parse a simple html string that included friendid=?:?,Genre,City,Fans, with a comma as the delimiter.

I still am not completely clear on how to use position to find new delimiters after the found search string. Thanks

Edited June 5, 2008 by Guest

comment · June 5, 2008

The Position function has start and occurrence parameters which can be very useful here. Suppose your text is:

... friendid=A, GenreA, CityA, FansA, friendid=B, GenreB, CityB, FansB, friendid=C, GenreC, CityC, FansC, ...friendid=Z, GenreZ, CityZ, FansZ ...





and you want to extract the city of the n-th friend. The friend's data begins at:



Position ( text ; "friendid=" ; 1 ; n )



and his/her city is between the following 2nd and 3rd commas:


Let ( [

anchor = Position ( text ; "friendid=" ; 1 ; n ) ;

start = Position ( text ; ", " ; anchor ; 2 ) + 2 ; 

end = Position ( text ; ", " ; anchor ; 3 ) 

] ;

Middle ( text ; start ; end - start )

)

Edited June 5, 2008 by Guest
Added spaces after commas, because it was messing with the forums index view

Medusa Productions · June 5, 2008

hi comment,

that makes sense. if the deliminater is different that a comma will this still work (if you enter the symbols used as a delimiter or does anchor apply only to deliniators that are the same each time. Would you use a replace function to replace the seperators.

Thank you so much for holding my hand through this. Here is a specific example of how the text that I am going to parse looks right now. (i have finally narrowed down exactly what data I need to parse).

Please don't hesitate to let me know someway I can repay you for your help. I am very grateful.

html_to_parse.txt

comment · June 5, 2008

In the above example, the anchor is established with no regard to the delimiters. After that is done, you look for delimiters that follow the current anchor, using the anchor position as the start parameter. The delimiter can be anything, you could even use a variable for this, e.g.


Let ( [

anchor = Position ( text ; $searchString ; 1 ; $i ) ;

start = Position ( text ; $delimiter ; anchor ; $j ) + Length ( $delimiter )  ; 

end = Position ( text ; $delimiter ; anchor ; $j + 1 ) 

] ;

Middle ( text ; start ; end - start )

)

If your script pre-defines the variables $searchString, $delimiter, $i and $j, you can then use this as a generic calculation to extract a specific "field" ($j) from a specific "record" ($i). This would be quite similar to writing a custom function with the same four parameters.

Here is a specific example of how the text that I am going to parse

It seems to be an rtf file? Not that that should make a difference, but the name indicates html.

Medusa Productions · June 5, 2008

thanks it is going to be an html file but i copied into a text document.

thanks for your help I will try that now.

D

Medusa Productions · June 5, 2008

I know this is a basic quesiton but, are spaces regarded as characters.

Lee Smith · June 5, 2008

Yes

comment · June 5, 2008

Mostly yes (but depends on the context of the question).

Medusa Productions · June 5, 2008

dang,

i still can't figure out what I'm doing wrong. Your name solution works perfectly though.

If you get a chance take a look at this. Thanks for all your help.

ParseFriends3.fp7.zip

comment · June 5, 2008

I don't get your file. I have taken my previous file and modified it to create new records instead of piling up lists. Hopefully you'll be able to adapt this.

ParseFriends2b.fp7.zip

Medusa Productions · June 6, 2008

okay I know you guys are probably tired of my questions, but I still can't get this to work properly. I am having a beautiful time parsing the html but my code breaks down after name and id.

how would you parse the 3rd and 4th pieces of data from this line of html

I promise I will leave you guys alone once I figure this out. i have learned so much but still can't get the genre, city and plays to parse correctly.


"Panic At The Disco

										
[b]Genre:[/b] Rock / Big Beat / Techno

										

											
[b]Location:[/b] Las Vegas, Nevada

										

										
[b]Last Update:[/b]

										

											

												



													23 May 2008, 17:20

											

										

									

									

									

										[b]Plays:[/b] 202,107,106

										
[b]Views:[/b] 40,554,775

										
[b]Fans:[/b] 1,411,855

									

									

								

							"

thanks heaps.

Edited June 6, 2008 by Guest
Added Code tag

comment · June 6, 2008

Note:

Please do not post HTML directly in the message - it gets modified by the forum software. Use a 'code' tag, or an attachment.

This is again a different problem, but the technique is the same: I would look for the first occurence of "Location:", starting from the anchor (i.e. the position of "friendID=" in the text). That would give me the starting point.

The ending point seems to be the first occurrence of "
", starting from the above starting point.

Medusa Productions · June 6, 2008

sorry,

thanks that is what I thought I was doing must be some kind of simple mistake. I'll keep on it. And yeah no more html.

Medusa Productions · June 6, 2008

I finally got it. Thank you thank you.

Sign In

how would i parse this

Recommended Posts

Create an account or sign in to comment

Create an account

Sign in

Important Information