Medusa Productions Posted May 27, 2008 Posted May 27, 2008 hi guys first post here. I am looking for a way to parse each instance of "friendid=870987" in an html code i have scraped with the web browser. This is my first foray into parsing data and after several failed attempts i'm getting a little discouraged. Any help would be much appreciated. Dave an example would be pulling this data from the following html----> [*] Steve McQueen >>>>>> 184040237 Steve McQueen also making a new record for each unique found id. Thanks
comment Posted May 27, 2008 Posted May 27, 2008 Assuming there is some consistency here (hard to tell from a single example), you could try: Let ( [ pos = Position ( text ; "friendid="184040237"" ; 1 ; 1 ) ; start = Position ( text ; ">" ; pos ; 2 ) + 1 ; end = Position ( text ; "<" ; start ; 1 ) ] ; Case ( pos ; Middle ( text ; start ; end - start ) ) ) also making a new record for each unique found id I don't think you gave us enough information to make a suggestion.
Medusa Productions Posted May 31, 2008 Author Posted May 31, 2008 (edited) thank you so much guys, I still can't seem to work it out. Here is a better example: How would I pull each instance of the 6-8 number strings found after "friendid=" in the following html source i scraped from the webviewer. I really appreciate your help as I've been trying to figure this out for a couple of weeks to no avail. Thanks D Editted (Sorry) Edited June 1, 2008 by Guest
comment Posted June 1, 2008 Posted June 1, 2008 [bigger][color:red]Please use an attachment next time. [/bigger] Since the number of occurrences of the search string is unknown, you need some sort of recursive procedure. It seems like a script would be most suitable here. See a rough sketch in the attached file. It extracts the ID's and names as lists - I presume you'd eventually want to create records from the parsed data. Another thing that needs to be added is the replacement of HTML entities with the corresponding characters. ParseFriends.fp7.zip
Medusa Productions Posted June 1, 2008 Author Posted June 1, 2008 You did it! That is so awesome. Thank you Thank you. I just learned so much from that I can't thank you enough. I promise I won't drop 30 page sources files into the forum again as well. YES!
mr_vodka Posted June 1, 2008 Posted June 1, 2008 I promise I won't drop 30 page sources files into the forum again as well. Well yes especially when then contain profanity. :)
Medusa Productions Posted June 2, 2008 Author Posted June 2, 2008 I've been looking at the code but, i'm not sure exactly how the script is finding the name. It also seems that it doesnt always work but, still a great script. Any chance of a short explanation. Also what direction would you head if you wanted to make each of the numbers a new record with related name? Thanks!
Medusa Productions Posted June 2, 2008 Author Posted June 2, 2008 here is the error as it is occurring in my solution. I think the error is due to my original example not your coding. Thanks in advance ParseFriends.fp7.zip
comment Posted June 2, 2008 Posted June 2, 2008 The script looks for the n-th occurrence of "friendid=" in the text. Once it has found it, it extracts the ID which begins 9 characters (the length of "friendid=") further and ends with the first quotation mark. Similarly, the name begins after the first following carriage return and ends with another carriage return. After extracting the ID and the name, n is increased by 2 (because there is another paragraph containing "friendid=" in the same block). This should work as long as the text is consistent in following the rules assumed above. To create records, put the extracted ID and name into $variables (instead of the Set Field[] steps), go to a layout of the contacts table, create a new record, set the ID and the Name fields to the corresponding $variables, return to the original layout and continue with the loop. Or, if the text is in a global field in the contacts table, just create a new record at each iteration, set the fields, commit the record and continue with the loop.
Medusa Productions Posted June 2, 2008 Author Posted June 2, 2008 sorry not an error so much as it appears that the name doesn't always follow or (is followed) by a carriage return. I can't thank you enough for your help and advice. D
comment Posted June 2, 2008 Posted June 2, 2008 Could you be more specific? In your file, there are 84 occurences of "friendid=" in the text - and the script manages to extract exactly 42 IDs and 42 names. Which name is not followed by a carriage return?
Medusa Productions Posted June 3, 2008 Author Posted June 3, 2008 in the example I gave your script worked perfectly, unfortunately I am designing the program to extract ids from a html pages that sometimes have varying syntax. because of this it seems that sometimes the script gives me the names as well as a large amount of other data as well. I am well on my way to understanding this though so I may be able to figure it out on my own. On the previous attached fp7 doc. you can see the extra data I am receiving in addition to the ids. Please let me know if there is anyway I can repay you for your help. You have been a lifesaver. D
LaRetta Posted June 3, 2008 Posted June 3, 2008 Please let me know if there is anyway I can repay you for your help. Cash, credit card, Jewelry, Ferrari ... many things work very well. And Comment's real name is LaRetta and you can send the payment here.
comment Posted June 3, 2008 Posted June 3, 2008 Sorry, I must have been looking at the wrong file. What can I say - this code doesn't follow the same rules as the one before. It's easy to adapt the script to this (see attached), but the question is: will it hold for the next page you want to scrape? Not to mention that the site may change the coding at any time. ParseFriends2.fp7.zip
comment Posted June 3, 2008 Posted June 3, 2008 LOL, you're just jealous of all the money I made from selling Field Factory.
Medusa Productions Posted June 3, 2008 Author Posted June 3, 2008 what can i say. you are a godsend. Need any audio production work hit me up.
Medusa Productions Posted June 3, 2008 Author Posted June 3, 2008 bing bang! I'm making really good progress now and think I understand what the script is doing. I'm going to try and add the create new record aspect. Wish me luck.
Medusa Productions Posted June 3, 2008 Author Posted June 3, 2008 (edited) got that down wow! now i'm trying to get fancy by looking for a couple more pieces of data, mainly the word following the Genre: tag after the found string. you can see the example as well as the exact html page i will be parsing (by pressing search) in my attached example. oh and i will try and stop bothering you guys. I've almost got this nailed. ParseFriends3.fp7.zip Edited June 3, 2008 by Guest
Medusa Productions Posted June 5, 2008 Author Posted June 5, 2008 (edited) here is a better example how would you parse a simple html string that included friendid=?:?,Genre,City,Fans, with a comma as the delimiter. I still am not completely clear on how to use position to find new delimiters after the found search string. Thanks Edited June 5, 2008 by Guest
comment Posted June 5, 2008 Posted June 5, 2008 (edited) The Position function has start and occurrence parameters which can be very useful here. Suppose your text is: ... friendid=A, GenreA, CityA, FansA, friendid=B, GenreB, CityB, FansB, friendid=C, GenreC, CityC, FansC, ...friendid=Z, GenreZ, CityZ, FansZ ... and you want to extract the city of the n-th friend. The friend's data begins at: Position ( text ; "friendid=" ; 1 ; n ) and his/her city is between the following 2nd and 3rd commas: Let ( [ anchor = Position ( text ; "friendid=" ; 1 ; n ) ; start = Position ( text ; ", " ; anchor ; 2 ) + 2 ; end = Position ( text ; ", " ; anchor ; 3 ) ] ; Middle ( text ; start ; end - start ) ) Edited June 5, 2008 by Guest Added spaces after commas, because it was messing with the forums index view
Medusa Productions Posted June 5, 2008 Author Posted June 5, 2008 hi comment, that makes sense. if the deliminater is different that a comma will this still work (if you enter the symbols used as a delimiter or does anchor apply only to deliniators that are the same each time. Would you use a replace function to replace the seperators. Thank you so much for holding my hand through this. Here is a specific example of how the text that I am going to parse looks right now. (i have finally narrowed down exactly what data I need to parse). Please don't hesitate to let me know someway I can repay you for your help. I am very grateful. html_to_parse.txt
comment Posted June 5, 2008 Posted June 5, 2008 In the above example, the anchor is established with no regard to the delimiters. After that is done, you look for delimiters that follow the current anchor, using the anchor position as the start parameter. The delimiter can be anything, you could even use a variable for this, e.g. Let ( [ anchor = Position ( text ; $searchString ; 1 ; $i ) ; start = Position ( text ; $delimiter ; anchor ; $j ) + Length ( $delimiter ) ; end = Position ( text ; $delimiter ; anchor ; $j + 1 ) ] ; Middle ( text ; start ; end - start ) ) If your script pre-defines the variables $searchString, $delimiter, $i and $j, you can then use this as a generic calculation to extract a specific "field" ($j) from a specific "record" ($i). This would be quite similar to writing a custom function with the same four parameters. Here is a specific example of how the text that I am going to parse It seems to be an rtf file? Not that that should make a difference, but the name indicates html.
Medusa Productions Posted June 5, 2008 Author Posted June 5, 2008 thanks it is going to be an html file but i copied into a text document. thanks for your help I will try that now. D
Medusa Productions Posted June 5, 2008 Author Posted June 5, 2008 I know this is a basic quesiton but, are spaces regarded as characters.
comment Posted June 5, 2008 Posted June 5, 2008 Mostly yes (but depends on the context of the question).
Medusa Productions Posted June 5, 2008 Author Posted June 5, 2008 dang, i still can't figure out what I'm doing wrong. Your name solution works perfectly though. If you get a chance take a look at this. Thanks for all your help. ParseFriends3.fp7.zip
comment Posted June 5, 2008 Posted June 5, 2008 I don't get your file. I have taken my previous file and modified it to create new records instead of piling up lists. Hopefully you'll be able to adapt this. ParseFriends2b.fp7.zip
Medusa Productions Posted June 6, 2008 Author Posted June 6, 2008 (edited) okay I know you guys are probably tired of my questions, but I still can't get this to work properly. I am having a beautiful time parsing the html but my code breaks down after name and id. how would you parse the 3rd and 4th pieces of data from this line of html I promise I will leave you guys alone once I figure this out. i have learned so much but still can't get the genre, city and plays to parse correctly. "Panic At The Disco [b]Genre:[/b] Rock / Big Beat / Techno [b]Location:[/b] Las Vegas, Nevada [b]Last Update:[/b] 23 May 2008, 17:20 [b]Plays:[/b] 202,107,106 [b]Views:[/b] 40,554,775 [b]Fans:[/b] 1,411,855 " thanks heaps. Edited June 6, 2008 by Guest Added Code tag
comment Posted June 6, 2008 Posted June 6, 2008 Note: Please do not post HTML directly in the message - it gets modified by the forum software. Use a 'code' tag, or an attachment. This is again a different problem, but the technique is the same: I would look for the first occurence of "Location:", starting from the anchor (i.e. the position of "friendID=" in the text). That would give me the starting point. The ending point seems to be the first occurrence of "", starting from the above starting point.
Medusa Productions Posted June 6, 2008 Author Posted June 6, 2008 sorry, thanks that is what I thought I was doing must be some kind of simple mistake. I'll keep on it. And yeah no more html.
Medusa Productions Posted June 6, 2008 Author Posted June 6, 2008 I finally got it. Thank you thank you.
Recommended Posts
This topic is 6015 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now