Tony Diaz Posted June 18, 2020 Posted June 18, 2020 Trying to capture text from the web viewer: I want to get "On-Line Systems" Usually this kind of stuff is just right after the "Published by", but since this is a hyperlink it's got varied contents between that and the actual string I want. ;">Published by</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/company/sierra-entertainment-inc">On-Line Systems</a></div><div With this script I can capture "<div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/company/sierra-entertainment-inc">On-Line Systems" # PUBLISHER Set Variable [ $prefix ; Value: ">Published by</div>" ] Set Variable [ $suffix ; Value: "</a></div><div" ] Set Variable [ $start ; Value: Position ( $text ; $prefix ; 1 ; 1 ) ] If [ $start ] Set Variable [ $result ; Value: Let ( [ start = $start + Length ( $prefix ) ; end = Position ( $text ; $suffix ; start ; 1 ) ] ; Middle ( $text ; start ; end - start ) ) ] Set Variable [ $result ; Value: Trim( Substitute( TrimAll($result; 0;0 ) ; [Char(10); ""] ; ["'"; ""] )) ] Set Field [ Moby_Games::Publisher ; $result ] End If I'm not having any success with minimizing that to where I can parse for: ">Published by</div>" {any characters} "<a href="https://www.mobyg" {any characters} "\">" I'm getting a -huge- blob as if it's matching a whole lot more between $prefix and $suffix. Can I have it stop at the fist instance of $suffix? Their API does not seem to have all the data available as the Web Page for the associated listing/record. So I must get a few things from the HTML source instead.
comment Posted June 18, 2020 Posted June 18, 2020 1 hour ago, Tony Diaz said: ">Published by</div>" {any characters} "<a href="https://www.mobyg" {any characters} "\">" Start by finding the position of ">Published by</div>" (marker 1). Then find the position of the first "<a href="https://www.mobyg", starting from marker 1 (marker 2). Next, find the position of the first ">", starting from marker 2. Make that your start and look for the position of the "</a>" suffix from there.
Tony Diaz Posted June 19, 2020 Author Posted June 19, 2020 I get what you're saying, but I wasn't figuring out how to structure it. I also found a custom function which looks like it would have worked right after the substitute / trim that I couldn't seem to get to do anything either: Let ([ searchStringSplit = Substitute(searchString; "*"; ¶); beginString = GetValue(searchStringSplit; 1); endString = GetValue(searchStringSplit; 2); lenBegin = Length(beginString); lenEnd = Length(endString); ptBegin = Position(text; beginString; 1; 1); ptEnd = Position(text; endString; ptBegin + lenBegin; 1) + lenEnd; lenAll = ptEnd - ptBegin; keepText = Middle(text; ptBegin + lenBegin; lenAll - (lenBegin + lenEnd)); modText = Left(text; ptBegin - 1) & Substitute(replaceString; "*"; keepText); remainText = Middle(text; ptEnd; 9999999) ]; Case( lenBegin > 0 and lenEnd > 0 and ptBegin > 0 and ptEnd > lenEnd; modText & SubstituteWildcardRange(remainText; searchString; replaceString) ; text)) ...and I ended up with this laughable abomination: (It's got to be completely silly) # PUBLISHER Set Variable [ $prefix ; Value: ">Published by</div>" ] Set Variable [ $suffix ; Value: "></div><div style=\"font-size:" ] Set Variable [ $start ; Value: Position ( $text ; $prefix ; 1 ; 1 ) ] If [ not IsEmpty ($start) ] Set Variable [ $result ; Value: Let ( [ start = $start + Length ( $prefix ) ; end = Position ( $text ; $suffix ; start ; 1 ) ] ; Middle ( $text ; start ; end - start ) ) ] Set Variable [ $result ; Value: Trim( Substitute( TrimAll($result; 0;0 ) ; [" "; " "] ; ["><a"; " "] ; [Char(10); ""] ; ["'"; ""] )) ] // Set Variable [ $result ; Value: SubstituteWildcardRange ( $result ; "*" ; "12*34" ) ] Set Variable [ $text2 ; Value: $result ] Set Variable [ $prefix ; Value: "\">" ] Set Variable [ $suffix ; Value: "</a" ] Set Variable [ $start ; Value: Position ( $text2 ; $prefix ; 1 ; 1 ) ] If [ not IsEmpty ($start) ] Set Variable [ $result ; Value: Let ( [ start = $start + Length ( $prefix ) ; end = Position ( $text2 ; $suffix ; start ; 1 ) ] ; Middle ( $text2 ; start ; end - start ) ) ] Set Variable [ $result ; Value: Trim( Substitute( TrimAll($result; 0;0 ) ; [Char(10); ""] ; ["'"; ""] )) ] End If Set Field [ Moby_Games::Publisher ; $result ] End If Take the first result, replace any occurrences of "><" (should only be one anyway) so that it will then find the first closing bracket and strip from there. I also read where using negative numbers in the position would cause it to start the search from the end of the string, but I wasn't getting anywhere with those either, and it still involved a second embedded if. I also realize that I didn't need to actually do the second If [ ], but for the sake of consistency ...
Wim Decorte Posted June 19, 2020 Posted June 19, 2020 1 hour ago, Tony Diaz said: I get what you're saying, but I wasn't figuring out how to structure it. Not sure what you're saying here... @comment's answer is spot on. Seems like you're over-thinking the problem. Don't try to solve it as a one-step thing; do a two-step process as per Comment's anser. First find the node that contains "Published by" which will give you a small chunk of text and then search within that text for the href part. That way you don't need to care what text is between the "published by" and the start of the href at all.
comment Posted June 19, 2020 Posted June 19, 2020 Here is a simple example: Given the following HTML snippet: ;">Published by</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/company/sierra-entertainment-inc">On-Line Systems</a></div><div this calculation: Let ( [ mark1 = Position ( HTML ; ">Published by</div>" ; 1 ; 1 ) ; mark2 = Position ( HTML ; "<a href=\"https://www.mobyg" ; mark1 ; 1 ) ; mark3 = Position ( HTML ; ">"; mark2 ; 1 ) ; start = mark3 + 1 ; end = Position ( HTML ; "</a>" ; mark3 ; 1 ) ] ; If ( mark3 ; Middle ( HTML ; start ; end - start ) ) ) will return: On-Line Systems
Recommended Posts
This topic is 1616 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now