Jump to content

Recommended Posts

Trying to capture text from the web viewer:

2061353765_ScreenShot2020-06-18at00_08_58.png.2d3c93616314800148fd3013711247d9.png

I want to get "On-Line Systems"
Usually this kind of stuff is just right after the "Published by", but since this is a hyperlink it's got varied contents between that and the actual string I want.

;">Published by</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/company/sierra-entertainment-inc">On-Line&nbsp;Systems</a></div><div

 

With this script I can capture
"<div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/company/sierra-entertainment-inc">On-Line&nbsp;Systems"

 

# PUBLISHER
Set Variable [ $prefix ; Value: ">Published by</div>" ] 
Set Variable [ $suffix ; Value: "</a></div><div" ] 
Set Variable [ $start ; Value: Position ( $text ; $prefix ; 1 ; 1 ) ] 
If [ $start ] 
Set Variable [ $result ; Value: Let ( [ start = $start + Length ( $prefix ) ; end = Position ( $text ; $suffix ; start ; 1 ) ] ; Middle ( $text ; start ; end - start ) ) ] 
Set Variable [ $result ; Value: Trim( Substitute( TrimAll($result; 0;0 ) ; [Char(10); ""] ; ["&#039;"; ""] )) ] 
Set Field [ Moby_Games::Publisher ; $result ] 
End If

I'm not having any success with minimizing that to where I can parse for:

">Published by</div>" {any characters} "<a href="https://www.mobyg" {any characters} "\">"

I'm getting a -huge- blob as if it's matching a whole lot more between $prefix and $suffix. Can I have it stop at the fist instance of $suffix?

Their API does not seem to have all the data available as the Web Page for the associated listing/record. So I must get a few things from the HTML source instead.

 

Link to post
Share on other sites
1 hour ago, Tony Diaz said:

">Published by</div>" {any characters} "<a href="https://www.mobyg" {any characters} "\">"

Start by finding the position of ">Published by</div>" (marker 1). Then find the position of the first "<a href="https://www.mobyg", starting from marker 1 (marker 2). Next, find the position of the first  ">", starting from marker 2. Make that your start and look for the position of the "</a>" suffix from there.

 

Link to post
Share on other sites

I get what you're saying, but I wasn't figuring out how to structure it.

I also found a custom function which looks like it would have worked right after the substitute / trim that I couldn't seem to get to do anything either:

Let ([
searchStringSplit = Substitute(searchString; "*"; ¶); 
beginString = GetValue(searchStringSplit; 1); 
endString = GetValue(searchStringSplit; 2); 
lenBegin = Length(beginString);
lenEnd = Length(endString);
ptBegin = Position(text; beginString; 1; 1);
ptEnd = Position(text; endString; ptBegin + lenBegin; 1) + lenEnd;
lenAll = ptEnd - ptBegin; 
keepText = Middle(text; ptBegin + lenBegin; lenAll - (lenBegin + lenEnd)); 
modText = Left(text; ptBegin - 1) & Substitute(replaceString; "*"; keepText);
remainText = Middle(text; ptEnd; 9999999)
]; 

Case(
lenBegin > 0 and lenEnd > 0 and 
ptBegin > 0 and ptEnd > lenEnd;

modText & 
SubstituteWildcardRange(remainText; searchString; replaceString)

 ; text))

 

...and I ended up with this laughable abomination: (It's got to be completely silly)

# PUBLISHER
Set Variable [ $prefix ; Value: ">Published by</div>" ] 
Set Variable [ $suffix ; Value: "></div><div style=\"font-size:" ] 
Set Variable [ $start ; Value: Position ( $text ; $prefix ; 1 ; 1 ) ] 
If [ not IsEmpty ($start) ] 
     Set Variable [ $result ; Value: Let ( [ start = $start + Length ( $prefix ) ; end = Position ( $text ; $suffix ; start ; 1 ) ] ; Middle ( $text ; start ; end - start ) ) ] 
     Set Variable [ $result ; Value: Trim( Substitute( TrimAll($result; 0;0 ) ; [" "; " "] ; ["><a"; " "] ; [Char(10); ""] ; ["'"; ""] )) ] 
     // Set Variable [ $result ; Value: SubstituteWildcardRange ( $result ; "*" ; "12*34" ) ] 
     Set Variable [ $text2 ; Value: $result ] 
     Set Variable [ $prefix ; Value: "\">" ] 
     Set Variable [ $suffix ; Value: "</a" ] 
     Set Variable [ $start ; Value: Position ( $text2 ; $prefix ; 1 ; 1 ) ] 
	     If [ not IsEmpty ($start) ] 
              Set Variable [ $result ; Value: Let ( [ start = $start + Length ( $prefix ) ; end = Position ( $text2 ; $suffix ; start ; 1 ) ] ; Middle ( $text2 ; start ; end - start ) ) ] 
              Set Variable [ $result ; Value: Trim( Substitute( TrimAll($result; 0;0 ) ; [Char(10); ""] ; ["'"; ""] )) ] 
         End If
     Set Field [ Moby_Games::Publisher ; $result ] 
End If

Take the first result, replace any occurrences of "><" (should only be one anyway) so that it will then find the first closing bracket and strip from there.

I also read where using negative numbers in the position would cause it to start the search from the end of the string, but I wasn't getting anywhere with those either, and it still involved a second embedded if. I also realize that I didn't need to actually do the second If [ ], but for the sake of consistency ...

Link to post
Share on other sites
1 hour ago, Tony Diaz said:

I get what you're saying, but I wasn't figuring out how to structure it.

Not sure what you're saying here... @comment's answer is spot on.

Seems like you're over-thinking the problem.  Don't try to solve it as a one-step thing; do a two-step process as per Comment's anser.  First find the node that contains "Published by" which will give you a small chunk of text and then search within that text for the href part.  That way you don't need to care what text is between the "published by" and the start of the href at all.

Link to post
Share on other sites

Here is a simple example:

Given the following HTML snippet:

;">Published by</div><div style="font-size: 90%; padding-left: 1em; padding-bottom: 0.25em;"><a href="https://www.mobygames.com/company/sierra-entertainment-inc">On-Line&nbsp;Systems</a></div><div

this calculation:

Let ( [ 
mark1 =  Position ( HTML ; ">Published by</div>" ; 1 ; 1 ) ; 
mark2 =  Position ( HTML ; "<a href=\"https://www.mobyg" ; mark1 ; 1 ) ; 
mark3 = Position ( HTML ; ">"; mark2 ; 1 ) ; 
start = mark3 + 1 ; 
end = Position ( HTML ; "</a>" ; mark3 ; 1 ) 
] ; 
If ( mark3 ; 
Middle ( HTML ; start ; end - start ) 
) 
)

will return:

On-Line&nbsp;Systems

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.