Jump to content

Parse the HTML


stuj1026

This topic is 4582 days old. Please don't post here. Open a new topic instead.

Recommended Posts

Hi All

Have been playing with the web viewer and have visitied the fm website and it seems you can do something called web scraping.

I bring up a page in the web viewer and in another field am able to retreive the source code for the web page being diplayed using the GetLayoutObjectAttribute ( "web object name" ; "content" )




Here is a small sample above of the html

I now need to extract the profession of the doctor.. I can get the starting position of the profession using the position function but now nedd to extract everything that falls between the and the after the position of the profession. Any thoughts??

Really need some help on this one..

Thanks Stu

Edited by Guest
added code markup
Link to comment
Share on other sites

How about this:

Let ( [var1  = Position ( SourceText ; "[b]" ; 1 ; 1 ) ; 

var2 = Position ( SourceText ; "[/b]" ; 1 ; 1 ) ;

var3 = var2-var1-3];

Middle ( SourceText ; var1+3 ; var3 ))

Edited by Guest
added code markup
Link to comment
Share on other sites

Ok I got it!!

I built a custom function called WebScrape

Let (  [



text_to_find =Position ( Source_Text;Extract_Text; 1; 1);

Extraction1=Middle (Source_Text ;  text_to_find ;100000 );

Extraction2=Position(Extraction1 ;Beg_Tag;1;1 );

Extraction3=Position(Extraction1 ;End_Tag;1;1 );

LenBegTag=Length ( Beg_Tag );

LenEndTag=Length ( End_Tag )





];



Middle ( Extraction1; Extraction2+LenBegTag; Extraction3-LenEndtag-Extraction2+1)



)

-------------------------------------------

Source text which is the raw html

Extract Text which in my case is profession

BegTag which is [b]

EndTag which is [/b]

SO this will extract from the raw html the information which falls between and that follows The word Profession.

Stu

Edited by Guest
added code markup
Link to comment
Share on other sites

  • 4 weeks later...

I recently posted a HTMLtoText custom function at http://www.briandunning.com/filemaker-custom-functions/, which can convert a whole web page or larger part hereoff from HTML to plain text - with preservation of bold, italic and bullet formatting. A demo file with this custom function may be downloaded from this post.

Bedst regards,

Mogens Brun

DemoHTMLtoText.fp7.zip

Link to comment
Share on other sites

  • 3 weeks later...
  • 2 weeks later...
  • 1 month later...

The [color:red]HTMLtoText custom function at Brian Dunning's site has been updated to 1.04. You may use web viewer, Troi URL or Fusion TCPdirect to capture the web page, and then parse the source HTML to formatted text. A demo file can be downloaded from this post.

The custom function uses now three global fields. The reason for this is that:

In FileMaker a text expression may either be (1) a constant - or "literal" - text string, (2) a field reference (either a normal or global field), or (3) a calculated combination of (1) and (2).

Ad. (1) A constant is entered between quotes in a Set Field script step or in a Custom Function ... or similar. FileMaker's internal text editor will filter keyboard entered characters, so only a subset of the possible ASCII chars may be expressed. For example you can't enter a line feed (ASCII = 010).

Ad. (2) Text in a field may be entered through keyboard (1), by import from other files or by import from a web viewer field. The two last methods make it possible for any ASCII value to occur in a field. ASCII-value NULL (0) seems to provoke a crash in some circumstances. Other ASCII-values can give other problems. These characters can't be entered in a constant/literal text string, but must be pasted into a field, if you want get rid of them through a substitution or similar.

Dette er årsagen til at jeg er nødt til i HTMLtoText at benytte to globale felter til at repræsentere karakterværdier, der ikke kan indtastes mellem anførselstegn som literal tekst. Det drejer sig om ASCII = 010 og ASCII = 063, som jeg fandt ud af ofte gav problemer ved HTML parsing. Listen burde måske udvides med flere tegn, bl.a. ASCII = 000.

There is an alternative metoh: Fusions TCPdirect plugin will allow you to express any char value. So by using this plug-in you can avoid to use globals for storing of special char values.

Ad. (3) All above applies to this point.

DemoHTMLtoText_1.04.zip

Link to comment
Share on other sites

  • 5 months later...

Hi Mogens,

I finally got around to looking at your files. I have a project that I think the View will work great for. When I tested your files, I'm having trouble with the second one. For some reason your second file times out using Brian Dunning's site, and FMForum. However, your original file works as you wanted to. What am I missing on the second file?

Lee

Link to comment
Share on other sites

  • 3 years later...
  • 1 year later...

Similar to this I need a code that can Parse the XML and show me the plain text in the web viewer. I work on XML files. I used XML to call some external APIs , which in return gives me the response in XML. I need that XML response to be shown as a plain text in a web viewer. Any help would be highly appreciated.

Right now instead of parsing it to normal text, I have decided to show the XML itself as data in the web viewer. I have just tried to use the following code

$XML_ResponseFinal =

"<html>

<body>

" & ¶ &

Substitute ( $XML_Response ; [ "<" ; "&lt" ] ; [ ">" ; "&gt" ] )

& ¶ &

"</body>

</html>"

to try to show the normal XML (protected from html tag i.e to be treated as normal text and not xml) in web viewer but not successuful. The web viewer is just showing blank.

I am using Set Web Viewer [Object Name: "webviewer"; URL:"data:text/html;"$XML_ResponseFinal]

Thanks

Pushkraj

Link to comment
Share on other sites

WARNING! danger Will Robinson! "scraping" can lead to problems if the site decides to change the format (and they do!). If you get XML content, then yes, import the XML. You may need an XSLT to get it transformed to FMPXMLRESULT (used for FM xml import).

Link to comment
Share on other sites

This topic is 4582 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.