Parse the HTML

stuj1026 · July 13, 2006

Hi All

Have been playing with the web viewer and have visitied the fm website and it seems you can do something called web scraping.

I bring up a page in the web viewer and in another field am able to retreive the source code for the web page being diplayed using the GetLayoutObjectAttribute ( "web object name" ; "content" )

Here is a small sample above of the html

I now need to extract the profession of the doctor.. I can get the starting position of the profession using the position function but now nedd to extract everything that falls between the and the after the position of the profession. Any thoughts??

Really need some help on this one..

Thanks Stu

Edited August 26, 2006 by Guest
added code markup

Ballycroy · July 13, 2006

How about this:

Let ( [var1  = Position ( SourceText ; "[b]" ; 1 ; 1 ) ; 

var2 = Position ( SourceText ; "[/b]" ; 1 ; 1 ) ;

var3 = var2-var1-3];

Middle ( SourceText ; var1+3 ; var3 ))

Edited August 26, 2006 by Guest
added code markup

stuj1026 · July 14, 2006

Ok I got it!!

I built a custom function called WebScrape

Let (  [



text_to_find =Position ( Source_Text;Extract_Text; 1; 1);

Extraction1=Middle (Source_Text ;  text_to_find ;100000 );

Extraction2=Position(Extraction1 ;Beg_Tag;1;1 );

Extraction3=Position(Extraction1 ;End_Tag;1;1 );

LenBegTag=Length ( Beg_Tag );

LenEndTag=Length ( End_Tag )





];



Middle ( Extraction1; Extraction2+LenBegTag; Extraction3-LenEndtag-Extraction2+1)



)

-------------------------------------------

Source text which is the raw html

Extract Text which in my case is profession

BegTag which is [b]

EndTag which is [/b]

SO this will extract from the raw html the information which falls between and that follows The word Profession.

Stu

Edited August 26, 2006 by Guest
added code markup

Philip · August 7, 2006

I like that custom function. Would you give an example with the parameters put into it?

for example,

WebScrape(SourceField ; "Profession " ; "")

That would be so helpful! Thanks.

MogensBrun · August 7, 2006

I recently posted a HTMLtoText custom function at http://www.briandunning.com/filemaker-custom-functions/, which can convert a whole web page or larger part hereoff from HTML to plain text - with preservation of bold, italic and bullet formatting. A demo file with this custom function may be downloaded from this post.

Bedst regards,

Mogens Brun

DemoHTMLtoText.fp7.zip

bruceR · August 26, 2006

Very nice. Looks like you need to add to the list of substitutions:

[ "�" ; "®" ] ;

Edited August 26, 2006 by Guest

MogensBrun · September 6, 2006

The [color:red]HTMLtoText custom function at http://www.briandunning.com/filemaker-custom-functions/ has been updated. You may use web viewer or Troi URL to fetch the page, you want to parse from HTML to text. A demo file can be downloaded from this post.

Søren Dyhr · October 18, 2006

Hi Mogens

couldn't you explain the reasoning behind the use of global fields, which is required by your CF??

--sd

Lee Smith · October 18, 2006

I don't see a demo file at Brian's site, and the link to your site did not work.

Lee

Søren Dyhr · October 21, 2006

I'll let him know!

--sd

MogensBrun · October 24, 2006

The [color:red]HTMLtoText custom function at Brian Dunning's site has been updated to 1.04. You may use web viewer, Troi URL or Fusion TCPdirect to capture the web page, and then parse the source HTML to formatted text. A demo file can be downloaded from this post.

The custom function uses now three global fields. The reason for this is that:

In FileMaker a text expression may either be (1) a constant - or "literal" - text string, (2) a field reference (either a normal or global field), or (3) a calculated combination of (1) and (2).

Ad. (1) A constant is entered between quotes in a Set Field script step or in a Custom Function ... or similar. FileMaker's internal text editor will filter keyboard entered characters, so only a subset of the possible ASCII chars may be expressed. For example you can't enter a line feed (ASCII = 010).

Ad. (2) Text in a field may be entered through keyboard (1), by import from other files or by import from a web viewer field. The two last methods make it possible for any ASCII value to occur in a field. ASCII-value NULL (0) seems to provoke a crash in some circumstances. Other ASCII-values can give other problems. These characters can't be entered in a constant/literal text string, but must be pasted into a field, if you want get rid of them through a substitution or similar.

Dette er årsagen til at jeg er nødt til i HTMLtoText at benytte to globale felter til at repræsentere karakterværdier, der ikke kan indtastes mellem anførselstegn som literal tekst. Det drejer sig om ASCII = 010 og ASCII = 063, som jeg fandt ud af ofte gav problemer ved HTML parsing. Listen burde måske udvides med flere tegn, bl.a. ASCII = 000.

There is an alternative metoh: Fusions TCPdirect plugin will allow you to express any char value. So by using this plug-in you can avoid to use globals for storing of special char values.

Ad. (3) All above applies to this point.

DemoHTMLtoText_1.04.zip

Lee Smith · April 20, 2007

Hi Mogens,

I finally got around to looking at your files. I have a project that I think the View will work great for. When I tested your files, I'm having trouble with the second one. For some reason your second file times out using Brian Dunning's site, and FMForum. However, your original file works as you wanted to. What am I missing on the second file?

Lee

apathyisafad · July 26, 2010

Thank you! Your custom function solved a huge problem for me!

Pushkraj · September 30, 2011

Similar to this I need a code that can Parse the XML and show me the plain text in the web viewer. I work on XML files. I used XML to call some external APIs , which in return gives me the response in XML. I need that XML response to be shown as a plain text in a web viewer. Any help would be highly appreciated.

Right now instead of parsing it to normal text, I have decided to show the XML itself as data in the web viewer. I have just tried to use the following code

$XML_ResponseFinal =

"<html>

<body>

" & ¶ &

Substitute ( $XML_Response ; [ "<" ; "&lt" ] ; [ ">" ; "&gt" ] )

& ¶ &

"</body>

</html>"

to try to show the normal XML (protected from html tag i.e to be treated as normal text and not xml) in web viewer but not successuful. The web viewer is just showing blank.

I am using Set Web Viewer [Object Name: "webviewer"; URL:"data:text/html;"$XML_ResponseFinal]

Thanks

Pushkraj

comment · September 30, 2011

Why don't you import the XML response?

beverly · October 4, 2011

WARNING! danger Will Robinson! "scraping" can lead to problems if the site decides to change the format (and they do!). If you get XML content, then yes, import the XML. You may need an XSLT to get it transformed to FMPXMLRESULT (used for FM xml import).

Sign In

Parse the HTML

Recommended Posts

stuj1026

Ballycroy

stuj1026

Philip

MogensBrun

bruceR

MogensBrun

Søren Dyhr

Lee Smith

Søren Dyhr

MogensBrun

Lee Smith

apathyisafad

Pushkraj

comment

beverly

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Explore

Affiliate Forums

Activity

Important Information