Newbie tried to build an Amazon web scraper... here's what happened

SB-Books · January 14, 2015

Hello,

I am trying to learn FM and at the same time build myself an ebook library tool that will scrape Amazon.com to pull data like title, author, rating, etc. -- based on the user (me!) submitting a book product ID -- which Amazon calls an "ASIN."

I have searched online for a similar solution or template, but have only found bits and pieces of a solution so I am trying to build one myself -- and hopefully learn FM in the process.

Here's what I have been able to accomplish so far. Don't laugh...

1) Create a field into which a user can enter an Amazon ASIN (like "B00FYW9VHC")

2) Use that data to generate an Amazon image URL (like ' alt='' class='ipsImage' >)'>' alt='' class='ipsImage' >)

3) Use that image URL and InsertFromURL to pull a book cover image from Amazon and put it into an image container (wow!)

4) Use the entered Amazon ASIN to generate an WebViewer object (named "amzn") which displays the Amazon product page (like http://www.amazon.com/exec/obidos/tg/detail/-/B00FYW9VHC)

Here's what I haven't been able to figure out:

1) How to pull the HTML source from the WebViewer object and put into into another field. I know this has something to do with GetLayoutObjectAttribute, but I haven't been able to figure out the correct script. Currently, I have something that looks like this:

 Set Field [LibraryTest::webcontent;GetLayoutObjectAttribute("amzn";"content")]

where webcontent is a text field and amzn is the name of my WebViewer. This clearly isn't working, though.

2) Once I am able to get the HTML source code into a field, I'm not sure how to parse it in order to grab data elements like Title, Author, Price, Description, Rating, etc.

If anyone has done this already or has some advice for me, I'd be very grateful.

Thanks in advance,

SB

ggt667 · January 14, 2015

This I do every day, I must admit I do not use FileMaker for the parsing part, I use Import Records as XML, and what I import is actually a script, the script produces FMPXMLRESULT, is very efficient and can run on the server.

The script that generates the XML I have built for different projects and environments in

Shellscript

PHP as pr blog: http://wethecomputerabusersamongst.blogspot.com/2013/10/execute-php-script-from-filemaker-with.html

NodeJS

PhantomJS

curl / tidy -asxml / XSLT

I also know people who made these parsers in Ruby and Perl

If you are to do the parsing inside a FileMaker field, you are probably to use position() middle() and make some offsets and take certain things for granted, this will be very fragile to changes on the Amazon output.

Fetching the HTML source does not work until after the progression bar is completed,

you may have to pause for 5 secs after changing that URL before you try to execute the GetLayoutObjectAttribute( "amzn"; "content" )

Lee Smith · January 14, 2015

Hi SB, and welcome to the FM Forums,

Scraping a Web Site is not as simple as it may seem.

See what another member is currently going through in his two thread, Link and Link

If you are under the impression that FileMaker is easy to learn and use, than you need to understand where that is coming from. For a simple thing like a Rolodex or Contact Database, this could be true. However, to develop a something more robust then you can look forward to a lot of time and effort on your part to learn the fundamentals and the more you want our of your solution, the more time you will be investing.

Before you start slapping together a file, you need to learn FileMaker and it’s way of doing things.

Start by studying the User Manual, Help Files, Starter files. You need to learn how to create fields, layouts, scripts, tables, and relationships, etc. and see how the work together. The Starter Files can help, so go under the hood of the files and find out how the tick. In layout mode, check out the Buttons, Popovers, tabs, etc.

Since this is a new solution that you are starting from scratch, you should prepare an ERD (Entity Relationship Digram) not to be confused with the Relationship Graph in FileMaker, and see how these help you determine the structure of all of these things.

Since you have identified your skill level as Beginner, (I’ll take that to mean new to databases design and FileMaker), so be prepared to spend a lot of time and effort learning both of these things. There are some excellent resources available, so let us know if you need some recommendations.

You might want to look at a commercial product I use called "Delicious Library 3” Link

Lee

ggt667 · January 14, 2015

If you are under the impression that FileMaker is easy to learn and use, than you need to understand where that is coming from. For a simple thing like a Rolodex or Contact Database, this could be true. However, to develop a something more robust then you can look forward to a lot of time and effort on your part to learn the fundamentals and the more you want our of your solution, the more time you will be investing.

I'd say there are horses for courses, FileMaker has no real competitors in making reports, and it's very quick for making GUI.

But when it comes to parsing HTML? There are many options that I would prefer to use before native FileMaker; I would actually like to have something predigested into native FMPXMLRESULT, I often think the best solution is to pick the best tool from each toolbox.

If SB Books is only on Mac there is a full tool set of unix tools available

Example( just typed off the top of my head, not tested ) script: /usr/local/bin/fetchurlandmakefilemakersource.sh

---

cd /Library/WebServer/Documents/

curl $1

tidy -i -asxml -wrap 0 -m $1

xsltproc amzn2fmpxmlresult.xslt $1 > $1.fmpxmlresult.xml( this step is optional, can be applied in the import )

---

Should be able to call this script as follows: fetchurlandmakefilemakersource http://www.amazon.com/exec/obidos/tg/detail/-/B00FYW9VHC)

And import this using regular FileMaker XML import if websharing is turned on on the Mac.

All that has to be done is make a mapping; namely: amzn2fmpxmlresult.xslt

it should be a matter of one xpath SELECT pr tag you would like to map to a field.

Lee Smith · January 14, 2015

I understood your first post.

eos · January 14, 2015

I understood your first post.

:laugh:

Wim Decorte · January 14, 2015

4) Use the entered Amazon ASIN to generate an WebViewer object (named "amzn") which displays the Amazon product page (like http://www.amazon.com/exec/obidos/tg/detail/-/B00FYW9VHC)

The proper way to do this is to use Amazon's web API, not scrape the web page. They will change their layout very frequently and that will break your scraping part. Use the developer tools that Amazon has and you will get the data in a standard format.

ggt667 · January 15, 2015

The proper way to do this is to use Amazon's web API.

The same methods and links I already gave you will still be valid,

yet the task will be simpler with an official API at hand.

SB-Books · January 22, 2015

Thanks very much for the good feedback. I will dig into some of those resources, but UNIX, PHP, and setting up web servers is way beyond my capabilities.

Also as far as I can tell the official Amazon API will not return the current price of a Kindle book -- which makes it a non-starter for my ebook library tool.

Thnx again

ggt667 · October 8, 2015

Using the approach I gave you in the link above you can basically

run these steps

1) Go to page of your liking using a tool that renders the page

2) screenshot, f ex both step 1 and 2 can be done like this( using phantomjs ): https://github.com/ariya/phantomjs/blob/master/examples/render_multi_url.js

3) OCR the screenshot, f ex using tesseract or OCRopus

4) Search for $, f ex using grep on the output

Edited October 8, 2015 by ggt667

SB-Books · October 8, 2015

That's a very interesting and unique approach, GGT. Thank you!

Sign In

Newbie tried to build an Amazon web scraper... here's what happened

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Important Information