Importing library catalogue XML

June 2, 201115 yr

Hi All

Yes, yet another XSL cry for help.

I have a couple of Filemaker databases for managing scanned images of library collection items that currently import metadata from Excel spreadsheets. This process is extremely inefficient (for those preparing the XLS) and suffers numerous data quality issues. We've recently been provided with a few extra URL schemas for specific lookups and an XML feed of an individual record. I've hacked a means of importing the metadata (scraping the source code of the XML) but it's not pretty. Importing the XML is obviously much more sensible but this is outside of my current experience.

Here is a sample data XML

I've yet to figure out the exact mappings to other metadata standards, but either way, what I want to extract from the XML is a separate record for each SUBFIELDDATA with the following fields

VARFLD/HEADER/TAG                      -> TAG

VARFLD/HEADER/SEQUENCENUM              -> SEQUENCENUM

VARFLD/MARCINFO/MARCTAG                -> MARCTAG

VARFLD/MARCSUBFIELD/SUBFIELDINDICATOR  -> SUBFIELDINDICATOR

VARFLD/MARCSUBFIELD/SUBFIELDDATA       -> SUBFIELDDATA

Any help would be greatly appreciated, and I'm happy to credit any contributors.

Cheers

Ben

June 2, 201115 yr

This wouldn't be too difficult - except your XML document doesn't pass validation against its own DTD. You can see the errors if you use a validator such as:

http://www.xmlvalidation.com/

FWIW, I am attaching a XSLT stylesheet. It will work if you remove the DOCTYPE declaration from your XML document.

unimelb.zip

June 2, 201115 yr

Author

This wouldn't be too difficult - except your XML document doesn't pass validation against its own DTD. You can see the errors if you use a validator such as:

http://www.xmlvalidation.com/

FWIW, I am attaching a XSLT stylesheet. It will work if you remove the DOCTYPE declaration from your XML document.

Many thanks for this. Changing the DOCTYPE will require requesting the vendor to make a change.... hmmm... but for Filemaker on a desktop I can always import into a text field, strip it out and export to a temporary XML file. Not sure yet what I could do with Filemaker Go but I have a temporary hack that scrapes the title and author from the HTML record for quick verification which is enough for the mobile usage.

I'm looking at using an iPhone/iPod Touch as a barcode scanner to create lists of items being sent to us for scanning rather than them filling in spreadsheets. Quicker for them, makes it possible to simplify check in/out of items for everyone and gives us the info we need to get the full metadata without any extra annotations

Ben

June 2, 201115 yr

Changing the DOCTYPE will require requesting the vendor to make a change...

It's not so much a question of changing the DOCTYPE. The problem is that the content of the XML does not conform to the DTD declared in the DOCTYPE. They should either produce XML according to their own specifications or change the DTD - or remove the declaration altogether.

June 3, 201115 yr

Author

Continuing with my workaround... I've looked at various ways of importing into a text field from a calculated URL. Being on Windows I tend to use DOS batch files for different things. In this case:

Export .BAT to have wget download the XML to a text file with a fixed name "xrecord.TXT" in a directory where it's the only file.
Run the .BAT
Import the folder to a global text field, strip out the DOCTYPE so Filemaker won't explode
Export field as plain text (XML with only the field contents) to "xrecord.XML"
Import "xrecord.XML"

As I already have the RECORDKEY in a field (used to generate the URL to get the URL) I can auto enter it, keeping this part nice and simple. Alternatively I could download a whole batch using RECORDKEY for the file name, doing all of the processing after the file downloads have finished...

I wrestled with the XSL (yes you're allowed to laugh but at least I'm learning to fish ) and finished up with this change which gets all of the SUBFIELD data.

<xsl:for-each select="IIIRECORD/VARFLD/MARCSUBFLD">

<ROW MODID="" RECORDID="">

<COL><DATA><xsl:value-of select="../HEADER/TAG"/></DATA></COL>

<COL><DATA><xsl:value-of select="../HEADER/SEQUENCENUM"/></DATA></COL>

<COL><DATA><xsl:value-of select="../MARCINFO/MARCTAG"/></DATA></COL>

<COL><DATA><xsl:value-of select="SUBFIELDINDICATOR"/></DATA></COL>

<COL><DATA><xsl:value-of select="SUBFIELDDATA"/></DATA></COL>

</ROW>

</xsl:for-each>

I'd love to have a solution that works with Filemaker Go but I have plenty to work with for now. A bit of data cleansing and "all" I need to do is figure out a set of rules to map our usage of MARC21 to DC, IPTC and PRISM... :mellow:

Thanks for helping me past this hurdle.

Ben

June 3, 201115 yr

I think that if you're already using OS-level scripting to download the file/s, you could also have it do the pre-processing.

I wrestled with the XSL (yes you're allowed to laugh but at least I'm learning to fish ) and finished up with this change which gets all of the SUBFIELD data.

Why would I laugh? I haven't noticed there can be multiple MARCSUBFLD elements in a VARFLD, and there's nothing wrong with your correction.

June 3, 201115 yr

Continuing with my workaround... I've looked at various ways of importing into a text field from a calculated URL. Being on Windows I tend to use DOS batch files for different things. In this case:

Export .BAT to have wget download the XML to a text file with a fixed name "xrecord.TXT" in a directory where it's the only file.

Run the .BAT

Import the folder to a global text field, strip out the DOCTYPE so Filemaker won't explode

Export field as plain text (XML with only the field contents) to "xrecord.XML"

Import "xrecord.XML"

Would you post a copy of the .BAT you're using?

June 4, 201115 yr

Author

I think that if you're already using OS-level scripting to download the file/s, you could also have it do the pre-processing.

I had overlooked that for now... partly trying to keep it as portable as possible. A PERL script may be an option for our internal uses. ...but I'm hoping that I can make a good enough case out of the benefits from the re-use of the data that we can push for the DOCTYPE to be fixed/removed.

Why would I laugh? I haven't noticed there can be multiple MARCSUBFLD elements in a VARFLD, and there's nothing wrong with your correction.

Not so much the at the correction, but that I had to wrestle a number of dumb ideas (read up on nesting for-each) to get to final solution... Definitely need to learn more XML

Would you post a copy of the .BAT you're using?

To simplify things a bit I keep the apps and data in folders with the database . It's not essential but it helps to make things more portable later on without requiring he user to know too much about what they're doing.

* Database

* CMD\wget.exe

* DATA\TXT files

Anyway... in this case I have the RECORDKEY already (scraped from the web page of the catalogue entry). The commandline to download the XML as TXT is simply:

wget http://cat.lib.unimelb.edu.au/xrecord=[RECORDKEY] -O ../DATA/[RECORDKEY].txt

To import I have two fields, IIIRECORD for the XML content and FILENAME to match the downloaded file to the records on import.

I have a lot of documentation to write at work for the next couple of months, so I'm only working on this in my spare time at the moment. Once I get it to a usable form I'll make it publicly available. Could be useful for people trying to get data from other libraries using the same catalogue system.

June 4, 201115 yr

..................

Anyway... in this case I have the RECORDKEY already (scraped from the web page of the catalogue entry). The commandline to download the XML as TXT is simply:
wget http://cat.lib.unimelb.edu.au/xrecord=[RECORDKEY] -O ../DATA/[RECORDKEY].txt
..................

Well after struggling quite a bit I guess I'll give up for now.

I was hoping to use this to do a simple download and save a file but I guess it's above my knowledge scale... lol

I wanted:

https://sellercentral.amazon.com/gp/ssof/knights/csv-download.html?ie=UTF8&itemsPerPage=10000&searchType=advanced&currentPage=1&quantityType=sellable&condition=All

.... but my cmd just runs and runs and then an error. Guessing it's the URL........

}:(

Oh well

June 5, 201115 yr

Author

Well after struggling quite a bit I guess I'll give up for now.

I was hoping to use this to do a simple download and save a file but I guess it's above my knowledge scale... lol

I wanted:

https://sellercentral.amazon.com/gp/ssof/knights/csv-download.html?ie=UTF8&itemsPerPage=10000&searchType=advanced&currentPage=1&quantityType=sellable&condition=All

.... but my cmd just runs and runs and then an error. Guessing it's the URL........

Oh well

Your URL requires an authenticated, encrypted session so you'll need a bit more than just the url in this case. There are a few examples in the wget manual (PDF in the /man folder of wget)

June 5, 201115 yr

Your URL requires an authenticated, encrypted session so you'll need a bit more than just the url in this case. There are a few examples in the wget manual (PDF in the /man folder of wget)

Thank you !! Well...... sort of... lol. At least that pointed me in a correct direction!

I've gotten close after playing with it for the last 2 hrs. Learned a bit along the way.

I can connect and it even saved a file. It turns out to be the web page asking me to sign in but at least it's close.....

Hopefully it's something I can fix tomorrow after looking at it again. I have the 'http://user%40domain:pass@' correct I think....

wget --no-check-certificate https://user_name%40domain.com:[email protected] ...........

I wonder if I need to replace and '_' in the email address and another in my password ?

<_<

Well 'night all .... until tomorrow!!

June 5, 201115 yr

Author

Tested out the download > import > export > import process on a list of the RECORDKEYS we've digitised already. (I work in a digitisation service so we're caught in the middle between collections and the digital repository)

* Batch downloading was OK although our catalogue timed out occasionally... one benefit of using wget as it retries failed connections.

* The rest of the processing was nice and fast

- import folder of TXT into a text field, autoenter calculation strips out DOCTYPE

- loop

- transfer text of corrected XML to a global text field in an unrelated table with only 1 record

- export unrelated table to XML file

- import XML

- next record

- end loop

* Identified a 2% error rate in the manual entry of our existing RECORDKEYS, so there's another benefit of getting this working to eliminate manually created spread sheets. We don't use this field for our workflows, we just pass it on to the digital repository where it probably breaks something.

Now it's time to talk to people on the other sides of our workflows to see what else we can do with this framework.

June 5, 201115 yr

Author

Thank you !! Well...... sort of... lol. At least that pointed me in a correct direction!

I've gotten close after playing with it for the last 2 hrs. Learned a bit along the way.

I can connect and it even saved a file. It turns out to be the web page asking me to sign in but at least it's close.....

Hopefully it's something I can fix tomorrow after looking at it again. I have the 'http://user%40domain:pass@' correct I think....

wget --no-check-certificate https://user_name%40domain.com:[email protected] ...........

I wonder if I need to replace and '_' in the email address and another in my password ?

Well 'night all .... until tomorrow!!

Also have a look in the −−post−data=string, −−post−file=file section in the wget manual. That authentication method possibly doesn't apply in this case. I haven't tried this yet so that's about as far as I can help

June 5, 201115 yr

Also have a look in the −−post−data=string, −−post−file=file section in the wget manual. That authentication method possibly doesn't apply in this case. I haven't tried this yet so that's about as far as I can help

Not being a programmer I go by trial and error until I can figure it out ...... not helping much at the moment!

}:(

I thought it might be easier to just try to get a feedback page but alas ... (I once knew.... but I digress).... nope.

I tried all kinds of --post combinations but can only get the sign-in page.

This is the basic .bat I started with if anyone can or wants to help that has an Amazon account.............


wget --secure-protocol=auto --no-check-certificate https://sellercentral.amazon.com/gp/feedback-manager/view-all-feedback.html?ie%3DUTF8%26sortType%3DsortByDate%26dateRange%3D%26descendingOrder%3D1 -O test.txt

If not thanks anyway..... it's been fun as always!

July 22, 201510 yr

Author

Ah those were the days. FWIW in the end I gave up on fixing the XML and wrote a script to insert the XML into a text field and then scrape the data fields out recursively. Extremely crude, but it made the solution completely self-contained so it works on mobile devices as well. Ended up presenting it at a conference: http://www.vala.org.au/component/docman/?task=doc_download&gid=482&Itemid=269

July 22, 201510 yr

Extremely crude

Not only crude, but completely untenable whenever the input XML needs parsing beyond the most basic of the basic. You cannot parse HTML with regex and you most certainly cannot parse XML using Filemaker's text functions.

July 22, 201510 yr

Author

That may be, but I was in a situation where direct import of the XML would instantly crash Filemaker. The calculations/scripts I set up are reasonably flexible in that I only need to copy a few lines of script and define the text for the start and end tags. The other nice thing (from a scraping perspective) is that the start and end "tags" can be any text. I've used this a number of times to quickly pull bits of data buried in ugly HTML code and even OCR'ed text documents of structured text data.

But for XML it's definitely only a last resort if all direct import options have been exhausted.