MS Word extraction

nopposan · May 16, 2008

Hello all.

I've just realized that if I can extract a few different field sources of data from about 2000 MS word documents and get that data into some format that can be imported into my FM Pro database . . . if I can do something like this then I can save many hours of data entry that would otherwise be required to fill in my database with archived data.

Please share any ideas you have on the subject.

The fields would be: Gender (text, limited to value list), MedicalRecordNumber (text), SubmitterID (text), SecondSubmitterID (text), AccessionNumber (our ID, text), and ProvidedClinicalInfo (text entries of observed abnormalities or suspected syndromes separated by commas). I figure if I can get the information that's in these documents into a CSV or Excel spreadsheet then I can import to FM Pro; then I can use a sophisticated script interfaced with a gui layout to take the "ProvidedClinicalInfo" and use it to populate clinical data fields that are used to categorize the type of abnormality. For example, the script might take "micrognathia" from the "ProvidedClinicalInfo" field and copy it to the "FacialAbnormalities" field -- terms that haven't been introduced to the script's lists yet could be turned over to staff who would decide which list they belong in.

Thanks for any help you can offer.

mz123 · May 16, 2008

How are the word documents formatted?

Fenton · May 16, 2008

Do you possibly have access to a Mac? Because using AppleScript you could create copies of all those files as plain text files. Then you could Import them all into FileMaker as one operation. You would still need to parse out the data.

Alternatively you could write a Word macro to save each as a text file on a PC, but I wouldn't know how to run it on the whole folder of files.

nopposan · May 16, 2008

Yes, I do have access to Macs here; biologists tend to love Macs and this place is full of both.

However, although I love the Mac interface, I'm a new convert to Linux. I'm sure I can find a Linux software that would convert the MS Word files to text in batch mode.

Fenton, what would be the general process flow you're alluding to?

Step 1: Convert thousands of MS Word docs to text.

Step 2: Import to Filemaker Pro. (How?)

Step 3: ? . . .

Thanks again. You all have really helped a lot.

Fenton · May 17, 2008

Yes, you can probably find a Linux "antiword" command line tool. There is an "textutil" tool for Mac OS (10.4?), which works on Word files also. Or you can just use AppleScript and TextEdit, which can open a Word file and get its text.

Once the files are just text, you have a couple options.

1. One is to parse/extract the text using AppleScript and/or command line tools, such as grep and cut.

AppleScript can run Unix command line, using: do shell script "the command goes here". You can mix AppleScript variables in there too (outside the quotes). You can set AppleScript variables to resulting values, then later set them into FileMaker fields, all with AppleScript.

You would likely be parsing the files one at a time, so as to put the values in separate records.

2. Or, you can just use the FileMaker Import Folder command. Though it's usually used for image files, it also supports importing a folder full of text files. The contents of the file go into a single FileMaker field, line returns and all. Then you could use a script with a loop, and FileMaker text functions to parse the field.

3. Or some unholy mixture of the two methods above :-]

You could do #3 in different ways. But there are also a few plug-ins which add grep capability to FileMaker. None free though, I don't think. There is a set of Custom Functions that are free however, somewhere....

Edited May 17, 2008 by Guest
textutil, not rtf2txt

Richard Rönnbäck · May 17, 2008

Could you post a sample Word document?

nopposan · May 19, 2008

Thanks, Fenton. I think I'll go with the first example; I'll use some linux software to convert the docs to text and then parse them. Maybe we'll parse with Perl as there's someone here who's familiar with writing Perl scripts.

As far as posting a sample report document, I'm a bit nervous about that as I don't want to be advertising the name of my company and I certainly don't want to accidentally reveal patient information. (Word docs are notoriously insecure.) Perhaps I could "anonymize" a report and convert it to pdf first in order to give you an idea of the layout. Would you like me to do that?

nopposan · May 19, 2008

Woops. Complication. The patient ID, birth day, gender, and outside medical record numbers are all given as a MS word table that isn't read by the text converter in Openoffice.org. 'Probably means that most doc-to-text converters won't read the table, I guess.

Ugh! I'll do a little research and ask around about this.

Brudderman · May 19, 2008

You don't say what version of Word you are using, but years ago there was a batch conversion wizard included with Word. It would process a batch of files, saving then into a different format. The macro as such doesn't appear to be there any more, but it does seem to be on the Microsoft support site:

http://support.microsoft.com/kb/826174/

If you can get this installed to work with your version of Word, it will probably do what you want.

James

www.james-mc.com

nopposan · May 19, 2008

Ahah!

catdoc -a -f ascii FILEname.doc > Newfile.txt

The catdoc software is able to convert the table to in-line tab-delimited text. Cool!

'Also found something called wvText that, if you install elinks, makes a pretty-print text file with the table boarders printed with * | - + like on old dot-matrix printers. Tehe! But I don't want that 'cause it will just make parsing more difficult later. Neat though.

This is a big project for an amateur though, and there's other work to be tended to. So, to be continued. . .

nopposan · May 19, 2008

Wahoo!

for i in *.doc ; do catdoc -a -f ascii $i >${i/%doc/txt} ; done

Just cd into the directory and issue that command. All FileName.doc's are converted to FileName.txt.

Whoopee!

Sign In

MS Word extraction

Recommended Posts

nopposan

mz123

Fenton

nopposan

Fenton

Richard Rönnbäck

nopposan

nopposan

Brudderman

nopposan

nopposan

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information