extract data from pdf cells

Noél Dubau · November 7, 2012

Hello,

I have a pdf file which consists in a table like a calc sheet ; every cell can have one ore more lines. Using the iText and ScriptMaster I can read line after line but that doesn't help me enough.

Is there a way to parse a pdf and get the content of every cell ?

I join pdf and text file I can get now...

Thanks for your help

Noël

PDF_INE.txt

PDF_INE.pdf

qube99 · November 7, 2012

I looked at it in Acrobat Pro and I don't think its either a data file, a table or a form. It doesn't appear to be in any of the PDF data formats.

Noél Dubau · November 7, 2012

Thanks for having take a look to my question.

When I ask in Acrobat X Pro I get the informations you can see on te scrre copy attached. It seems to be a pdf ?? I had only modified the original in Acrobat to change names.

Don't understand...

NoÃ«l

qube99 · November 7, 2012

FMP will normally import structured PDF data just fine but I don't see any data structure in Acrobat Pro. I don't know a way to import this into FMP as data without some elaborate scripting. Maybe someone else can help.

john renfrew · November 7, 2012

Noel

the answer is really complex as you have to have a text extraction strategy which 'knows' what you are trying to parse

Any library will give you back something more or less like that because how you interpret the lines on the page with text and numbers is a semantic one, and not how the document is constructed

The PDF just says draw a line form here to here, draw another from here to here, draw this glyph in this font here, and put this one at this place, and each of the instructions may appear jumbled up in order as the job of the 'reader' application is to reconstruct the document by doing what it is told.

It is us who impart meaning to the relative position of the marks on the page.

And the concepts of white space and tabs do not exist in PDF either

So the simple answer is that you will have to parse this yourself either in Groovy first or in FM for it to be of any use

John

Interestingly this is what I got out of the file from iText which is nowhere as comprehensive as yours

noel.txt

qube99 · November 8, 2012

You're probably thinking postscript which is what PDF uses. But PDF has a bunch of extra features including tables. You can extract a PDF table for other use (FMP, Excel, Word, HTML, etc). Alas, the PDF provided didn't have a PDF table in it.

fseipel · November 8, 2012

If you 'Save as Excel' (a function available in Acrobat X Pro, which from your message, you have), you will generate a cleanly formatted file you can import into Filemaker that preserves the row/column structure present in the PDF file. If this needs to run on computers with only reader, it still allows conversion, but you have to use Adobe's online service to do so and buy a subscription, i.e. it's not included as it is in Pro Acrobat. You will still need to import into a temporary table since there are several rows before table begins, so script will simply skip until it hits headings row.

Automating this process would be possible with Applescript, OLE, etc. If you don't have a huge number to process, simply saving as Excel and importing using an FM script ought to work. This may not be a great solution if there are thousands of files to process. Of course you can also parse the text back into rows/columns.

Noél Dubau · November 8, 2012

Hello,

Thanks to John and qube99 for their "lessons" that I appreciate ! Thanks more to Frank Seipel for that solution which works very fine but requires having Acrobat in its full version... I just tried with Adobe Reader on my Mac : the export via excel is possible only for those having an account paid :hmm: Indeed AS or OLe would have been the solution !

Noël

PS : Frank, your Amazon script is always in use ! Thanks more !

qube99 · November 8, 2012

If you 'Save as Excel' a function available in Acrobat X Pro,

Acrobat Pro 7 doesn't have that.

fseipel · November 9, 2012

qube99: v7 is circa 2005 so that is a rather old version; I think they added export to Excel in ~v9 circa 2008. Current version is 11. I'm still using 10.

Hi again Noél,

The OLE isn't too bad, the Javascript Guide/Acrobat SDK covers this pretty well, I had to do work with footers/headers a while back. Sorry if original message wasn't very clear about this not being included in Reader (any version).

This VB Script will convert a PDF to Excel. I'm assuming one could convert this to Java or Groovy and run directly in ScriptMaster; i.e. build it into a global field or variable, export to a file, and run the VBS that way

To convert to SM you'd probably need JACOB for COM communications.

Set Acroapp = CreateObject("AcroExch.App")

Set PDDoc = CreateObject("AcroExch.PDDoc")

result = PDDoc.Open ("C:PDF_INE.pdf")

Acroapp.Show

Set jso = PDDoc.GetJSObject

jso.SaveAs "C:temp.xlsx","com.adobe.acrobat.xlsx"

PDDOC.Close

AcroApp.CloseAllDocs

Acroapp.Hide

AcroApp.Exit

set Acroapp=Nothing

Set PDDOC=Nothing

e.g. save as test.vbs in Notepad. Of course FM could execute this directly or using run shell script command.

Noél Dubau · November 9, 2012

Hello Frank

Thanks for that VB script. Before I try it and because I'm not very informed about this, will this solution run on both mac and win ?

Have a good day !

Noël

Oops : I hadn't see "I'm assuming one could convert this to Java or Groovy and run directly in ScriptMaster" which is certainly the good way.

fseipel · November 9, 2012

Noél, that will run only on Windows. For Mac you'll need to use Applescript. Of course, Acrobat Pro 9.0 or above is still required for this approach. If converted to SM using JACOB COM library, it will still run only on Windows because Macs don't use COM.

Noél Dubau · November 9, 2012

OK ! Thanks !

Noël

fseipel · November 11, 2012

Noël, one other comment on this, you can actually determine where the lines begin & end with iText, by parsing through the tokens. You should find 'l' for line or 're' for rectangle. I would think you could then simply have two arrays, one for the horizontal line positions, and one for the vertical line positions. The FilteredTextRenderListener class which accepts a rectangle parameter, could then be used to parse the text in each 'cell', albeit with a speed penalty of repeated calls. This would have the advantage of being cross-platform, and not requiring Acrobat Pro or the conversion subscription.

Also, some of the free tools such as xpdf's pdf2text, include command line switches such as -layout to preserve physical layout upon conversion, typically that preserves somewhat the table structure (text is not perfectly aligned & there will be a variable number of spaces between columns in successive rows), but at least it includes the carriage returns.

Noél Dubau · November 11, 2012

Frank

Thank you for trying to help me on that problem ; if I understand the ideas you have to solve my question, I have no level required to apply these advices.

I'm going to search through Google about that class and these tools ; perhaps a sample will open my eyes !

Thanks one more

Noël

Sign In

extract data from pdf cells

Recommended Posts

Noél Dubau

qube99

Noél Dubau

qube99

john renfrew

qube99

fseipel

Noél Dubau

qube99

fseipel

Noél Dubau

fseipel

Noél Dubau

fseipel

Noél Dubau

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information