Noél Dubau Posted November 7, 2012 Posted November 7, 2012 Hello, I have a pdf file which consists in a table like a calc sheet ; every cell can have one ore more lines. Using the iText and ScriptMaster I can read line after line but that doesn't help me enough. Is there a way to parse a pdf and get the content of every cell ? I join pdf and text file I can get now... Thanks for your help Noël PDF_INE.txt PDF_INE.pdf
qube99 Posted November 7, 2012 Posted November 7, 2012 I looked at it in Acrobat Pro and I don't think its either a data file, a table or a form. It doesn't appear to be in any of the PDF data formats.
Noél Dubau Posted November 7, 2012 Author Posted November 7, 2012 Thanks for having take a look to my question. When I ask in Acrobat X Pro I get the informations you can see on te scrre copy attached. It seems to be a pdf ?? I had only modified the original in Acrobat to change names. Don't understand... Noël
qube99 Posted November 7, 2012 Posted November 7, 2012 FMP will normally import structured PDF data just fine but I don't see any data structure in Acrobat Pro. I don't know a way to import this into FMP as data without some elaborate scripting. Maybe someone else can help.
john renfrew Posted November 7, 2012 Posted November 7, 2012 Noel the answer is really complex as you have to have a text extraction strategy which 'knows' what you are trying to parse Any library will give you back something more or less like that because how you interpret the lines on the page with text and numbers is a semantic one, and not how the document is constructed The PDF just says draw a line form here to here, draw another from here to here, draw this glyph in this font here, and put this one at this place, and each of the instructions may appear jumbled up in order as the job of the 'reader' application is to reconstruct the document by doing what it is told. It is us who impart meaning to the relative position of the marks on the page. And the concepts of white space and tabs do not exist in PDF either So the simple answer is that you will have to parse this yourself either in Groovy first or in FM for it to be of any use John Interestingly this is what I got out of the file from iText which is nowhere as comprehensive as yours noel.txt
qube99 Posted November 8, 2012 Posted November 8, 2012 You're probably thinking postscript which is what PDF uses. But PDF has a bunch of extra features including tables. You can extract a PDF table for other use (FMP, Excel, Word, HTML, etc). Alas, the PDF provided didn't have a PDF table in it.
fseipel Posted November 8, 2012 Posted November 8, 2012 If you 'Save as Excel' (a function available in Acrobat X Pro, which from your message, you have), you will generate a cleanly formatted file you can import into Filemaker that preserves the row/column structure present in the PDF file. If this needs to run on computers with only reader, it still allows conversion, but you have to use Adobe's online service to do so and buy a subscription, i.e. it's not included as it is in Pro Acrobat. You will still need to import into a temporary table since there are several rows before table begins, so script will simply skip until it hits headings row. Automating this process would be possible with Applescript, OLE, etc. If you don't have a huge number to process, simply saving as Excel and importing using an FM script ought to work. This may not be a great solution if there are thousands of files to process. Of course you can also parse the text back into rows/columns.
Noél Dubau Posted November 8, 2012 Author Posted November 8, 2012 Hello, Thanks to John and qube99 for their "lessons" that I appreciate ! Thanks more to Frank Seipel for that solution which works very fine but requires having Acrobat in its full version... I just tried with Adobe Reader on my Mac : the export via excel is possible only for those having an account paid Indeed AS or OLe would have been the solution ! Noël PS : Frank, your Amazon script is always in use ! Thanks more !
qube99 Posted November 8, 2012 Posted November 8, 2012 If you 'Save as Excel' a function available in Acrobat X Pro, Acrobat Pro 7 doesn't have that.
fseipel Posted November 9, 2012 Posted November 9, 2012 qube99: v7 is circa 2005 so that is a rather old version; I think they added export to Excel in ~v9 circa 2008. Current version is 11. I'm still using 10. Hi again Noél, The OLE isn't too bad, the Javascript Guide/Acrobat SDK covers this pretty well, I had to do work with footers/headers a while back. Sorry if original message wasn't very clear about this not being included in Reader (any version). This VB Script will convert a PDF to Excel. I'm assuming one could convert this to Java or Groovy and run directly in ScriptMaster; i.e. build it into a global field or variable, export to a file, and run the VBS that way To convert to SM you'd probably need JACOB for COM communications. Set Acroapp = CreateObject("AcroExch.App") Set PDDoc = CreateObject("AcroExch.PDDoc") result = PDDoc.Open ("C:PDF_INE.pdf") Acroapp.Show Set jso = PDDoc.GetJSObject jso.SaveAs "C:temp.xlsx","com.adobe.acrobat.xlsx" PDDOC.Close AcroApp.CloseAllDocs Acroapp.Hide AcroApp.Exit set Acroapp=Nothing Set PDDOC=Nothing e.g. save as test.vbs in Notepad. Of course FM could execute this directly or using run shell script command.
Noél Dubau Posted November 9, 2012 Author Posted November 9, 2012 Hello Frank Thanks for that VB script. Before I try it and because I'm not very informed about this, will this solution run on both mac and win ? Have a good day ! Noël Oops : I hadn't see "I'm assuming one could convert this to Java or Groovy and run directly in ScriptMaster" which is certainly the good way.
fseipel Posted November 9, 2012 Posted November 9, 2012 Noél, that will run only on Windows. For Mac you'll need to use Applescript. Of course, Acrobat Pro 9.0 or above is still required for this approach. If converted to SM using JACOB COM library, it will still run only on Windows because Macs don't use COM.
fseipel Posted November 11, 2012 Posted November 11, 2012 Noël, one other comment on this, you can actually determine where the lines begin & end with iText, by parsing through the tokens. You should find 'l' for line or 're' for rectangle. I would think you could then simply have two arrays, one for the horizontal line positions, and one for the vertical line positions. The FilteredTextRenderListener class which accepts a rectangle parameter, could then be used to parse the text in each 'cell', albeit with a speed penalty of repeated calls. This would have the advantage of being cross-platform, and not requiring Acrobat Pro or the conversion subscription. Also, some of the free tools such as xpdf's pdf2text, include command line switches such as -layout to preserve physical layout upon conversion, typically that preserves somewhat the table structure (text is not perfectly aligned & there will be a variable number of spaces between columns in successive rows), but at least it includes the carriage returns.
Noél Dubau Posted November 11, 2012 Author Posted November 11, 2012 Frank Thank you for trying to help me on that problem ; if I understand the ideas you have to solve my question, I have no level required to apply these advices. I'm going to search through Google about that class and these tools ; perhaps a sample will open my eyes ! Thanks one more Noël
Recommended Posts
This topic is 4463 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now