Newbies rekates Posted October 24, 2011 Newbies Posted October 24, 2011 I have text that I occasionally have to copy & paste from a newspaper PDF file. The text comes in formatted like this: ELON—Elon University announced it has surpassed a $100 million fundraising goal in what the school calls its largest-ever fundraising effort. The private college has raised nearly $106 million in its “Ever Elon” campaign, which was launched three years ago. The college will continue collecting donations through the end of the year. I need to find a way to strip out the "extra" carriage returns without removing the ones that are supposed to be there. It seems like these is usually a letter followed by a carriage return, followed by a letter. But sometimes there is a comma, sometimes a hyphen, a number, etc. If there is a period or a quote followed by a carriage return, those are usually the ones that are supposed to be there. I have tried pasting unformatted text into various programs, but it seems that whatever program the newspaper uses to create the PDF file hard codes the carriage returns. I would like the text to look like this: ELON—Elon University announced it has surpassed a $100 million fundraising goal in what the school calls its largest-ever fundraising effort. The private college has raised nearly $106 million in its “Ever Elon” campaign, which was launched three years ago. The college will continue collecting donations through the end of the year. Has anyone delt with a similar problem? Is there some kind of pattern count or similar function that I can use? Or some way to find each carriage return and evaluate the character preceding it before either keeping it or substituting it with nothing? Right now I am removing all the carriage returns, but then I have to go back in and add the ones where each paragraph starts. Not a big deal for a short story, but a major pain for a long article! Any help will be greatly appreciated!
comment Posted October 25, 2011 Posted October 25, 2011 I'd go back to the original text and examine it on character level for any differences between line and paragraph separators. If they are indeed the same, you are playing a guessing game - and there's no way you can win them all. For example, you could assume that any period, question mark or exclamation mark, followed by a carriage return is the end of a paragraph - but then all your paragraphs would be one sentence long. OTOH, a list like: a) this; that; c) another will be mangled into one line.
Lee Smith Posted October 25, 2011 Posted October 25, 2011 :iagree: Hi rekates, I have text that I occasionally have to copy & paste from a newspaper PDF file. The text comes in formatted like this: ELON—Elon University announced it has surpassed .... snip As comment said, parsing or extracting text can be a mixed bag. The tool I use for this, is TextWrangler, by Bare Bones Software, http://www.barebones...gler/index.html, a free text editor that has grep patterns capabilities, which can come in handy when dealing with this type of situation. Once I see what your text looks like, I can supply you with the Find and Replace Grep Patterns you'll need to deal with it. There are many threads about this need on FMForum dealing with this kind of need. To research futher, do a search for Parse or Extract and Text, BTW, I went to ELON's and captured one of the articles and pasted it into TestWrangler, and all of the paragraph returns were normal. I went to our local newspaper site and did the same process and again, all of the paragraph returns were normal. Perhaps you could give me the link to the particular text and I'll try it with TextWrangler and see if get the same thing. Lee
Recommended Posts
This topic is 4778 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now