Removing extra carriage returns

October 24, 201114 yr

Newbies

I have text that I occasionally have to copy & paste from a newspaper PDF file. The text comes in formatted like this:

ELON—Elon University

announced it has surpassed

a $100 million fundraising

goal in what the school

calls its largest-ever

fundraising effort.

The private college has

raised nearly $106 million

in its “Ever Elon” campaign,

which was launched

three years ago. The college

will continue collecting

donations through the

end of the year.

I need to find a way to strip out the "extra" carriage returns without removing the ones that are supposed to be there. It seems like these is usually a letter followed by a carriage return, followed by a letter. But sometimes there is a comma, sometimes a hyphen, a number, etc. If there is a period or a quote followed by a carriage return, those are usually the ones that are supposed to be there. I have tried pasting unformatted text into various programs, but it seems that whatever program the newspaper uses to create the PDF file hard codes the carriage returns.

I would like the text to look like this:

ELON—Elon University announced it has surpassed a $100 million fundraising goal in what the school calls its largest-ever fundraising effort.

The private college has raised nearly $106 million in its “Ever Elon” campaign, which was launched three years ago. The college

will continue collecting donations through the end of the year.

Has anyone delt with a similar problem? Is there some kind of pattern count or similar function that I can use? Or some way to find each carriage return and evaluate the character preceding it before either keeping it or substituting it with nothing?

Right now I am removing all the carriage returns, but then I have to go back in and add the ones where each paragraph starts. Not a big deal for a short story, but a major pain for a long article!

Any help will be greatly appreciated!

October 25, 201114 yr

I'd go back to the original text and examine it on character level for any differences between line and paragraph separators.

If they are indeed the same, you are playing a guessing game - and there's no way you can win them all. For example, you could assume that any period, question mark or exclamation mark, followed by a carriage return is the end of a paragraph - but then all your paragraphs would be one sentence long. OTOH, a list like:

a) this;

that;

c) another

will be mangled into one line.

October 25, 201114 yr

:iagree:

Hi rekates,

I have text that I occasionally have to copy & paste from a newspaper PDF file. The text comes in formatted like this:

ELON—Elon University

announced it has surpassed

.... snip

As comment said, parsing or extracting text can be a mixed bag. The tool I use for this, is TextWrangler, by Bare Bones Software, http://www.barebones...gler/index.html, a free text editor that has grep patterns capabilities, which can come in handy when dealing with this type of situation. Once I see what your text looks like, I can supply you with the Find and Replace Grep Patterns you'll need to deal with it.

There are many threads about this need on FMForum dealing with this kind of need. To research futher, do a search for Parse or Extract and Text,

BTW, I went to ELON's and captured one of the articles and pasted it into TestWrangler, and all of the paragraph returns were normal. I went to our local newspaper site and did the same process and again, all of the paragraph returns were normal. Perhaps you could give me the link to the particular text and I'll try it with TextWrangler and see if get the same thing.

Lee