Jump to content
Claris Engage 2025 - March 25-26 Austin Texas ×

This topic is 4797 days old. Please don't post here. Open a new topic instead.

Recommended Posts

  • Newbies
Posted

I have text that I occasionally have to copy & paste from a newspaper PDF file. The text comes in formatted like this:

ELON—Elon University

announced it has surpassed

a $100 million fundraising

goal in what the school

calls its largest-ever

fundraising effort.

The private college has

raised nearly $106 million

in its “Ever Elon” campaign,

which was launched

three years ago. The college

will continue collecting

donations through the

end of the year.

I need to find a way to strip out the "extra" carriage returns without removing the ones that are supposed to be there. It seems like these is usually a letter followed by a carriage return, followed by a letter. But sometimes there is a comma, sometimes a hyphen, a number, etc. If there is a period or a quote followed by a carriage return, those are usually the ones that are supposed to be there. I have tried pasting unformatted text into various programs, but it seems that whatever program the newspaper uses to create the PDF file hard codes the carriage returns.

I would like the text to look like this:

ELON—Elon University announced it has surpassed a $100 million fundraising goal in what the school calls its largest-ever fundraising effort.

The private college has raised nearly $106 million in its “Ever Elon” campaign, which was launched three years ago. The college

will continue collecting donations through the end of the year.

Has anyone delt with a similar problem? Is there some kind of pattern count or similar function that I can use? Or some way to find each carriage return and evaluate the character preceding it before either keeping it or substituting it with nothing?

Right now I am removing all the carriage returns, but then I have to go back in and add the ones where each paragraph starts. Not a big deal for a short story, but a major pain for a long article!

Any help will be greatly appreciated!

Posted

I'd go back to the original text and examine it on character level for any differences between line and paragraph separators.

If they are indeed the same, you are playing a guessing game - and there's no way you can win them all. For example, you could assume that any period, question mark or exclamation mark, followed by a carriage return is the end of a paragraph - but then all your paragraphs would be one sentence long. OTOH, a list like:

a) this;

B) that;

c) another

will be mangled into one line.

Posted

:iagree:

Hi rekates,

I have text that I occasionally have to copy & paste from a newspaper PDF file. The text comes in formatted like this:

ELON—Elon University

announced it has surpassed

.... snip

As comment said, parsing or extracting text can be a mixed bag. The tool I use for this, is TextWrangler, by Bare Bones Software, http://www.barebones...gler/index.html, a free text editor that has grep patterns capabilities, which can come in handy when dealing with this type of situation. Once I see what your text looks like, I can supply you with the Find and Replace Grep Patterns you'll need to deal with it.

There are many threads about this need on FMForum dealing with this kind of need. To research futher, do a search for Parse or Extract and Text,

BTW, I went to ELON's and captured one of the articles and pasted it into TestWrangler, and all of the paragraph returns were normal. I went to our local newspaper site and did the same process and again, all of the paragraph returns were normal. Perhaps you could give me the link to the particular text and I'll try it with TextWrangler and see if get the same thing.

Lee

This topic is 4797 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.