Parsing and concatenating imported listserv records

Christopher_Campbell · May 6, 2010

Hello all,

I have used Filemaker for many years, but only in fairly rudimentary ways. As one of a group of people doing audio preservation work, I recently downloaded the listserv archives for Studer open-reel tape recorders and found that they are extremely difficult to use as huge, continuous blocks of text. Last night I spent a few hours massaging and cleaning up the text a bit in TextWrangler, and have now imported the seven archived text files into Filemaker in the most basic mode, where Filemaker supplies simple numbered fields. The basic structure of the listserve is intact, and there are nearly 11,000 records.

I would now like to make this database reasonably usable in a conventional sense. I know how to do some basic text parsing, but in this case, for example, I can't count on the "Subject:" line to always be in a given field. So I have two questions:

1. How can I parse the record and extract some basic fields such as "Date," "Sender," "Subject," when I don't have a fixed field location for that information?

2. How can I concatenate the message "body" back into a single field, when it is now splintered into as many as 150 fields? This is the result of converting the line breaks that ended each line into tab characters, and preserving only the last line break as a record delimiter. I realize that this would clearly involve some sort of calculation that would assemble all fields falling after a field that begins "Reply-To: [email protected]", but how to do it, and then remove the individual lines is beyond my current expertise.

I've attached a small sample containing a few records, and would be most grateful for any advice on how to proceed.

Studer_list_sample.fp7.zip

RodSierra · May 6, 2010

Well from my POV, I would not import the file as a delimited file with separate records. Import or drag the entire text file into one global field, then use calculated text fields to extract the data into the format you want, once your satisfied the calc fields are working properly, change them to text fields, then select the auto enter option, and you'll find the calc is already listed there. Doing the parsing is a matter of using the text functions.

Lee Smith · May 6, 2010

Is this a text archive, or can you view the list online?

Can you provide an address were I (we) can see the archive?

comment · May 6, 2010

You could try the attached as a starting point - however, I too think it would be best to do some more pre-processing in a text editor.

Studer_list_sample1.zip

Christopher_Campbell · May 6, 2010

As the Studer archive files span nearly ten years, the format changes slightly from chunk to chunk. However, I've saved a sample from one so that anyone interested can see one of the original formats. It's wonderful that so many of you experienced users take note of these posts so quickly!

Lee, the Studer forum is open only to accepted members, but is located here:

http://tech.groups.yahoo.com/group/STUDER/

Studer_listserv_text_sample.txt.zip

Edited May 6, 2010 by Guest

Christopher_Campbell · May 6, 2010

Comment, the work of your script is simply lovely, and gives me hope that the many former users of the Studer list will soon have an extremely useful tool. As for your suggestion about doing more pre-processing, could I please ask you to be more specific? Once I had all seven text segments sequenced in TextWrangler, I found I had some 22 millions chars, 3.4 million lines and 10,623 messages of consistent inconsistency, so it was far from clear to me how I could clarify the structure without damaging or losing any information!

Thanks so much.

Lee Smith · May 6, 2010

I think you can do this all in TextWrangler. How familiar are you with Grep Patterns and Regular Expresions?

Lee

Christopher_Campbell · May 6, 2010

Lee, until last night I would say only that I was only slightly familiar with that level of searching/replacing, but I did do a quick read through the TextWrangler 3.1 user manual on grep searching, and have a close friend who is truly expert, so I would be entirely willing to go in that direction if that makes the most sense.

comment · May 6, 2010

I'll let Lee handle the details, since he knows more about grep than I, but basically it's a similar process. I believe it would be best to replace all returns with a character such as ¶ (which you will later replace in Filemaker with a return), and all tabs with spaces. Then use the constants to insert a tab in between fields and a return in between records. Do some cleanup, e.g. runs of spaces to a single space, and you're ready to import.

the many former users of the Studer list will soon have an extremely useful tool.

LOL, isn't that like a tool for the former builders of pyramids in Egypt?

Christopher_Campbell · May 6, 2010

Comment, it's more like a tool for the guys responsible for preserving the pyramids. The estimate is that there exists something like 50 million hours of recorded audio tape, and all of it is deteriorating. Analog-to-digital converters are now of truly excellent quality, and so the race is on the preserve the most valuable of all that material. Studer decks have long been the gold standard for analog recorders — especially in the delicacy of their tape-handling — so making this material accessible will be wonderful.

Christopher_Campbell · May 8, 2010

I've had some terrific support here, and now hope that someone can answer a quick TextWrangler question. I'm still cleaning up the listserv text file, and it's going very well. I've got a huge text file in which I've replaced line feeds with tabs, and displayed it so that the data is in columns. I've learned how to transpose columns when necessary, but something that I'm sure is relatively simple is eluding me in studying the User Manual. In email messages without a header, i.e. in a given line, I want to simply push all the content several columns to the right with tab characters, starting with the first column.

I can find the lines I want with a wildcard search: ^[a-z].*?t

But obviously I can't just replace it with a string of three tabs, as that replaces the content of that line. I need an expression that will preserve whatever content is delimited by the beginning of the line (or, I suppose, the prior line feed), and the first tab, and then simply pushes it to the right three columns.

Thanks very much.

P. S. I've ordered Friedl's Mastering Regular Expressions, but don't have a copy yet

comment · May 8, 2010

I am not an expert on this, so I'd just find r and replace with rttt (not grep). Then do the first line manually...

Christopher_Campbell · May 8, 2010

Comment, thanks for the suggestion. The trouble is that such a method is effectively non-selective, and I need to be able to pick and choose the lines that need moving.

Christopher_Campbell · May 23, 2010

Finally, today I was able to import all the listserv records into Filemaker. Much of the work I had to do in processing the text files before import was related to the highly irregular nature of hundreds of the records (many of which had to be formatted by hand after I got as far as possible with Textwrangler and grep). Because of this unpredictability, it was not in the end feasible to end up with the three key fields — Date, From and Subject — in definitive field positions. Would it be hard for someone to show me the format for a short sample script that would parse just one of those lines, such as "Subject," and which I could then modify using the other two search terms? It would be lovely to present the database to interested users with those three fields separated out and easy to read, with the full header still available for inspection for those that need it.

I've attached a copy of the database with four records, where you will note that in the first record, for example, the subject line falls in the 11th line of the database, and begins with "Subject:".

Thanks very much.

Studer_list_sample3.fp7.zip

Edited May 23, 2010 by Guest

comment · May 23, 2010

Try:

Studer_list_sample4.zip

Christopher_Campbell · May 23, 2010

Comment, your scripting is a thing of beauty. There's nothing quite like clicking a button and watching 60,000 fields (12,000 records x 5 fields) populate themselves!

Thanks to everyone who assisted, and especially Comment and Lee Smith, for all the patient assistance in making this material available.

Sign In

Parsing and concatenating imported listserv records

Recommended Posts

Christopher_Campbell

RodSierra

Lee Smith

comment

Christopher_Campbell

Christopher_Campbell

Lee Smith

Christopher_Campbell

comment

Christopher_Campbell

Christopher_Campbell

comment

Christopher_Campbell

Christopher_Campbell

comment

Christopher_Campbell

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information