import large ascii file

May 18, 200620 yr

I have a 20 fixed width ascii files with 20,000+ records each that I would like to import into filemaker. The files are text files with no deliminaters, 200+ variables, taking up 3200 "positions" or characters.

I would like some advice as to how to best move these files into filemaker.

Thank you for your advice

Chris

May 18, 200620 yr

Not enough information I think...

Can you post a few records? Is every record structured exactly the same?

If there are no delimiters, how do you know where what field starts and ends? Is everything fixed length?

May 18, 200620 yr

I second Wim, like what is the end of record, just a certain number of characters? In any case I would suggest a good text editor, such as BBEdit or its free lite version.

http://www.barebones.com

Its "grep" search can easily find a given number of characters and add a delimiter (tab for fields, return for records). A period (.) is one character, so a 6 character field would be:

Find: ......

Replace: &t

(& is "whatever matched")

You'd likely want to do the returns for the records first, or remember to add the tab to the number of characters; a return is r

BBEdit is fully AppleScriptable. BBEdit Lite is not. TextWrangler is, but it requires OS 10.3.5. You might want to upgrade. It was hard to even find BBEdit Lite on their web site; it is so old:

http://www.barebones.com/products/bblite/index.shtml

May 18, 200620 yr

Author

Thanks for the insights. Here is some more information.

All of the records and disks have that same structure, fixed width, with a carriage return at the end of each record. The data files contain many padded "blank" spaces, making visual inspection of the records a little off putting. But here is what it looks like.

2003PD126640010208250101S C Y00260011000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ALAN J GERSON PETER GLEASON DEMOCRATIC

The "blanks" don't reproduce in copy/paste.

Here are some sample variables:

Start End Type Length

Election-ID 1 5 A 5

Office 6 7 A 2

County 8 8 A 1

District 9 10 A 2

Ass. Dist 11 12 A 2

Elect. Dist. 13 15 A 3

And so on.

Two additional sets of data, of similar file structure and size will also need to be converted and linked between each other to generate reports and match address/name data from membership lists to voter histories, election results and turn-out information by census block.

Thanks for your further thoughts.

May 18, 200620 yr

Since they are seperated by a carriage return the import shouldnt be an issue. Then your fields should be calculations or set using a loop and the calcs in a Set Field script step. examples

Election_ID = Trim(Middle(MyImportFieldName; 1; 5))

Office = Trim(Middle(MyImportFieldName; 6; 2))

County = Trim(Middle(MyImportFieldName; 8; 1))

and so on.....

Middle(MyField; StartPosition, NumberOfCharacters)

Trim removes leading and trailing spaces

May 18, 200620 yr

It's a lot of work, writing 200+ calculations like this. I would use a repeating calculation field to split out the field data, then export it as text and convert the separator to a tab character.

This way you only need to type the starts and lengths into 2 repeating fields - and even that could be imported, using the same process in reverse.

May 19, 200620 yr

It is also possible to write a fairly simple grep that will split the fields. I'm not quite sure what's going on with the names at the end, so I'm ignoring them. You just put periods for the characters, enclosed in () parenthesis, and use the backslashed position as the replace. I think you have a maximum of 9 fields you can do (and the last one on the line doesn't need a tab). Easier to see than explain (esp. for those such as myself).

This is for 6 fields; and I don't know for sure that the lengths are correct. But you can get the idea, and that's what matters.

Find:

(.........................) (.) (.............................................) (.............................................) (.............................................) (.............................................)

Replace:

1t2t3t4t5t6t

May 19, 200620 yr

It's a lot of work, writing 200+ calculations like this. I would use a repeating calculation field to split out the field data, then export it as text and convert the separator to a tab character.

This way you only need to type the starts and lengths into 2 repeating fields - and even that could be imported, using the same process in reverse.

IMO its about the same amount of work as the only thing that changes in the calc are the same numbers you are putting in the repeating fields. I also think it would be much easier to troubleshoot my way, but I readily admit you know more about this than I do so I may be missing something.

May 19, 200620 yr

I think that typing a list of numbers in sequence is a lot faster than duplicating 200 fields and editing them one at a time, trying to select the 2 numbers in the formula. Also, in case of a mistake, you have the entire list in front of you, so tracking it down should be easy.

It doesn't even have to be a repeating field - you could put the data into a text field, separated by carriage returns. So if you skipped a number, you could correct it by insert. And of course, if such a list already exists, then import, or copy and paste, should make this a breeze.

May 19, 200620 yr

Author

All good suggestions.

I will work on using sbg2's suggestion. As a novice, i think laying out the strucutre so plainly will allow me to write an import script that will work. While it may be longer, I can understand it -- heh.

Thanks again. I will post the outcome on Monday.

May 20, 200620 yr

Author

In order to get the data into FM, I first put the "long-string" into excel. After being imported/opened in filemaker, I used the parse function as suggested. I tried out a loop and defined calculation and both worked fine, though defining the feild was much faster. However, I am running into a problem on the larger files, which cannot be opened in excel or word due to their size. When I try to open or import the original .txt file data directly into filemaker i get an error message saying the file type is not supported.

Any thoughts?

May 22, 200620 yr

Author

I solved my problem. I needed to install the correct drivers.

When taking additional files into the data base, I found that defining the calculations, though time consumeing the first time, makes the importing/conversion process much faster than running a script/loop to perform the calculations.

May 22, 200620 yr

When taking additional files into the data base, I found that defining the calculations, though time consumeing the first time, makes the importing/conversion process much faster than running a script/loop to perform the calculations.

And, has the added benefit that nearly 100% of filemaker developers would understand this choice, so your calcs would make sense to the next person to work on it...

One tip : beware the Comma. Watch your import format -- Filemaker sometimes decides that commas are a field delimiter when you don't want it to be in a fixed-format.

May 23, 200620 yr

Here is a more flexible approach which lets you adapt more quickly to changing data formats. Importing tab-delimited data is usually no prob since there are very rarely tabs in data like this ...

ImportFixedLength.fp7.zip

May 23, 200620 yr

Author

Thanks for the file. Once filemaker get through importing the data...going on 3 hours now. I will look at it.

The file is 514 MB and will likely have more than 1 million records when done. I know this is allot of data, but more than 3 hours to import the file? Generally, I use SPSS to manage data sets this large but we want to have the data available on the web. I am running on a macbook pro with 1 GB of memory.

I have the file structured to import the first "long string" from which all of the other variables are calculated. Would it be faster to do the calculations via an "insert calculation" script after the initial "long string" has been imported? Seeing FM process each record at a time using the loop function lead me to think that it would be slower, but I could be wrong. Is there some batch processing function that I missed? I have five more CDs and the idea of having to wait several hours for each one leads me to believe that I have missed some import step to speed the process along.

Thanks for your thoughts.

May 25, 200620 yr

Author

This is a very hand way of building the import strategey for multile fixed files. Much more straight forward than having to define the calculation each time and it allows you to see the variables and lenghts at a glance.

Thanks for offering this up!