Import text file to new records w/ many fields

Marcel1 · June 22, 2005

Hi,

I am new to filemaker, and I am trying to create a database to hold and manage my research data. I frequently have to update the database with large quantities of new data, and I am hoping to find some tricks to save time and reduce errors.

Firstly, I am hoping to be able to automate the import step instead of copying and pasting manually. The updating step always looks the same, and involves taking a text file and creating new records. The text file usually has the format;

">Alphanumericname1" <carriage return>

"Alphastring1" <carriage return>

">Alphanumericname2" <carriage return>

"Alphastring2" <carriage return>

etc..

and what I would like to do is create a new record for each name/string pair, with the alphanumeric name occupying one field, and the (name + carriage return + string) occupying the second field.

Secondly, each record in my dataset is currently linked to a smaller number of related records in a related database (a many-to-one relationship), and when I import my new records, I would like to be able to specify which record they are associated with in the related database.

Lastly, I would like each record to have a unique ID that never changes, and a field that indicates when it was made, and when it was last modified. Is there an automatic way to do this? Any thoughts on any of this would be great!

Fenton · June 22, 2005

There is more than one way to do this. First, how big are these files?

Less than 64,000 characters (always)? If so you could import the entire file, using Import Folder, or an AppleScript, into a global field. Then you could use a FileMaker Loop to set the fields. If there are a lot of fields, it would be best to use a generic test on an ordered layout to set the fields. Otherwise it would require a big ugly (and slow) nested If test.

So, how many fields? Is it always the same fields, in the same order? In that case you could just do a save tabbed text import; just ignore the label lines. You could create a merge file out of a sample, to set up the import (that is, put all the labels in a 1 line text file, separated by commas, with a dummy line under it).

">Alphanumericname1"

"Alphastring1"

Is it really exactly like this? Is there a ">" character before the "label" line? Are there quote marks?

So tell use more specifics. With text imports, the devil (or angel) is in the details.

It may also be possible to use an external tool, AppleScript, a text editor, to "munge" the text before import.

Marcel1 · June 23, 2005

My files are almost always less than 64,000 characters, so it sounds like the import entire file option might work. For this project, importing data into the records would always result in populating only two fields, and always in the same order. However, each record will have many additional emptly fields that I would fill in later. The fields do indeed look the way I typed, with a ">" symbol present before the label line (this is part of a standard format for listing the name of a gene, followed by a gene sequences. The format is called "fasta" and is used by genetics people a lot.)

Marcel1 · June 23, 2005

Oops, terminated the post before I meant to. I was hoping to attach a sample input file, but I am not sure how to do this.

Marcel

-Queue- · June 23, 2005

Edit the post and make sure 'Preview' is selected in the options. You will then see the option of attaching a Stuffed or Zipped file.

Fenton · June 23, 2005

The ">" in front is good, I guess. Real XML would have been better (but harder to type). The big question is:

Is it ALWAYS the same labels and data? Do they always have the same labels, in the same order, and just leave the next line blank for empties? I'm thinking probably not. They probably only put a label when they have data.

Now that I think of it, you could import both the label and the data, as only 2 fields in a line-items table. Remove the ">". This is proper relational structure. The idea of creating a field for each label, then parsing the data to put it into the correct field would also work, but is more work.

A downside to proper relational structure is some loss in speed. Reports also may be difficult, if you want the many various labels across the top.

If you want to do it as many fields in one table, with a looping script, I have some ideas on a "generic" loop. It would use a match between the current field name and a list of labels to go get that data from the global field.

Marcel1 · June 23, 2005

(I tried to post a sample file and went through all of the steps correctly, but I don't see my posting. Sorry if this is a repeat: )

Ok, here is a sample input file created in BBedit, but recognized by word and simpletext. It consists of a series of gene name/gene sequence pairs, each separated by a carriage return. I often have to upload large files of this kind, with each sequence creating a new record, but connecting to a related smaller database with records of where the genes came from. Each source might contribute many genes, but no gene can come from more than one source.

I tried importing this my test file into a dummy database with no records, using the "import records" command. It gave me a dialog box showing a "from" side and a "to" side with all of the potential fields, and I was able to get a arrow to show up, mapping the sequence name from the first line of the file to the "sequence name" field in the database. I also noticed that you can scan the data in the input file from this dialog box and go to the next line, so I did this for several but not all lines in my sample input file. However when completed the import step, I just got 24 (the number of lines in the text file) new empty records. I also tried changing the carriage return format of the input file from Dos to Unix, with the same result. Hmm....

Marcel

SampleInputfile.txt

Marcel1 · June 23, 2005

Still no attached file on my posting. OK in any case, to answer the other questions, the format is always:

>sequencename <carriage return> genesequence <carriagereturn>

For example, reading the file it would look like this:

>UL100499

actgactgactcgggt.....etc...

>UL100500

atgggatccctgaatgacct...etc

Occasionally there is an additional carriage return after the gene sequence string, which basically separates file entries by an empty line.

I am not sure I understand all of the other import options, such as how to import as a line-items table, or as many fields in one table, and how I could use those methods to create new records for each gene sequence. Ideally, in the above example, I would have two records, with >UL100499 in field 1, and actgactgactcgggt.. in field 2 of the 1st record, and UL100500 in field 1 and atgggatccctgaatgacc in field 2 of the 2nd record, and so on. Removing the > would be easy enough to do, but then when exporting gene names and sequences, I would have to put them back on for downstream applications.

I would love to take a look at any clever script that can help.

Marcel

comment · June 24, 2005

IMHO, if you keep your research data in BBEdit, the simplest way would be to do a minor pre-processing in BBEdit. Filemaker easily imports text in tab-delimited format, where a tab separates between fields, and a CR between records.

So, if you make a copy of your BBEdit document and do a series of Find/Replace:

1. Find: r> Replace with: n

2. Find: r Replace with: t

3. Find: n Replace with: r

The first step replaces all CR's followed by a > with linefeeds;

The seconds step converts all remaining CR's into tabs;

The final step restores the "real" CR's back (without the >).

An additional step to decimate multiple CR's into one could be also useful. For this, you turn grep on and:

Find: r+ Replace with: r

I'm sure all of this could be automated into a single grep pattern in BBEdit, but I am not really good at that.

Fenton · June 25, 2005

Here is a little example file that parses the data in FileMaker. I forgot to post it. Whether you do it in FileMaker or outside, using a text editor, depends partly on how big the files are. FileMaker 6 has a 64,000 character limit on a text field. FileMaker 7 is 2 GB per field. Quite a bit bigger (understatement), but becomes impractical. I tried a setting a field to 17 MB worth of ValueListItems once; real slow.

Genes.zip

Marcel1 · July 25, 2005

Hi Fenton,

Sorry to get back to you so late on this; I was pulled away from the project for a while.

Thank you very much for your munge script - works like a charm. This is something I couldn't have done by myself! Incidentally, I was able to fix it to accept Unix files too. I think the problem was that Mac and Unix line feed characters are different, and so I just inserted a line at the head of the script that replaces unix line feeds with FM line feeds. Here is what I used:

Substitute ( Genes:_gText; "

";

Sign In

Import text file to new records w/ many fields

Recommended Posts

Marcel1

Fenton

Marcel1

Marcel1

-Queue-

Fenton

Marcel1

Marcel1

comment

Fenton

Marcel1

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information