Import from large ASCI file

cbum · January 3, 2005

I'm trying to import a 500MB Genbank data file into FM7, using BBedit to strip <CR>s etc, but I'm running into the following structural problems:

1. Not all records have entries for all fields

2. Many records have multiple lines for multiple entries for the same field.

The basic file structure is:

FIELDNAME1<SP><SP><SP> data;<SP>data;<SP>data <CR>

FIELDNAME2<SP><SP><SP> data;<SP>data;<SP>data;<SP>data;<SP>data <CR>

FIELDNAME2<SP><SP><SP> data;<SP>data;<SP>data; <CR>

FIELDNAME3<SP><SP><SP> data;<SP>data;<SP>data;<SP>data;<SP>data <CR>

FIELDNAME5<SP><SP><SP> data;<SP>data;<SP>data;<SP>data;<SP>data <CR>

//<CR>

Any suggestions appreciated.

Lee Smith · January 3, 2005

post a sample of your text document.

Lee

RodinBangkok · January 3, 2005

First insure you are able to see all specials via the text options menu in bbedit, then try and isolate the first occurance of a bad record import. Go back to the text file and see what specials or delimiters are different at the point the records are comprimised. Most likely you have not found all the delimiters between records and this is what is hosing your full import.

Rod

cbum · January 3, 2005

Rod,

I may not have been clear in my description, which is why I'm attaching a sample below.

The issue is not "bad" records, and I have indeed looked at the delimiters carefully. The problem is that for some records, there is no ENTRY for certain fields at all in the original, and instead of filling that with spaces etc, the record just goes to the next field(s).

Anyway, here goes:

ID Hs.2

TITLE N-acetyltransferase 2 (arylamine N-acetyltransferase)

GENE NAT2

CYTOBAND 8p22

LOCUSLINK 10

HOMOL YES

EXPRESS liver ; hepatocellular carcinoma ; corresponding non cancerous liver tissue ; small intestine ; adenocarcinoma ; mixed ; Liver ; colon ; Cell lines

GNM_TERMINUS S

CHROMOSOME 8

STS ACC=GDB:386004 UNISTS=157141

STS ACC=WIAF-2120 UNISTS=44576

STS ACC=PMC181332P1 UNISTS=271673

PROTSIM ORG=Homo sapiens; PROTGI=105377; PROTID=pir:B34585; PCT=100.00; ALN=290

PROTSIM ORG=Mus musculus; PROTGI=1703436; PROTID=sp:P50295; PCT=74.83; ALN=290

PROTSIM ORG=Rattus norvegicus; PROTGI=16758720; PROTID=ref:NP_446306.1; PCT=73.79; ALN=290

SCOUNT 4

SEQUENCE ACC=BP262787.1; NID=g52178018; CLONE=HSI08034; LID=16400; SEQTYPE=EST

SEQUENCE ACC=BP264356.1; NID=g52179587; CLONE=HSI15615; LID=16400; SEQTYPE=EST

SEQUENCE ACC=D90042.1; NID=g219415; PID=g219416; SEQTYPE=mRNA

SEQUENCE ACC=BU624903.1; NID=g23291118; CLONE=UI-H-FG1-bgl-g-02-0-UI; END=3'; LID=11914; SEQTYPE=EST; TRACE=159705553

//

ID Hs.129

TITLE Cholecystokinin A receptor

GENE CCKAR

CYTOBAND 4p15.1-p15.2

LOCUSLINK 886

HOMOL YES

EXPRESS mixed

CHROMOSOME 4

STS ACC=STS-L19315 UNISTS=19085

STS ACC=SHGC-50173 UNISTS=19745

STS ACC=D4S3157 UNISTS=46160

PROTSIM ORG=Homo sapiens; PROTGI=416772; PROTID=sp:P32238; PCT=100.00; ALN=428

PROTSIM ORG=Rattus norvegicus; PROTGI=231713; PROTID=sp:P30551; PCT=90.97; ALN=431

SCOUNT 3

SEQUENCE ACC=L19315.1; NID=g306595; PID=g306596; SEQTYPE=mRNA

SEQUENCE ACC=BC074987.2; NID=g50959868; PID=g50959869; SEQTYPE=mRNA

SEQUENCE ACC=L13605.1; NID=g306490; PID=g306491; SEQTYPE=mRNA

//

As you can see, the second record has no "GNM_TERMINUS" entry, other records a missing things like the GENE field or the STS field etc, because they have none.

Clear as mud?

cbum · January 5, 2005

No takers? :-(

Well, I've dumped the file into a FP7 file with few fields, and will deal with extracting specific data with text functions.

One issue I'm struggling with is how to efficiently expand the the set to account for the multiple entries in one field for any given record.

Specifically, in the example above, the record for Hs.129 has 3 entries for Sequence Acc.

I need to find data both based on Hs ID # and by Seq Acc #, so I probably need to duplicate the Hs.129 x2 and copy one diffrerent Acc# to each.... probably some type of loop script, right?

Lee Smith · January 5, 2005

No takers?

Not quite an accurate statement, is it?

After all, I did offer to assit you off list.

The problem of preparing text for import and/or dealing with the problems of text imports comes up from time to time. A search of the Forums will provide an abundance of information for your reading pleasure.

Do a search of [color:"blue"]ALL FORUMS, using[color:"blue"] blank in the[color:"blue"] Weeks and copy or paste these words in the[color:"blue"] Keywords as shown:

[color:"blue"] +Pars +Text, will find "parse" and "parsing" or

[color:"blue"] +Extract +Text, will find "extracting" and "extract". Some even have Enclosures.

There are a couple of Sample Files available in the Sample File Area that will also benefit you. Most, if not all, were done in previous versions, but they should convert to v7 okay.

HTH

Lee

Fenton · January 5, 2005

Lee's right. This is fairly standard, though rather large, text parsing into fields. I guess we're all a little burned out on it already :-| There are a few files on the forums. Here is one I've done, and there are others:

http://www.fmforums.com/threads/download.php?Number=115349

When I have more time I could help you a bit more with this. The basic idea is the use the "known names," which are the 1st "column" (depending on your separators) of each line to identify what field it should go into. The original text just goes all into one text field, to work on. Usually I'd then set that into a global field, and process line by line. That way the original text isn't messed with, in case (certainty) parsing doesn't work the first time.

Unfortunately, because we cannot set a field by name in FileMaker (though you can with AppleScript), you have to use the "loop through the fields on a layout" technique, or build a big nested IF script to set each field.

As far as "multiple lines of the same field," I would say those lines are related records, which should be created in another linked table. Use an auto-entered FileMaker serial ID to link them.

Now, "multiple values in a field" is a question:

EXPRESS liver ; hepatocellular carcinoma ; corresponding non cancerous liver tissue ; small intestine ; adenocarcinoma ; mixed ; Liver ; colon ; Cell lines

I would lean toward another table for this also. It could also be multiple lines in one field, or multiple repeats in a field. But really a separate table is both easier and more useful in the long run.

cbum · January 5, 2005

Lee,

I'm sorry if I misunderstood your messages, but I thought you preferred a chat set up to emails or the forum, and I don't have one.

Thanks for the search pointer, looks like that will be more efficient than what I've done so far, i.e. browse the import forum, to no avail so far.

I'm still looking for a good script to cut and paste the multiple entries in one field to multiple new records as mentioned above, but I've just started going through the search hits...

cbum · January 5, 2005

Fenton,

thanks for the link, I still trying to digest what the scripts do.

As to your comments, I have pretty much done your first suggestion, creating the FM7 file structure with one field for the Hs ID tag, one field for everything except the SEQUENCE (ACC#) data, and one field with the SEQ (ACC#) data. I can format the latter to a <CR> delimited table containing just

ACC=L19315.1

ACC=BC074987.2

ACC=L13605.1

etc., which would contain the most important data I need.

I'm all for creating a separate table, but I don't follow "auto-entered FileMaker serial ID to link them"...

re: "It could also be multiple lines in one field",

that's what I have now, and I don't think that would work well since many of the records have several hundred ACC# entries for each Hs.# and indexing etc probably would not work. (Or does it if each element in on a new line, but in the same field?)

How do I create a new table that has the link between the Hs.# and the matching ACC data entries?

PS: I hear you re: burnout, I'm sure this wheel has been invented many times already.

It's just not that easy to find the right match in these huge forums.

I guess this is in part the price of an expanding userbase... :-)

Fenton · January 5, 2005

You need a reliable serial ID to like the main record with the related records. I guess you could use your ID Hs.#. But FileMaker can also generate an auto-entered incremented serial #. Either would work. I prefer to never use imported data as a primary ID.

>How do I create a new table that has the link between the Hs.# and the matching ACC data entries?

You pass the primary ID to a field in the child table, which would then be the "foreign key," the right side of that relationship (though relationships go both ways). Since it's all scripted, this is easy enough.

So, when you got to a line with ACC, rather than setting a field in the main table, you'd capture the ID of the parent record, and go over to the child table and create a record.

I modified another one of my parsing example files (which I have about 30, mostly the same, a little different; I don't even know) to do creation of the child records. But it is, as my daughter would say, "butt simple." It can only do the child records if it is the last entries on the form.

Because the Loop for the main records is "generic"; it doesn't even know the name of the field it's setting; it's just doing whatever's on that layout that matches.

But your form has several "fields" that have mutiple rows. Some of them have mulitple values per row (record). This is not going to be easy; in fact it is going to be hard. But it is doable.

The database needs to know what to do with each of those things in the list. They are a mixture of "entities" and "fields." Each entitity needs its own table; because some have multiple "attributes," which would each be a separate field.

They are not even consistent; most multiple attributes are in the standard form: field label=

Some are not: EXPRESS liver ; hepatocellular carcinoma ; corresponding non cancerous liver tissue ;

So I think it has to know those by name (say the 1st word in the line). This means you need a list of what they are, and whether they are a child or not. In other words a table of these things, which the loops can consult for each line, so they know what to do. This would be easier than trying to write a completely generic script. There are not that many of them.

Being able to recognize the fields by name would allow you to (indirectly) set the field also by name; with some kind of nested If script. So you loop through the lines, check your "field and action" table to see what to do, then do it; set the field, create a child record and set a field (or fields, depending on how many values in that line).

And, I think I'd process the form 1 line at a time; putting it into a global (like I did with the Code child records in the example). That makes it easier to do further processing, to get multiple values out.

ParseListLoopwChild.zip

cbum · January 6, 2005

Fenton,

thanks for your file & suggestions.

In the meantime, I took care of my most pressing need by a cruder approach: I copied the Hs# field and the ACC= fields back to BBedit and replaced the tabs with <CR>, giving me a single field file with consecutive records with Hs# followed by all ACC= records belonging to that Hs#, next Hs# record with bunch of records with ACC= etc.

Hs.1

ACC= 12314

ACC= 23423

ACC= 34gfd

ACC= re543534

Hs.2

ACC=u34636

ACC=spdf6454

etc.

I then did a loop copying the Hs. to a global field (if the field contained an Hs entry), and copying the global to a new Hs. field for each record.

The script is still chugging along since yesterday, since my record count ballooned to over 5*10**6...

(thank god for OS X)

But thanks for your ideas, and I will try to adapt your scripts to working out the rest of the data.

Fenton · January 6, 2005

Well, that's one way to do. It's good that they at least put Tabs in the data; that will make parsing a little easier.

It's kind of toss up whether to pre-munge the data with BBEdit, or do it all with FileMaker. I will use BBEdit whenever I really need "grep," which isn't available in FileMaker; though someone here has written an awesome Custom Function for that (I haven't really tried it yet, but it looks great).

BBEdit is "recordable," ie., it can record your actions as an AppleScript, which you can then run from the menu (I did it as a stand-alone applet). BBEdit 8 has the ability to record a series of actions internally, called "Text Factory," then run them from a menu. (But I just tried, and the Line Processing options are limited to "Delete matching lines" only, which is kind of lame.)

I recorded an AppleScript (attached), saved as an application (no startup screen), which will process your example into its components, leaving the original complete document as is. The "parent" file has only the single lines, with the ID and the fields. Each of the "child" entities into its own text file, with IDs. It's a "droplet." Drop the original on it (which can be named anything except "Med Lines ID.txt".

Modify the file paths to match yourself (not "fej"). It's in a folder "Med Lines" on the desktop. This could have been done better, more generic; this one is simple.

I don't know how it would fare with a large file however. Test with a small sample first. This was created with BBEdit 8. It would need tweaking to work with 7 (It would not compile, so I'd have to check it; Process Lines is different in 8, no longer an Extension.)

If using 8 TURN OFF "Open In Front Window" for Open and New documents in Preferences, Documents. I don't quite know how to write it for that; could be done, but separate docs is easier.

The separate files could then be imported, each into their own table, with the ID in each. You'd still need to process the multiple fields. But if, in your original file, there are Tabs (which don't show here), then it's a pretty straight import, with a little processing to remove the "name tags." This could all be automated; you know the name & location of each of those files.

MEDLINES.zip

cbum · January 6, 2005

Thanks for the Ascript - I'll give it a try tomorrow.

I'm a bit worried about BBedit 8, though, I have found it to be very bad at memory management with large files - I constantly get out of mem messages (at least it does'nt crash) with files of a few hundred MB on a 2.5GB G5, although they advertize it as "bound by mem only"...

I had to split the file into 4 to perform some grep massaging :-(

Actually, there are no tabs in the ASCI file, just spaces and <CR>s, I had to grep that into it...

Where is that custom FM grep function? Sounds awesome.

I constantly but my head against the lack of wildcards in text string operations in FM...

Thanks!

c.

PS: the overnight loop script delivered the goods!

Fenton · January 7, 2005

First, from our own The Shadow:

http://www.fmforums.com/threads/showflat...true#Post120696

Yes, I've also run into memory problems with BBEdit, I think in 7 and 8. It will take forever to open a large XML file. I believe it's going through and checking something; there are so many preferences, maybe one could be turned off to speed this up. But I rarely deal with large XML files; mostly I'm just poking around with examples.

Ironically I updated to BBEdit 8 because it touted integration of the Tidy engine, which I use for syntax checking and indenting of xml/xsl. Only to find out they didn't integrate that part of it (which has been able to do xml for years). I called them and they were surprised it didn't work, but "no it didn't, and yes it should, and we'll look into it." That was before the latest upgrade, but no, it still doesn't work. Fortunately the BBTidy extension has been updated for OS X, and still works fine as a plug-in. And the Text Factory, though promising, is distinctly limited in its options, as yet.

I imagine this could also be done with command line, independent of applications, and probably faster. If you have to do it often that might be worth investigating.

I'm glad your overnight loop worked. I was worried you were going to see that it didn't, which wouldn't be fun.

cbum · January 7, 2005

That grep custom function - like ... wow! I hope it works.

Is that really too hard for FM to do themselves?

THanks!

cbum · January 19, 2005

I talked to some of the FM reps at Macworld about GREP functions - and they just looked at me...

If any programmer can cook up the custom functions for a good chunk of GREP, how hard can it be for FM? Is there that little demand for a capable text crunching ability?

cjaeger · January 21, 2005

FileMaker is not intended to do sophisticated grep - if your records are organized properly, you can find the stuff easily using the internal search function. That's probably the reason you want the data in FileMaker instead of using grep and Unix piped commands on a text file ...

Forget about the grep custom function, it will be very slow, as will be using FileMaker's text parsing functions.

For import such a large aamount of data, you have to do the pre-processing externally. It's always a question of "tagging" the data properly. If you are importing this stuff regularly, instead of BBedit, I would recommend using perl to generate an XML file. which in turn can be imported into FileMaker using an appropriate XSLT.

cjaeger · January 21, 2005

The grep CF is great . Just not for this purpose.

cbum · March 20, 2005

"FileMaker is not intended to do sophisticated grep"

granted, but I'm talking about a reasonable subset... why would that be slow?

cbum · March 20, 2005

"FileMaker is not intended to do sophisticated grep"

granted, but I'm talking about a reasonable subset... why would that be slow?

cbum · March 20, 2005

"FileMaker is not intended to do sophisticated grep"

granted, but I'm talking about a reasonable subset... why would that be slow?

Sign In

Import from large ASCI file

Recommended Posts

cbum

Lee Smith

RodinBangkok

cbum

cbum

Lee Smith

Fenton

cbum

cbum

Fenton

cbum

Fenton

cbum

Fenton

cbum

cbum

cjaeger

cjaeger

cbum

cbum

cbum

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information