Improve or redesign an script to reduce running time

Wardiam · December 31, 2013

Hi Everybody,

I am here again with other "rookie" problem. I have a FM database with several thousand records ( in the attached example there are around 6100 records). Basically, I have in each record a field "sequence" with a sequence of letters (aminoacids). I would like to find the unique sequences (no repeats), it means, the unique records. But it is not so easy because there may be sequences which are a fragment of others.

Therefore, I have created two scripts:

the first one "unique partial" checks that each FULL sequence is unique and then writes in a field ("unique_partial") "1" or "0" depending on whether the sequence appears only once or more respectively with a simple search:

--> Search: ["SEQUENCE"] (between quotes)

The second script "unique_seq" selects only the records with "1" into "unique partial" field and then searches the sequence:

--> Search: [*"SEQUENCE"*] (between *"..."*).

In this case, if the sequence is a fragment of others, several records including it are shown. The script writes in a field ("unique_seq") "0" or "1" depending on whether several records are shown or not.

Finally, I am only interested in "unique" records that have the value "1" in both fields.

The main problem is that the total time for both scripts is around 20 minutes (in my MacBook Pro) to analyze 6000 records and it is too long because some databases have several hundred thousand records.

I would like to ask you for advice to optimize better both scripts or to find a different way to achieve the same result in much less time. Maybe, could I use executeSQL functions? or, is it enough to redesign the script?

Could anyone help me?

Thank you very much... and a Happy New Year 2014! !!! !!!

Wardiam

Compare_SEQ.zip

comment · December 31, 2013

Is there a limit on how large or how small the length of a sequence can be?

Kris M · December 31, 2013

Not a script solution but an easier way to calculate duplicates. Hope it helps

Compare_SEQ (2).zip

Wardiam · December 31, 2013

Hi,

There isn't length limit of the sequence. Kris, thank you very much. I'm going to study your solution and tell you.

Thanks,

Wardiam

Wardiam · January 1, 2014

Happy New Year,

Kris, I have tested your solution and if you compare the obtained results with both options, there is a little difference in the number of records.

With your solution I get 5628 unique sequences and with my scripts I obtain 5661 unique records, it means, there is a difference of 38 sequences. If you check several of these sequences, it's true that they are duplicated but in every group of duplicates at least one of them should be considered as unique because otherwise we lose that sequence. It's like the function to remove duplicates in excel, if you sort the strings and use that function, excel program removes all duplicates but leaves the first sequence of each group of duplicates. I hope you understand me.

Anyway, thanks for your option, it is very interesting.

Wardiam

Sign In

Improve or redesign an script to reduce running time

Recommended Posts

Wardiam

comment

Kris M

Wardiam

Wardiam

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Script Time Out on Server - Troubleshooting help

Import a folder of photos, using reference to the files only, using a script

question re import from mysql

The PubMed Horcrux: Nine XSL to import one XML. Can the import be recursively scripted or somehow automated?

Loop Script to Evaluate a Match of Either of Two Fields

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information