Jump to content
Claris Engage 2025 - March 25-26 Austin Texas ×

This topic is 3979 days old. Please don't post here. Open a new topic instead.

Recommended Posts

Posted

Hi Everybody,

 

I am here again with other "rookie" problem. I have a FM database with several thousand records ( in the attached example there are around 6100 records). Basically, I have in each record a field "sequence" with a sequence of letters (aminoacids). I would like to find the unique sequences (no repeats), it means, the unique records. But it is not so easy because there may be sequences which are a fragment of others.

 

Therefore, I have created two scripts:

 

  • the first one "unique partial" checks that each FULL sequence is unique and then writes in a field ("unique_partial") "1" or "0" depending on whether the sequence appears only once or more respectively with a simple search:

 

--> Search: ["SEQUENCE"] (between quotes)

 

  • The second script "unique_seq" selects only the records with "1" into "unique partial" field and then searches the sequence:

 

--> Search: [*"SEQUENCE"*] (between *"..."*).

 

In this case, if the sequence is a fragment of others, several records including it are shown. The script writes in a field ("unique_seq") "0" or "1" depending on whether several records are shown or not.

 

Finally, I am only interested in "unique" records that have the value "1" in both fields.

 

The main problem is that the total time for both scripts is around 20 minutes (in my MacBook Pro) to analyze 6000 records and it is too long because some databases have several hundred thousand records.

 

I would like to ask you for advice to optimize better both scripts or to find a different way to achieve the same result in much less time. Maybe, could I use executeSQL functions? or, is it enough to redesign the script?

 

Could anyone help me?

 

Thank you very much... and a Happy New Year 2014!!!!!!!

 

Wardiam

 

 

Compare_SEQ.zip

Posted

Hi,

 

There isn't length limit of the sequence. Kris, thank you very much. I'm going to study your solution and tell you.

 

Thanks,

Wardiam 

Posted

Happy New Year,

 

Kris, I have tested your solution and if you compare the obtained results with both options, there is a little difference in the number of records.

 

With your solution I get 5628 unique sequences and with my scripts I obtain 5661 unique records, it means, there is a difference of 38 sequences. If you check several of these sequences, it's true that they are duplicated but in every group of duplicates at least one of them should be considered as unique because otherwise we lose that sequence. It's like the function to remove duplicates in excel, if you sort the strings and use that function, excel program removes all duplicates but leaves the first sequence of each group of duplicates. I hope you understand me.

 

Anyway, thanks for your option, it is very interesting.

 

Wardiam

This topic is 3979 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.