Speed up scripts in large DB?

Mike Bee · December 31, 2007

I have a DB with ~35,000 entries, each entry has 60 fields. I am trying to automate a process to delete or omit duplicate entries based on three fields.

my problem is that even with indexed fields, it is taking too long to run an automated script, since it tries to compare each record to all 35,000 entries.

Does anyone have any suggestions as to a script or method to speed up this process.

comment · December 31, 2007

Define a calculation field that concatenates the three fields. Sort the records by this field, and compare only to the previous record's value (stored in a variable). There are other methods, too - all depends on how often you need this and for what purpose.

Søren Dyhr · January 1, 2008

60 fields per record is a kind of asking for problems if indexes are sprinkled arbitrarily all over the place as well as numerous of interdependent calc'fields, these ought to be broken out in several tables, by it should exhausting scripts be run in TOG's where irrelevant TO's have been disjoined.

Problems like these occurs sometimes when a database is mistaken for being a spreadsheet.

I kind of fancied the method Comment suggests, which could be found here: http://www.databasepros.com/FMPro?-DB=resources.fp5&-lay=cgi&-format=list.html&-FIND=+&resource_id=DBPros000669 until I realized that it's actually much slower than this:

http://sixfriedrice.com/wp/deleting-duplicate-records-in-filemaker/

--sd

comment · January 1, 2008

It was not my intention to suggest any particular method - only to "unsuggest" comparing each record to all 35,000 entries.

Regarding speed, I would refer you to:

http://www.fmforums.com/forum/showtopic.php?tid/159484/

These tests should probably be re-examined in version 9. In any case, speed alone should not be the determining factor: there is also the question of how many resources each method requires.

Fenton · January 1, 2008

The speed of any method to remove duplicates depends on the number of duplicates vs. the number of records. Since the number of duplicates is hopefully pretty small, isolating them would be a priority for me. The poster is saying that this is an operation that is going to be run repeatedly, and that the priority (in this case) is speed.

I would use the first part of Comment's idea, create a concatenated field of all three fields. This of course bloats the file, especially if you Find on the field. But that's the price you pay for speed.

I would first do a Find, using the "!" special character, to Find all duplicates. It would be slow the 1st time, but not afterwards. I would then use either a "sort and compare" method, a Loop (starting at either 1st or Last record, depending on which record you want to keep), holding the data in Variables (to see which is the duplicate, reset when the data changes).

Or I'd use a self-relationship on the concatenated duplicate field. In fact, I'd likely use that. You've already paid the disk space price by indexing the concatenated duplicate field, might as well use it. Such a self-relationship, comparing a self-related serial ID* to the local serial ID, would positively identify either the 1st or last of the self-relationship. If the record is not the 1st (or last), then Delete it. This method does not require sorting.

The self-relationship can also be used, in a calculation, result Text, to produce a "Duplicate!" warning, on a Form view data entry layout, as soon as the 3 fields are populated and committed.

*If you do not have a unique auto-enter serial ID in the table, you should.

Edited January 1, 2008 by Guest

Søren Dyhr · January 1, 2008

Fenton is right here, the level of contamination does really matter, and yes Sixfried Rice's algorithm could indeed benefit from both the !-search Fenton suggests as well as Comments adaption of the Fastsummaries algorithm sans summary fields, using:

Jump to first record of next group using calculation: Get(recordNumber)+Count(selfJoin::etc.)

My attack was primarily to stuff a $variable 34999 times, although it hardly have any impact on the number of fields involved as well as bluring the relational graphing!

--sd

comment · January 1, 2008

I would still reserve judgment, without knowing more. I didn't get the impression that speed was the primary concern here - only that the current method was unbearably slow, trying to perform 35,000 x 35,000 comparisons (if not three times as much). A loop in a found set of 35,000 doesn't take THAT long and performing a find for duplicates beforehand, as Fenton suggested, could very well be enough.

If there's one thing I learned from the thread I linked to, it's that there is no single preferred method for all situations - not even when speed IS paramount.

Søren Dyhr · January 2, 2008

I didn't get the impression that speed was the primary concern here - only that the current method was unbearably slow

Are we being rhetoric here? The word "speed" made headline or rather the topic here!

that there is no single preferred method for all situations

Indeed! This is why we ask into to purpose and context usually, although it's exactly where it usually gets tricky:

Consulting is hard because clients are not acting rationally. They will have a problem but will never admit it, and the problem is always a people problem, no matter how technical it might seem at first ...........

Weinberg’s advice is not to try to be rational at all costs, and don’t force clients to admit their problems and fears. Consultants should be reasonable rather rational, cultivate a paradoxical frame of mind and help clients solve their problems by themselves.

Snipped from: http://agylen.com/2005/03/21/the-secrets-of-consulting-review/

I laughed to myself, I have indeed been a fool, not the realize this by myself, without reading it :bang: :hair: - To which I BTW was lead by these excellent blog'ings:

http://www.sumware.net/robfm2/?p=30

Thanks Rob Russell! !!!

--sd

comment · January 2, 2008

Are we being rhetoric here? The word "speed" made headline or rather the topic here!

I think we could very well be. Because we are debating differences that are negligible compared to the original problem. Let's say, for argument's sake, that it takes a minute to loop through all 35,000 records and perform 35,000 comparisons. Now, I don't know exactly how the original script works, but let's say it compares each record in turn to all the remaining records in the found set. If there are no duplicates, it will require about 612,500,000 comparisons (35000^2 / 2). If it takes one minute to perform 35,000 comparisons, this script will run for more than 12 days.

Mike Bee · January 2, 2008

Define a calculation field that concatenates the three fields. Sort the records by this field, and compare only to the previous record's value (stored in a variable).

I tried this, it works ok until you start to get over 1000 entries in a database, so the more entries it enters into the current found set, the slower it goes.

60 fields per record is a kind of asking for problems if indexes are sprinkled arbitrarily all over the place as well as numerous of interdependent calc'fields, these ought to be broken out in several tables, by it should exhausting scripts be run in TOG's where irrelevant TO's have been disjoined.

I have been tailoring my scripts to only work on three fields: last name, house number and zip code. Pretty much I'm trying to sort lots of data entries to remove duplicate households (IE, generating a mailing list that does not include more than one piece of mail per household) Do you think that splitting off those three fields into a seperate table have an impact?

I would use the first part of Comment's idea, create a concatenated field of all three fields. This of course bloats the file, especially if you Find on the field. But that's the price you pay for speed.

I've already made a single field based on my three search fields.

If there are no duplicates, it will require about 612,500,000 comparisons (35000^2 / 2). If it takes one minute to perform 35,000 comparisons, this script will run for more than 12 days.

I ran a smaller set of 1000 entries as a time trial over new years day and my fastest script took ~4 hours to run it, that's with storing the data locally on a pretty fast computer with 10,000rpm drives.

--All in all-- it seems that these are my priorities:

1) perform find to isolate duplicate entries

2) store omitted (non-duplicate entries) to a new table

3) perform duplicate entry comparison and removal to found set.

4) add results of found set to table with prior omitted.

I'm going to spending today trying to get this together.

comment · January 2, 2008

it works ok until you start to get over 1000 entries in a database, so the more entries it enters into the current found set, the slower it goes.

That doesn't make sense. Can you post a copy of your script?

2) store omitted (non-duplicate entries) to a new table

What's the purpose in moving non-duplicates to another table?

Søren Dyhr · January 2, 2008

There is something else wrong here, I made myself a found set of 35000 records all with unique data, then did I by applescript deliberately contaminate the date twice this way:

tell application "FileMaker Pro Advanced"

	duplicate (records 1334 thru 3333)

end tell



....and:



tell application "FileMaker Pro Advanced"

	duplicate (records 334 thru 2333)

end tell





My scripting dealt with the entire found set which by the above have risen to 39000 records. This means that 4000 records should be omitted.



I applied the two approaches, a combination of SixFried as well as Comment's with the twist, that since the relation aready are in work, why not attemt to make an unstored search in a layout with a portal, not something you should expect would be particular fast, since the indexing should be made each time the contamiation happens.



Anyhow did I with this never ever exceed 6 minutes of processing, by doing this:


Go to Layout [ âLayout #2â (pling) ] 

Enter Find Mode [  ] 

Set Error Capture [ On ] 

Go to Portal Row 

[ First ] 

Set Field [ pling 2::field1; "*" ] 

Perform Find [  ] 

Set Error Capture [ On ] 

Sort Records [ Specified Sort Order: pling::field1; ascending 

pling::field2; ascending 

pling::field3; ascending ] 

[ Restore; No dialog ] 

Go to Record/Request/Page 

[ First ] 

Loop 

       Set Variable [ $howMany; Value:Count ( pling 2::RecordID ) ] 

       Omit Record 

       Exit Loop If [ Get ( FoundCount ) = Get ( RecordNumber ) ] 

       Go to Record/Request/Page [ Get ( RecordNumber)+$howMany ] [ No dialog ] 

End Loop 

Show Omitted Only 

Go to Layout [ original layout ]

I didn't introduce any new indexes, by adding fields - since recordID should be present anyway.

I have thought of a way to make irrelevant fields dissapear from the equation, the separation model seems to give entry to stored fields only if the data-file behind is closed ... and then perform the script in this interface file only. If this is something others have made experiments with this approach would I suggest you chime in!

--sd

Sign In

Speed up scripts in large DB?

Recommended Posts

Mike Bee

comment

Søren Dyhr

comment

Fenton

Søren Dyhr

comment

Søren Dyhr

comment

Mike Bee

comment

Søren Dyhr

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Explore

Affiliate Forums

Activity

Important Information