Newbies Mike Bee Posted December 31, 2007 Newbies Posted December 31, 2007 I have a DB with ~35,000 entries, each entry has 60 fields. I am trying to automate a process to delete or omit duplicate entries based on three fields. my problem is that even with indexed fields, it is taking too long to run an automated script, since it tries to compare each record to all 35,000 entries. Does anyone have any suggestions as to a script or method to speed up this process.
comment Posted December 31, 2007 Posted December 31, 2007 Define a calculation field that concatenates the three fields. Sort the records by this field, and compare only to the previous record's value (stored in a variable). There are other methods, too - all depends on how often you need this and for what purpose.
Søren Dyhr Posted January 1, 2008 Posted January 1, 2008 60 fields per record is a kind of asking for problems if indexes are sprinkled arbitrarily all over the place as well as numerous of interdependent calc'fields, these ought to be broken out in several tables, by it should exhausting scripts be run in TOG's where irrelevant TO's have been disjoined. Problems like these occurs sometimes when a database is mistaken for being a spreadsheet. I kind of fancied the method Comment suggests, which could be found here: http://www.databasepros.com/FMPro?-DB=resources.fp5&-lay=cgi&-format=list.html&-FIND=+&resource_id=DBPros000669 until I realized that it's actually much slower than this: http://sixfriedrice.com/wp/deleting-duplicate-records-in-filemaker/ --sd
comment Posted January 1, 2008 Posted January 1, 2008 It was not my intention to suggest any particular method - only to "unsuggest" comparing each record to all 35,000 entries. Regarding speed, I would refer you to: http://www.fmforums.com/forum/showtopic.php?tid/159484/ These tests should probably be re-examined in version 9. In any case, speed alone should not be the determining factor: there is also the question of how many resources each method requires.
Fenton Posted January 1, 2008 Posted January 1, 2008 (edited) The speed of any method to remove duplicates depends on the number of duplicates vs. the number of records. Since the number of duplicates is hopefully pretty small, isolating them would be a priority for me. The poster is saying that this is an operation that is going to be run repeatedly, and that the priority (in this case) is speed. I would use the first part of Comment's idea, create a concatenated field of all three fields. This of course bloats the file, especially if you Find on the field. But that's the price you pay for speed. I would first do a Find, using the "!" special character, to Find all duplicates. It would be slow the 1st time, but not afterwards. I would then use either a "sort and compare" method, a Loop (starting at either 1st or Last record, depending on which record you want to keep), holding the data in Variables (to see which is the duplicate, reset when the data changes). Or I'd use a self-relationship on the concatenated duplicate field. In fact, I'd likely use that. You've already paid the disk space price by indexing the concatenated duplicate field, might as well use it. Such a self-relationship, comparing a self-related serial ID* to the local serial ID, would positively identify either the 1st or last of the self-relationship. If the record is not the 1st (or last), then Delete it. This method does not require sorting. The self-relationship can also be used, in a calculation, result Text, to produce a "Duplicate!" warning, on a Form view data entry layout, as soon as the 3 fields are populated and committed. *If you do not have a unique auto-enter serial ID in the table, you should. Edited January 1, 2008 by Guest
Søren Dyhr Posted January 1, 2008 Posted January 1, 2008 Fenton is right here, the level of contamination does really matter, and yes Sixfried Rice's algorithm could indeed benefit from both the !-search Fenton suggests as well as Comments adaption of the Fastsummaries algorithm sans summary fields, using: Jump to first record of next group using calculation: Get(recordNumber)+Count(selfJoin::etc.) My attack was primarily to stuff a $variable 34999 times, although it hardly have any impact on the number of fields involved as well as bluring the relational graphing! --sd
comment Posted January 1, 2008 Posted January 1, 2008 I would still reserve judgment, without knowing more. I didn't get the impression that speed was the primary concern here - only that the current method was unbearably slow, trying to perform 35,000 x 35,000 comparisons (if not three times as much). A loop in a found set of 35,000 doesn't take THAT long and performing a find for duplicates beforehand, as Fenton suggested, could very well be enough. If there's one thing I learned from the thread I linked to, it's that there is no single preferred method for all situations - not even when speed IS paramount.
Søren Dyhr Posted January 2, 2008 Posted January 2, 2008 I didn't get the impression that speed was the primary concern here - only that the current method was unbearably slow Are we being rhetoric here? The word "speed" made headline or rather the topic here! that there is no single preferred method for all situations Indeed! This is why we ask into to purpose and context usually, although it's exactly where it usually gets tricky: Consulting is hard because clients are not acting rationally. They will have a problem but will never admit it, and the problem is always a people problem, no matter how technical it might seem at first ........... Weinberg’s advice is not to try to be rational at all costs, and don’t force clients to admit their problems and fears. Consultants should be reasonable rather rational, cultivate a paradoxical frame of mind and help clients solve their problems by themselves. Snipped from: http://agylen.com/2005/03/21/the-secrets-of-consulting-review/ I laughed to myself, I have indeed been a fool, not the realize this by myself, without reading it - To which I BTW was lead by these excellent blog'ings: http://www.sumware.net/robfm2/?p=30 Thanks Rob Russell! --sd
comment Posted January 2, 2008 Posted January 2, 2008 Are we being rhetoric here? The word "speed" made headline or rather the topic here! I think we could very well be. Because we are debating differences that are negligible compared to the original problem. Let's say, for argument's sake, that it takes a minute to loop through all 35,000 records and perform 35,000 comparisons. Now, I don't know exactly how the original script works, but let's say it compares each record in turn to all the remaining records in the found set. If there are no duplicates, it will require about 612,500,000 comparisons (35000^2 / 2). If it takes one minute to perform 35,000 comparisons, this script will run for more than 12 days.
Newbies Mike Bee Posted January 2, 2008 Author Newbies Posted January 2, 2008 Define a calculation field that concatenates the three fields. Sort the records by this field, and compare only to the previous record's value (stored in a variable). I tried this, it works ok until you start to get over 1000 entries in a database, so the more entries it enters into the current found set, the slower it goes. 60 fields per record is a kind of asking for problems if indexes are sprinkled arbitrarily all over the place as well as numerous of interdependent calc'fields, these ought to be broken out in several tables, by it should exhausting scripts be run in TOG's where irrelevant TO's have been disjoined. I have been tailoring my scripts to only work on three fields: last name, house number and zip code. Pretty much I'm trying to sort lots of data entries to remove duplicate households (IE, generating a mailing list that does not include more than one piece of mail per household) Do you think that splitting off those three fields into a seperate table have an impact? I would use the first part of Comment's idea, create a concatenated field of all three fields. This of course bloats the file, especially if you Find on the field. But that's the price you pay for speed. I've already made a single field based on my three search fields. If there are no duplicates, it will require about 612,500,000 comparisons (35000^2 / 2). If it takes one minute to perform 35,000 comparisons, this script will run for more than 12 days. I ran a smaller set of 1000 entries as a time trial over new years day and my fastest script took ~4 hours to run it, that's with storing the data locally on a pretty fast computer with 10,000rpm drives. --All in all-- it seems that these are my priorities: 1) perform find to isolate duplicate entries 2) store omitted (non-duplicate entries) to a new table 3) perform duplicate entry comparison and removal to found set. 4) add results of found set to table with prior omitted. I'm going to spending today trying to get this together.
comment Posted January 2, 2008 Posted January 2, 2008 it works ok until you start to get over 1000 entries in a database, so the more entries it enters into the current found set, the slower it goes. That doesn't make sense. Can you post a copy of your script? 2) store omitted (non-duplicate entries) to a new table What's the purpose in moving non-duplicates to another table?
Søren Dyhr Posted January 2, 2008 Posted January 2, 2008 There is something else wrong here, I made myself a found set of 35000 records all with unique data, then did I by applescript deliberately contaminate the date twice this way: tell application "FileMaker Pro Advanced" duplicate (records 1334 thru 3333) end tell ....and: tell application "FileMaker Pro Advanced" duplicate (records 334 thru 2333) end tell My scripting dealt with the entire found set which by the above have risen to 39000 records. This means that 4000 records should be omitted. I applied the two approaches, a combination of SixFried as well as Comment's with the twist, that since the relation aready are in work, why not attemt to make an unstored search in a layout with a portal, not something you should expect would be particular fast, since the indexing should be made each time the contamiation happens. Anyhow did I with this never ever exceed 6 minutes of processing, by doing this: Go to Layout [ âLayout #2â (pling) ] Enter Find Mode [ ] Set Error Capture [ On ] Go to Portal Row [ First ] Set Field [ pling 2::field1; "*" ] Perform Find [ ] Set Error Capture [ On ] Sort Records [ Specified Sort Order: pling::field1; ascending pling::field2; ascending pling::field3; ascending ] [ Restore; No dialog ] Go to Record/Request/Page [ First ] Loop Set Variable [ $howMany; Value:Count ( pling 2::RecordID ) ] Omit Record Exit Loop If [ Get ( FoundCount ) = Get ( RecordNumber ) ] Go to Record/Request/Page [ Get ( RecordNumber)+$howMany ] [ No dialog ] End Loop Show Omitted Only Go to Layout [ original layout ] I didn't introduce any new indexes, by adding fields - since recordID should be present anyway. I have thought of a way to make irrelevant fields dissapear from the equation, the separation model seems to give entry to stored fields only if the data-file behind is closed ... and then perform the script in this interface file only. If this is something others have made experiments with this approach would I suggest you chime in! --sd
Recommended Posts
This topic is 6171 days old. Please don't post here. Open a new topic instead.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now