Searching records for duplicates

Batfastad · May 19, 2005

I've got a database with a field called number, which holds a number.

There are 20,000 or so records.

If I want to find the duplicates I just search using the exclamation mark find symbol.

No problem!

Is there a way to search to keep one duplicated record, and omit the other duplicates?

Say I have 4 records with a number 365 in.

When I search for ! it returns all 4 of those records.

Is there a way for it to return just 3 of those records, leaving one of those in with the rest of the records??

Thanks

Ben

Batfastad · May 19, 2005

So a way to search and have each different value showing, but without any of the duplicates of the values.

I then need to loop through the values so this needs to be a find of some sort.

Thanks

Ben

BobWeaver · May 19, 2005

Have a look at this thread:

http://www.fmforums.com/threads/showflat.php?Cat=0&Number=159484&an=0&page=0#159484

Batfastad · May 19, 2005

I was wondering whether there was a way to do it without looping through all the records.

Obviously not.

Ah well.

Thanks for the tips!

Lee Smith · May 19, 2005

Check out the FM101 files by Fenton Jones. Here , He has one Labeled Duplicates that may be helpful. I couldn't find it by itself, but I got a copy of in the [color:"blue"]

FM 101

Fenton · May 20, 2005

Yeah, I've got a couple of duplicate loops and such in there. Not a whole lot is different in 7. Except you would use compound relationships instead of concatenated keys; same principle, less cholesterol. BTW, the $20 donation was just wishful thinking. Perhaps 1 or 2 people have ever sent it, over several years; maybe not. You may safely ignore it :-]

BobWeaver · May 20, 2005

Batfastad, If you look further down the thread that I referred to, Comment posted a method that doesn't involve looping. However, from my testing of various methods, the looping method is faster in most cases.

Lee Smith · May 20, 2005

[color:"blue"] Hi Bob,

I looked at the link you provided, and saw that you had uploaded three files to that thread, I wasn't sure which one you were meaning.

And I remember the Phone file by Fenton, and thought it might suit the need.

[color:"blue"] Hi Fenton

Are the v5 files straight converts of the v3 files, or did you do some enhancements?

$20.00 is cheap. I owe you a lot more than that though. One of these days we will have lunch or dinner on me.

In the mean time, I would rather owe it to you than screw you out of it.

Lee

Batfastad · May 20, 2005

Hi guys,

Thanks for all your help!!

I've now got a loop to do exactly what I need.

Performance isn't too bad either, when looping through 30,000 records.

Thanks

Ben

Inky Phil · May 20, 2005

Fenton,

I downloaded your buttons and I will use one or two of them. Is that a freebie or would a contribution be in order and if so how?

Phil

Jake Sterling · May 27, 2005

Here's an idea based on version 5 or 6 (you can do it more easily in v.7).

This script compares a global field to the data in the field that you are using to search for duplicates--let's call it tDuplicate.

Set up a global text field called, gText.

Now, run a script which first finds and sorts all the duplicates. (Sort them by tDuplicate so the duplicated records will be grouped together.)

The script then goes to the first record and sets gNumber to the value of tDuplicate and then omits that first record.

Next, the script runs a loop which goes through each record. As it goes, it checks to see if gText is the same as tDuplicate. If it is, nothing happens, if it is different, the record gets omitted. Something like this:

Show all records, Sort [Restore, No dialog], Go to record [First], Set Field [gText, tDuplicate], Omit Record, Loop, If[gText = tDuplicate], Omit record, EndIf, Go to next record(exit after last record), Exit Loop

So, the first time it comes to a duplicated field, it omits it, but the subsequent duplications are passed by. Then, when it comes to a new set of duplications, it omits the first one and then passes by the rest of those, etc, etc.

In case you haven't used looping scripts before, the thing to make sure of is that on the "Go to next record" script step, you check off the box that says "Exit after last record." Otherwise the loop just keeps on going and going. I always try out a looping script without the loop first. It will just work on the first two records that way and you don't waste a lot of time. The Omit Record script step just excludes the current record from the found set.

-Queue- · May 27, 2005

I think your Go to next record step should be inside an Else portion of your If statement. You run the risk of skipping many records otherwise.

I am not sure if this has been mentioned yet, but the self-relationship duplicate flag technique is probably much faster than this, though it may depend on how many thousands of records your file contains. To do this, create a self-relationship based on your number field, then create a calculated field of serial = selfrelationship::serial. Performing a find for 1 in this field will return a found set of distinct 'number' records. Performing a find for zero will return all duplicate records.

Batfastad · May 27, 2005

Yeah I've got a loop sorted to omit the duplicates, it's working well so far!!

BobWeaver · May 28, 2005

Queue, I assumed that some technique using a self relationship would be fastest too, but in the thread I referred to earlier, I tried a number of different techniques to go through the duplicates, and still found the looping script generally fastest. It does depend on the number of total records, and percentage of records that are duplicates.

The method you mentioned was proposed by Comment in that same thread, but because the field is unindexed, it is actually slower than the standard find duplicates (using the ! symbol) method followed by a looping script. The file that I attached there includes the find on the self relationship as one of the methods. When the number of records increases above about 10,000 there is a significant difference in speed. However, I noted that FM appears to build a temporary index when doing the unindexed find, so that if you do another one immediately, it is blazing fast. But, as soon as you modify any record, the index is lost and the unindexed find returns to it's previous speed.

The method that I found to be fastest was a variation on the looping script which used a Count(selfjoin::IDfield) formula to skip over records in the loop. So, in the end, a self relationship does play a part in it.

LaRetta · June 1, 2005

I've run into similar comparisons. I needed to create a Customer file from my Invoices (data migration) so I isolated my unique Customers in a 300,000-record Invoice file (one customer with approx. 400 orders each). The fields were already indexed. The self-join was lightening fast in comparison under those circumstances. I don't recall the exact time comparisons but self-join method was roughly 15 minutes compared to 2 hours (overall) start to finish.

Number of duplicates and indexing indeed seems to make a big difference.

Sign In

Searching records for duplicates

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Important Information