Jump to content
Server Maintenance This Week. ×

Searching records for duplicates


This topic is 6920 days old. Please don't post here. Open a new topic instead.

Recommended Posts

I've got a database with a field called number, which holds a number.

There are 20,000 or so records.

If I want to find the duplicates I just search using the exclamation mark find symbol.

No problem!

Is there a way to search to keep one duplicated record, and omit the other duplicates?

Say I have 4 records with a number 365 in.

When I search for ! it returns all 4 of those records.

Is there a way for it to return just 3 of those records, leaving one of those in with the rest of the records??

Thanks

Ben

Link to comment
Share on other sites

Check out the FM101 files by Fenton Jones. Here , He has one Labeled Duplicates that may be helpful. I couldn't find it by itself, but I got a copy of in the [color:"blue"]

 

FM 101

Link to comment
Share on other sites

Yeah, I've got a couple of duplicate loops and such in there. Not a whole lot is different in 7. Except you would use compound relationships instead of concatenated keys; same principle, less cholesterol. BTW, the $20 donation was just wishful thinking. Perhaps 1 or 2 people have ever sent it, over several years; maybe not. You may safely ignore it :-]

Link to comment
Share on other sites

[color:"blue"] Hi Bob,

I looked at the link you provided, and saw that you had uploaded three files to that thread, I wasn't sure which one you were meaning. smile.gif

And I remember the Phone file by Fenton, and thought it might suit the need.

[color:"blue"] Hi Fenton

Are the v5 files straight converts of the v3 files, or did you do some enhancements?

$20.00 is cheap. I owe you a lot more than that though. One of these days we will have lunch or dinner on me.

In the mean time, I would rather owe it to you than screw you out of it.

Lee

biglaff.gif

Link to comment
Share on other sites

Here's an idea based on version 5 or 6 (you can do it more easily in v.7).

This script compares a global field to the data in the field that you are using to search for duplicates--let's call it tDuplicate.

Set up a global text field called, gText.

Now, run a script which first finds and sorts all the duplicates. (Sort them by tDuplicate so the duplicated records will be grouped together.)

The script then goes to the first record and sets gNumber to the value of tDuplicate and then omits that first record.

Next, the script runs a loop which goes through each record. As it goes, it checks to see if gText is the same as tDuplicate. If it is, nothing happens, if it is different, the record gets omitted. Something like this:

Show all records, Sort [Restore, No dialog], Go to record [First], Set Field [gText, tDuplicate], Omit Record, Loop, If[gText = tDuplicate], Omit record, EndIf, Go to next record(exit after last record), Exit Loop

So, the first time it comes to a duplicated field, it omits it, but the subsequent duplications are passed by. Then, when it comes to a new set of duplications, it omits the first one and then passes by the rest of those, etc, etc.

In case you haven't used looping scripts before, the thing to make sure of is that on the "Go to next record" script step, you check off the box that says "Exit after last record." Otherwise the loop just keeps on going and going. I always try out a looping script without the loop first. It will just work on the first two records that way and you don't waste a lot of time. The Omit Record script step just excludes the current record from the found set.

Link to comment
Share on other sites

I think your Go to next record step should be inside an Else portion of your If statement. You run the risk of skipping many records otherwise.

I am not sure if this has been mentioned yet, but the self-relationship duplicate flag technique is probably much faster than this, though it may depend on how many thousands of records your file contains. To do this, create a self-relationship based on your number field, then create a calculated field of serial = selfrelationship::serial. Performing a find for 1 in this field will return a found set of distinct 'number' records. Performing a find for zero will return all duplicate records.

Link to comment
Share on other sites

Queue, I assumed that some technique using a self relationship would be fastest too, but in the thread I referred to earlier, I tried a number of different techniques to go through the duplicates, and still found the looping script generally fastest. It does depend on the number of total records, and percentage of records that are duplicates.

The method you mentioned was proposed by Comment in that same thread, but because the field is unindexed, it is actually slower than the standard find duplicates (using the ! symbol) method followed by a looping script. The file that I attached there includes the find on the self relationship as one of the methods. When the number of records increases above about 10,000 there is a significant difference in speed. However, I noted that FM appears to build a temporary index when doing the unindexed find, so that if you do another one immediately, it is blazing fast. But, as soon as you modify any record, the index is lost and the unindexed find returns to it's previous speed.

The method that I found to be fastest was a variation on the looping script which used a Count(selfjoin::IDfield) formula to skip over records in the loop. So, in the end, a self relationship does play a part in it.

Link to comment
Share on other sites

I've run into similar comparisons. I needed to create a Customer file from my Invoices (data migration) so I isolated my unique Customers in a 300,000-record Invoice file (one customer with approx. 400 orders each). The fields were already indexed. The self-join was lightening fast in comparison under those circumstances. I don't recall the exact time comparisons but self-join method was roughly 15 minutes compared to 2 hours (overall) start to finish.

Number of duplicates and indexing indeed seems to make a big difference.

Link to comment
Share on other sites

This topic is 6920 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.