Have dupe records (same Acct#) show "top 1" each

OnTheOutside · May 13, 2005

Greetings, I have an ENORMOUS database (16 million records, only two related tables). "AcctNbr" ties the two together. The problem is that there are records for the same accountnumbers in the Accounts table, one was created for every year the account was open,

"AcctNbr"=1, "BillYr"=1998,

"AcctNbr"=1, "BillYr"=1999,

"AcctNbr"=1, "BillYr"=2000,

"AcctNbr"=1, "BillYr"=2001

What I need to do is this:

1) Show records with duplicate "AcctNbr" and then show only the "top 1" ("top 1" refers to the 'most recent year'. In the above example it would only show me the record for "AcctNbr"=1, "BillYr"=2001--everything else should be hidden/deleted.

2) [i can handle this part], Show all records that don't have dupe AcctNbr and preserve these so they don't get deleted when doing the above search.

It is something like FIND "AcctNbr"=!

& then

"BillYr"=%ExcludeAllButTheMostRecentYear%

Help, suggestions?

BobWeaver · May 14, 2005

You will need to use a looping script to go through the duplicates. I have attached a sample file showing how it's done.

FindDups.fp7.zip

OnTheOutside · May 14, 2005

OMG!@

Thanks so much. This is exactly what I needed. Thanks so much for the example; now I know how to do this myself.

Søren Dyhr · May 14, 2005

You will need to use a looping script to go through the duplicates.

Not sure the number of records makes it a pleasant expirience, I would suggest the method I uploaded in this thread:

http://fmforums.com/forum/showpost.php?post/133778

But the script needs a little tweaking in order not to omit single occurences as well ...or might they be there as well???

Other ideas however pops up in my mind, a portal with reversed sortorder due to the ties mentioned or the Last( function over the relation made as a scripted replace stuffing the value in a field in the Account Table???

--sd

Edited October 30, 2008 by Guest
Updated Broken Link

BobWeaver · May 14, 2005

That's an interesting technique. I have looked at a couple of alternative methods to avoid looping through all duplicates. Both techniques involved using a relationship based on the "<" operator.

One used GTRR to skip to the next group, and the other used a global key field to set the flag field via the relationship without moving between records at all. Although they both worked, they were horrendously slow compared to the standard looping script.

BobWeaver · May 15, 2005

Out of curiosity, I tried implementing the Fast Summaries method, but it is much slower than the standard looping script (unless I implemented it wrong). It seems to slow down exponentially as the number of records increases. It took 154 seconds to process 3000 records, compared to 22 seconds for the classic looping script.

It appears that the running count summary field gets very slow to calculate as the record number increases. However this gave me an idea for a variation using a Count(selfJoin::etc.) calculation to skip through records. This method proved to be fastest at 11 seconds for 3000 records, twice as fast as the classic looping script.

I have attached a file with all the methods I've tested including the two extremely slow methods I mentioned in my last post.

FindDupsVariations.fp7.zip

Søren Dyhr · May 15, 2005

I hope Mr. Horak (Comment) sees this, summaries under 7.0 is much slower than they used to be ...I'm glad you had the time to test it and tweak the idea a bit. This is why forums rock ... Given enough eyeballs as they say in sourceforge!

--sd

comment · May 16, 2005

I do see this, Mr. Dyhr. So, are we now going to put a big red X on summaries?

BTW, Bob, I think there may be a faster method - see attached.

FindDupsVariations.fp7.zip

BobWeaver · May 16, 2005

Interesting, I'm surprised that a find on an unstored calc field would be so fast. This is certainly a switch from my past experience. Good work.

On a hunch, I switched to a blank layout and re-ran my method and the classic method. In that situation, they both ran faster, and took almost the same amount of time (7 seconds for mine, 9 seconds for classic). The times given are for my Powerbook G4, 400 MHz. OSX panther.

Comment, with your method, I noticed that it usually runs in about 1/3 the time of the other methods (2 or 3 seconds), but in some situations it actually runs slower. I haven't quite figured out what the reason is. But if you run your "just find" method immediately after running my method, it will take a second or two longer (sometimes 8 or 9 seconds). It doesn't seem to be related to sort order or the number of records in the found count.

BTW, in order to meet the original requirements of this thread, the result of the find should produce the record with the latest year. This requires changing the formula for cFlag from:

Year=SameAcctNo::Year

to:

Year=Max(SameAcctNo::Year)

This slows things down somewhat, but is still fast.

I've attached another copy of the file which includes the modifications discussed.

FindDupsVariations3.fp7.zip

comment · May 16, 2005

I have noticed some unexplained phenomena as well: the first time I added Perform Find to your script, it took about a minute. Then it came back to around 6 - 7 seconds. I made the flag unindexed, then allowed indexing, but it didn't make difference any more. Mine ran at around 5 - 6 seconds, never less - and my processor is 1.25GHz, so go figure.

No change of the formula is required, IMHO. The relationship is sorted, therefore SameAcctNo::Year already is the highest related year (which should be a number field, BTW). I have never tested this thoroughly, but I tend to suspect the Max() function of being slow, and rarely use it.

Vaughan · May 16, 2005

At least with FMP 6 and earlier, when a find is performed FMP pulls the field data down from all records to the client, and then (if possible) caches it to temp disk space. It gets re-used for subsequent searches in the same session.

Exit FMP (or maybe even close the database) and the cached data is deleted.

comment · May 16, 2005

Interesting point - but then my find, which is performed on unstored field, should also take the same amount of time - at least the first time. But it didn't.

BobWeaver · May 16, 2005

No change of the formula is required, IMHO. The relationship is sorted..

Oh right. I missed that.

As for the unexplained change in execution times, it might be worthwhile doing a flush cache to disk to see if that makes a difference.

Okay I tried a flush cache, and it has no effect. But here is something interesting. If you make any change to any field data before doing the unindexed find, it takes 8-9 seconds. If you do it again immediately, it takes about 2 seconds. I suspect that as FM does the unindexed find, it builds a temporary index which it may re-use in subsequent finds as long as no changes are made to any field data. Hence the subsequent finds are fast until data is changed.

comment · May 16, 2005

That sounds reasonable. Perhaps the poster should clarify how often he intends to do this - it seems there is no single best choice for all scenarios.

BobWeaver · May 16, 2005

I also expect that the best method may depend on quantity of records, and what percentage are duplicates.

stubanz · June 15, 2005

I was just wondering guys...is there any way to do this without using a "flag" field? I used the classic script in a DB to weed out duplicates, but I ran into trouble when I wanted to further cull the records I want.

For example, first I want to narrow down the records so that their names show only if their end dates aren't duplicates, and to do this I use the classic loop and a global savedID field. If[custID=gsavedID and endDate = gsavedEndDate], and this weeds out everything I want.

Next, I want to show the customers that are only in Japan. I was thinking of using the flag field and just replacing the flags for fields that weren't in Japan. So, a record would be flagged if it was a duplicate and if it's not in Japan. The problem with this is that the flag field somehow becomes 'locked,' a problem I wasn't aware of when I started making the scripts.

I was hoping that there was a way to look through the records, and instead of flagging the ones that I didn't want to show, keeping track of the ones that I did want to show. I didn't know how to do this if I want to show a record that is has a duplicate name and ID #, with two unique end dates, and they're both in Japan.

(ie: Joe's Harleys, 43992, 5/12/06, JP 01

Joe's Harleys, 43992, 5/12/06, JP 02

Joe's Harleys, 43992, 12/12/07, JP 03

Jim's Ducatis, 43921, 3/4/06, JP 04

Joe's Harleys, 43992, 5/21/07, NZ 05)

Is it possible to show 1, 3, and 4? or is this just too complicated a task for FMP?

Any help will be greatly appreciated. I'm just so stuck right now it's not funny any more

BobWeaver · June 15, 2005

Instead of searching for 1 in the flag field, search for blank or zero (depending on which find script you use)

Sign In

Have dupe records (same Acct#) show "top 1" each

Recommended Posts

OnTheOutside

BobWeaver

OnTheOutside

Søren Dyhr

BobWeaver

BobWeaver

Søren Dyhr

comment

BobWeaver

comment

Vaughan

comment

BobWeaver

comment

BobWeaver

stubanz

BobWeaver

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Explore

Affiliate Forums

Activity

Important Information