|
Your continued generosity and support of FMForums is greatly appreciated. |
xochi
veteran
Posts: 605
Post Rank (AVG):
FMP: 9 Advanced OS: Mac OS X Leopard
Member: TechNet Skill: Expert
Tweet This Post!
|
This is a highly optimized version of the "Unique List" function.
I was having terrible performance problems with the common "unique list" algorithm (as seen http://www.briandunning.com/cf/596 ) and similar.
The issue is that these functions use both the RightValues and CountValues functions -- both of which require filemaker to (A) separate the list into separate values, and (B) process the entire list.
I realized that when looking for unique values, we only need to find the First non-unique value -- at that point we can short-circuit the calculation.
And, since values are separated by paragraph characters, we can actually just use a character-based method which turns out to be much faster since it doesn't require turning a string into a list of separate values.
Without further ado, I present "UniqueListFast":
UPDATE: see below for an even Faster version of this function that can handle more than 10,000 unique items: http://fmforums.com/forum/showpost.php?post/359130/
Code:
UniqueListFast(theList)
// recursive function that takes a list and returns only the unique items
// This is highly optimized for speed -- instead of using CountValues and FilterValues , we use character-based comparisons using
// Position() which short-circuit the match
Let(
[ this = GetValue ( theList ; 1 ) // the value we are testing for uniqueness
;
nThis = Length(this) ; // character length of this value
nList = Length(theList) // character length of list
]
;
Case(
theList = "" ; "" ; // end of list
theList = "¶" ; "" ;// end of list
// if we find a second instance of this item? then skip this item, or include this item if unique
// note hack here: since we are searching for "¶this" we automatically will skip the 1st instance!
If ( Position( theList & ¶; ¶ & this & ¶; 1 ; 1) > 0 ; "" ; this & "¶" )
// and process rest of list recursively
& UniqueListFast( Right(theList; nList- nThis - 1) )
)
)
And the sister function which simply returns a count of the unique values
Code:
UniqueListCountFast(startCount; theList)
// recursive function that takes a list and returns the count of unique items
// This is a highly optimized version which uses Position() which can short-circuit the match
// other versions use FilterValues and RightValues which are very slow
Let(
[ this = GetValue ( theList ; 1 ) // the value we are testing for uniqueness
;
nThis = Length(this) ; // character length of this value
nList = Length(theList) // character length of list
]
;
Case(
theList = "" ; currentCount ; // end of list
theList = "¶" ; currentCount ;// end of list
// note hack here: since we are searching for "¶this" we automatically will skip the 1st instance!
// unique?
Position( theList & ¶; ¶ & this & ¶ ; 1 ; 1) > 0 ;
// not unique
UniqueListCountFast( currentCount ; Right(theList; nList- nThis - 1) ) ; // currentCount + restOfList
// unique
UniqueListCountFast( currentCount + 1 ; Right(theList; nList- nThis - 1) ) // currentCount + 1 + restOfList
)
)
Edited by xochi on 06-24-10 09:04 AM. Reason for edit: No reason given.
|
|
|
BruceR
Pooh-Bah
Posts: 2218
Loc: Redmond WA
Post Rank (AVG):
FMP: 11 Advanced OS: Mac OS X Snow Leopard
Member: TechNet Skill: Expert
Certified:
DBUG SIG
Tweet This Post!
|
In response to xochi
Nice but I think it needs some work. As in many other similar functions, you need to put a return on the end of the value and on the end of the list before doing your test.
UniqueListFast ("a¶ab¶abc¶abcd¶abcde")
Result:
abcde
|
BruceR
Pooh-Bah
Posts: 2218
Loc: Redmond WA
Post Rank (AVG):
FMP: 11 Advanced OS: Mac OS X Snow Leopard
Member: TechNet Skill: Expert
Certified:
DBUG SIG
Tweet This Post!
|
In response to xochi
This seems to do it:
Code:
/*
UniqueListFast(theList)
recursive function that takes a list and returns only the unique items
This is highly optimized for speed -- instead of using CountValues and FilterValues ,
we use character-based comparisons using Position()
which short-circuit the match
*/
Let(
[ this = LeftValues ( theList ; 1 ) // the value we are testing for uniqueness
;
nThis = Length(this) -1 ; // character length of this value
nList = Length(theList) // character length of list
]
;
Case(
theList = "" ; "" ; // end of list
theList = "¶" ; "" ;// end of list
// if we find a second instance of this item? then skip this item, or include this item if unique
// note hack here: since we are searching for "¶this" we automatically will skip the 1st instance!
If ( Position( theList & ¶; ¶ & this ; 1 ; 1) > 0 ; "" ; this )
// and process rest of list recursively
& UniqueListFast( Right(theList; nList- nThis - 1) )
)
)
|
BruceR
Pooh-Bah
Posts: 2218
Loc: Redmond WA
Post Rank (AVG):
FMP: 11 Advanced OS: Mac OS X Snow Leopard
Member: TechNet Skill: Expert
Certified:
DBUG SIG
Tweet This Post!
|
In response to xochi
I also think you can use a substitute function to drop out all the non-unique values, rather than always walking the entire list one item at a time.
|
xochi
veteran
Posts: 605
Post Rank (AVG):
FMP: 9 Advanced OS: Mac OS X Leopard
Member: TechNet Skill: Expert
Tweet This Post!
|
In response to BruceR
Thanks for the fixes: I'm only testing using the List(table:field) which I suspect puts the trailing paragraph marker in. Is this the reason you changed from "GetValue" to "LeftValues"?
I'm trying for absolute pedal-to-the-medal raw speed here, so I'm not sure that Substitute() is going to help -- it probably depends on the nature of the list (e.g. if it's 10,000 records, 9900 of which are unique vs only 500 of which are unique).
In my use, the list only contains numeric primary keys, thus I don't care about case-sensitive comparisons (which are slow). And, I also know that my keys are sorted since I'm doing this via a relationship, so I don't need to scan the whole list -- once a matching key is found, we can just check the next item.
Here is the new variation:
The main optimizations here are
* using Exact()
* avoid re-building the list every iteration -- instead we just pass the whole list and the list index
When these two requirements are met (case insensitive, sorted keys, no empties) -- this new version can be 5x faster.
Code:
// Function that takes a list and returns the count of unique items
// ### This is a highly optimized version, best used for sorted keys.
// Assumptions:
// The list MUST be sorted
// The list is compared case-insensitive -- "AAA" will not match "aaa"
// The list must NOT contain any empty items (as it will terminate if it hits two empty items in a row)
//
// call the function as follows:
// UniqueListCountSortedFast(""; 1 ; 0 ; List(RelatedTable::Field))
//
Let(
[ // 'this' is the value we are testing for uniqueness
next = GetValue(theList; indx+1 ) // the next value to compare with
]
;
Case(
IsEmpty(this) and IsEmpty(next) ; currentCount ; // end of list
// the next lines are a performance hack :
// since indx keeps going higher, in a long list
// the "GetValue()" has to traverse a longer and longer list
// thus it *may* be faster to break the list into a smaller chunk every N items
indx > 500 ; // try different values here and benchmark
UniqueListCountSortedFast(this; 1; currentCount ; MiddleValues(theList; indx; 9999999) ) ;
// NOTE: Keys must be sorted and are compared byte-identical (case sensitive)
Exact( this; next) ; // keys match, so don't increment the found count
UniqueListCountSortedFast(next; indx + 1; currentCount ; theList ) ;
// not matching? this is a unique key, so increment the count and continue
UniqueListCountSortedFast(next; indx + 1; currentCount + 1; theList )
)
)
In benchmarking, I'm finding that I can do this calculation across 500,000 records, with 10 different unstored calculations in the child table, in about 20 minutes.
Update: turns out it was dying with over 10,000 items -- see below for an updated formula which can handle more items.
Using other UniqueList formulas it was so slow that I eventually gave up after it ran for 3 hours straight.
Edited by xochi on 06-23-10 09:10 PM. Reason for edit: No reason given.
|
comment
consultant
Posts: 15781

Post Rank (AVG):
FMP: 7 Advanced OS: Mac OS X Panther
Tweet This Post!
|
In response to xochi
I'm trying for absolute pedal-to-the-medal raw speed here
Have you tried using a value list?
|
xochi
veteran
Posts: 605
Post Rank (AVG):
FMP: 9 Advanced OS: Mac OS X Leopard
Member: TechNet Skill: Expert
Tweet This Post!
|
In response to comment
I haven't tried value lists, although I do understand they might also do the trick -- the reason I didn't is that I have dozens of different fields to summarize, and if I use a valueList then I'll need to make dozens of different valueLists.
Using this technique I only need one relationship and then the rest can be done in calc fields (of which I have dozens). So it's really a design trade-off.
In your experience are value lists much faster?
|
BruceR
Pooh-Bah
Posts: 2218
Loc: Redmond WA
Post Rank (AVG):
FMP: 11 Advanced OS: Mac OS X Snow Leopard
Member: TechNet Skill: Expert
Certified:
DBUG SIG
Tweet This Post!
|
In response to xochi
Thanks for the fixes: I'm only testing using the List(table:field) which I suspect puts the trailing paragraph marker in. Is this the reason you changed from "GetValue" to "LeftValues"?
I'm trying for absolute pedal-to-the-medal raw speed here, so I'm not sure that Substitute() is going to help -- it probably depends on the nature of the list (e.g. if it's 10,000 records, 9900 of which are unique vs only 500 of which are unique).
Well, your claimed intent certainly doesn't match what you're actually doing.
First of all, did you try my version?
Do you understand why your version failed?
Your version failed because you were not using the whole value; you NEED to add the trailing return before testing for position. That's what leftValues does. LeftValues, rightValues, middleValues always give back a list with a trailing return.
|
xochi
veteran
Posts: 605
Post Rank (AVG):
FMP: 9 Advanced OS: Mac OS X Leopard
Member: TechNet Skill: Expert
Tweet This Post!
|
In response to BruceR
Hi bruce,
Thanks for your comments -- I was confused by your change to LeftValue *and* adding the paragraph mark as this changed the length calc, but now I see how you did it.
I've updated the code in the original post so it should work now -- note that i kept using GetValue() rather than LeftValue so our two versions differ in how they calculate the length, but functionally I think they are now identical?
|
xochi
veteran
Posts: 605
Post Rank (AVG):
FMP: 9 Advanced OS: Mac OS X Leopard
Member: TechNet Skill: Expert
Tweet This Post!
|
In response to xochi
The prior versions both had problems with more than 10,000 records since there's a built-in 10k (or 50k?) iteration limit in FileMaker.
Here is a version that (in my testing) can handle a much higher number of records. I've tested with a list of 400,000 items containing over 25000 unique keys.
I use a "loop unrolling" technique so that when it's processing a run of identical keys (which happens often, since the keys are sorted, remember!) it stays inside the custom function.
This one is 12 deep (since I was processing yearly data with typically no more than 12 records per person) but you could certainly edit it to go higher as needed.
In the case where the keys don't match, the code pathway is only 1-deep, so that might be a place to optimize.
Code:
UniqueListCountSortedFast12x(indx, currentCount, theList)
// Function that takes a list and returns the count of unique items
// This 12x version has an un-rolled loop inside it, which has been tested
// with a list of 400,000 items containing 24000 unique keys.
// Your mileage may vary.
//
// This is a highly optimized version for special use.
// Use at your own risk.
//
// Assumptions:
// The list MUST be sorted
// The list is compared case-insensitive (works best with number keys )
//
// call the function as follows:
// UniqueListCountSortedFast12x(1 ; 0 ; List(Table::Field))
//
Let(
[
this = GetValue(theList;indx) ; // value we are testing for
next01 = GetValue(theList;indx+1) // next value (get this once here as we use it twice)
]
;
Case(
IsEmpty(this) and IsEmpty(next01) ; currentCount ; // end of list
// performance hack : break the list up every 500 items then start back at indx=1
// we do this so that GetValue() doesn't have to traverse a super-long list.
indx > 500 ; // 500 is not magical, try other values & benchmark
UniqueListCountSortedFast12x(1; currentCount ; MiddleValues(theList; indx ;9999999) ) ;
// now the comparison -- when processing identical items it's very efficient
If(Exact(this; next01);
If(Exact(this;GetValue(theList;indx+2));
If(Exact(this;GetValue(theList;indx+3));
If(Exact(this;GetValue(theList;indx+4));
If(Exact(this;GetValue(theList;indx+5));
If(Exact(this;GetValue(theList;indx+6));
If(Exact(this;GetValue(theList;indx+7));
If(Exact(this;GetValue(theList;indx+8));
If(Exact(this;GetValue(theList;indx+9));
If(Exact(this;GetValue(theList;indx+10));
If(Exact(this;GetValue(theList;indx+11));
If(Exact(this;GetValue(theList;indx+12)); UniqueListCountSortedFast12x(indx + 12; currentCount ; theList ) ; // too many keep going in new function
UniqueListCountSortedFast12x(indx + 12; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x( indx + 11; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x( indx + 10; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x( indx + 9; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x(indx + 8; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x(indx + 7; currentCount + 1; theList )
);
UniqueListCountSortedFast12x(indx + 6; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x(indx + 5; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x(indx + 4; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x(indx + 3; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x(indx + 2; currentCount + 1; theList )
) ;
UniqueListCountSortedFast12x(indx + 1; currentCount + 1; theList )
)
) // end of Case
) // end of Let
|
BruceR
Pooh-Bah
Posts: 2218
Loc: Redmond WA
Post Rank (AVG):
FMP: 11 Advanced OS: Mac OS X Snow Leopard
Member: TechNet Skill: Expert
Certified:
DBUG SIG
Tweet This Post!
|
In response to xochi
10K/50K both true but dependent on type of recursion. The 50K you get with tail recursion.
|
BruceR
Pooh-Bah
Posts: 2218
Loc: Redmond WA
Post Rank (AVG):
FMP: 11 Advanced OS: Mac OS X Snow Leopard
Member: TechNet Skill: Expert
Certified:
DBUG SIG
Tweet This Post!
|
In response to xochi
The prior versions both had problems with more than 10,000 records since there's a built-in 10k (or 50k?) iteration limit in FileMaker.
Here is a version that (in my testing) can handle a much higher number of records. I've tested with a list of 400,000 items containing over 25000 unique keys.
I use a "loop unrolling" technique so that when it's processing a run of identical keys (which happens often, since the keys are sorted, remember!) it stays inside the custom function.
This one is 12 deep (since I was processing yearly data with typically no more than 12 records per person) but you could certainly edit it to go higher as needed.
In the case where the keys don't match, the code pathway is only 1-deep, so that might be a place to optimize.
Very nice.
|
xochi
veteran
Posts: 605
Post Rank (AVG):
FMP: 9 Advanced OS: Mac OS X Leopard
Member: TechNet Skill: Expert
Tweet This Post!
|
In response to BruceR
10K/50K both true but dependent on type of recursion. The 50K you get with tail recursion.
I *think* that my "UniqueListCountSortedFast12x" function is using Tail Recursion --
Reasoning:
A ) the final call on the stack doesn't need to evaluate the entire stack to calculate the result. It just returns "currentCount". But I may be confused on this point.
B ) I have tested this processing 12 months of personnel data in which the # of unique records is over 25,000 and the number of non-blank records is over 250,000 - so I'm pretty sure that if I was hitting the 10,000 limit, it wouldn't work (since it would probably fail around 12x10000 = 120k records). This suggests it does make full use of the 50k recursion limit. In any case, if one needs to handle more than 50,000 * 12 records, one could increase the depth of the nested If statements...
Edited by xochi on 07-05-10 09:28 AM. Reason for edit: No reason given.
|
|
Your continued generosity and support of FMForums is greatly appreciated. |
john renfrew
enthusiast
Posts: 34
Post Rank (AVG):
FMP: 11 Advanced OS: Cross Platform
Member: TechNet Skill: Advance
Tweet This Post!
|
In response to xochi
What about Scriptmaster???
One line of code
Code:
ts = new TreeSet(list.tokenize("\n")).sort()
Just run it on a list of 660000 words without hardly a blink
|
|
|