UniqueListFast and UniqueListCountFast

xochi · June 23, 2010

This is a highly optimized version of the "Unique List" function.

I was having terrible performance problems with the common "unique list" algorithm (as seen http://www.briandunning.com/cf/596 ) and similar.

The issue is that these functions use both the RightValues and CountValues functions -- both of which require filemaker to (A) separate the list into separate values, and (: process the entire list.

I realized that when looking for unique values, we only need to find the First non-unique value -- at that point we can short-circuit the calculation.

And, since values are separated by paragraph characters, we can actually just use a character-based method which turns out to be much faster since it doesn't require turning a string into a list of separate values.

Without further ado, I present "UniqueListFast":

UPDATE: see below for an even Faster version of this function that can handle more than 10,000 unique items: http://fmforums.com/forum/showpost.php?post/359130/

UniqueListFast(theList)



// recursive function that takes a list and returns only the unique items



// This is highly optimized for speed -- instead of using CountValues and FilterValues , we use character-based comparisons using 

// Position() which short-circuit the match



Let( 

[   this =  GetValue ( theList ; 1 )  // the value we are testing for uniqueness

 ;

  nThis  = Length(this) ;  // character length of this value 

  nList = Length(theList) // character length of list



]

;

Case(

  theList = ""  ; "" ;  // end of list

  theList = "¶" ; ""  ;// end of list



  // if we find a second instance of this item? then skip this item, or include this item if unique

  // note hack here: since we are searching for "¶this" we automatically will skip the 1st instance!

  If ( Position( theList & ¶; ¶ & this  &  ¶; 1 ; 1) > 0 ;  "" ; this & "¶"    )

  // and process rest of list recursively

  &  UniqueListFast( Right(theList; nList- nThis - 1) )

)  

)





And the sister function which simply returns a count of the unique values


UniqueListCountFast(startCount; theList)



// recursive function that takes a list and returns the count of unique items



// This is a highly optimized version which uses Position() which can short-circuit the match

// other versions use FilterValues and RightValues which are very slow



Let( 

[   this =  GetValue ( theList ; 1 )  // the value we are testing for uniqueness

 ;

  nThis  = Length(this) ;  // character length of this value 

  nList = Length(theList)  // character length of list



]

;

Case(

  theList = ""  ; currentCount  ;  // end of list

  theList = "¶" ; currentCount   ;// end of list



  // note hack here: since we are searching for "¶this" we automatically will skip the 1st instance!

  // unique?

  Position( theList & ¶; ¶ & this & ¶ ; 1 ; 1) > 0 ;  

    // not unique

    UniqueListCountFast( currentCount ; Right(theList; nList- nThis - 1) )  ; // currentCount + restOfList



  // unique  

  UniqueListCountFast( currentCount + 1 ; Right(theList; nList- nThis - 1) )  // currentCount + 1 + restOfList



)  



)

Edited June 24, 2010 by Guest

bruceR · June 23, 2010

Nice but I think it needs some work. As in many other similar functions, you need to put a return on the end of the value and on the end of the list before doing your test.

UniqueListFast ("a¶ab¶abc¶abcd¶abcde")

Result:

abcde

bruceR · June 23, 2010

This seems to do it:


/*

UniqueListFast(theList)

recursive function that takes a list and returns only the unique items



This is highly optimized for speed -- instead of using CountValues and FilterValues , 

we use character-based comparisons using Position()

which short-circuit the match

*/



Let( 

[   this =  LeftValues ( theList ; 1 )  // the value we are testing for uniqueness

 ;

  nThis  = Length(this) -1 ;  // character length of this value 

  nList = Length(theList) // character length of list



]

;

Case(

  theList = ""  ; "" ;  // end of list

  theList = "¶" ; ""  ;// end of list



  // if we find a second instance of this item? then skip this item, or include this item if unique

  // note hack here: since we are searching for "¶this" we automatically will skip the 1st instance!



  If ( Position( theList & ¶; ¶ & this ; 1 ; 1) > 0 ;  "" ; this )

  // and process rest of list recursively

  &  UniqueListFast( Right(theList; nList- nThis - 1) )

)  

)

bruceR · June 23, 2010

I also think you can use a substitute function to drop out all the non-unique values, rather than always walking the entire list one item at a time.

xochi · June 24, 2010

Thanks for the fixes: I'm only testing using the List(table:field) which I suspect puts the trailing paragraph marker in. Is this the reason you changed from "GetValue" to "LeftValues"?

I'm trying for absolute pedal-to-the-medal raw speed here, so I'm not sure that Substitute() is going to help -- it probably depends on the nature of the list (e.g. if it's 10,000 records, 9900 of which are unique vs only 500 of which are unique).

In my use, the list only contains numeric primary keys, thus I don't care about case-sensitive comparisons (which are slow). And, I also know that my keys are sorted since I'm doing this via a relationship, so I don't need to scan the whole list -- once a matching key is found, we can just check the next item.

Here is the new variation:

The main optimizations here are

* using Exact()

* avoid re-building the list every iteration -- instead we just pass the whole list and the list index

When these two requirements are met (case insensitive, sorted keys, no empties) -- this new version can be 5x faster.


// Function that takes a list and returns the count of unique items

// ### This is a highly optimized version, best used for sorted keys.

// Assumptions:

//   The list MUST be sorted

//   The list is compared case-insensitive -- "AAA" will not match "aaa" 

//   The list must NOT contain any empty items (as it will terminate if it hits two empty items in a row)

//

// call the function as follows:

//   UniqueListCountSortedFast(""; 1 ; 0 ; List(RelatedTable::Field)) 

//



Let( 

[   // 'this' is the value we are testing for uniqueness

    next  = GetValue(theList;  indx+1 )  // the next value to compare with

]

;

Case(

  IsEmpty(this) and IsEmpty(next) ;  currentCount  ;  // end of list



  // the next lines are a performance hack :  

  // since indx keeps going higher, in a long list

  // the "GetValue()" has to traverse a longer and longer list

  // thus it *may* be faster to break the list into a smaller chunk every N items

  indx > 500 ;    // try different values here and benchmark

       UniqueListCountSortedFast(this; 1; currentCount ; MiddleValues(theList; indx; 9999999) )  ;



   // NOTE: Keys must be sorted and are compared byte-identical (case sensitive)

   Exact( this; next)  ;        // keys match, so don't increment the found count

       UniqueListCountSortedFast(next; indx + 1; currentCount ; theList ) ;



    // not matching? this is a unique key, so increment the count and continue

    UniqueListCountSortedFast(next; indx + 1; currentCount  + 1; theList ) 

)  



)

In benchmarking, I'm finding that I can do this calculation across 500,000 records, with 10 different unstored calculations in the child table, in about 20 minutes.

Update: turns out it was dying with over 10,000 items -- see below for an updated formula which can handle more items.

Using other UniqueList formulas it was so slow that I eventually gave up after it ran for 3 hours straight.

Edited June 24, 2010 by Guest

comment · June 24, 2010

I'm trying for absolute pedal-to-the-medal raw speed here

Have you tried using a value list?

xochi · June 24, 2010

I haven't tried value lists, although I do understand they might also do the trick -- the reason I didn't is that I have dozens of different fields to summarize, and if I use a valueList then I'll need to make dozens of different valueLists.

Using this technique I only need one relationship and then the rest can be done in calc fields (of which I have dozens). So it's really a design trade-off.

In your experience are value lists much faster?

bruceR · June 24, 2010

Thanks for the fixes: I'm only testing using the List(table:field) which I suspect puts the trailing paragraph marker in. Is this the reason you changed from "GetValue" to "LeftValues"?

I'm trying for absolute pedal-to-the-medal raw speed here, so I'm not sure that Substitute() is going to help -- it probably depends on the nature of the list (e.g. if it's 10,000 records, 9900 of which are unique vs only 500 of which are unique).

Well, your claimed intent certainly doesn't match what you're actually doing.

First of all, did you try my version?

Do you understand why your version failed?

Your version failed because you were not using the whole value; you NEED to add the trailing return before testing for position. That's what leftValues does. LeftValues, rightValues, middleValues always give back a list with a trailing return.

xochi · June 24, 2010

Hi bruce,

Thanks for your comments -- I was confused by your change to LeftValue *and* adding the paragraph mark as this changed the length calc, but now I see how you did it.

I've updated the code in the original post so it should work now -- note that i kept using GetValue() rather than LeftValue so our two versions differ in how they calculate the length, but functionally I think they are now identical?

xochi · June 24, 2010

The prior versions both had problems with more than 10,000 records since there's a built-in 10k (or 50k?) iteration limit in FileMaker.

Here is a version that (in my testing) can handle a much higher number of records. I've tested with a list of 400,000 items containing over 25000 unique keys.

I use a "loop unrolling" technique so that when it's processing a run of identical keys (which happens often, since the keys are sorted, remember!) it stays inside the custom function.

This one is 12 deep (since I was processing yearly data with typically no more than 12 records per person) but you could certainly edit it to go higher as needed.

In the case where the keys don't match, the code pathway is only 1-deep, so that might be a place to optimize.


UniqueListCountSortedFast12x(indx, currentCount, theList)



// Function that takes a list and returns the count of unique items

// This 12x version has an un-rolled loop inside it, which has been tested

// with a list of 400,000 items containing 24000 unique keys.

// Your mileage may vary.

//

// This is a highly optimized version for special use.

// Use at your own risk.

//

// Assumptions:

//   The list MUST be sorted

//   The list is compared case-insensitive (works best with number keys )

//

// call the function as follows:

//   UniqueListCountSortedFast12x(1 ; 0 ; List(Table::Field)) 

//



Let( 

[  

     this = GetValue(theList;indx)    ; // value we are testing for

     next01 = GetValue(theList;indx+1)   // next value (get this once here as we use it twice)

]

;

Case(

  IsEmpty(this) and IsEmpty(next01) ;  currentCount  ;  // end of list



// performance hack : break the list up every 500 items then start back at indx=1

// we do this so that GetValue() doesn't have to traverse a super-long list.

indx > 500 ;    // 500 is not magical, try other values & benchmark

  UniqueListCountSortedFast12x(1; currentCount ; MiddleValues(theList; indx ;9999999) )  ;



// now the comparison -- when processing identical items it's very efficient

If(Exact(this; next01);

 If(Exact(this;GetValue(theList;indx+2));

  If(Exact(this;GetValue(theList;indx+3)); 

   If(Exact(this;GetValue(theList;indx+4)); 

    If(Exact(this;GetValue(theList;indx+5)); 

     If(Exact(this;GetValue(theList;indx+6)); 

      If(Exact(this;GetValue(theList;indx+7)); 

       If(Exact(this;GetValue(theList;indx+8)); 

        If(Exact(this;GetValue(theList;indx+9)); 

         If(Exact(this;GetValue(theList;indx+10)); 

          If(Exact(this;GetValue(theList;indx+11)); 

           If(Exact(this;GetValue(theList;indx+12));    UniqueListCountSortedFast12x(indx + 12; currentCount ; theList ) ;   // too many keep going in new function

          UniqueListCountSortedFast12x(indx + 12; currentCount  + 1; theList )

          ) ;

         UniqueListCountSortedFast12x( indx + 11; currentCount  + 1; theList )

         ) ;

        UniqueListCountSortedFast12x( indx + 10; currentCount  + 1; theList )

        ) ;

       UniqueListCountSortedFast12x( indx + 9; currentCount  + 1; theList )

       ) ;

      UniqueListCountSortedFast12x(indx + 8; currentCount  + 1; theList )

      ) ;

      UniqueListCountSortedFast12x(indx + 7; currentCount  + 1; theList ) 

      );

      UniqueListCountSortedFast12x(indx + 6; currentCount  + 1; theList )

      ) ;

      UniqueListCountSortedFast12x(indx + 5; currentCount  + 1; theList ) 

     ) ;

    UniqueListCountSortedFast12x(indx + 4; currentCount  + 1; theList ) 

   ) ;

  UniqueListCountSortedFast12x(indx + 3; currentCount  + 1; theList ) 

  ) ;

  UniqueListCountSortedFast12x(indx + 2; currentCount  + 1; theList ) 

 ) ;

UniqueListCountSortedFast12x(indx + 1; currentCount + 1; theList ) 

)



) // end of Case

) // end of Let

bruceR · June 24, 2010

10K/50K both true but dependent on type of recursion. The 50K you get with tail recursion.

bruceR · June 24, 2010

The prior versions both had problems with more than 10,000 records since there's a built-in 10k (or 50k?) iteration limit in FileMaker.

Here is a version that (in my testing) can handle a much higher number of records. I've tested with a list of 400,000 items containing over 25000 unique keys.

I use a "loop unrolling" technique so that when it's processing a run of identical keys (which happens often, since the keys are sorted, remember!) it stays inside the custom function.

This one is 12 deep (since I was processing yearly data with typically no more than 12 records per person) but you could certainly edit it to go higher as needed.

In the case where the keys don't match, the code pathway is only 1-deep, so that might be a place to optimize.

Very nice.

xochi · July 5, 2010

10K/50K both true but dependent on type of recursion. The 50K you get with tail recursion.

I *think* that my "UniqueListCountSortedFast12x" function is using Tail Recursion --

Reasoning:

A ) the final call on the stack doesn't need to evaluate the entire stack to calculate the result. It just returns "currentCount". But I may be confused on this point.

B ) I have tested this processing 12 months of personnel data in which the # of unique records is over 25,000 and the number of non-blank records is over 250,000 - so I'm pretty sure that if I was hitting the 10,000 limit, it wouldn't work (since it would probably fail around 12x10000 = 120k records). This suggests it does make full use of the 50k recursion limit. In any case, if one needs to handle more than 50,000 * 12 records, one could increase the depth of the nested If statements...

Edited July 5, 2010 by Guest

john renfrew · July 7, 2010

What about Scriptmaster???

One line of code

ts = new TreeSet(list.tokenize("n")).sort()

Just run it on a list of 660000 words without hardly a blink

Sign In

UniqueListFast and UniqueListCountFast

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Important Information