PatternCount() vs. Position() revisited

August 16, 200619 yr

Since this discussion last year:

http://www.fmforums.com/forum/showtopic.php?tid/151574/

I've been under the impression that Position() is faster than PatternCount() for determining the presence of a string within a larger source text field. And intuitively this makes it makes sense that PatternCount() would need to check the entire source text field while Position() only needs to find one.

But when I tried setting up a test of this, I was unable to notice a difference between them.

In both FM6 and FM8, my test show equally fast evaluations of a Set Field[] that populates the Position or PatternCount of the search string within the text field.

In FM8, the text field can hold a huge amount of text. I tried my test with 50,000 words, 100,000 words, 200,000 words, and saw no difference in the time elapsed (about a second for each.)

In FM6, the limit for a text field is much smaller (I got about 11,000 words in there before it maxed out). In this test, the results for both Position() and PatternCount() were nearly instantaneous (less than a second.)

I'm guessing if we were to build custom functions that scan through the text to behave like the Position() and PatternCount() functions, the theoretical speed difference would come into play. But why not here? Are my tests flawed? Is the difference noticable in different versions or different OSs? Or are these functions somehow optimized with some internal search algorithm that makes them fast enough to work with large text fields?

:qwery:

Attached is my test file.

WordSearch.fp5.zip

August 16, 200619 yr

Hi Mike,

While on 7.v2, I ran a test between them but the test was unstructured. I used a field with 10,000 words. But I used a copy of my LineItems (500,000). The difference, if I recall, was approx 10 seconds. Not much really but 10 seconds is a long time to a User when they are waiting for system.

How many records did you use for your test?

August 16, 200619 yr

I have arrived at the same conclusion when I tested this a couple of months ago. I did not make this public yet because it's a part of a larger test, and the other part is not finished yet. Now you had to go and let half of the cat out of the bag...

August 16, 200619 yr

I think that counts as a cat still in the bag.

Phil

August 16, 200619 yr

I think that depends on whether you want to consider the bag as half-empty or half-full.

In any case, I am consulting Mr. Schrödinger regarding the health of the cat.

August 17, 200619 yr

Author

From LaRetta's description, it sounded like her tests showed differences when looping though a large record set. I don't know how often this sort of thing happens in a real solution (doesn't seem very efficient), but based on this I'm in the process of running these and other tests while looping through a large record set.

I think I see why comment hadn't posted this yet. There seem to be a lot of other things that have a greater effect on performance than merely whether the test is Position() or PatternCount(). Although it's turning into more of a can of worms than a half a cat.

I should have something more definitive in a day or two.

August 17, 200619 yr

There seem to be a lot of other things that have a greater effect on performance than merely whether the test is Position() or PatternCount().

So true, Mike. That is why the tests must be identical, each using a backup of the same file. I even reboot my system between the two tests. Because I've ran tests using the same open file and skewed the second results because of system resources or FM indexing on same file etc. The only observation I could make in my ONE test is that, the larger the record-set, the more obvious the difference between the two. ONE test does NOT a theory make but it SUGGESTS there is a difference, although slight. Since FM can't calculate nanonseconds, only large record-sets can display these differences in countable seconds.