Find-mode Symbols (=, etc.) and Text-field Index Usage

Mark Scott · December 1, 2014

Hi all,

Because I was in the middle of scripting a new “Find” UI for a project, and concerned (as always) about performance, a comment in another thread that == (exact field match) searches don’t use the index caught my attention. That set me on a course of 2 days worth of testing to see which indexes are created, and used, by different text-field search strategies/operators (i.e., =, ==, " ", *, etc.). The upshot is that I was unable to find a search method that didn’t create one or both text-field index types, and then use them to speed up subsequent searches. Here are my methods and a subset of my results; I welcome any insights folks have into how searches do or don’t use indexes, as well as spotting any flaws in my methodology or conclusions.

Methods

I created a fresh file with a single table, with about 30 text fields. All fields were initially set to “Indexing: None — Automatically create indexes as needed.” This allowed each different search method/operator to be tested against a fresh field that initially had no index. One additional field was set to “Indexing: None” (with “Automatically create indexes as needed” unchecked) and one to “Indexing: All,” to serve as no- and full-index controls, respectively. Using a looping script, I then created 1,000,000 records and set the values of all text fields to the same values: the first 999,9994 records had all fields set to “John Doe,” 3 records to “Mary Smith,” two records to “$¢€ £¥π,” and one record each to “AB_1234” and “[email protected].”

Before performing initial searches, I double-checked in Manage Database to ensure that none of the fields had indexes (except for the full-index control). I then threw a different search term at each field, in turn, in Find mode. I used a “Perform Find” script that reported the elapsed time back to me in ms, recording the baseline (unindexed) search times as I went. These baseline times represented the combined time to both create the index(es) and perform the search. In the cases reported below where both types of indexes were created, that fact was made immediately obvious by the appearance of two consecutive progress bars, lasting roughly 18 and 9 seconds, respectively, during the baseline search. In contrast, when only one index type was created, only a single progress bar appeared.

Immediately after each search, I re-checked Manage Database to confirm that an index now existed for that field, noted whether it was a “Minimal” (only Word or Value) or “All” index, and, in the former cases, unchecked “Automatically create indexes as needed,” thus locking the field to the index type initially created. Back in Browse mode, it was then possible to use Insert from Index to see whether the “Minimal” index created was a Word or Value index. (Locking in the index type, first, is necessary; otherwise, Insert from Index itself actually creates the other index type.) One interesting note: In cases where only a Word index was created, Insert from Index doesn’t actually show the index at all, but, slightly counterintuitively, displays a dialog stating that the operation could not be completed. It seems that a value index must either exist or be created on the fly, in order to reach the word index, and the latter wasn’t possible due to having deselected the “create as needed” option. When only a Value index was created, in contrast, Insert from Index shows the expected value index, and grays out the “Show individual words” option (again, due to having deselected the “create as needed” option).

After all baseline searches were run once, and the respective indexes created, I re-ran the searches, each in the same (now-indexed) field it was initially run in, noting again the elapsed time. I also ran all searches against the full- and no-index control fields.

Results

The values listed in the Value Index were “$¢€ £¥π,” “AB_1234,” “John Doe,” “Mary Smith,” and “[email protected],” as expected, while the values listed in the Word Index (as confirmed in fields where both types were created) were “1234,” “AB,” “acme.com,” “Doe,” “John,” “magoo,” “Mary,” “mr,” “smith,” and “π.” (The glyphs “$," "¢," "€," "£," and "¥" [but not "π"] are "word-breaking" characters and are not indexed in the word index. FileMaker follows the Unicode collation algorithm in determining how words break and which glyphs are treated as word-breaking characters.)

The table below shows the baseline and post-indexing search times for each method, for a representative sample of the various search strategies employed. Other search strategies or combinations thereof showed similar results. Also shown is the time for the same search run against a no-index control field. The full-index control search times are not shown, but ranged between 1 and 3 ms.

In order to check the effect, if any, of caching, I periodically closed and re-opened the database, although this did not seem to noticeably affect the post-indexing search times.

Table: Search Times by Search Method/Operator.

[Edit: table looked fine when composing message, but didn't survive posting, so I'm attaching as a zipped .txt file instead.]

Table footnotes:

* All times in ms, measured on non-hosted file with 1M records, on a MacBook Pro.

** Note that the symbols ~, $, ¢, €, £, ¥ and _ are “word-breaking” symbols that are indexed in value-, but not word-indexes, while π is a “non-breaking” symbol that is indexed in both index types.

† First search in field initially set to “Indexing: None — Automatically create indexes as needed.” Search time reflects combined time to create index and perform the search.

§ Repeat search in same field, with field options left on “Indexing: Minimal” and “Automatically create indexes as needed” unchecked, thus no additional index was created or used. For searches that created both index types on baseline search, storage options were left on “Indexing: All.”

‡ Same search in field set to “Indexing: None” with “Automatically create indexes as needed” unchecked.

ø Search result = no records found.

Conclusions

A couple tentative conclusions can be drawn. First, all search-term methods tested produced, and then used, indexes, resulting in between a 1,500- and 30,000-fold faster search. This included the whole-word operator (=), the exact field-match operator (==), the general wildcard (*), and the literal-phrase operator (“ “), as well as combinations. The post-indexing time improvements were probably a best-case scenario, as the indexes were quite small (i.e., very few distinct values, by design); real-world improvements in a similarly large table would likely be less, depending on the size of the index.

Second, the type of index (value, word, or both) created is dependent on the search term/operator, not on the values in the database. In fact, some of the terms tested (e.g., “==~~”) intentionally searched for values not present at all in the database. For example, a basic (no operator) search for the value “$¢€” (consisting only of word-breaking glyphs), created only a value index; a basic search for the value “Mary” created only a word index, while the “mixed” value “AB_1234” created both types. Similarly, searching for “$¢€ £¥π” also produced both index types (data not shown), “π” being a non-breaking value treated like other letters or numbers.

As mentioned, I’d love to hear more about other people’s experience with using symbols in searches. Do these observations match or contradict yours? Do my conclusions sound reasonable? It certainly makes me feel at ease using various symbols in a scripted search routine that may end up being run against large tables.

Mark

Table - Search Times by Search Method.rtf.zip

comment · December 1, 2014

It's been a long time since I have tested this, and it's always good to test again in order to see what optimizations have newer versions brought.

What I see (in version 11) is this:

Perform Find [ Criteria: Data::Indexed: “$searchPhrase” ]     -  1 second the first time after the file is opened;
                                                                   subsequent finds are instant.

Perform Find [ Criteria: Data::Indexed: “==$searchPhrase” ]   -  5 seconds the first time after the file is opened;
                                                                   subsequent finds take between 1 to 2 seconds.

Perform Find [ Criteria: Data::Unindexed: “==$searchPhrase” ] -  5 seconds.

Draw your own conclusions.

LaRetta · December 3, 2014

Hi Mark,

I love the fact that you are exploring issues! I'll be reviewing your information this weekend when I have a block of time to give it proper consideration. Good on you for questioning and testing!!!

Mark Scott · December 5, 2014

Thanks, LaRetta. Always appreciate your savvy input; looking forward to any insight you might add here.

And thanks, Comment, for testing in FM11. Your results contrast with mine for 13, suggesting that there may have been some optimization. Your second observation—5 s (file just opened) dropping to 2 s (subsequent searches)—sounds like caching effect, and even that latter time contrasts sharply with the 1–3 ms that I was seeing for indexed "==" searches in a million-record table. Also, the fact that your third observation of a consistent 5 s for an unindexed "==" search matches my times very closely suggests that our testing setups were reasonably comparable.

Mark

Sign In

Find-mode Symbols (=, etc.) and Text-field Index Usage

Recommended Posts

Mark Scott

comment

LaRetta

Mark Scott

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information