Best Matches for a set of multiple fields

January 5, 201610 yr

From a record in Table A that includes 9 fields, I need to identify the records in Table B (of hundreds of records and include 9 corresponding fields) I can easily define a relationship that retrieves the records in Table B where the contents of the 9 fields exactly match, 9 out of 9, between the Table A query record and Table B records.

How do I include the best, but not perfect matches, e.g., 8 out of 9 of the fields exactly match.

I can script through all of the table B records and set 9 additional temporary fields with a 1 for exact matches and 0 for mismatches, but that seems cumbersome.

I'm using Pro 12, but if 14 had some features that would make this easier, that would be fine. And just to be clear, I'm not needing to handle any mismatches between the corresponding fields. The corresponding fields either match exactly, or they don't. In fact, the fields could be boolean to start with and just contain a 1 or 0 each.

What is the recommended way to determine the best match records involving a set of multiple fields?

January 5, 201610 yr

7 hours ago, bac mac said:

How do I include the best, but not perfect matches, e.g., 8 out of 9 of the fields exactly match.

I would be more comfortable answering this question if I knew what's this about. When a question is entirely about technique, one always wonders if there might not be a better solution to the underlying problem.

That said, you could use a relationship based on a calculated multikey, where each sub-key would have a wildcard for one of the values. To calculate the multikey, it would be convenient to use a repeating calculation field (on both sides of the relationship) =

Let ( [
multikey = Extend ( field1 ) & ¶ & Extend ( field2 ) & ¶ & Extend ( field3 ) ... & ¶ & Extend ( field9 ) ;
i = Get ( CalculationRepetitionNumber ) ; 
n = ValueCount (multikey ) ;
wildcard = "*¶" ;
partialkey = LeftValues ( multikey ; i -1 ) & wildcard & RightValues ( multikey ; n - i )
] ;
Substitute ( partialkey ; ¶ ; " | " )
)

Set the calculation to return Text result in 9 repetitions.

1

January 6, 201610 yr

Author

That is a very concise calculation that would do well identifying records with one mismatch and allow for that mismatch to be in any of the 9 fields. Would it also go beyond that and help evaluate less equivalent matches?

The context for this involves a table (B) of hundreds to thousands (depending on which data set we use) each representing a subject type with 0 to 9 different disabilities, e.g., inability to walk without assistance, inability to bathe without assistance, etc. For each of these subject types, we have a record that indicates the status with respect to each problem. Each problem is given a field and a fully functional subject type would have able, or 1 in each of these 9 fields. A problem with one or more of these activities would have unable or 0 as the value for the field. (Actually, right now, it uses text- able/unable, but I'm considering switching it to 1/0- thoughts on that with respect to performance?)

Then, in a separate table (A), we generate a new record with a test subject and populate the 9 activity fields in this table with the status for each of these activities. We can identify table B records that exactly match the state of the 9 activity fields in the table A record (easy via a single relationship matching each of the 9 fields between the tables). The question involves how best to identify the records in table B that include a mismatch and sorting them in order from best to worst match.

What do you think?

January 6, 201610 yr

5 hours ago, bac mac said:

The question involves how best to identify the records in table B that include a mismatch and sorting them in order from best to worst match.

Actually, these are two separate questions.

With regard to the first question - how to identify the relevant records - we need more constraints. After all, ALL records in the "other' table are mismatched to some degree. I believe the first constraint needs to be in the direction of the mismatch. If a subject with say 4 disabilities fits perfectly a profile with the same 4 disabilities, then the next best match would be either a profile with any 3 out of those 4 disabilities, or a profile with exactly the same 4 disabilities, plus one additional disability. It doesn't seem reasonable (or useful) to include both. I would also expect to have some limit to the number of allowed mismatches.

Before getting to the question of sorting, I would ask how exactly will you use the result. It's not possible to have the lists sorted permanently, because the sort order of the child records depends on the parent record (I am using relationship terms here, but the same would apply if using a find). You must select a parent record before you can sort the children - which means that the sort must be by an unstored calculation.

Edited January 6, 201610 yr by comment

January 6, 201610 yr

Author

Usually, we only need to look at the exact matches because the most common test subject, Table A record, will match records in Table B that have sufficient cases for statistical significance. Failing that, we need to look further, e.g., include matches where there is one additional or one less disability- both cases would have a score of 8/9 compared to 9/9 for a perfect match. And yes, it will be entirely dependent on the test subject. We will include a limit on the mismatches and it may be 8/9 or possibly lower.

January 6, 201610 yr

26 minutes ago, bac mac said:

We will include a limit on the mismatches and it may be 8/9 or possibly lower.

Well, you have the answer for 8/9. The thing is, it only takes 9 repetitions to calculate the key for 8/9 - but you would need 36 repetitions for 7/9 and no less than 84 repetitions for 6/9. Not to mention the increased complexity of the calculation formula. Still, the advantage here is that the keys can be stored, so this would be very fast.

Another possibility might be to store the subject's disabilities as a return-separated list in a global field or variable. Then have each profile in the other table calculate the number of matches against its own list, using ValueCount (FilterValues(...)). This, together with a count of the profile's disabilities, would give you some kind of a match index: for example, if the subject has 5 disabilities, and the profile has 4 matches out of a total of 7, than the score is 6/9.