Address matching - scoring the match to existing addresses

Adam S · June 21, 2009

Hi, all. I’m hoping someone’s able to see through this puzzle as I’m way too close to this to see the forest anymore and I may be missing something semi-obvious. Anyone have a solution or smarter way to set this up, please? (Or a reference to an existing thread I’ve missed might be helpful.)

The scenario: An order comes in from a repeat customer, which has already been matched to an existing customer record. Each order has a payer and ship-to address. We would like FM to recognize if any existing records in the addresses table match either of the addresses (payer or ship-to) received with the new order. This will let FM suggest we avoid transferring any exact duplicate addresses and, more importantly, we'd like to highlight any close matches (with a score or description of how close they are) so the user can consolidate or update any near-match addresses.

Tables:

order_data Has raw order data that’s come in, including the payer’s address and the ship-to address.

addresses Stores addresses created when orders arrive. Addresses get assigned an “address_id”

customers One record for each customer, with a unique “contact_id”

Relationships:

1. order_data to addresses, bridged by contact_id

2. addresses to order_data, bridged by contact_id

3. Several others and self-joins that haven't quite yet worked as planned.

On the order_data repeat customer layout, there’s a portal showing all addresses that match this order’s contact_id. For each portal row showing the customer's previous addresses, we'd like a score or flag on each of those showing how well each matches the payer’s address for the new order and how well each existing address matches the ship-to address. We've got a matching formula in mind so the question is where to put it and how to display the results.

This solution may have to do with how the one-to-many relationship(s) need to be set up or tunneled, including how the calc fields should work to be indexable. It may also be solved by which table holds the the calc. field(s) and in what context one is viewing the portal.

We’ve tried a few configurations and come up with either an “index missing” error or something that shows the match score from each address to the first matching order_data record for that customer instead of the one being browsed (a failure of the portal context, I believe, as the result is within a portal based solely on customer_id match).

I’m hoping to build this without a script If possible, based on one or two relationships involving calc fields. I’ve tried a test relationship based on match both contact_id and first names and that’s not working unless we use multiple relationships and portals for each part of the match.

It seems likely that the calculation field holding the match formula, i.e. how well does an existing address correspond to one of the new addresses, will be created in the addresses table rather than the order_data table and then be viewed within the context of the portal in order_data. Somehow.

Several attempts so far have failed due to context in one way or another.

I've seen a few closely related posts but haven't quite found the answer. Thoughts, please?

Vaughan · June 21, 2009

What's the aim of this system: to speed data entry, or to reduce duplicate addresses and keep the data cleaner?

I ask because generally anything that reduces the chance of duplicates involves the user having to pick through lists of possible-duplicates, and this almost always SLOWS data entry.

"This will let FM suggest we avoid transferring any exact duplicate addresses and, more importantly, we'd like to highlight any close matches (with a score or description of how close they are) so the user can consolidate or update any near-match addresses."

I'm not sure what you mean by "transferring any exact duplicate addresses". Transferring to where? What will happen if exact duplicate addresses do get transferred?

Weigh the cost of "transferring any exact duplicate addresses" with the reduction in data entry speed (and the cost of development).

Adam S · June 22, 2009

Vaughan-

Input appreciated.

The approach is semi-automated with an intentional human eye on it. Raw data comes in, gets cleaned in various ways by script and auto-enter values, and is then transferred to to other tables holding final customer and order info.

We put a bit of time into cleaning and checking addresses the first time and these will eventually become customer-manageable accounts, so it's important to avoid duplicate (or nearly duplicate) addresses. The hope is also to point the user to any subtle differences, such as a new apartment number or a cell phone number, so we can enhance the information we have.

It's not a mass data entry process with thousands of orders arriving each day where pausing would be too big a bottleneck, though hopefully some day we'll have that challenge!

Off Topic Sidenote: Saw that you're in Sydney and we're watching The Dish (the film about the radio telescope in Australia), which we paused so I could reply. Fitting.

Vaughan · June 22, 2009

What's interesting is that you're relating the addresses to the order. It's more common to relate addresses to the customer.

I cannot see any advantages in relating addresses to orders instead of customers: in fact I can only see problems which I think you're already encountering.

Hopefully LaRetta will take the baton from me here: she has a lot of expertise in systems like these not just in terms of development but also in terms of business processes and work flow. (Others do too.)

The Dish is at Parkes is several hours drive west of Sydney in the middle of nowhere. Nobody does "nowhere" better than Australia, we're full of it. Unlike other places, all of Australia except the coast is dry. Over 80% of our population live on the coast.

If you want to see Sydney, watch the original Matrix movie again. That was shot here. Nothing internationally identifiable like the Bridge or Opera House, more subtle things like the fountain at Martin Place (woman in the red dress scene) and the railway bridges near Central (when Neo gets picked up in the car and has the bug removed).

Adam S · June 22, 2009

Vaughn- The addresses are also related to the customers table and, once all is transferred, that is indeed the main relationship used. The order_data table just serves as a temporary data location and reference for later so we can see how an order was intended in case there are questions. During order transfer the raw data are turned into real customer, addresses, order, and order item info. Guess we could relate order_items to customers then to addresses; I'm not sure if there's an advantage to that tunnel but if anyone sees a plus to that, we'd be glad to know.

Fun film trivia; thanks!

LaRetta · June 22, 2009

Guess we could relate order_items to customers then to addresses; I'm not sure if there's an advantage to that tunnel but if anyone sees a plus to that, we'd be glad to know.

I haven't had chance to follow this thread so take what I say with that in mind but, in general, there is nothing wrong with accessing addresses from the Order_Items table (through Customers). The relationship model of vs. 7 and greater allows grabbing grand-children and even great-great-grand-parents through a relationship without the need to tunnel the data like the old days. That's its beauty.

For instance, if you have a relationship as: Order_Items::CustomerID -> Customers::CustomerID and then a relationship from Customers::CustomerID -> Addresses::CustomerID ... then you can place a portal of Addresses directly on your Order_Items layout, because all of that Customer's addresses ARE related to that specific Order_Items record.

Adam S · June 22, 2009

Thanks, LaRetta. That might even have simplified another step in this process.

Can I gently nudge this topic back to the original question and, if you or others have a moment, would you please chime in on that?

LaRetta · June 22, 2009

I would be happy to look but I have little time right now. If you can attach a demo file with the relevant fields (and just a few entries of sample data), I'll take a look. You don't need to actually create the relationship you need because, after all, that's why you are posting. But a good 'base' from which to work will save a lot of time and questions. :wink2:

LaRetta · June 22, 2009

order_data Has raw order data that’s come in, including the payer’s address and the ship-to address.

Specifically and utmost ... I need to know whether each portion of the address (payer AND ship-to) are separate fields or one field as multiline. If you need to compare these fields(?) to the Addresses table, I need to see what you are starting with.

Edited June 22, 2009 by Guest

Adam S · June 22, 2009

I'll try and switch gears to see if I can create a thinned sample file; not sure how that will go, but will try ASAP.

In the meantime.... The transfer holding fields for address are broken into separate component fields (first name, middle, last, street_address_1, street_address_2, etc.). There are two addresses received with each order, one for payer and one for ship-to. If they're deemed new by the user, they're created as real address records (in our addresses table) during the order transfer process. A typical field name would be order_data::transfer_shipto_address_street2, which gets transferred (and compared) to addresses::address_street2.

We considered a calc. field that concatenates the entire address and compares them, building a relationship just on that. It would only check for a precise match though and would miss opportunities to match those that are substantially the same (e.g. apt. number on line one vs. line 2 of an address) and to point the user toward differences to be checked.

Please let me know if there are other questions in the meantime. Thanks in advance.

Adam S · June 23, 2009

As requested, here's an attached sample (zipped) of three files with the relevant, related tables. They should all open with guest access. LaRetta and others, please let me know if there are any questions on these. Thanks!

Relationship_Customer_Match_Example.zip

LaRetta · June 23, 2009

I do not understand why your Addresses table also includes the Contact information and why your Contacts table also includes the address information. How far along are you in this design? Because I would suggest major changes to eliminate the redundancy (which could never been kept sychronized).

I am unsure if I can work in such a design; the best I could do is create a new file (with each table within) and show you a generic, general (and much simpler) approach to view the dups. But that will take me few days to create because I don't have a lot of time right now.

I couldn't even cut to the 'similars' issue in addresses because I can't see through (please forgive my phrase but) all the unnecessary complexity of it. I've worked a lot with online orders coming through without proper identification of customer, address etc and requiring matchup and transfer of this data into the various tables. Would it be beneficial to you for me to take the time to create an example? If you can't consider changing what you have now (because you are using it) then I won't bother but I have an aversion to offering duct tape for dam repair. I hope you understand.

And maybe others are willing to jump aboard as well. :wink2:

Adam S · June 23, 2009

LaRetta-

Hopefully it's workable but here's why you're seeing what you're seeing. This solution started with the Customers table storing addresses directly in the customer record where it shouldn't have (with two sets of address fields, buyer and shipping); the ShippingDest address fields there are legacy and can be ignored, please, as I forgot to remove them before posting. This originally matched a setup created in another database and this is, at last, our process of creating a proper, relational approach.

The addresses table intentionally contains the name associated with a given address. For a billing address someone may go by a legal name but for shipping they often choose a different name. Same for gift addresses, so if it's to a friend, there's a different name. We are, for the moment, also including the phone #s received for that address in that table until we sort out how best to track them; we often need the number tied to a given address, such as the gift recipient's number to provide to FedEx. If you've got a better suggestion on that aspect we'd love to hear it, but hopefully it won't get in the way of seeing something for the address matching question.

For address matching, the goal is to focus on seeing how well the order_data transfer fields match existing records in the addresses table, such as how well the field order_data::transfer_buyer_address_street matches any existing address_street entries in addresses. If this can be done via a calc field, our question is which table should contain it, how to build the relationship, and if it can be retrieved and displayed inside a portal within order_data showing all addresses associated with a given customer.

That make sense?

Thank you.

BruceJ · June 24, 2009

I do not understand why your Addresses table also includes the Contact information and why your Contacts table also includes the address information. How far along are you in this design? Because I would suggest major changes to eliminate the redundancy (which could never been kept sychronized).

It might be necessary to have two sources of addresses.. if a customers address changed then it would change any historical info on old orders, unless of course you also contained shipping addresses right in the order file that just looked-up the customer's address at time of shipping.

I think this is getting more complicated than it has to be.

I think the original question was how to see close or near match addresses to avoid duplication. To do this, I'd use some calc fields and relationships. Make a calc field that strips out any numbers or spaces in the street address in both the address file and invoice file. Then create a relationship that matches up these calc fields plus a zip code field. When your user enters the address, the can use a portal to see if any near matches come in and then choose one if its a match or click a button to "make new" or "update".. well that's pretty crude description, but you may get the idea I'm shooting for.

There's a lot of ways to skin a cat... but I'd do away with the separate address file and keep it all in the contacts file. I'd also keep addresses in the Orders file. Then I'd enter the address in the order.. use the portal for near matches... or... use a similar method to find the customer from the order file, use a lookup for the shipping address in the order.. and then an "update" button if the address that's pulled in needs to be updated to the customer file.

Actually ranking degrees of matching can be done, but that's a bit more sophisticated. It can be done by creating very lengthy matching calc fields and yours would be a bit more complicated unless you used separate fields for HouseNumber and StreetName... hey... that give me an idea.. that might make it quite a bit easier, it's just not a natural way of entering data, but your users will get used to it in about 3 days!

Edited June 24, 2009 by Guest

LaRetta · June 24, 2009

It might be necessary to have two sources of addresses.. if a customers address changed then it would change any historical info on old orders, unless of course you also contained shipping addresses right in the order file that just looked-up the customer's address at time of shipping.

No disagreement at all and thanks for bringing it up. :smile2:

But there are addresses in the Orders (makes sense for history purposes), addresses in the addresses table AND addresses in the Contacts table. Keeping old addresses is not a problem ... they should plant into the order OR remain in the addresses table and marked Inactive and dated but not all three locations! And it was explained to me that they understood the current redundancy.

Regardless, I was willing to do as you suggest, Bruce but I had no idea (until their last post) which addresses should be matched to which! Can you see my confusion, not to mention my overwhelming need to warn the OP that they have normalization issues?

Since you've already explained the gist of how such comparison would take place, maybe you'd create some samples in their file since I couldn't get to it for a few days yet. It would be great if you could! If not then maybe others can assist as well. Otherwise, I'll get to it when I can. :wink2:

LaRetta · June 24, 2009

There's a lot of ways to skin a cat... but I'd do away with the separate address file and keep it all in the contacts file.

Oh. I STRONGLY disagree here. What if a new address type is added? C'mon, Bruce, you can't mean this!

Adam S · June 24, 2009

1. Thanks to Bruce for helping us focus a bit on the orig. question. Other input is much appreciated, though.

2. I'm not the world's normalization expert, but the addresses really do belong in a separate table when going beyond a basic contact list. We began with a customer's main address (often billing) and their one alternative as part of the customers table. That works until they want to ship to a work address (now we're at three sets of address fields), send a gift to their sister (now we're a four), etc. The table size and efficiency starts to diminish. I so wish we'd started with a separate addresses table in the first place.

3. Back to the relationship q. Bruce, your idea sounds good though I think it hits at our original snag, which is how to show a score (whether textual or numeric) of how well an address matches. We're able to set this up showing only 100% matches (trimmed and such), or having separate relationships for each aspect (first name matches, full name matches). We have a draft calc. field with a decent placeholder matching formula and cases ready to go (it's in address_match_rating_payer in order_data if you want to see it, and eventually we'll add a bit more heuristics).

During order transfer, there does seem to be genuine need to see both FM's ideas of close matches as well as all addresses connected to that customer. The heuristics will never be perfect and the user should see all addresses to avoid duplicates and also allow us to add a new info, such as a new phone number, arriving with the order.

My hope is to include this directly in the portal relating to all customer addresses. It might, though, require a separate relationship and I think it might bring up the portal-within-a-portal impossibility. Hopefully there's a workaround, please?

If it can't go in the portal, maybe a (repeating?) field goes next to each of two the new addresses that identifies matching address_ids with their match "score." It seems less useful than putting this info in the portal with each of the existing addresses, but it's a possibility. Thoughts appreciated whenever someone has a moment. Thank you.

Adam S · July 7, 2009

Hoping someone might have a few minutes to help with this question now that July 4th is over perhaps? We're still stuck on this and would like to get the new system implemented before we train some new staff. Thanks!

comment · July 7, 2009

I am afraid that after all this your question is still not clear. IIUC, you have a structure of:

Customers -< Orders -< Addresses

so it should be quite easy to have a portal of customer's previous addresses on a layout of Orders.

Then comes the question of evaluating "how well each matches the payer’s address for the new order and how well each existing address matches the ship-to address".

You have been very vague about the nature of this comparison, but obviously such calculation must be performed on each address individually, so it must reside in the Addresses table.

Also obviously, the calculation must compare each address to a GIVEN address - and a record in Addresses cannot know which address is the given one, unless this is indicated in a global field/variable.

It follows then that the result of such calculation cannot be stored, and cannot be used as a matchfield for filtering the relationship to Addresses (at least not easily) - though it can be shown in the portal as a flag or a score.

Perhaps a scripted processing of each new order might be more suitable here.

Adam S · July 8, 2009

Thanks, Comment.

Our structure is approx. what you described, being Customers -< Order_Data -< Addresses. We have the portal set up showing any previous addresses that match the customer for a given, new order.

It sounds like the only way to go so far, barring someone's magical idea, is as you described, having a script set a global in the addresses table to say "link to this specific order_data record now," base a relationship on that global, then have the comparison formula in addresses refer to that relationship. We have this set up (and working) using the order number field in order_data as the foreign key:

addresses

g_order_number_external_to_relate (set by script)

maps to

order_data

order_number_external (unique)

The scoring calc field can now compare each address component in fields in addresses to the buyer and ship-to addresses in order_data and build a score.

This approach needs the global g_order_number_external_to_relate in addresses to be reset each time an order is reviewed, so it knows which addresses to compare. Without script triggers (we're running this in 8.5), the user will have to hit a button to set these global values; even with script triggers, it seems like extra resource use to run a script each time a field is entered, but I digress. It turns out not to be the biggest deal for this use since our staff are usually transferring addresses as part of a larger script, so we can set the global there when it gets to each order. Ideally, though, it would be more useful for this and other applications to use FM's awareness of which record the user is browsing (within order_data) to guide the resulting address score in each portal row.

The reason we thought this might be doable without setting globals was that

a) we only care about this value in the context of the order_data table (so we don't need an address match calc to have a valid value when browsing addresses, but only when browsing through the order_data entries)

it seems the order_data table could already be aware of which record we want compared because it knows which record is being browsed, and it has a portal showing candidate addresses so it knows which we'd like to score. It just seems like all the information is sitting right there and FM would have some way to piece this together.

I was hoping the calc field in addresses could somehow, in context of the portal, realize they should be calculating relative to the current order_data's record (even if they were blank or not useful if browsing addresses itself). We were also hoping to do it without scripts so they'd be dynamic as we moved records with no extra step, but I'm not sure that's possible.

BTW, I was not intending to be vague on how the addresses are scored, just hoping to be clear that our question was not about address matching heuristics; that's a whole other rabbit hole!

If anyone else (LaRetta, guessing you're still swamped, yes?) has a suggestion on how to do this absent setting a global each time one enters a record, it would be great to hear. Otherwise this might be one of those "can't be done" kinds of things.

Thank you, all.

comment · July 8, 2009

There's no way a global field will change its value in response to user browsing. Even an unstored calculation field in the child table cannot know which record is current in the parent table.

At best, you might get away with "publishing" a variable in the parent layout and letting a calculation field in the child "subscribe" to it. I am saying "might" because I don't know exactly what your relationships are and I cannot be sure you won't run into refresh issues.

Other options are restricting users to scripted navigation or using a script-triggering plugin.

Adam S · July 8, 2009

There's no way a global field will change its value in response to user browsing.

Right. I was hopeful there was some elegant way to solve the puzzle without using a global in this way. So far I can't see a way for a portal to hold calculated values that will dynamically update in this way if those values hinge on the record being browsed in the parent table.

Sign In

Address matching - scoring the match to existing addresses

Recommended Posts

Adam S

Vaughan

Adam S

Vaughan

Adam S

LaRetta

Adam S

LaRetta

LaRetta

Adam S

Adam S

LaRetta

Adam S

BruceJ

LaRetta

LaRetta

Adam S

Adam S

comment

Adam S

comment

Adam S

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information