The user interface design of ID numbers

jbante · December 12, 2011

I've been pondering the last couple days over how we negotiate the needs of databases and users in the structure of ID numbers — not necessarily primary keys, but not necessarily excluding primary keys, either — and I'm wondering who else thinks about these things and what conclusions they've come to? Other discussions on these forums have hashed over design opinions of primary keys based on what's good for the database — "always use meaningless serial numbers" or "just use universally unique IDs", etc. — but any conversation that begins with what's good for users gets diverted to what's good for databases. UUIDs are great for databases, but they suck for users because they are so long — 128 bits of data is an awful lot to have to transcribe if your barcode scanner isn't working, even with an alphanumeric encoding.

So what makes an ID scheme good for users? I've seen other discussions on these forums (does anyone else have handy links to them, for reference?) about one user-centric ID issue already: What's the best way to make an ID where users can identify the type of entity being identified? The majority opinion seems to be using an auto-entered serial number with a text prefix. The UUID length problem is another issue; IDs that are good for users should be as short as possible. Given N objects we need to uniquely identify, the theoretical minimum ID length is Ceiling ( Log ( N ) ) (assuming decimal numbers, Ceiling ( Log ( N ) / Log ( k ) ) for IDs encoded with a character set with k characters). How close can we get to this minimum while satisfying other helpful properties?

The last several days, I've been particularly concerned with devising numbers that facilitate helpful machine responses to human errors. Great techniques for detecting the most common human transcription errors have already been figured out. Does anyone use these in FileMaker on a regular basis? Another issue is copying an ID number correctly, but copying the wrong one — like copying a purchase order number when you wanted to find based on an invoice number. If the database has no invoice with that number, there will be an error, but saying that isn't very helpful. If there *is* an invoice with that number, the user will find the wrong invoice, and may be too busy to notice the difference. I think the interface should be able to look at the number before acting on it and say, "this field is for invoice numbers, but you just gave me a PO number." Many people address this with the text prefix mentioned earlier, but if the underlying field is a number field, it wont really make a difference. We can encode an "entity type key" as part of an ID, but that will make the key longer; it is worth it? Should such an encoding be human-decodable?

I don't want to confuse human-friendly IDs with business-friendly IDs. For example, many businesses demand that invoice numbers be strictly sequential for accounting purposes — this is a business process concern, not a human interface concern. Or is it?

What's a good ID to you?

Drew Sanderson · December 12, 2011

I use a backend with serialized numbers, front uses calculated text fields with text concatenated onto the serial numbers.Example for an invoice is that it has an I appended to the front of the serial number in the interface. The relationships and searches are based on the number for speed. The other problem with having text in your id for the physical relationship is that you have to use "==" & id in your searches unlike a number field. If you search I10 in a text based ID field you would get back records for I10, I100, I101, I102, ect. With a number field and searching for 10 you just get back 10. Jeremy, please excuse me being overly verbose here. I know you know this stuff, I just want to be clear for anyone else that stumbles in. :-)

jbante · December 13, 2011

Thanks, Drew! I had forgotten that caveat about finds on text-based IDs.

I had the thought over dinner that it might be worth enumerating some of the properties that make some ID schemes better than others. Then it would be easier to talk about which features can co-exist in the same scheme, which are balanced compromises, and which are mutually exclusive.

Human-centric

Short

Decomposable into short-term-memory-sized (3-5 digit) chunks.
Decomposable into a short-term-memory-sized number of chunks (3-5).

Format and handling detect transcription errors (check digit)

Database-centric

Unique within a certain scope

Unique among records in a table
Unique among records between tables in a centrally-managed schema
Unique among records in a distributed schema
Unique among records between schemata

Immutable: The ID cannot be revised.
Good performance

Creation speed
Storage (disk/bandwidth usage)
Query speed

Format-centric

Lowest common denominator character set: uses a character set that fits in the least-flexible expected storage or transmission medium.

Text fields: Unicode (omitting low-ASCII control characters)
URLs: Base 64
Code 39 barcode: Base 36
Humans: Base 32 (Base 36, but omitting homoglyphs "0", "O", "1", and "I")
Number fields: Base 10

Business-centric

Serial: If last ID was 1234, the next ID should be 1235.
Ordinal (weakly serial): If last ID was 1235, the next ID should be greater than 1235; 1237 is OK, but 1234 is not.
Representational: ID includes descriptive information about the entity, such as an invoice date included in an invoice number.

Security-centric

Non-serial: An outside observer can't guess the next ID based on the last one.
Non-representational: An outside observer can't use known data to build an ID, or use an ID to infer unknown data, such as using the MAC address encoded in a version 1 UUID to identify the machine that created it.

What other nice properties am I not thinking of?

Some of these properties are obviously mutually contradictory; just compare the security-centric properties with the business-centric properties. Other properties are strongly correlated. Small storage requirements and fast find times tend to go hand-in-hand.

Other pairs of properties call for compromises based on need unique to each system. Including a check digit in an ID adds a character to the length, which makes it less short; but one digit seems like such a small price to pay for the computer to be able to tell me that I mis-typed the number, rather than finding the wrong record.

One trade-off in particular seems to be interesting me lately: between short IDs and IDs that are unique across increasingly large scopes, as I started to explicate in my first post of this thread. In one table with 1,000 records, I can use 000-999 to identify records — 3 digits. In a schema with 1,000 records in each of 100 tables, I need at least Log ( 1000 * 100 ) = 5 digits. Off the top of my head, I can only think of one way to reliably achieve the theoretical shortest-possible ID: serial numbers generated by the same issuing authority. For example, customer numbers and item numbers generated in chronological order might go something like:

Customer 1234
Item 1235
Item 1236
Customer 1237
Item 1238

But coordination with a central authority is a bottleneck in concurrent systems, and may not be possible in distributed, intermittently connected systems. The solution to this I normally see is to combine an authority ID with a row ID. UPC numbers consist of a company ID plus a product ID. Version 1 UUIDs under the RFC 4122 standard are an extension of this idea, including as issuing authorities (1) the UUID scheme being used, (2) the machine that created the ID, (3) the "clock sequence" of the ID's timestamp, (4) the timestamp (to the max resolution of the machine) in which serial numbers were generated, and (5) the UUID variant being used (for compatibility with past schemes). Encoding scopeID & rowID seems like a slippery slope to longer IDs. Is there a best-practice way to do this? Is there another (more compact) way to achieve the same result?

Drew Sanderson · December 13, 2011

I like the usability of data chunks. 1234-5678-7876. It seems that could combine nicely with what you describe as business-centric representational and nearly all points on the database centric portion.

I have a complete re-write coming up in 2013. This topic will surely be on my mind.

jbante · December 14, 2011

Representational IDs can be helpful, but I think it's easy to create the possibility of duplicate IDs with naive representational schemes. If the ID format for a contact concatenates a company key with a person key with a fixed number of digits for each field — CCCC-PPPP — the fixed number of digits limits the number of unique entities you can distinguish for all "fields" except the left-most. Otherwise, you might get one ID (1234-567) conflicting with another ID (123-4567). I use a delimiter to distinguish between the two here, but I'm generally storing IDs in number fields for performance purposes; a number field wont make the same distinction. Several numeric-only delimiter schemes are possible that prevent this sort of conflict (I'm tinkering with one right now, in fact), but they generally aren't human-decodable like some fixed-field-length IDs. This doesn't preclude breaking up the resulting key into 3-5 digit chunks, it just means that the chunks don't necessarily represent anything.

UUIDs seem to be accepted despite having this limitation because the space of possible UUIDs is astronomically large compared to CCCC-PPPP — a UUID converts to a 38-digit decimal number. UPCs have a fixed number of digits reserved for different keys, but UPC issuing is much more tightly controlled than other ID schemes. UPS tracking numbers are also shorter than UUIDs, but duplicates are acceptable (and do occur) as long as a package is delivered before the ID is recycled.

Sign In

The user interface design of ID numbers

Recommended Posts

jbante

Drew Sanderson

jbante

Drew Sanderson

jbante

Create an account or sign in to comment

Create an account

Sign in

Browse

Site Support

Forums

Blogs

Marketplace

Activity

Important Information