Jump to content

Need to generate short unique alphanumeric hash from text string


This topic is 2595 days old. Please don't post here. Open a new topic instead.

Recommended Posts

I have a library of ~25,000 photographs, each in a (linked) container, 1 photo per record.

 

The photos have standardized ID tags of "Site Abbreviation" + "Timestamp". These text tags are unique because photos are never taken less than 1 second apart:

 

215MAIN-2014-05-02_13-24-52

152JBAR-2013-01-02_02-21-13

152JBAR-2013-01-02_02-21-12

 

I would like to generate the shortest-possible alphanumeric unique reference code for these photos using a custom function.

I plan to embed (watermark) the code into each photo so that someone can easily locate any particular photo in the database by doing a Find on the unique ref. code.

 

My ideal output would look something like a TinyURL, but with fixed length (these are made-up random examples):

 

215MAIN-2014-05-02_13-24-52  -->  F59PL2

215MAIN-2014-05-02_13-21-13  -->  RK51JF

152JBAR-2013-01-02_02-21-13  -->  S21CW9

 

Can anyone point me to existing functions that do this, or perhaps give some examples of something that would work?

 

I have investigated available hash functions (SHA / MD5,) these are too long. Adler 16 or 32 CRC is closer to what I want but still too long because it produces only digits. I want to incorporate alphabetics for compactness.

 

Thanks for your help.

Link to comment
Share on other sites

You could take an MD5 hash and truncate it to get something short enough. Bear in mind that the shorter a piece you bite off, the more likely it is you'll create a duplicate at some point, but this will be true no matter what kind of hash you use.

 

But the result you get from an MD5 hash is still hexadecimal, which does not use as large an alphabet as you might like. You could convert this to a number, then re-encode it with a larger alphabet, like all decimal digits and capital roman letters excluding characters that could be easily confused with each other (homoglyphs):
 

Let ( [
_input = "215MAIN-2014-05-02_13-24-52" ;
_md5 = GetContainerAttribute ( _input ; "MD5" ) ;
_number = NumberFromRadix ( _md5 ; "0123456789ABCDEF" ) ;
_text = RadixFromNumber ( _number ; "23456789ABCDEFGHJKLMNPQRSTWXYZ" )
] ;
Left ( _text ; 6 )
)
 
Which would give you:
 
215MAIN-2014-05-02_13-24-52 -> KC6NS4
215MAIN-2014-05-02_13-21-13 -> EYR52A
152JBAR-2013-01-02_02-21-13 -> GN7D9G
 
If these are codes that have to be human-enterable, it might be worthwhile to chunk it down into a group of letters and a group of numbers, like is the default with many car license plates (and for the same reason).
Link to comment
Share on other sites

Left ( _text ; 6 )

 

To be on the safe side, you should probably increase the alphabet to 31 characters instead of the current 30, or increase the length to 7. The capacity of 6 characters out of 30 is less than 20% compared to the 32-characters long hexadecimal MD5.

Link to comment
Share on other sites

To be on the safe side, you should probably increase the alphabet to 31 characters instead of the current 30, or increase the length to 7. The capacity of 6 characters out of 30 is less than 20% compared to the 32-characters long hexadecimal MD5.

 

Yes, I agree.

 

In case anyone seeing this thread is confused by what Comment is referring to here: The MD5 result is a 32 character string in hexadecimal (base 16), so it could be any of 16^32 different possible values. A 6-character string from an alphabet of 30 characters could be any of 30^6 possible values. 30^6 / 16^32 = 2.14e-30, WAY less than 20%, so it can't handle nearly as large a range of unique values. Adding a letter to the alphabet would give us 31^6 possible values (22% more), or adding a character would give us 30^7 (30 times more). 30^7 is still 28 orders of magnitude smaller than 16^32, but it's one way to expand the value space if necessary.

 

(The calculations in this were previously very far off due to a lexicographic error pointed out by Gilbert. Now corrected.)

Edited by jbante
Link to comment
Share on other sites

BTW, I don't see why you need to exclude all homoglyphs from the alphabet. If concerned about possible confusion between "0" and "O", you could still allow zero and substitute any entered "O". 

 

Also, are "U" and "V" really homoglyphs? I mean, outside the Roman empire?  :smile:

Link to comment
Share on other sites

BTW, I don't see why you need to exclude all homoglyphs from the alphabet. If concerned about possible confusion between "0" and "O", you could still allow zero and substitute any entered "O". 

 

Also, are "U" and "V" really homoglyphs? I mean, outside the Roman empire?  :smile:

 

Better safe than sorry. What if the codes are being used in some context outside of Filemaker where there is no automatic checking or substitution. For example, if they are being copied to paper by hand, or searched for in an Excel spreadsheet.

Link to comment
Share on other sites

Yes, I agree.

In case anyone seeing this thread is confused by what Comment is referring to here: The MD5 result is a 32 character string in hexadecimal (base 16), so it could be any of 32^16 different possible values. A 6-character string from an alphabet of 30 characters could be any of 6^30 possible values. 6^30 / 32^16 = .183, so it can't handle nearly as large a range of unique values. Adding a letter to the alphabet would give us 6^31 / 32^16 = 1.097, or adding a character would give us 7^30 / 32^16 = 18.6 — since both of these results are bigger than one, either of these changes would be enough for the shortened hash to be just as unique (more, even!) as the full 32 character MD5.

Excuse me, I'm not a math or stats whiz, but these calculations don't ring true for me --

Isn't a the number of possible permutations of a 6-character string (with an alphabet of 30 characters) 30^6 = 729,000,000 -- not 6^30. :question:

This makes more sense to me. For example if it was a 2-character string, the number of possible combinations would be 30^2 = 900 --- not 2^30.

If my revised math is correct, 729 million is a small subset of the possible MD5 hash space, so I would be smart to check for duplicates before accepting the value. (I currently have ~25,000 photos, but I could see having 1,000,000 photos some day, resulting in the chance of a collision gradually increasing towards more than 1 in 1000.

---Edit--

I tested this with a string output length of 5 characters, and in a set of 4,300 records I got 1 duplicate / collision, a roughly 1 in 10,000 probability. (4300 / 30^5)= .00018. So, I will definitely be using 6 characters starting now, with dupe checking, & structuring for the possibility that the length may need to go to 7+ characters if my library gets very large.

Edited by Gilbert Osmond
Link to comment
Share on other sites

Excuse me, I'm not a math or stats whiz, but these calculations don't ring true for me --

Isn't a the number of possible permutations of a 6-character string (with an alphabet of 30 characters) 30^6 = 729,000,000 -- not 6^30. :question:

This makes more sense to me. For example if it was a 2-character string, the number of possible combinations would be 30^2 = 900 --- not 2^30.

 

Very right. My mistake, and a very embarrassing one considering that I majored in math! Information theory (the branch of math that is perhaps most relevant to this sort of thing) is not my specialty. I'll fix my numbers...

Link to comment
Share on other sites

Isn't a the number of possible permutations of a 6-character string (with an alphabet of 30 characters) 30^6 = 729,000,000

 

Alas, you are right about that. Apparently, Jeremy and I have managed to confuse each other in turn.

 

If we disengage for a moment from the technique, and look at the essence, your current format "152JBAR-2013-01-02_02-21-12" could be expressed as "###@@@@##############" or "4 alpha characters and 17 digits". Assuming the 4 alpha characters come from an alphabet of 26, you have a capacity of 26^4 * 10^17 possible values. To achieve a similar capacity with an alphabet of 36(!) characters, you would need a string of at least 15 characters.

 

Perhaps we should ask this in another way: how many sites are there and how many dates?

Link to comment
Share on other sites

Perhaps we should ask this in another way: how many sites are there and how many dates?

 

On second thought, if we know that you aim at having ~10^6 items, then 32^4 would suit you (and so would 30^5). All you need to do is assign a serial number to each item (which you should do anyway) and convert it to base 32 (or higher). There is no random factor in effect here, and no danger of collision.

Link to comment
Share on other sites

This topic is 2595 days old. Please don't post here. Open a new topic instead.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.