Extremely Large Numeric Bases with Unicode

Previously, we discussed using Punycode for non-ASCII domain names with internationalized URLs, e.g., https://去.cc/叼

I would like to use this approach to create a URL Shortening Service, where we can create shortened URLs that use UTF-8 characters in addition to the normal ASCII characters.

Most URL shortening services use a base-62 alphanumeric key to map to a long-url. Typically, the base-62 characters include 26-uppercase letters (ABCD…), 26 lowercase letters (abcd…), and 10 digits (0123…), for a total of 62 characters. Occasionally they will include an underscore or dash, bringing you to base-64. This is all perfectly reasonable when using ASCII and trying to avoid non-printable characters.

For example, using base-62 encoding, you may have a typical short URL that looks like this:


However, nowadays with modern browsers supporting UTF-8, and offering basic rendering of popular Unicode character sets (中文, 한글, etc), we can leverage a global set of symbols!

One of the larger contiguous Unicode ranges that has decent support on modern browsers is the initial CJK block as well as Korean Hangul syllables. Why are these interesting? Well, rather than a base-62 or base-64, we can use CJK and Hangul syllables to create extremely large numeric bases.

The CJK range of 4e00 to 9fea seems to be adequately supported, as well as the Hangul syllable range of ac00 to d7a3, this would give us a base-20971 and a base-11172 respectively. Rather than base-62, we can offer numeric bases into the tens of thousands!

This would allow shortened URLs to look like this:


Taken to extremes, let's consider a really large number, like 9,223,372,036,854,775,807 (nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six billion eight hundred fifty-four million seven hundred seventy-five thousand eight hundred seven). This is the largest signed-64-bit integer on most systems. Let's see what happens when we encode this number in extremely large bases:

= 7M85y0N8lZa
= 瑙稃瑰蜣丯
= 셁뻾뮋럣깐

The CJK and Hangul encodings are 6-characters shorter than their base-62 counterpart. For a URL Shortening Service, I'm not sure this will ever be useful. I'm not sure anyone will ever need to map nine quintillion URLs. There aren't that many URLs, but there are billions of URLs. Let's say we're dealing with 88-billion URLs. In that case let's look at a more reasonable large number.

= 3CG2Fy1
= 執洪仉
= 닁읛껅

NOTE: while the character-length of the Chinese string is less than the base-62 string, each of the Chinese characters represents 3-bytes in UTF-8. This will not save you bandwidth, although technically neither does ASCII, but it's worth mentioning nonetheless.

To convert a number to one of these Unicode ranges, you an use the following Python,

def encode_urange(n, ustart, uend):
    chars = []
    while n > 0:
        n, d = divmod(n, uend-ustart)
    return ''.join(chars)

def decode_urange(nstr, ustart, uend):
    base = uend-ustart
    basem = 1
    n = 0
    for c in unicode(nstr):
        if uend > ord(c) < ustart:
            raise ValueError("{!r}, {!r} out of bounds".format(nstr, c))
        n += (ord(c)-ustart) * basem
        basem = basem*base
    return n

The CJK range is 4e00 to 9fea, and you can map arbitrary CJK to base-10 as follows,

>>> decode_urange(u'你好世界', int('4e00',16), int('9fea',16))
>>> print encode_urange(92766958466352922, int('4e00',16), int('9fea',16))

Unicode is full of fun and interesting character sets, here are some examples that I have built into x404:

# base-10   922,111
base62:     LSR3
top16:      eewwt
CJK:        鶱丫
hangul:     쏉걒
braille:    ⣚⡴⡙
anglosaxon: ᛡᛇᛞᚻᚢ
greek:      οΒΦδ
yijing:     ䷫䷔䷫䷃
This entry was posted in python. Bookmark the permalink.