Punycode

I would like a webapp that supports UTF-8 URLs. For example, https://去.cc/叼, where both the path and the server name contain non-ASCII characters.

The path /叼 can be handled easily with %-encodings, e.g.,

>>> import urllib
>>> 
>>> urllib.parse.quote('/叼')
'/%E5%8F%BC'

Note: this is similar to the raw byte representation of the unicode string:

>>> bytes('/叼', 'utf8')
b'/\xe5\x8f\xbc'

However, the domain name, "去.cc" cannot be usefully %-encoded (that is, "%" is not a valid character in a hostname). The standard encoding for international domain names (IDN) is punycode; such that "去.cc' will look like "xn--1nr.cc".

The "xn--" prefix is the ASCII Compatible Encoding that essentially identifies this hostname as a punycode-encoded name. Most modern web-browsers and http libraries can decode this kind of name, although just in case, you can do something like this:

>>> 
>>> '去'.encode('punycode')
b'1nr'

In practice, we can use the built-in "idna" encoding and decoding in python, i.e., IRI to URI:

>>> p = urllib.parse.urlparse('https://去.cc/叼')
>>> p.netloc.encode('idna')
b'xn--1nr.cc'
>>> urllib.parse.quote(p.path)
'/%E5%8F%BC'

And going the other direction, i.e., URI to IRI:

>>> a = urllib.parse.urlparse('https://xn--1nr.cc/%E5%8F%BC')
>>> a.netloc.encode('utf8').decode('idna')
'去.cc'
>>> urllib.parse.unquote(a.path)
'/叼'
This entry was posted in python, software arch.. Bookmark the permalink.