Solving the Encoding Puzzle

Deciphering the Multilingual Character Mysteries

Jan 08, 2024

∙ Paid

Have you ever encountered a perplexing situation while working with languages other than English, where specific characters mysteriously transform into question marks or incomprehensible symbols? In this article, I'll delve into this phenomenon, a common occurrence in my experience, and provide insights into its nature and potential solutions.

To begin with, it's essential to understand that all characters are stored as bytes on a disk. The type of bytes stored is determined by the encoding in use, such as ASCII or UTF-8. Interestingly, since characters are sequential on the code page, the regular expression "A-z" will capture both uppercase and lowercase letters, as capital letters precede lowercase ones in the sequence. Conversely, the expression "a-Z" is invalid, as it attempts to define a range starting with a higher character code and ending with a lower one, which results in a parsing error.

Adjacent to this technicality lies a more sinister issue in the realm of encoding. Within the same UTF-8 encoding, there exists a vulnerability known as homograph attacks. Cybercriminals can exploit this by creating deceptive URLs that resemble legitimate domains. For example, a domain like apple.com can be mimicked using a character from a different language set within UTF-8, replacing a character such as "a" with a visually identical counterpart. This subtle deception, undetectable at a glance, can lure users to malicious sites while appearing legitimate.

This type of attack, as highlighted in a 2017 article by The Hacker News, utilizes the wide array of characters available in UTF-8 to craft URLs that visually mimic those of reputable sites, yet lead to harmful destinations. The risk is compounded by the issuance of HTTPS certificates for these deceptive domains, giving a false sense of security to users who often associate HTTPS with trustworthiness.

ASCII, for instance, comprises 127 code points, including 95 printable characters (the rest being whitespace and control characters). My most frequent encounters have been with CP-1252, or Windows Latin, which forms the basis of ISO-8859-1. Whether these names ring a bell or not, the takeaway is that when writing text, the encoding (or charset) operates in the background, typically unbeknownst to the user.

The character sets I mentioned above are single-byte, meaning they use 8 bits to store a single letter. In contrast, UTF-8 and UTF-16 are more flexible, storing variable numbers of bytes for each character. Moreover, UTF-8 maintains compatibility with ASCII, so a letter like 'A' is represented identically in both encodings.

So, how do we end up with garbled characters or question marks? This usually happens when a "smaller" encoding attempts to display characters from a larger encoding and fails to recognize them. For example, in ASCII, a multi-byte character like 'Õ' might be erroneously displayed as '?', 'O~', or even 'Oæ'. I once faced a situation where text stored in one encoding passed through middleware using a different encoding and was eventually displayed in a third, resulting in severely garbled output.

What's the solution?

Keep reading with a 7-day free trial

Subscribe to Markus’s Substack to keep reading this post and get 7 days of free access to the full post archives.