January 29, 2020

What every software engineer should know about characters/strings

How often have you seen this?

There is some binary data, e.g. an array/list/sequence of bytes.
Someone wants to print that binary data to the screen, i.e. see it in a human-readable format, such as text.
They do something like new String(byteArray) (In Java).
Things don’t work as expected: The output looks “weird”, or other unexpected results.

Many of you will be able to spot the mistake right away: If you have arbitrary binary data, and want to get a string/text representation of that, you should use something like Base64 encoding. You should not just convert it to a String by trying to convert the bytes to characters.

The TL;DR is that if you have arbitrary binary data, and want to get a String representation, you should use Base64 (which is meant for this) to encode the data as a String. This is because a character is not necessarily a byte, and a byte is not necessarily a character.

What are characters?

Characters are of course made up of bytes, which is why the idea to simply convert bytes to characters seems to make sense. However, a JPEG image is also made up of bytes, but most people wouldn’t think that an arbitrary sequence of bytes should translate into a valid JPEG image. (At least I hope not) The same is true for characters.

I suspect a lot of the confusion comes from some elementary courses folks have taken, where they were directly translating or manipulating bytes to characters via an ASCII table. This is where the “convert from uppercase to lowercase by adding 32” came from, since bytes [65, 90] are for [A-Z] and bytes [97, 122] for [a-z]. Under this mental model, it’s easy to conflate bytes with characters, and to think that one character is exactly equal to one byte. (In fact this is what Python 2 did, where str was essentially the same as bytes)

However, ASCII is really better thought of as a character encoding, that is, a mapping from characters (the things you see displayed on screen) to the underlying binary representation as bytes. You can see that many of the ASCII characters are not printable, and are things like control characters or the NUL byte. So, right away, you can see if that if you took an arbitrary sequence of bytes and tried to interpret them as ASCII, it might not result in something that could be printed to the screen.

However, ASCII is not the only character set encoding, and it only covered 128 different values. Because of this, it only covered the letters [A-Za-z0-9], plus some punctuation, special characters, and control characters. This was obviously not sufficient for many languages.

To address this, many software companies at the time came up with their own character encodings that were compatible with ASCII (i.e. they shared the same first 128 mappings) but could use more bits/bytes to hold a larger number of characters. This resulted in a multitude of new character encodings, (ISO-8859-1, Windows-1252), many of which did not interoperate and became known as Extended ASCII.

The Unicode Consortium saw this happening, and decided that the solution was to make one universal standard that everyone would agree on. That became known as Unicode.

Unicode

The first thing to know about Unicode is that it is not a character encoding.

Unicode is a further abstraction beyond a character encoding. Unicode instead defines characters in terms of their code point, which is an integer number usually represented in hexadecimal format. However, this code point does not indicate what the binary representation of that character will be.

Instead, you have to apply an encoding format, such as UTF-8 or UTF-16, in order to map the code point to its byte representation. Despite their names, UTF-8 and UTF-16 do not always use 8-bits and 16-bits, respectively for each character. Instead, they are variable width encodings, meaning that depending on the particular character/code point being encoding, a different number of bytes may be used. This means the length of a particular string (the number of characters) may not equal the number of bytes it takes to represent that string.

You can think of characters and their code points in Unicode as a sort of “Platonic Ideal” of what a character should be. The Unicode standard defines what a character should look like in very general terms, and also assigns it a code point value. It does not, however, define what the ultimate binary representation of that character should be, nor does it define exactly what that character should look like. For example, this is what the PILE OF POO character/emoji specification states:

dog dirt; may be depicted with or without a friendly face

This has lead to somewhat varied depictions of the character.

Unicode is also extensible. In fact, the Unicode standard has been updated quite a lot since its inception. They have added many new “planes” (ranges of code points) to capture not just existing languages, but also extinct languages such as the ancient Minoan language of Linear A, which to my knowledge has still not yet been deciphered. Emojis, the primary form of communication for many, were also added in one of these additional planes.

This has led to many of these planes being know as the “Astral Planes”. (They really went with the whole “kitchen sink” approach with Unicode)

The addition of these extra planes (ranges of Unicode code points) is why an encoding scheme like UTF-8 needs to be variable width. There are many more characters than could be encoded by just a single byte. So, some code points will encode to a single byte in UTF-8, but many will require more than one byte. The need to support this variable-length encoding means that for multi-byte sequences, the first few bits of each byte have to be set to certain values. If they are not, it will result in an invalid byte sequence.

Why does this matter?

Because character encodings like UTF-8 are designed to work with specific input (Unicode code points), you cannot guarantee that arbitrary sequences of bytes will be able to decode into printable characters. Indeed, UTF-8 is compatible with ASCII, so many bytes will map to non-printable characters such as control characters, the NUL character, etc. Added on top of this is the possibility for invalid byte sequences as described above.

Put more succinctly, every character in Unicode can be translated to a sequence of bytes using the UTF-8 encoding. However, the reverse is not true; it is not guaranteed that an arbitrary sequence of bytes can be decoded into a valid sequence of characters. This becomes more important when you consider that UTF-8 is probably the most popular character encoding nowadays.

This is why the Javadoc for the constructor new String(byte[] bytes) states:

The behavior of this constructor when the given bytes are not valid in the default charset is unspecified.

When someone states that the behaviour is “undefined”, it usually means “don’t do this, and if you did do it and something broke, don’t bother to complain to me because I told you so.”

Furthermore, using that specific constructor carries the additional issue of using the machine’s default character set to decode the bytes back to characters. This is even worse, since now the results may vary across different environments. Because of that, seeing something like new String(someByteArray) should immediately be a code smell.

Instead, if you have arbitrary binary data, the only safe way to get a string representation would be to use something like Base64/Base32, which are explicitly designed to take arbitrary binary data and convert this into a safe/printable/human-readable format. (One of the original uses of Base64 was to encode binary data into an email, which at the time could only contain text)