July 19, 2020

Why a Java `String` may not be a String

This is the third in a series of articles about strings that I’ve written. As you can tell, I have a (perhaps unhealthy) obsession with strings. Here were the first two, and a quick summary of each:

Why you should use Base64: If you have an arbitrary sequence of bytes and want to convert it to a string, the only safe way is to use something like Base64 encoding. This is because although all strings are sequences of bytes, not all sequences of bytes are valid strings.
String length in various programming languages: Different programming languages may have different definitions of what the length of a given string is. This mostly has to do with how these languages internally represent strings.

In the second article above, I stated:

So if you just think of Java Strings as UTF-16 encoded strings, you should be fine.

I’ll have to offer a mea culpa here, as that statement is not entirely correct. Not all instances of a String in Java will represent a valid UTF-16 string. Let’s dig into the details to find out why.

Background

Before we go into the details of strings in Java, it’s worthwhile to give a quick overview of what a string is. A string is a sequence of characters, and each character represents a symbol that has been encoded to one or more bytes. The system that dictates how characters/symbols are encoded to bytes is called a character encoding.

Typically, the set of symbols is drawn from Unicode, which is essentially a set of symbols (characters) for almost every written language out there - even ones that remain undeciphered. Each Unicode character has a value associated with it called a code point. These are integer values and are typically represented as hexadecimal values with a U+ prefix. For example, the ~ character (“tilde”) has a code point value of U+007E.

These integer code point values do not represent the actual byte values of the character; you should just think of them as a numerical unique identifier for the character. A Unicode character encoding is then used to map from these code points to the actual bytes.

Unicode transformation to bytes

In the case of Java strings, they are represented as sequences of Java characters (the primitive char type), with each Java character being 16 bits in length. Because of this, the UTF-16 encoding is used to represent Unicode characters as Java characters, since not all Unicode code point values can fit into 16-bits.

UTF-16 will encode each Unicode code point into one or two 16-bit code units; each 16-bit code unit is basically a Java character. This means that a single Unicode character (a code point) could be represented as two Java characters or code units!

UTF-16 strings impose additional requirements for validity

Recall that not all sequences of bytes will be valid strings; in the same vein, not all sequences of UTF-16 code units (aka Java char values) will be valid UTF-16 strings. However, because Java’s String type has a constructor that accepts a char[] and doesn’t do any validation, you can create an instance of a String that is not a valid UTF-16 string. This is because the String type is more-or-less just a wrapper around an array of Java characters.

Let’s take a look at some valid and invalid UTF-16 sequences in Java, and how they are handled. First, a valid UTF-16 sequence where a surrogate pair (two char values) is used to represent a single Unicode character:

public static void validUtf16Example() {
    // The G-clef has code point of U+1D11E: https://www.compart.com/en/unicode/U+1D11E
    // It is encoded in UTF-16 as a "surrogate pair".
    // Thes are the two characters {'\uD834', '\uDD1E'}.
    // We represent these using escape sequences to directly specify the code unit values.
    final char[] data = {'\uD834', '\uDD1E', ' ', 'g', '-', 'c', 'l', 'e', 'f'};
    final String output = new String(data);

    // Prints out: "𝄞 g-clef"
    System.out.println(output);
}

In the above example, a surrogate pair of {'\uD834', '\uDD1E'} is used to represent the two UTF-16 code units that make up the musical “G-clef” symbol. The first code unit is known as the high surrogate, and the second as the low surrogate. All high surrogates will come from the range 0xD800-0xDBFF and all low surrogates will come from the range 0xDC00-0xDFFF. A high surrogate must be followed by a low surrogate in any valid UTF-16 sequence.

To create some invalid UTF-16, we can simply drop the low surrogate to create an unpaired surrogate, which constitutes an invalid UTF-16 sequence:

public static void invalidUtf16Example() {
    // If we only include the first (high) surrogate from the G-clef pair {'\uD834', '\uDD1E'},
    // this would be invalid UTF-16. But the String constructor is fine with this!
    final char[] data = {'\uD834', ' ', 'g', '-', 'c', 'l', 'e', 'f'};
    final String output = new String(data);

    // When you try to print out this invalid string, you will probably see this: "? g-clef"
    // The unpaired high surrogate just gets mapped to a '?' depending on what your terminal does.
    System.out.println(output);
}

You can see that the String(char[]) constructor will happily let you create a String from an array of characters that forms an invalid UTF-16 sequence, meaning that you can create an instance of a String that is not a valid UTF-16 string. While you can create such a string, trying to use them could be fraught with danger - just trying to print it out could yield weird results.

Validating UTF-16 sequences

If the String constructor won’t verify whether we have a valid sequence of UTF-16 characters, then what will?

One option is to use an instance of the CharsetEncoder class to try and encode a sequence of Java char values into an underlying byte representation. This operation will perform a strict verification to ensure that the Java character sequence was valid:

public static void verifyValidUtf16Example() throws CharacterCodingException {
    final char[] data = {'\uD834', '\uDD1E', ' ', 'g', '-', 'c', 'l', 'e', 'f'};
    // UTF_16LE is the litte-endian (LE) variant of UTF-16.
    final CharsetEncoder utf16encoder = StandardCharsets.UTF_16LE.newEncoder();
    final ByteBuffer byteBuffer = utf16encoder.encode(CharBuffer.wrap(data));

    // Get the bytes representing the UTF-16 encoding.
    System.out.println(Arrays.asList(ArrayUtils.toObject(byteBuffer.array())));
}

If the input sequence of characters was not valid UTF-16, the CharsetEncoder.encode() method would have thrown a CharacterCodingException, as in the following example:

public static void verifyInvalidUtf16Example() throws CharacterCodingException {
    final char[] data = {'\uD834', ' ', 'g', '-', 'c', 'l', 'e', 'f'};
    final CharsetEncoder utf16encoder = StandardCharsets.UTF_16LE.newEncoder();

    // This will throw a MalformedInputException (sub-class of CharacterCodingException)
    // due to the unpaired high surrogate.
    final ByteBuffer byteBuffer = utf16encoder.encode(CharBuffer.wrap(data));
    System.out.println(Arrays.asList(ArrayUtils.toObject(byteBuffer.array())));
}

Conclusion

A Java String, while consisting of a sequence of primitive char values (aka UTF-16 code units), may not be a valid UTF-16 string. This is because certain sequences of char values do not produce a valid UTF-16 string, or rather, not all possible sequences of char values will be valid UTF-16 strings.

However, the String class doesn’t impose restrictions around this. Instead, the String class should be thought of as a wrapper around an array of characters. If you want to ensure that a sequence of Java characters actually is a valid UTF-16 string, you should use something like CharsetEncoder to verify that.

Why a Java String may not be a String

Background

UTF-16 strings impose additional requirements for validity

Validating UTF-16 sequences

Conclusion

Why a Java `String` may not be a String