March 6, 2020

How long is that string?

What is the length of this string?

"Hello World!"

This isn’t a trick question. In pretty much any programming language, it will be 12. Java is no exception:

public static void simpleStringLengthIntroExample() {
    // 12 characters, as expected.
    final String helloWorld = "Hello World!";
    System.out.println(helloWorld.length());
}

But what if we use international characters? (You can do this, since Java source code supports Unicode, so most source code would be encoded as UTF-8)

public static void basicInternationalStringLengthExample() {
    // 2 characters, as expected.
    final String international = "你好";
    System.out.println(international.length());
}

In this case, the answer is still as expected: 2.

What if we move to more esoteric characters, such as symbols from Cuneiform?

public static void ancientScriptExample() {
    // Cuneiform script: This single "character" (or sign) seems to mean "donkey"
    final String ancient = "𒀲";
    // Even though the above is a single Unicode code point, the string length is 2!
    System.out.println(ancient.length());
}

In this case, a single “character” turns out to have a String length of 2!

And, what about emojis? (Emojis are actually characters under the Unicode standard, and are not the same as pure images)

public static void emojiExample() {
    // This is the "Thumbs Up" in a dark skin tone, which looks like a single character:
    // https://emojipedia.org/thumbs-up-dark-skin-tone/
    final String emoji = "👍🏿";
    // However, its string length is 4!
    System.out.println(emoji.length());
}

In this case, a single emoji (Thumbs Up: Dark Skin Tone) has a String length of 4!

What is going on here?

Strings in Java

Let’s look at Strings in Java, since the above examples are in Java 13. As is the case with many (but not all!) programming languages, characters are not bytes, and bytes are not characters. So, a String of length N might not be represented by exactly N bytes.

But we weren’t talking about byte length above, but instead String length, which seems pretty well-defined. Since a String is a sequence of characters, the length of that String should just be the number of characters.

So the question then becomes: What is a character? In Java, the char data type is one of eight primitive data types, and it has a length of 16 bits. The official Oracle documentation states that:

The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

However, Unicode has far more than 2¹⁶ code points, each of which can represent a character. (They are grouped into planes, each of which has 2¹⁶ code points) So how is a Java char able to represent this much larger range of Unicode characters?

The answer is that a char does not store a Unicode code point. Instead, each char stores the UTF-16 encoding of a Unicode code point. UTF-16 is a variable-width encoding that encodes characters in either one or two 16-bit values, called code units. Basically, most of the characters you’d normally use would map to a single 16-bit value, but some, more esoteric characters may encode to two 16-bit values.

So basically, some Unicode characters actually map to two Java characters. This is Java’s internal representation of those Unicode characters, and when you call String.length(), you’re getting back the length of the String in terms of the number of Java characters, not the number of Unicode characters.

This is explained, somewhat confusingly, in the Javadoc. In the description for String.length(), we see:

Returns the length of this string. The length is equal to the number of Unicode code units in the string.

Note that they said “Unicode code units” and not “Unicode code points”. The difference is explained in the Javadoc for Character:

In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.

So if you just think of Java Strings as UTF-16 encoded strings, you should be fine.

Now that we understand how Strings are represented, we can use that knowledge to explain the interesting results from above.

For the String "Hello World!", all of the characters come from the Basic Multilingual Plane (BMP), which means they all fit into a single 16-bit value.
So, the number of Unicode characters is equal to the String length.
Same goes for the String "你好"; these characters are also from the BMP.
Our Cuneiform symbol, "𒀲", is from the higher Supplementary Multilingual Plane (SMP), which under UTF-16 encodes to two 16-bit values, giving us a Java String length of 2.
The Emoji example of "👍🏿" is even more complicated. Our Thumbs Up: Dark Skin Tone is actually comprised of two Unicode characters: The “Thumbs Up” emoji followed by the “Dark Skin Tone” modifier. (The details of this are described in this Unicode standard)
Each of these Unicode characters comes from the SMP, meaning they each encode to two 16-bit values, giving us a total of four UTF-16 values, giving us a Java String length of 4.

Note that things get even more complicated with Zero-Width Joiner (ZWJ) Emojis, which use a joining character (the ZWJ) to bind together multiple emoji characters to be rendered as a single emoji. For example, the “Family: Woman, Man, Boy, Girl” is made up of four emojis: Woman, Man, Boy, and Girl, and between each one is a ZWJ. This mades the sequence consist of seven Unicode code points, but renders as a single Emoji, looking like this: 👩‍👨‍👦‍👧

Aside: Why are Strings or characters in Java implemented like this?

When Java was first released, way back in 1996, the Unicode standard did not have so many characters as it does today. It was initially believed that 2¹⁶ distinct values would be enough for any character that anyone would ever want to use, and so when Java was being designed, a char was set to be 16 bits. Initially, these char values essentially stored the Unicode code point (actually, the UCS-2 encoding, which was basically the same), but as Unicode expanded to support multiple “planes”, the need arose to support these additional characters. In Java 5, the internal character encoding of Strings was changed from UCS-2 to UTF-16 to support these additional characters; this was probably done so that the length of a char could remain at 16 bits.

What about other programming languages?

So far, we’ve seen how Java treats the length of various Unicode characters, but what about other programming languages? Is there agreement or consensus on what the length of strings containing these characters should be? Let’s take a look at a few: (The selection of which is biased toward my own experience.)

JavaScript

Tested under Chrome Version 80.0.3987.87 (Official Build) (64-bit):

let helloWorld = "Hello World!";
helloWorld.length
> 12
let international = "你好";
international.length
> 2
let ancient = "𒀲";
ancient.length
> 2
let emoji = "👍🏿";
emoji.length;
> 4

JavaScript appears to follow the same conventions as Java 13. This is because String.length is defined to be:

The length property of a String object contains the length of the string, in UTF-16 code units.

This is identical to Java’s definition.

Python 3

Tested on Python 3.7.4 (default, Sep 7 2019, 18:27:02) [Clang 10.0.1 (clang-1001.0.46.4)] on darwin

>>> helloWorld = "Hello World!"
>>> len(helloWorld)
12
>>> international = "你好"
>>> len(international)
2
>>> ancient = "𒀲"
>>> len(ancient)
1
>>> emoji = "👍🏿"
>>> len(emoji)
2

For Python 3, things are different. While the two examples from the BMP are the same, the examples with characters from the SMP differ. In particular, the length of our single Cuneiform symbol is 1, and the length of the emoji is 2; both are half of what Java and JavaScript report.

This is because in Python 3, strings (the str type) represents Unicode characters (or code points), not some encoding like UTF-16. In this sense, Python 3 strings are “pure” because they are referring directly to Unicode characters and not some encoding of them. Note that the internal representation in bytes of a Python 3 string could be one of several encodings, but the default view you get of a string is not of the bytes, or even of the encoding, but rather of the Unicode characters or code points.

This is completely different than Python 2, where the str type was essentially a sequence of bytes, and there was a separate unicode type that essentially became str in Python 3.

I won’t go over a separate Python 2 example, since you really shouldn’t be using it.

Go

As tested on the Go Playground:

func main() {
	helloWorld := "Hello World!"
	fmt.Println(len(helloWorld)) // 12

	international := "你好"
	fmt.Println(len(international )) // 6

	ancient := "𒀲"
	fmt.Println(len(ancient)) // 4

	emoji := "👍🏿"
	fmt.Println(len(emoji)) // 8
}

The results are almost completely different; only the simple "Hello World!" example gives the same string length as the other languages.

As you might imagine, the different results are explained by how Go represents strings. Go chooses to use UTF-8 as its internal encoding for strings. However, they have also chosen to essentially treat strings as sequences of bytes, unlike Python 3:

In Go, a string is in effect a read-only slice of bytes.

When you get the “length” of a string in Go, what you are actually getting is the number of bytes it took to encode the characters of that string in UTF-8. When you iterate over a string in Go using a regular for loop, you get access to each byte, but if you recall from before, a character is not necessarily a byte, and a byte is not necessarily a character, so you should be careful how you treat these bytes! From the above blog post:

As we saw, indexing a string yields its bytes, not its characters: a string is just a bunch of bytes. That means that when we store a character value in a string, we store its byte-at-a-time representation.

Instead, if you want to iterate over a string and get each character (which Go calls a rune), you should use a range loop, like the following:

func main() {
	helloWorld := "Hello World!"
	fmt.Printf("Length of '%s' is %v\n", helloWorld, len(helloWorld)) // 12
	for i, runeValue := range helloWorld {
		fmt.Printf("%#U at byte index %v\n", runeValue, i)
	}

	international := "你好"
	fmt.Printf("Length of '%s' is %v\n", international, len(international)) // 6
	for i, runeValue := range international {
		fmt.Printf("%#U at byte index %v\n", runeValue, i)
	}

	ancient := "𒀲"
	fmt.Printf("Length of '%s' is %v\n", ancient, len(ancient)) // 4
	for i, runeValue := range ancient {
		fmt.Printf("%#U at byte index %v\n", runeValue, i)
	}

	emoji := "👍🏿"
	fmt.Printf("Length of '%s' is %v\n", emoji, len(emoji)) // 8
	for i, runeValue := range emoji {
		fmt.Printf("%#U at byte index %v\n", runeValue, i)
	}
}

This results in the following output:

Length of 'Hello World!' is 12
U+0048 'H' at byte index 0
U+0065 'e' at byte index 1
U+006C 'l' at byte index 2
U+006C 'l' at byte index 3
U+006F 'o' at byte index 4
U+0020 ' ' at byte index 5
U+0057 'W' at byte index 6
U+006F 'o' at byte index 7
U+0072 'r' at byte index 8
U+006C 'l' at byte index 9
U+0064 'd' at byte index 10
U+0021 '!' at byte index 11
Length of '你好' is 6
U+4F60 '你' at byte index 0
U+597D '好' at byte index 3
Length of '𒀲' is 4
U+12032 '𒀲' at byte index 0
Length of '👍🏿' is 8
U+1F44D '👍' at byte index 0
U+1F3FF '🏿' at byte index 4

When you use a range loop over a string like this, it actually decodes the UTF-8 encoded bytes into runes one at a time, where each rune is a Unicode code point. Since a rune is an alias for the int32 type, there’s more than enough space in the rune to store the Unicode code point.

Go’s justification for this design choice is as follows:

Strings are built from bytes so indexing them yields bytes, not characters. A string might not even hold characters. In fact, the definition of “character” is ambiguous and it would be a mistake to try to resolve the ambiguity by defining that strings are made of characters.

I’ll leave it to you to decide if this justification makes sense, but it is worthwhile to note that it is markedly different than the other languages here.

Conclusion

The main point here is that it is important to understand how strings are represented in the programming language you are using. Although the examples I’ve given here could be considered edge cases, not properly understanding how string length relates to the number of characters in the string can lead to bugs which can be hard to diagnose, or even worse, security issues.

This is especially important if you intend to start supporting Unicode in your application for whatever reason - internationalization, or just because you want emojis. If you’ve only been dealing with “normal” characters up until this point, your concept of a string’s length might not actually match with the truth - but you will not have noticed it because you haven’t yet encountered those more esoteric Unicode characters.