Saturday, January 28, 2017

When String.length() Lies

With most languages developers are accustomed to dealing with a string object that has a length size method, which returns the size of the string. But this fails in Java when dealing with certain characters in the unicode range, including some emoji. The reason for this is characters are internally represented with UTF-16, allowing for up to 65536 characters, but some characters requires three bytes or more and so therefore cannot be expressed in a single String character.

Here's what the cactus emoji 🌵 looks like in the debugger. The emoji character occupies two characters of the String, index 6, and 7, yet of course is a single character. If you wrote code that iterated over the String and printed each character you would most definitely not see a cactus emoji in the output, yet if you print the string then you would see it, assuming the console or app doing the printing is capable of rendering emoji.



In general this hasn't been much of an issue since the vast majority of characters exist within 0xFFFF, or the Basic Multilingual Plane (BMP). But now, with the emergence of emoji some of those new characters are being placed above 0xFFFF, simply because we've run out of code points in the BMP. The range > 0xFFFF is known as the Supplementary Multilingual Plane. That range goes from 10000–​1FFFF and there's even additional ranges, up to F0000–​10FFFF.


Characters that fall into the SMP require four bytes or two Java characters. The first character falls in the high range and the second in the low range:


high range 0xD800..0xDBFF.
low range  0xDC00..0xDFFF.


Now, when you see a character that is >= 0xD800 and <= 0xDBFF, you know that it is greater than two bytes and is a surrogate pair.


The rules for decoding the pair into the unicode code point is as follows (from wikipedia)


Consider the encoding of U+10437 (𐐷):
  • Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
  • Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
  • Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
  • Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.


What this means for the developer is that string.length may report a size that is larger than the visible character size of the String. Also, when using the substring method you must be careful not to chop in the middle of a surrogate pair. For example,


if (string.charAt(splitPosition) >= 0xD800 && string.charAt(splitPosition) <= 0xDBFF)


Then the substring position must be adjusted to splitPosition + 1 or splitPosition - 1.


You can easily determine if an emoji will occupy more than one character in a string by looking at the unicode code point. As an example, hot beverage is U+2615 (hex). That's just two bytes so no issue there. But, cactus is U+1F335 and requires three bytes, so it would need two characters in a String.


Emoji reference: http://apps.timwhitlock.info/emoji/tables/unicode


No comments:

Post a Comment