highaltitudedev: January 2017

Saturday, January 28, 2017

When String.length() Lies

With most languages developers are accustomed to dealing with a string object that has a length size method, which returns the size of the string. But this fails in Java when dealing with certain characters in the unicode range, including some emoji. The reason for this is characters are internally represented with UTF-16, allowing for up to 65536 characters, but some characters requires three bytes or more and so therefore cannot be expressed in a single String character.

Here's what the cactus emoji 🌵 looks like in the debugger. The emoji character occupies two characters of the String, index 6, and 7, yet of course is a single character. If you wrote code that iterated over the String and printed each character you would most definitely not see a cactus emoji in the output, yet if you print the string then you would see it, assuming the console or app doing the printing is capable of rendering emoji.

In general this hasn't been much of an issue since the vast majority of characters exist within 0xFFFF, or the Basic Multilingual Plane (BMP). But now, with the emergence of emoji some of those new characters are being placed above 0xFFFF, simply because we've run out of code points in the BMP. The range > 0xFFFF is known as the Supplementary Multilingual Plane. That range goes from 10000–1FFFF and there's even additional ranges, up to F0000–10FFFF.

Characters that fall into the SMP require four bytes or two Java characters. The first character falls in the high range and the second in the low range:

high range 0xD800..0xDBFF.

low range 0xDC00..0xDFFF.

Now, when you see a character that is >= 0xD800 and <= 0xDBFF, you know that it is greater than two bytes and is a surrogate pair.

The rules for decoding the pair into the unicode code point is as follows (from wikipedia)

Consider the encoding of U+10437 (𐐷):

Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

What this means for the developer is that string.length may report a size that is larger than the visible character size of the String. Also, when using the substring method you must be careful not to chop in the middle of a surrogate pair. For example,

if (string.charAt(splitPosition) >= 0xD800 && string.charAt(splitPosition) <= 0xDBFF)

Then the substring position must be adjusted to splitPosition + 1 or splitPosition - 1.

You can easily determine if an emoji will occupy more than one character in a string by looking at the unicode code point. As an example, hot beverage is U+2615 (hex). That's just two bytes so no issue there. But, cactus is U+1F335 and requires three bytes, so it would need two characters in a String.

Emoji reference: http://apps.timwhitlock.info/emoji/tables/unicode

Netty Conflict with TwitterServer and Play WS

TL;DR when an app runs from the IDE but not when invoked by another means, such as command line, it could be a dependency conflict where classpath ordering matters. AKA, seemingly inexplicable errors are often explicable, but only after some headachy troubleshooting

I might be the first person to run TwitterServer with akka and Play WS. There's really no reason for this combination of frameworks. I first started working with Play WS (built on AsyncHttpClient) because it's super simple (much lass convoluted than akka-http). See, super easy:

val updateIp = client
.url(s"https://app.herokuapp.com/ip-address/${ip}")
.withMethod("POST")
.execute()

updateIp.onComplete {
case Success(response) =>
   response.status match {
     case 200 =>
       log.info(s"Heroku ip address update returned ${response.body}")
     case _ =>
       log.warning(s"Heroku update ip address failed with status ${response.status}")
   }
case Failure(e) =>
   log.warning(s"Heroku update ip address failed with exception ${e}")
}

Then I added akka and later tossed TwitterServer into the mix for the sweet stats. Here's a screenshot of the admin server UI that comes with TwitterServer

Admittedly this chart isn't too exciting but the stats provided by TwitterServer (actually from Finagle) and the included utilities (profiling etc) are exceedingly useful.

After I added TwitterServer, the app was running just fine in Intellij but I could not run from the command line. Every time I saw:

java.net.BindException: Failed to bind to 0.0.0.0/0.0.0.0:9999: Unable to create Channel from class class io.netty.channel.socket.nio.NioServerSocketChanne\

When I see this my first reaction is I left an instance running

netstat -lnp | grep 9990

but nope, so this made absolutely no sense.

Then, if I removed TwitterServer from the code but kept the dependency I got past the bind exception:

WARN c.r.app.AppActor - Failed to get ip from heroku java.net.ConnectException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel

In this case there were conflicting dependencies with Netty. Specifically the Netty version required by TwitterServer and Play WS were not in agreement. When I was running from Intellij the Netty version that loaded first happened to work for both (Play WS and TwitterServer). But, the SBT JavaServerAppPackaging loaded the problematic/conflicting Netty classes first, which resulted in the weird and misleading Netty errors. I was able to solve the problem by simply shuffling the order of the dependencies in build.sbt.

Works:

libraryDependencies += "com.twitter" %% "twitter-server" % "1.25.0"

libraryDependencies += "com.twitter" % "finagle-stats_2.11" % "6.40.0"

libraryDependencies += "com.typesafe.play" %% "play-ws" % "2.5.4"

libraryDependencies += "com.typesafe.akka" %% "akka-actor" % "2.4.16"

libraryDependencies += "com.typesafe.akka" %% "akka-http-core" % "10.0.0"

libraryDependencies += "com.typesafe.akka" %% "akka-http" % "10.0.0"

libraryDependencies += "com.typesafe.akka" %% "akka-http-testkit" % "10.0.0"

libraryDependencies += "com.typesafe.akka" %% "akka-http-spray-json" % "10.0.0"

libraryDependencies += "com.typesafe.akka" %% "akka-http-jackson" % "10.0.0"

libraryDependencies += "com.typesafe.akka" %% "akka-http-xml" % "10.0.0"

libraryDependencies += "com.typesafe.akka" %% "akka-slf4j" % "2.4.16"

libraryDependencies += "ch.qos.logback" % "logback-classic" % "1.1.3"

libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.1.0"

Fails: