Saturday, January 28, 2017

When String.length() Lies

With most languages developers are accustomed to dealing with a string object that has a length size method, which returns the size of the string. But this fails in Java when dealing with certain characters in the unicode range, including some emoji. The reason for this is characters are internally represented with UTF-16, allowing for up to 65536 characters, but some characters requires three bytes or more and so therefore cannot be expressed in a single String character.

Here's what the cactus emoji 🌵 looks like in the debugger. The emoji character occupies two characters of the String, index 6, and 7, yet of course is a single character. If you wrote code that iterated over the String and printed each character you would most definitely not see a cactus emoji in the output, yet if you print the string then you would see it, assuming the console or app doing the printing is capable of rendering emoji.



In general this hasn't been much of an issue since the vast majority of characters exist within 0xFFFF, or the Basic Multilingual Plane (BMP). But now, with the emergence of emoji some of those new characters are being placed above 0xFFFF, simply because we've run out of code points in the BMP. The range > 0xFFFF is known as the Supplementary Multilingual Plane. That range goes from 10000–​1FFFF and there's even additional ranges, up to F0000–​10FFFF.


Characters that fall into the SMP require four bytes or two Java characters. The first character falls in the high range and the second in the low range:


high range 0xD800..0xDBFF.
low range  0xDC00..0xDFFF.


Now, when you see a character that is >= 0xD800 and <= 0xDBFF, you know that it is greater than two bytes and is a surrogate pair.


The rules for decoding the pair into the unicode code point is as follows (from wikipedia)


Consider the encoding of U+10437 (𐐷):
  • Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
  • Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
  • Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
  • Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.


What this means for the developer is that string.length may report a size that is larger than the visible character size of the String. Also, when using the substring method you must be careful not to chop in the middle of a surrogate pair. For example,


if (string.charAt(splitPosition) >= 0xD800 && string.charAt(splitPosition) <= 0xDBFF)


Then the substring position must be adjusted to splitPosition + 1 or splitPosition - 1.


You can easily determine if an emoji will occupy more than one character in a string by looking at the unicode code point. As an example, hot beverage is U+2615 (hex). That's just two bytes so no issue there. But, cactus is U+1F335 and requires three bytes, so it would need two characters in a String.


Emoji reference: http://apps.timwhitlock.info/emoji/tables/unicode


Netty Conflict with TwitterServer and Play WS

TL;DR when an app runs from the IDE but not when invoked by another means, such as command line, it could be a dependency conflict where classpath ordering matters. AKA, seemingly inexplicable errors are often explicable, but only after some headachy troubleshooting


I might be the first person to run TwitterServer with akka and Play WS. There's really no reason for this combination of frameworks. I first started working with Play WS (built on AsyncHttpClient) because it's super simple (much lass convoluted than akka-http). See, super easy:
val updateIp = client
 .url(s"https://app.herokuapp.com/ip-address/${ip}")
 .withMethod("POST")
 .execute()

updateIp.onComplete {
 case Success(response) =>
   response.status match {
     case 200 =>
       log.info(s"Heroku ip address update returned ${response.body}")
     case _ =>
       log.warning(s"Heroku update ip address failed with status ${response.status}")
   }
 case Failure(e) =>
   log.warning(s"Heroku update ip address failed with exception ${e}")
}
Then I added akka and later tossed TwitterServer into the mix for the sweet stats. Here's a screenshot of the admin server UI that comes with TwitterServer




Admittedly this chart isn't too exciting but the stats provided by TwitterServer (actually from Finagle) and the included utilities (profiling etc) are exceedingly useful.


After I added TwitterServer, the app was running just fine in Intellij but I could not run from the command line. Every time I saw:


java.net.BindException: Failed to bind to 0.0.0.0/0.0.0.0:9999: Unable to create Channel from class class io.netty.channel.socket.nio.NioServerSocketChanne\

When I see this my first reaction is I left an instance running

netstat -lnp | grep 9990
but nope, so this made absolutely no sense.

Then, if I removed TwitterServer from the code but kept the dependency I got past the bind exception:


WARN  c.r.app.AppActor - Failed to get ip from heroku java.net.ConnectException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel


In this case there were conflicting dependencies with Netty. Specifically the Netty version required by TwitterServer and Play WS were not in agreement. When I was running from Intellij the Netty version that loaded first happened to work for both (Play WS and TwitterServer). But, the SBT JavaServerAppPackaging loaded the problematic/conflicting Netty classes first, which resulted in the weird and misleading Netty errors. I was able to solve the problem by simply shuffling the order of the dependencies in build.sbt.


Works:


libraryDependencies += "com.twitter" %% "twitter-server" % "1.25.0"
libraryDependencies += "com.twitter" % "finagle-stats_2.11" % "6.40.0"
libraryDependencies += "com.typesafe.play" %% "play-ws" % "2.5.4"
libraryDependencies += "com.typesafe.akka" %% "akka-actor" % "2.4.16"
libraryDependencies += "com.typesafe.akka" %% "akka-http-core" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-testkit" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-spray-json" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-jackson" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-xml" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-slf4j" % "2.4.16"
libraryDependencies += "ch.qos.logback"    %  "logback-classic" % "1.1.3"
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.1.0"


Fails:


libraryDependencies += "com.typesafe.play" %% "play-ws" % "2.5.4"
libraryDependencies += "com.typesafe.akka" %% "akka-actor" % "2.4.16"
libraryDependencies += "com.typesafe.akka" %% "akka-http-core" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-testkit" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-spray-json" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-jackson" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-http-xml" % "10.0.0"
libraryDependencies += "com.typesafe.akka" %% "akka-slf4j" % "2.4.16"
libraryDependencies += "ch.qos.logback"    %  "logback-classic" % "1.1.3"
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.1.0"
libraryDependencies += "com.twitter" %% "twitter-server" % "1.25.0"
libraryDependencies += "com.twitter" % "finagle-stats_2.11" % "6.40.0"

Another solution would be to ditch Play WS and use finagle-http instead. Or in the case where you want to use multiple versions of the same library you can load them in their on Classload or use something like OSGI which excels at solving this sort of issue.
A tool that is useful debugging these sorts of dependency errors is the sbt dependencyTree plugin. Example output:

| | | | +-io.netty:netty-codec-http:4.1.6.Final[[0m
| | | | | +-io.netty:netty-codec:4.1.6.Final[[0m
| | | | |   +-io.netty:netty-transport:4.1.6.Final[[0m
| | | | |     +-io.netty:netty-buffer:4.1.6.Final[[0m
| | | | |     | +-io.netty:netty-common:4.1.6.Final[[0m
| | | | |     | [[0m
| | | | |     +-io.netty:netty-resolver:4.1.6.Final[[0m
| | | | |       +-io.netty:netty-common:4.1.6.Final[[0m
| | | | |       [[0m
| | | | +-io.netty:netty-handler-proxy:4.1.6.Final[[0m
| | | | | +-io.netty:netty-codec-http:4.1.6.Final[[0m
| | | | | | +-io.netty:netty-codec:4.1.6.Final[[0m
| | | | | |   +-io.netty:netty-transport:4.1.6.Final[[0m
| | | | | |     +-io.netty:netty-buffer:4.1.6.Final[[0m
| | | | | |     | +-io.netty:netty-common:4.1.6.Final[[0m
| | | | | |     | [[0m
| | | | | |     +-io.netty:netty-resolver:4.1.6.Final[[0m
| | | | | |       +-io.netty:netty-common:4.1.6.Final[[0m
| | | | | |       [[0m

And finally a diff of the classpath produced by JavaServerAppPackaging (working version top) shows the netty 3.10 from TwitterServer picked up first while the non-working script picks up several play dependencies, ex $lib_dir/com.typesafe.play.play-netty-utils-2.5.4.jar.






Tuesday, January 24, 2017

Git Commit Maven Plugin

I'm not sure how I survived so long without this. As a developer, almost everyone has run into a situation where a bug is reported but the question is what version. For releases there's generally a tag that identifies the release and makes it easy to find the source code to begin the investigation. But, often, especially with snapshot builds, it may be less clear. This simple addition to the Maven pom.xml (below) creates a git.properties file in your build artifact. There are many properties you can include and I've excluded some that were not particularly useful for my purposes. The most important, in my opinion include:

git.dirty=true
git.commit.time=24.01.2017 @ 07\:08\:22 MST
git.branch=master
git.commit.id=79b91f49b53eab5fe96efd2a2d9438ea3e3a0d00

git.tags=

The git.dirty indicates there are uncommitted changes in git. This build should never be dirty since if it is the commit hash does not reflect what was built. I believe you can fail the build if dirty. With web apps it's easy to expose this information via an endpoint. Also a good practice is to print the contents of the file to a log on start-up. Big tip of the hat goes to Konrad Malawski (akka core developer) for developing this plugin.


<plugin>
   <groupId>pl.project13.maven</groupId>
   <artifactId>git-commit-id-plugin</artifactId>
   <version>2.2.1</version>
   <executions>
      <execution>
         <id>get-the-git-infos</id>
         <goals>
            <goal>revision</goal>
         </goals>
      </execution>
      <execution>
         <id>validate-the-git-infos</id>
         <goals>
            <goal>revision</goal>
         </goals>
         <phase>package</phase>
      </execution>
   </executions>

   <configuration>
      <generateGitPropertiesFile>true</generateGitPropertiesFile>
      <generateGitPropertiesFilename>${project.build.outputDirectory}/git.properties</generateGitPropertiesFilename>
      <excludeProperties>
         <excludeProperty>git.build.*</excludeProperty>
         <excludeProperty>git.commit.user.email</excludeProperty>
         <excludeProperty>git.commit.message.*</excludeProperty>
         <excludeProperty>git.commit.id.describe</excludeProperty>
         <excludeProperty>git.commit.user.name</excludeProperty>
         <excludeProperty>git.remote.origin.url</excludeProperty>
         <excludeProperty>git.commit.id.abbrev</excludeProperty>
         <excludeProperty>git.closest.tag.name</excludeProperty>
         <excludeProperty>git.closest.tag.commit.count</excludeProperty>
         <excludeProperty>git.remote.origin.url</excludeProperty>
      </excludeProperties>
   </configuration>
</plugin>