[ Content | View menu ]

Character Encoding: The Lesson We All Learn The Hard Way

Mark Mzyk | November 14, 2008

Character encodings are a necessary evil for programmers.  It something that I wish I could forget, but to be competent I need to know about them.

I’ve read Tim Bray’s writings on the subject.  I’ve read Joel’s writings on the subject.

And it still bit me in perhaps the simplest form possible, when I wasn’t paying attention.  I didn’t even have to switch between encodings.

In the application I was working in, product prices can be converted between various currencies such as United States dollars, euros, pounds, etc.

This means a price can possibly be output in several ways, so it appears potentially as:

$5.50

₡7.75

₤8.35

etc.

Note that the prices above are just numbers I pulled from the air.  No attempt has been made to reflect exchange rates.  The important point here is the currency symbol preceding the price.

In the code, the string containing the currency symbol and price was passed into a method.  What then needed to happen was a comparison between prices.  To accomplish this, the PHP substr() function was called to strip the currency symbol from the string and just return the price, like so:

$price = substr($string, 1);

(A few notes before continuing: In PHP, the $ sign denotes a variable, so it is not a currency symbol in this case.  The var $string is holding the price + currency, i.e. $5.50)

This line will return the string, minus the first character and assign it to the variable $price.  $price will then be used for comparison purposes, now that the currency symbol has been removed.

See any problem?

I didn’t see it at first.

The code works just fine for $.  It fails for ₡ and ₤.

Why?

$ is included in the original ASCII definition.  Its representation neatly fits in a single byte.  The representation for ₡ and ₤ do not fit into a single byte.  When the substr() method was used to remove a single character, it completely removed $, but for ₡ and ₤ it left behind half the encoding for the characters, which lead to unpredictable results in the code.

There is also another piece to this puzzle.  The default encoding for this application was set to ISO-8859-1 if I remember correctly, which contributed to the problem, as it influenced how substr() interpreted the string.  Had Tim Bray’s advice been used and the encoding defaulted to UTF-8 or UTF-16 this pain might have been avoided.

You live and learn, sometimes the hard way.

Filed in: Programming.