php[architect] logo

Want to check out an issue? Sign up to receive a special offer.

Simplicity made complicated: character encoding

Posted by on June 24, 2010

The dangers that lurk behind character encodingBack in the twentieth century, when I was young and tender in university, I had my first experience with the Internet, and the first thing that struck me as odd was that if I wrote an email to any of my colleagues it would arrive garbled and barely readable.  A quick conversation with the sysadmin revealed that “Our mail servers use 7-bit encoding, and we need 8-bit encoding for accents”.  The first thing I thought was “The Americans who wrote the software didn’t know that people were going to use accents?  Big thing not to know…”.  The second thing I thought was “The sysadmins of a university based in Barcelona didn’t realize users were going to write using latin characters?  What were they thinking?”.  And the fact is that as developers we tend to confuse standard with usual, a mistake we all make far too often and although we encounter it almost day-in day-out we (at least I) still haven’t found a way to solve it.

Unicode! scream a few. UTF-8! scream quite a few others.  Yeah right, and all software written in a planned manner, with proper documentation and unit testing is bug free.  There is always a midget or a leprechaun lurking in the shadows, waiting to jump on the data and switch those bits around.  Maybe it’s the form data validation, maybe the request encoding, it could be the database is using a different encoding or, more fun still, a different collation.  And maybe the way the data gets recovered is the culprit.  And let’s not talk about data getting passed around using instant messengers or emails then getting copied and pasted in and out of documents. Finally, when everything has been taken into account and you make sure everything is going in and out in the appropriate manner… Bang!  Now you have a new employee who is Russian, or Chinese, or you have to store data written in the Arabic alphabet.  Dammit, when I was working at a big venue concert an artist insisted his name be printed and announced in binary!

So, enough ranting for now.  Where does the solution lie?  Because there is a solution isn’t there?  Well… yes… sort of.  There isn’t a magic recipe (that I know of) but some best-practices help.  Working only in UTF-8 is advisable while we wait (still, and patiently) for PHP to fulfill its promises of native Unicode support.  Using binary-safe and multibyte functions also helps… but don’t put all your weight on those.  Experience tells that when you have a rebellious string its going to need a lot of massaging.  As with coding in teams, following conventions when manipulating data are very very necessary, and probably your best ally, even if it seems you aren’t going to need them.  Sooner or later a character that your storage system doesn’t support is going to slap you in the face (yes, it will, don’t look at me like that) and being ready will only guarantee that it’s going to be less of a surprise.  Just because something is standard it doesn’t mean it’s going to be bullet-proof.

So remember, be it an address, somebody’s name, product model or whatever, something is bound to break your rules, and you’d better be ready to change them and adapt.  Communication is vital in today’s e-commerce world and bad encoding is emotionally a very bad message.


Tags: , ,
 

Responses and Pingbacks

I feel your frustration 😉

Great article. I like your writing style.

This nudged me to write about my UTF-8 string class I wrote a few years back.

Your writing style made my eyes blur over.

Interesting opinion but nonetheless pointless as what you’ve just gone and described has been done so many times before with little technical background to highlight solutions…

…and just when is PHP going to have an off the shelf solution to this issue anyway? We have namespaces thankfully but we lack one more essential ingredient to really make PHP kick ass.

Leave a comment

Use the form below to leave a comment: