During my daily work I often heard discussions about how to handle charset properly. What a server must provide to handle charsets correctly, which configuration for Apache is needed, what options must be set in php.ini to make PHP correctly working, which functions should be avoided when using PHP, which locales must be used and so on. So I want to give a short overview how to sail around common problems in a LAMP-setup.
Kernel
Just to make sure the option
CONFIG_NLS_UTF8 is set to
y.
Environment
To make sure, newly created filenames are there in UTF-8 and in general VT-input is handled correctly, you have to choose a charset, which comes with an
.UTF-8-suffix. For german feel free to choose
de_DE.UTF-8. Make sure your glibc is provides this locales. To convert current names of files you can just use
convmv. For a desktop you must also adjust the font and set a correct TTY-font but this could be ignored for a server which is just administrated via remote shell.
Webservers in general – focus on Apache
To make sure, the users input is UTF-8, the server has to deliver the correct
Content-Type-header. Take a look at the output of
wget -S http://usrportage.de, my weblog, which is hosted on
Schokokeks.org, a properly configured server (sure!):
wget -S usrportage.de
—21:00:33— http://usrportage.de/
=> `index.html’
Resolving usrportage.de… 87.106.4.7
Connecting to usrportage.de|87.106.4.7|:80… connected.
HTTP request sent, awaiting response…
HTTP/1.1 200 OK
Date: Thu, 13 Jul 2006 19:00:27 GMT
Server: Apache
X-Powered-By: PHP/5.1.4-pl0-gentoo with Hardening-Patch
X-Blog: Serendipity
Set-Cookie: PHPSESSID=9da2ded6522851ef8ddc3ebe7590b354; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Serendipity-InterfaceLang: de
X-FreeTag-Count: Array
Connection: close
Content-Type: text/html; charset=UTF-8
Length: unspecified [text/html]
[ <=> ] 65,557 250.71K/s
21:00:33 (250.19 KB/s) – `index.html’ saved [65557]
You see the header
Content-Type: text/html; charset=UTF-8. (You can also the a bug in
S9Y, which poorly casts an array, but anyway.) So your browser is notified, that it should send UTF-8 encoded data. That’s the whole secret. To configure Apache properly, make sure the directive
AddDefaultCharset is set to
UTF-8.
One thing at last: if you’re using AJAX-functions from
Prototype for JavaScript-purposes, you have to reencode the string delivered by the AJAX-call. In
PHP the following would work:
$string = utf8_encode( $_POST[‘key’] );
MySQL
Before transacting any data, make sure your connection charset is set to UTF-8:
SET NAMES utf8;
By the way: have I ever mentioned you should ever use
mysql_real_escape_string() instead of
mysql_escape_string()?
PHP
Just two rules: use
mb_string-functions whereever it is possible, set the INI-setting
default_charset to
UTF-8 and – anyway – don’t use functions from the ereg-family also they have an
mb_-Prefix. They aren’t binary-safe, that’s all you need to know.
Also make sure, your sources are UTF-8 encoded. Use
iconv to correct those who are not.
Update
I forgot to mention, that the functions
htmlentities(),
html_entity_decode() and
htmlspecialchars() does
not reflect PHPs
default_charset-directive but assumes iso-8859-15 as the default charset, which is pretty annoying and should be considered as a bug, from my point of view. So you need to pass
UTF-8 as the third parameter to the function to make sure it will work properly with Unicode.