Make your website completely UTF-8 friendly
LAST UPDATED JUNE 15, 2009
Running an Internationalization / Localization [or i18n / L10n] friendly website can be tricky, and sometimes downright maddening for those who haven’t yet delved into the world of Unicode. Allowing your users to post in whichever language and / or characters of their choice to your site is crucial for any modern website.
Here are a few things I have very painfully learned over the last 5 or so years on this topic … specifically with PHP and MySQL.
There are hundreds of character sets representing most of the languages on Earth, usually one per geographic location [Latin, Cyrillic, Greek, Arabic, Korean, Chinese etc...]. One character set that covers all of these is
UTF-8. So how can you put ‘
UTF-8‘ to practical use? Easy … here’s how I’ve done it:
Headers! Get your headers!
The most important area to implement
UTF-8 is in your
charset header within your outgoing HTML headers. This tells the browser that you have multi-byte characters in your HTML and you’d like it do display them as such [and not as the default
To do this, put this at the very top of your PHP scripts [with the headers and before any HTML is echoed]:
<?php header("Content-Type: text/html; charset=utf-8"); ?>
And this in your HTML <head> section:
<?php echo "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n"; ?>
MySQL / UTF-8 love
The second most important thing is to make sure your database is also
UTF-8 friendly. Be sure to set all your table / column collations [char / text] to
utf8_unicode_ci. This tells MySQL to treat this data as UTF-8.
Once you’ve done that, you’ll need to tell PHP to connect to the MySQL daemon under a
UTF-8 connection [otherwise the default is
latin1 … and your data will be stored in MySQL as such — no good!]. Run this right after you connect to MySQL:
<?php mysql_query("SET NAMES 'utf8'"); mysql_query("SET CHARACTER SET utf8"); ?>
Last, take advantage of PHP’s Multibyte String Functions! Oftentimes this is as easy as prefixing your string comparison functions with
mb_. But, before you start using these functions you’ll need to tell PHP which character set to use [once again!] because the default is
<?php mb_internal_encoding("UTF-8"); ?>
One often neglected method is ensuring that the data the server gets is UTF-8 encoded. One way to try and do this with HTML forms is to include the
accept-charset attribute in your form tag. I say “try” because it’s just a suggestion to the client which submits the form. Be aware that some clients may not pay much attention to the attribute, especially older browsers. [Thanks to Alejandro for the heads up :-)]
<form action="/action" method="post" accept-charset="utf-8">
Also see here: www.w3schools.com/TAGS/att_form_accept_charset.asp.
If you’ve gotten this far you should see some dramatic improvements to your web site’s accessibility and usability, drawing in users from around the world.
NOTE: This is a work in progress and I fully welcome any new ideas to this cocktail of methods. If you have anything to add, PLEASE DO SO!