Make your website completely UTF-8 friendly

Posted on March 23, 2008. Filed under: mysql+, PHP, UTF-8 | Tags: , , , , |


Running an Internationalization / Localization [or i18n / L10n] friendly website can be tricky, and sometimes downright maddening for those who haven’t yet delved into the world of Unicode. Allowing your users to post in whichever language and / or characters of their choice to your site is crucial for any modern website.

Here are a few things I have very painfully learned over the last 5 or so years on this topic … specifically with PHP and MySQL.

There are hundreds of character sets representing most of the languages on Earth, usually one per geographic location [Latin, Cyrillic, Greek, Arabic, Korean, Chinese etc…]. One character set that covers all of these is UTF-8. So how can you put ‘UTF-8‘ to practical use? Easy … here’s how I’ve done it:


Headers! Get your headers!
The most important area to implement UTF-8 is in your charset header within your outgoing HTML headers. This tells the browser that you have multi-byte characters in your HTML and you’d like it do display them as such [and not as the default ISO-8859-1].
To do this, put this at the very top of your PHP scripts [with the headers and before any HTML is echoed]:

    header("Content-Type: text/html; charset=utf-8");

And this in your HTML <head> section:

    echo "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n";


MySQL / UTF-8 love
The second most important thing is to make sure your database is also UTF-8 friendly. Be sure to set all your table / column collations [char / text] to utf8_unicode_ci. This tells MySQL to treat this data as UTF-8.

Once you’ve done that, you’ll need to tell PHP to connect to the MySQL daemon under a UTF-8 connection [otherwise the default is latin1 … and your data will be stored in MySQL as such — no good!]. Run this right after you connect to MySQL:

    mysql_query("SET NAMES 'utf8'");
    mysql_query("SET CHARACTER SET utf8");


Multibyte fun
Last, take advantage of PHP’s Multibyte String Functions! Oftentimes this is as easy as prefixing your string comparison functions with mb_. But, before you start using these functions you’ll need to tell PHP which character set to use [once again!] because the default is ISO-8859-1:



One often neglected method is ensuring that the data the server gets is UTF-8 encoded. One way to try and do this with HTML forms is to include the accept-charset attribute in your form tag. I say “try” because it’s just a suggestion to the client which submits the form. Be aware that some clients may not pay much attention to the attribute, especially older browsers. [Thanks to Alejandro for the heads up :-)]

<form action="/action" method="post" accept-charset="utf-8">

Also see here:

If you’ve gotten this far you should see some dramatic improvements to your web site’s accessibility and usability, drawing in users from around the world.

NOTE: This is a work in progress and I fully welcome any new ideas to this cocktail of methods. If you have anything to add, PLEASE DO SO!

Read Full Post | Make a Comment ( 14 so far )

Liked it here?
Why not try sites on the blogroll...