PHP’s unicode story is — not great.
PHP’s strings don’t know anything about text encoding. They are, under the hood, just an array of individual bytes. The lore has it that the abandoned PHP 6 project included attempts to make PHP strings unicode aware (similar to how python does it), but that this proved hard to do and was scrapped. The lore also has it that this was a major reason PHP 6 itself was scrapped, and that PHP 7 just skipped trying to bring unicode strings to PHP.
When I’m in a generous mood I can see one bright side to this state of affairs, and that’s that there’s a certain simplicity to passing around an opaque array of bytes. If there’s no text encoding on the string then there’s no opportunities for the programmer to convert things in a wrong or unexpected way.
When I consider the ramifications this spirit of generosity quickly dissolves.
Counting Char
If you create a PHP source file in a modern text editor that looks like this
# File: test.php
<?php
function main() {
$string = 'Hyvä';
echo 'This string is ' . strlen($string) . ' characters long',"\n";
}
main();
and run your program PHP will probably say the string is five characters long. That’s because PHP counts both bytes that make up the ä as individual characters.
I say probably because it will depend on how you’ve saved your source file. Since 'Hyvä'
is a string constant that means PHP is using the bytes in your source file when it stores the string.
Save the same program with a text encoding of ISO 8859-1 and PHP will “correctly” see it as a four character string. This is because in ISO 8859-1 every character is a single byte.
Save the same program in a source file with UTF-16 and PHP (at least the version of 7.4 on my mac) won’t run the program. Likely because the <?php
sequence doesn’t look right to the PHP engine when encoded as UTF-16.
Regular Expressions
OK, so let’s skip string constants and load our Hyvä string from a file instead
# File: test.php
<?php
function main() {
// the /tmp/source.txt file contains Hyvä encoded as UTF-8
$string = trim(file_get_contents('/tmp/source.txt'));
echo 'This string is ' . strlen($string) . ' characters long',"\n";
}
main();
PHP still thinks the string is 5 characters long, but at least we’re immune from problems due to the source file’s encoding now.
The effects of no text encoding goes far beyond string length — consider a regular expression
# File: test.php
<?php
function main() {
// the /tmp/source.txt file contains Hyvä encoded as UTF-8
$string = trim(file_get_contents('/tmp/source.txt'));
var_dump(
preg_match('/Hyv[a-z]/', $string)
);
}
main();
A reasonable person might expect the string to match the regular expression /Hyv[a-z]/
— but it won’t. Again — PHP sees that fourth character of the string Hyvä
as the first byte of its two byte UTF-8 encoding. All your carefully crafted PHP regular expressions can fall apart if they encounter UTF-8 text encoded as more than one byte.
PHP Multibyte String Handling
There is a solution for anyone who wants to program for users of text outside of the US hegemony, and that’s the multibyte string extension. Somewhat frustratingly, this is not a default extension, so depending on where you get your PHP from these functions may or may not be available.
This small program will produce results more in line with what we might expect thanks to the mb_strlen
function.
# File: test.php
<?php
function main() {
$string = trim(file_get_contents('/tmp/source.txt'));
echo 'This string is ' . mb_strlen($string) . ' characters long',"\n";
}
main();
But even here it’s not 100% clear how the multibyte string functions count things. Emoji still seem to give it a hard time. It counts the bellhop bell as a two character string
$string = '🛎️';
echo 'This string is ' . mb_strlen($string) . ' characters long',"\n";
Also — the behavior of these mb_
functions is influenced by the encoding value set via the mb_internal_encoding
function. This means it’s still up to you to know something about the strings you’re working with.
We’ll leave how all these functions work as an exercise for our more intrepid readers.
PHP Unicode Regular Expressions
The multibyte string functions include a series of regular expression functions — although their names imply they use the old ereg_
regular expression syntax that was removed from PHP for non-multibyte strings.
It’s also possible to use pcre_
regular expressions with unicode strings via the u
pattern modifier. This code will return false
, indicating the regular expression didn’t match
var_dump(
preg_match('%Hyv\w%','Hyvä')
);
However, if we add the u
modifier to the regular expression
var_dump(
preg_match('%Hyv\w%u','Hyvä')
);
Suddenly \w
is able to match the ä. Be careful here though — some things may not work like you expect. For example, an a-z
character range
var_dump(
preg_match('%Hyv[a-z]%u','Hyvä')
);
still won’t (at least on my computer) match the ä in Hyvä.
Iconv and intl
Two other PHP features to be aware of are the iconv and intl extensions. Both of these libraries contain functions and classes that allow you to convert a string that contains bytes encoded in one text format into a string that contains bytes encoded in a different text format. This is useful functionality, but it’s still up to you to correctly identify the current encoding of any string you want to convert.
Take Aways
I’ve been vaguely aware of all this for a long time, but seeing it laid out so plainly is sobering. Most of the popular PHP frameworks in use by professionals the world over don’t use the mb_
string functions, and don’t use u
based pcre_
regular expressions. This means there’s a sea of subtle bugs just waiting to be stumbled upon or exploited with a bit of unicode.