- Inspecting Bytes with Node.js Buffer Objects
- Unicode vs. UTF-8
- When Good Unicode Encoding Goes Bad
- PHP and Unicode
So why am I posting so much about unicode and the word Hyvä? That’s a long story.
Whenever I work on a longer piece of writing, part of my copy editing process is to read the contents out loud to myself. I also have my computer read the contents back to me.
I’ve been doing later for a long time via the MacOS command line command program say. The say
program will read back any text passed to it either directly or via a file. It can also save that audio as an aiff
file
% say "hello world"
% say -f /path/to/input.txt -o /tmp/out.aiff
One tricky part of to this is I write in markdown, and if I pass a markdown file to say
directly this means the computer is speaking back URLs, code samples, etc. This is not ideal.
So — lo those many years ago when I fist had this idea, I wrote a small bit of code to strip this information from my markdown file before passing it on to say
. This code is written in pigeon PHP and is — not good — but it mostly does the job.
use Michelf\Markdown;
$contents = file_get_contents('/path/to/markdown/file.md');
// covert the markdown to HTML
$html = Markdown::defaultTransform($contents);
// with a limited, well structured set of HTML out of the markdown library
// do some string replacment to pull out the parts we don't want spoken out
// loud
$html = preg_replace(
'%<pre><code>.+?</code></pre>%six',
'<p>[CODE SNIPPED].</p>',
$html
);
$html = str_replace('</p>','</p><br>',$html);
// save the HTML file
$tmp = tempnam('/tmp', 'md_to_say') . '.html';
file_put_contents($tmp, $html);
// use the MacOS textutil program to convert the HTML file into a text
// file, effectivly removing any link or image URLs we don't want spoken
// out loud
$cmd = 'textutil -convert txt ' . $tmp;
`$cmd`;
// generate the invocation of the say command
$tmp_txt = swapExtension($tmp, 'html','txt');
$tmp_aiff = swapExtension($tmp, 'html','aiff');
$cmd = "say -f $tmp_txt -o $tmp_aiff";
Sins of my youth include parsing HTML with regular expressions and shelling out to another program (textutil
) to finish my cleanup. It’s ugly code, but personal workflows are built on the ugly hacks of a programmer who has better things to do.
I said this mostly works. One problem it had is that some special characters wouldn’t survive all the round trips and my computer would end up saying something silly. This was very true when I started copy editing my two posts on the new Magento theme named Hyvä. Whenever I wrote Hyvä, my compter would say
Hyve tilde currency sign
Something was corrupting the unicode of the ä in Hyvä.
The Bytes
Here’s what I saw when I took at look at the bytes in the final text file I was feeding into the say
command
01001000 72
01111001 121
01110110 118
11000011 195
10000011 131
11000010 194
10100100 164
00100000 32
The first three characters are the ASCII H
, y
, and v
. Nothing fishy there. The last byte, 32, is an ASCII encoded space, so nothing too fishy there. It may be my best practice to end a file with a newline, but not every program will do this.
What was fishy was there were four additional bytes representing two additional unicode characters. Keeping my UTF-8 encoding in mind, the first byte sequence looked like this
Bytes: 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1
Codepoint Bits: _ _ _ 0 0 0 1 1 _ _ 0 0 0 0 1 1
Binary Codepoint: 11000011
Decimal Codepoint: 195
Hex Codepoint: 0x00C3
In other words, the Unicode codepoint U-00C3, LATIN CAPITAL LETTER A WITH TILDE: Ã.
The second byte sequence looked like this
Bytes: 1 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0
Code[oint Bits: _ _ _ 0 0 0 1 0 _ _ 1 0 0 1 0 0
Binary Codepoint: 10100100
Decimal Codepoint: 164
Hex Codepoint: 0x00A4
In other words, the Unicode codepoint U-00A4, CURRENCY SIGN: ¤
So my computer wasn’t saying
Hyve tilde currency sign
It was saying
Hyv “A tilde”, currency sign
But still — pronunciation asside — what had happened to my ä?
The Culprit
The culprit turned out to be the textutil
command. When given an input like this
<!-- File: input.html -->
<b>Hyvä</b>
and invoked like this
% textutil -format txt input.html
It produces an output file like this
Hyvä
The textutil
command wasn’t parsing input.html
as a UTF-8 file. Instead, it was parsing it as an a ISO/IEC 8859-1 encoded file. In most text encodings that predate unicode individual characters are encoded in a single byte. This means each of these character sets can only display 256 different characters. The mapping of these byte values to a character is often called a codepage.
So — when we encode our unicode ä as UTF-8 it looks like this
11000011 10100100
If a program is reading through our file and parsing our file as UTF-8, it knows to treat these two bytes as a single character. However — if a program is reading through our file and parsing it as ISO/IEC 8859-1 — then it will see this as two separate characters.
If we refer to the codepage chart on Wikipedia, the binary number 11000011
(decimal 195, hex 0xc3) maps to the characer Ã. Similarly, the binary number 10100100
(decimal 164, hex 0xa4) maps to the character ¤.
So textutil
reads these characters in as ISO/IEC 8859-1 encoded text. Then, when writing the text file back it, it encoded the characters as UTF-8.
My initial thought was a knee jerk reaction of “what the heck textutil
, you clearly know about UTF-8 — why u change my bytes?” Then I thought about it, sighed a computer sigh, and moved on to finding a solution.
No such Thing as Plain Text Encoding
If you’re looking for a “who did the wrong thing” morale to the this tale, there’s not a great one. So called “plain text” files have a major design flaw/feature that makes this sort of slip up inevitable. Namely, there’s nothing in a text file that tells a programmer what that text file’s encoding is.
The best any programmer can do is take an educated guess based on the sort of characters it sees in the file, or force users to be explicit about their encoding. This is why a lot of text editors have the ability to open a file using multiple encodings.
Interestingly, HTML files have an optional tag where you can specify an encoding. My best guess as to what happened is when textutil
is fed an HTML document without a charset
set — it defaults to ISO/IEC 8859-1
(or its close cousin, Windows-1252)
This may seem like a poor choice now, but when textutil
was written ISO/IEC 8859-1
was considered a reasonable default assumption if no character encoding was set. That this is still the default assumption points more towards a conservative philosophy from Apple in updating these old command line utilities than any lapse in judgment.
As for me, I had a choice. Had the time for me to cleanup this old utility script finally come, or would I slap on another layer of spackling paste and move on?
The quick hack won out. I made sure to generate a <meta charset="UTF-8" />
tag in my saved HTML
$html = '<html><head><meta charset="UTF-8" /></head><body>' .
$html .'</body></html>';
file_put_contents($tmp, $html);
and textutil
starting saving my files with the expected encoding. Another victory for “bad but it works and is just for me” code in the wild.