In my last quick tips post I mentioned examining the bytes of a text file that contained the text Hyvä, and getting back the following six bytes.
01001000 72
01111001 121
01110110 118
11000011 195
10100100 164
00001010 10
The first three bytes — 72, 121, and 118 are an ASCII text encoded H
, y
, and v
.
The last byte, 10, is an ASCII encoded newline character that ends the file (my text editor is configured to always add a newline character to the end of files)
What’s a little more mysterious are the following bytes
11000011 195
10100100 164
These represent the character ä. That’s because this file is UTF-8 encoded. In UTF-8 encoding, characters in the original US-centric ASCII encoding (i.e. in the range 0 – 127) are encoded as single bytes. Characters outside that original ASCII encoding range need to be encoded with multiple bytes.
How a program encodes those characters and how many bytes they should use gets tricky. To understand that we need to understand the difference between unicode and unicode encoding.
What is Unicode?
Unicode is an attempt to define every possible human character and assign it a number. This number is called a codepoint. So that ä character? Its unicode code point is U+00E4. The 00E4
portion of that is a hexadecimal number. In decimal that number is 228. In binary that number is
11100100
You’ll notice that it’s possible to store the number 228
as a single byte (11100100
). However, we don’t see this byte in our file. That’s because different unicode encoding standards will use different algorithms to encode characters.
Some examples of unicode encoding include
UTF-8
UTF-16
UTF-32
This is a weird, but important distinction to make when discussing unicode. Unicode is the standard that defines the codepoint, but unicode encoding defines the rules that determine how those code points are represented as bytes in your file.
The version of unicode encoding that seems like the defacto these days is UTF-8. In UTF-8 a character might be encoded with one byte, two bytes, three bytes, or four bytes.
UTF-8 Encoding
Per the wikipedia page, codepoints in the following ranges are encoded with the following number of bytes
U+0000 - U+007F: one bytes (our blessed ASCII text)
U+0080 - U+07FF: two bytes
U+0800 - U+FFFF: three bytes
U+10000 - U+10FFFF: four bytes (the encoding that makes all emoji possible)
When compared with other unicode encodings UTF-8 has two things going for it. First, unlike other encoding, it doesn’t force the same multiple byte encoding for characters that might not need all those bytes. Second, and likely more importantly, it’s a flexible encoding that was built allow for a different number of bytes depending on the needs of the character.
In UTF-16 and UTF-32, every character is encoded with a fixed number of bytes regardless of whether the character needs that space or not. UTF-8 allows a character like ä to be encoded with only two bytes, but is also flexible enough that we can encode a character like U-1F4A9 using four bytes.
When a programmer is writing a program to read through a UTF-8 file, if a byte starts with a 0
01001000
01111001
01110110
the programmer knows this is a single byte whose value represents a codepoint.
If, however, the byte begins with a 1
, they know they’re at the start of a multi-byte sequence. A byte starting with 110
indicates this byte and the next make up a character. A byte starting with 1110
indicates this byte and the next two bytes make up a character. A byte starting with 11110
indicates this byte and the next three bytes make up a character.
In addition to these prefixes for identifying the number of bytes being using, the second, third, or fourth byte in the sequence will also be prefixed with 10
to indicate they’re part of a multi-byte sequence.
So, if we consider our rules so far
two bytes: 1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _ _
three bytes: 1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _
four bytes: 1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _
These prefix bits serve as flags. The rest of the bits in the bytes will represent the code point’s value. So for our ä
1 1 0 0 0 0 1 1 1 0 1 0 0 1 0 0
We see it begins with 110
, which means it’s a two byte character. This leaves the following bits representing the actual codepoint value
Full Bytes: 1 1 0 0 0 0 1 1 1 0 1 0 0 1 0 0
Codepoint Portion: _ _ _ 0 0 0 1 1 _ _ 1 0 0 1 0 0
These bits will be combined into a single binary number
1 1 1 0 0 1 0 0
Which is 228
decimal or 00E4
in hex. There’s our codepoint.