- Inspecting Bytes with Node.js Buffer Objects
- Unicode vs. UTF-8
- When Good Unicode Encoding Goes Bad
- PHP and Unicode
I’ve started to really enjoy Node.js’s Buffer object for byte level examination of files.
For example — if you create a text file with a bit of unicode in it
# File: some-file.txt
Hyvä
and then write a small program that looks like this
# File: read-bytes.js
const fs = require('fs')
function main() {
// returns a Buffer object -- bytes.toString() will
// transform the buffer into a string
const bytes = fs.readFileSync('/path/to/some-file.txt')
for(const byte of bytes) {
console.log(
// byte will be a Number -- here we format that
// number as binary (toString(2)) and then pad
// out our zeros
byte.toString(2).padStart(8, '0'),
' ',
byte
)
}
}
main()
You’ll get a list of each byte — formated as both a binary and then a base-10 number — printed out.
% node read-bytes.js
01001000 72
01111001 121
01110110 118
11000011 195
10100100 164
00001010 10
This isn’t exactly new tech — the unix command line program hexdump
can do similar things.
% hexdump -C some-file.txt
00000000 48 79 76 c3 a4 0a |Hyv...|
00000006
But I never found its default formats (hexadecimal, first column is offsets, etc.) well optimized for how my brain thinks about byte streams.
It’s also possible to do this sort of thing in other programming languages — but the mechanics are a bit weird. The C
/C++
primitives (or at least the 90s era primitives I used) for this are too fiddly, and even something modern like Go or Rust makes you jump through hoops which might make sense for production code, but are a burden if all I want to do is write a small program to see what a file’s actual bytes are.