This website is hosted as simple static files, but it’s managed via a private WordPress instance that runs from my personal computer. I’m in the middle of retiring my circa 2014 laptop for a new M1 Mac mini and migrating this WordPress system was — about as rough as I expected. What follows are some random notes in case I ever need to do this again. Also, if you’re seeing this that means everything worked like it should.
Part of why I keep the WordPress install running on my local machine is so I don’t need to upgrade it (and more importantly, test/fix my customizations) regularly. Not how I’d approach client work or a system that was live on the internet, but for a hobby that relates to my profession it’s the right lazy trade-off.
So all that means I mostly didn’t need to worry about getting from WordPress 4.7.19 to WordPress 5.6. I say mostly because there were a few problems getting this version of WordPress to run under PHP 7.4 that I solved by replacing one or two core files (don’t worry — it was via an include path not via a core hack — I’m not that big a moster). Laziness asside, I should probably try to to update this system but whenever I contemplate it I think about changing blog systems entirely and then think about all the work involved in either and inertia wins out.
My upgrade testing strategy was basically do a full static publish with the new system and then diff the entire contents of the site. Then if there’s differences figure out why and fix them.
My biggest fear was getting the MySQL database moved over and upgraded at the same time. Staying on 5.6 wasn’t an option, because many of the PHP frameworks and apps I use no longer support this version of MySQL — and they do a hard version check. MySQL 8 seemed like the safest bet compatibility and availability wise.
Doing this big a version jump means I couldn’t rely on the replication features to move my data because they don’t work more than one minor version apart.
That meant a good old fashioned mysqldump
file, which meant a whole lot of character encoding angst. Without turning it into a whole blog series, MySQL has bunch of different places you can configure different character sets and if these character sets mismatch, weird things can happen. Making things worse the defaults (at least for MySQL 5.6) aren’t UTF8, which means ending up with databases that have the wrong encoding is easier to do than you think. On top of THAT — the default encoding (latin1
) mostly plays nice with the basic characters an American will enter into their system, and even accepts many UTF8 characters gracefull.
A lot of the MySQL migration character set nightmare stories I read about centered around tables and columns setup with latin1
encoding, but that had UTF8 characters inserted into them. Everything looked OK because MySQL would transmit the raw bytes and web browsers would display them OK — but when exporting data MySQL would look at the two, three, or four byte UTF8 characters and treat each byte as an independant latin1
character that needed to be converted.
There’s a lot of hand wavy “don’t do this” style advice out there about exporting and importing that’s presented as ritual without a lot of reason. One example: This website indicated that using unix redirects could hork your encoding, but didn’t really explain why. Better, they say, to use mysqldump
‘s -r
flag to export and then import via a SOURCE
command from the console.
WordPress made this easier by ensuing that every database table was encoded with utf8mb4
and every text column was collated with utf8mb4_unicode_520_ci
. There was one database added by an extension that was latin1
but this database was empty. I was able to preserve my encoding by dumping with either of the following
mysqldump -root -p db_name --default-character-set=utf8mb4 > file.sql
mysqldump -root -p db_name --default-character-set=utf8mb4 -r file.sql
Both produced identical dump files (excepting the “Dump completed on” line)
Importing I did directly from the console via a
$ mysql --verbose ...
mysql> SET names 'utf8mb4'
mysql> SOURCE path/to/file.sql
Starting the console with the --verbose
option ensured that MySQL would yell loudly about any and all warnings that came up. And a lot of warnings came up.
First — because I was dumping from MySQL 5.6 — the CREATE TABLE
statements all included integer widths which are deprecated and might not do what you think.
Second — MySQL didn’t like these lines in my dump file or my original SET names 'utf8'
/*!40101 SET character_set_client = utf8 */;
Apparently utf8
is an alias that points at utf8mb3
(three byte encoding for UTF8 characters) and not utf8mb4
.
I ended up needing to hand edit the dump file to fix these. Hand editing introduces the possibility of munging characters, but BBEdit did me right.
Another incredibly dumb thing I noticed was the Estimated Reading Times timings changed for a few articles. This came down to some theme code.
// probably ©EckoThemes -- used here under fair use
function ecko_estimated_reading_time(){
$post = get_post();
$words = str_word_count(strip_tags($post->post_content));
$minutes = floor($words / 120);
if($minutes == 0) $minutes = 1;
return $minutes;
}
Without getting too deep into it — PHP’s str_word_count
function behaves differently when a different locale is set via setlocale
, and the user shell in MacOS will set the LC_CTYPE
locale to en_US.UTF-8
instead of the stock value of C
that’s normally set.
Reasons this was dumb
- I don’t even use the
Min Read:
labels — they’re commented out in my themes but only with HTML comments, which meant I saw the problem and needed to understand it to make sure there wasn’t a bigger problem -
I discovered that when you check the value of the
LC_ALL
category withsetlocale
and there’s different values for each category that you get back a weird non-locale string that looks likeC/en_US.UTF-8/C/C/C/C
— this the value for each category mashed together with a/
character. -
The C function that this locale influence is the
isalpha
function used instr_word_count
— but the way PHP handles unicode and howisalpha
is intended to be used don’t even match up so the results you get fromstr_word_count
with a locale ofen_US.UTF-8
are probably just dumb. -
Speaking of
isalpha
there’s a lot of hand wavy historical comments that indicate that some unixes had some sort of “ints higher that an unsigned char mean something” use for this function in the pre-unicode days, but details on how this all worked are scant. -
The locale was a problem for me because I publish via a command line script — but the locale was only a testing problem because I publish via a cronjob and the cron shell runs with a
$LANG=C
instead of the$LANG=en_US.UTF-8
that’s set in my interactive shell. In other words — this ultimately wasn’t a problem.
All that said — the rubber bands seem to be holding and I’m chuckling at my younger self for believing that everyone would build and manage their own content management systems.