Recently, a friend (Janie Larson, A.K.A. the Red Queen Coder) asked me to take over hosting of her website.
Unfortunately, because of boring technical reasons, she wasn’t able to back up the WordPress database, and so I used a tool called SiteSucker Pro to download it as-is as a static HTML entity. We would come back to resurrecting it as a WordPress site “later”.
A couple of weeks ago, Janie asked if I could perhaps get it up and running for her in WordPress. A lifetime ago, I created a conversion utility to transform my blog from SimplePHPBlog to WordPress, and I remembered some of the fun and games I had with the C# code I wrote.
After reviewing the static HTML from Janie’s site that was formerly a WordPress site, I discovered some useful information. For starters, the title, slug, and contents were well-defined in <div> tags, so I was able to write a parser to pull all of that out.
The next trick was figuring out the tags and categories. Rather naively as it turns out, I started with the categories folder and ran the parser over the HTML files in each category, pulling in 623 blog posts. There was some magic with foreach loops and tags and so on, and I was happy. The one final challenge, or so I thought, would be to get the featured images working.
I prepared a MySQL script to insert all the requisite rows into a fresh WordPress install, and while I was reviewing the script, I noticed a LOT of duplication. It turns out that Janie—like any normal person—might use multiple categories per post, so that meant I needed to deduplicate the 623 posts. After a very short amount of time, this was down to 269 blog posts (nice).
The featured image thing I tried was a failure, and I still don’t exactly know how WordPress does it. What I did discover though, is that the images you see in the Media section, all have to have entries in the posts table, as well as the postmeta table, to be visible again. While the dates for these files weren’t accurate because I based them off the folder name, at least I didn’t have to upload them again.
Things I wasn’t able to do:
- Featured images. It was easier to do these manually.
- Internal links to other blog posts. I might end up doing a replace of any occurrences of “index.html” in the content, but that’s prone to fail. It might just be easier to do it manually. Even on 269 blog posts, it’ll take less time than writing a code solution.
Edited to add: I manually searched for any posts containing the string “../” that implied an internal link, and manually edited it to remove the prefix and trailing “index.html”. Fortunately this only took about an hour. I could have written code to do it, but that would also have taken an hour.