Migrations and hotlink protection
Migrating to Ghost from TypePad - oh, the hotlinks!
We've just wrapped up a migration for author Claudia Hall Christian's blog, On a Limb, which was coming to Ghost from TypePad. We had a plain text export that looked pretty usable, and the original site was still online, so I planned to grab the images from it, load them into Ghost and update all the URLs.
(Migrating blog content without fixing the images is almost always the wrong choice, since when the original blog goes offline, the images will be gone, too.)
Prologue: Meet the intern.
It looked straightforward. So straightforward that I thought my newly drafted intern, hereafter referred to as Bucket Hat Ghost, could handle it.
I handed off the file and asked him to bash (er... Python) it into a JSON object, so that I could run my existing migration flow on it. That part was fine, at least once he figured out that comments and posts were intermingled. So don't blame the intern here!
Where we got hung up was my part, which was grabbing the images from the server and using the Ghost Admin API to create each post. Two things went wrong.
Problem #1: Hotlink protection is a pain.
I ran my usual import process, which includes parsing the HTML object for images, downloading those images and saving them locally, uploading them into Ghost and grabbing their URLs, then substituting those URLs into the original HTML object. (Cheerio makes this pretty easy.) I knew from console logging that I had image URLs that worked in my browser, but things kept going weird when I tried to fetch the images with my script.
Eventually, the penny dropped, and I realized the images were protected by some sort of hotlink protection, and so my fetch requests were getting redirected to error pages that declared my script was a robot. OK, and fair enough, that was true, although I was only trying to reclaim my client's own assets from her own blog.
(If this happens to you, and you can turn off your own hotlink protection, you should do that. It'd avoid the rest of the mess required to do what was supposed to be an easy job. My client hadn't set up hotlink protection, and we didn't know who to even ask to turn it off.)
I tried saving cookies and following redirects and lying about my referrer and my browser and all the other easy stuff. No luck. I ended up running puppeteer in stealth mode, and not headless, either. Basically, for every single blog post (and Claudia had over a thousand!), my script opened up a browser window on my laptop, navigated to a page on Claudia's old blog, waited for it to render, then saved the images. Then it did all the image uploading and HTML rewriting and posting like I'd planned.
Normally I set up migration jobs to run in the background. They make my fan spin a little louder, but I can do other work while an hour-long migration job runs, no problem.
Not this one. Every single time my script opened a window, it grabbed focus. And because I was opening and closing windows right left and center, it ran for about a minute per post, not the usual second or two. So yeah, my laptop was pretty much useless for about a day. I tried running overnight, but as soon as I turned my back on it, it'd error out or my computer would sleep (why?!), or something. It was not the best two days.
My script for grabbing images is below, should you wish to experience this level of migration pain for yourself. I don't really recommend it.
// this package exports the getImages function as a module
// arguments to the getImages function are the url of the page.
module.exports = async function getImages(url) {
const puppeteer = require('puppeteer-extra')
const fs = require('fs');
const path = require('path');
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
headless: false,
args: [
'--incognito',
]})
const page = await browser.newPage();
page.on('response', async response => {
const url = response.url();
let status=response.status()
if (response.request().resourceType() === 'image' && !url.startsWith('data:')
&&!url.startsWith('blob:')
&& !(status > 299 && status < 400)
) {
// guess the file extension
const contenttype = response.headers()['content-type'];
let extension = '';
if (contenttype) {
extension = contenttype.split('/').pop();
//console.log('found extension is', extension)
response.buffer().then(file => {
let fileName = url.split('/').pop() + '.' + extension || 'temporaryname';
if (fileName.length > 200) { fileName = fileName.slice(0, 200) }
const filePath = path.resolve('./images', fileName);
//console.log('filePath is', filePath);
const writeStream = fs.createWriteStream(filePath);
writeStream.write(file);
writeStream.end();
});
}
}
});
await page.goto(url);
await new Promise(resolve => setTimeout(resolve, 1000));
await page.close();
await browser.close();
await new Promise(resolve => setTimeout(resolve, 1000));
};
Problem #2: Vanishing images after import
With the switch over the Lexical, there's apparently a bug in how html that contains images is parsed. It's pretty weird. The post renders on the front-end without the image, but the image is there in the Ghost admin panel, and re-saving the post fixes it. Weird behavior that I ducked for this job by switching the Ghost install's 'beta editor' setting to off. Unfortunately, that setting is gone with the latest Ghost, so I guess I'm going to do any new migrations into an older version of Ghost, then export and re-import? Not sure. I'm hoping this bug gets squashed before I have another client for migration.
Here's the bug for anyone curious:
Epilogue: Well, we learned a lot.
I still continue to assert that I can migrate anything that's currently online. But the next time we hit something that requires puppeteer is non-headless mode, we are definitely using the Bucket Hat Ghost's computer, instead of mine!