WordPress-to-static using wget mirror functionality and AWS S3

December 23, 2020 note-to-self full-stack tech

I like using WordPress. I’ve used it, AEM, Jekyll, a version of Laravel-as-CMS (currently Jigsaw and a lot of custom built code (in the early days). I don’t like how WordPress seems to be a main hacker target. Now, a lot of that is because:

  1. WordPress is so popular – if you are a hacker, you might as well get your time’s worth and go after the popular software, right?
  2. WordPress has a reputation of being easy to get up and running, so your grandma probably has installed WordPress. Problem is, Granny doesn’t know much about securing it.

Irregardless, I don’t need any of the dynamic features of WordPress but it’s pretty easy to build nice sites in it. So, I’ve installed it on my laptop (using Valet), so I just export it to static and upload it to Amazon Web Services S3.

Disclaimer: There are multiple ways to do this… it’s web development, you can do any number of things any number of ways. I had been using Simply Static and it fell out of maintenance [note: about publish time, I notice maintenance has restarted on this useful plugin!]. I considered asking to take it over, and resuscitating the plugin, but this script only took me about fifteen minutes to get started (I already had the S3 and was using Jekyll to publish it).

The Installation, etc

There isn’t a lot to install, actually.

**WordPress, for sure. **This is what we are mirroring.

I host it locally and upload it using the scripts I’m including. If you already have WordPress up on a publicly available web-server and intend to keep it that way, then I don’t think this will work that well. I’m doing it for:

  • Speed – just HTML, css, js and jpg. It’s pretty fast.
  • $avings – hosted by S3, costs me about $2 a month or something.
  • Security – since there isn’t any PHP or MySQL involved, if you don’t have a lame AWS console password, there really isn’t much that someone could hack. But, use 2factor for that AWS, okay?

Wget , famous for it’s “get a URL” functionality, actually has gobs of features, including the ability to mirror an entire site, based on the links it finds. I linked a Gnu page, but usually you just “brew install“, “apt-get install“, etc. Your system probably already has it.

AWS Cli, this is how we upload the site to AWS. This is always the bit that gets me because I deal with it once in a blue moon. I wrote up a little memory helper for it.

#: cat ~/.aws/credentials

Format is like this:

[s3-static-upload]
aws_access_key_id = [KEY STRING]
aws_secret_access_key = [SECRET STRING]

PHP. Since you are using WordPress on your localhost, you should have PHP.

Rsync. I think most systems will have this, but you may have to go the “brew/apt-get install” route. Rsync allows us to ability to synchronize two directories. Usually one of those directories is remote, but it doesn’t have to be.

The How

Okay, do the magic to get your web site up using WordPress. Get rsync, wget and AWS Cli installed. There isn’t much I can say about those that you can’t find in those links, or by googling it. I use Laravel’s Valet to host my WordPress and I recommend it. But there are so many ways to accomplish it. Whatever method you use to get WordPress running reliably on your localhost is how you should do it.

Here’s the script that glues it all together. I initially wrote this in bash, but the search and replace in sed started acting up and I couldn’t overcome it, so I spent ten minutes and rewrote the script in PHP. I started using PHP when it was PHP/FI 1.5 and it’s a good fallback for the days when you aren’t feeling the sed magic. PHP isn’t the go-to for most people on the command line, but it’s definitely capable.

The Script


# # Mirrors my local install of WP into a static web site # $replace_url = ['local' => 'http://dist1nc7ive.test', 'remote' => 'https://dist1nc7ive.com']; $pages = [ // main site page "http://dist1nc7ive.test", // pages that aren't linked that I want to include "http://dist1nc7ive.test/feed/", "http://dist1nc7ive.test/error/", "http://dist1nc7ive.test/wp-content/themes/d17-theme/style.css.map" ]; $bucket_name = 'dist1nc7ive.com'; function escape_backslashes($data) { return str_replace('/', '\/', $data); } ############## $static_dir = sys_get_temp_dir() . '/' . escapeshellcmd($bucket_name) . '.static'; $test_dir = "{$static_dir}/test"; $main_dir = "{$static_dir}/main"; if (!file_exists($test_dir)) { mkdir($test_dir); } if (!file_exists($main_dir)) { mkdir($main_dir); } echo "Mirroring to {$test_dir} " . PHP_EOL; // These next couple of wget commands will spider and download your entire site. // It only takes a few seconds, and shouldn't bother your computer too much, // but if so, add a "sleep(3)" and then go make some coffee while it runs. // Main site that will be the starting point, much like a user who might come along $pages = implode(' ', $pages); passthru("wget -P {$test_dir} --mirror --no-host-directories --page-requisites --continue --convert-links --timestamping --user-agent='' -e robots=off --wait 0 --adjust-extension {$pages}"); // Extra pages that aren't linked and thus wouldn't be included unless I did this // passthru("wget -P {$static_dir} -mpckN --user-agent='' -e robots=off --wait 0 -E "); // passthru("wget -P {$static_dir} -mpckN --user-agent='' -e robots=off --wait 0 -E "); // just a little cleanup unlink("{$test_dir}/xmlrpc.php?rsd"); // get rid of the big red flag directory names; this doesn't // really change much other than making google dorking // a little less convenient. And, I'm a nerd, // I think foo and bar sound better. rename("{$test_dir}/wp-content/", "{$test_dir}/foo/"); rename("{$test_dir}/wp-includes/", "{$test_dir}/bar/"); // wget does a lot of link cleanup, but our site is called 'domain.test' so we need to alter all of that... echo "Cleaning up links..." . PHP_EOL; $files = explode(PHP_EOL, `find {$test_dir} -name '*.html'`); foreach ($files as $file) { $file = trim($file); if (!$file) { continue; } $content = file_get_contents($file); $content = str_replace("index.html", '', $content); $content = str_replace($replace_url['local'], $replace_url['remote'], $content); $content = str_replace(escape_backslashes($replace_url['local']), escape_backslashes($replace_url['remote']), $content); $content = str_replace(rawurlencode($replace_url['local']), rawurlencode($replace_url['remote']), $content); // since we renamed the directories, we need to rename the references // flaw: if you blog about a WP install this might search and replace those, too! $content = preg_replace("#wp-content/#i", "foo/", $content); $content = preg_replace("#wp-includes/#i", "bar/", $content); $content = preg_replace("#wp-content\\\/#i", "foo/", $content); $content = preg_replace("#wp-includes\\\/#i", "bar/", $content); file_put_contents($file, $content); } // pass through to rsync, which isn't doing anything remote, but does a good job of syncing things up // aws will sync things based on the file stats so if we don't need to update anything, less will be sent to S3 // delete keeps things clean so you don't still have some page or post you think you removed still active. echo "Syncing with the local 'main' site..." . PHP_EOL; passthru("/usr/bin/rsync -icrvh {$test_dir}/ {$main_dir}/ --delete"); // now the aws s3 client will sync things. Only the files that have changed (added, updated or deleted) since the last update should be sent // Usually when I run this, only a few things are updated or removed. It's a tiny footprint. echo "SYNCING with remote site..." . PHP_EOL; passthru("/usr/local/bin/aws s3 sync {$main_dir} s3://{$bucket_name} --delete --profile s3-static-upload"); if (file_exists($test_dir)) { // temp dir passthru("rm -rf {$test_dir}"); } echo "The End" . PHP_EOL;

The Conclusion

I like this system because it fits my brain easily. If it fits your brain easily, I hope it works out for you. I’m on a Mac but this should work find on Linux, maybe Windows if you have that Bash command line thing that I’ve never used. If you want some help, or want to correct me, hit the links to my social channels and DM me.