New Backup Strategy – Amazon Glacier

I recently learned about Amazon Glacier – a storage service optimized for reliable very long term storage and infrequent data retrieval.

Storage is very cheap at $0.01/GB/month, but it is unlike conventional online storage services in a few ways –

  • Uploads are free
  • Downloads are free for 5% of your data per month (eg. if you store 100GB, you can download 5GB per month). Additional downloading is pretty expensive ($0.12/GB).
  • Downloads have a ~4 hours latency.
  • Designed “to provide average annual durability of 99.999999999% for an archive”

These properties make Amazon Glacier an attractive option for offsite backup!

I’ve always wanted to do offsite backup of my data, but never really found a good way/place.

I have plenty of backup locally, but, for example, if my house burns down or Richmond sinks, I would still lose my data.

So I wrote a simple shell script to do automated incremental backup from my Linux file server to Amazon Glacier.

It’s one of my first shell scripts, so please do send advices my way.

To use the script, create a data directory somewhere (data inside this directory will be backed up), and also an empty backup directory somewhere – this is where backup archives will be stored, in addition to uploading to Glacier. It’s good to have a local copy of everything because if you need to restore from backup due to something like accidentally deleted files (not a harddrive failure), you can just restore from local archives instead of paying to retrieve them from Glacier.

The script, on every run, will either make a full backup (tar.gz file) of the data directory if no previous backups are found, or build an archive of only changed files since the last backup using tar’s incremental archive feature. In either case, it will upload the resulting archive to Amazon Glacier. It creates a log of files changed in each run, size of the incremental archive, and any errors, etc. Every few runs (7 by default), it emails the log file to the email address specified.

Most of the configuration options in the script should be self-explanatory.

This script requires glacier-cli to upload to Glacier. glacier-cli requires the boto library (Python library for accessing Amazon services), as well as python-iso8601 and python-sqlalchemy.

The script can be run by cron if you want scheduled backups. Frequency is up to you of course. I run mine once a day. It sounds like a lot, but it’s actually not much since most data remains unchanged most of the time, and tar uses file modification time to looked for modified files, and that’s very fast.

To force a complete backup, simply delete the “archive.snar” file in backup directory.

That’s it! If s*** happens, just get all the archives since the last complete backup from either Glacier or the local copy, and untar them from the complete backup to the latest in order. Same as extracting regular tarballs, but add an “–incremental” switch. I didn’t write a script to automate restoring from backup because that shouldn’t happen too often. I have verified that all the incremental stuff works by hand (to Glacier and back).

It’s very easy to add encryption, but I chose not to because it introduces an additional point of failure (losing decryption key or forgetting password), and a pretty significant one at that. It’d be what I believe the most likely failure mode of a backup scheme like this. I value data durability over privacy, and I’m not convinced Amazon will ever be interested in my data, so I store my data un-encrypted.

In my setup, I actually have both the data directory and backup directory on a RAID-5 array on my file server. The backup directory is not shared, so backups cannot be messed up by the user from a client PC. That means the only time I’d need to restore from Glacier is if I have a┬ásimultaneous┬ádrive failures, or if my house gets burned down, or have my server stolen physically. Most user-error-type scenarios can be covered by on-machine backups.

Why RAID when I already have backup? Because single hard drive failures are extremely common. I’ve had about 3 already, in my ~10 years of using computers. I have ~100GB of data, which would cost about $120 to retrieve from Glacier, and I would also potentially lose up to 1 day of work, which would be annoying. Also, hard drives are cheap.

PS. I have also written an article on setting up a DIY Linux Network-Attached Storage (aka. file server).