Disaster Recovery with Tarsnap

Thursday, September 26, 2024

Real Backups Get Tested

This is yet another peek into my Terraform+Ansible Infrastructure as Code setup. I mentioned onsite+offsite backups of my (proudly minimally scoped) production server deployments – this article is all about how I restore from offsite backups in case of a genuine disaster that knocks out any onsite block storage.

Peace of mind, for me, is knowing my critical data is safe. I’ve made quite a few mistakes over the years hosting and managing my own services – the toughest ones were all data loss. With my new infrastructure I’ve finally learned to dedicate a couple days to a robust backup solution where policy and strategy are defined in one place, as code, and deployed to anything I manage. When I create a server in my Ansible inventory (generated by Terraform) I give it a tarsnap_key_file. This refers to a file generated by hand (for now – it’s just a CLI command) and links up with other tarsnap_* variables to both install tarsnap and configure some automation for using it to create backups. Each machine gets a keyfile that it uses to upload backups of critical data “offsite” with a sensible, simple retention policy.

But backups that aren’t tested aren’t backups. There are zero “organic” scenarios where you can afford to be fumbling with a disaster recovery procedure you’ve never drilled before. Because of this, backups have to be easy to test or you won’t test them adequately. Later on I’ll be presenting a feature where a snapshot is taken before a backup is restored. This is perfect for testing as it allows you to test restoring backups and then restore the snapshot you just created to resume normal functionality.

It’s also worth it to plan your deployment around being easy to restore. In my case I am using volumes and containers to isolate any critical data for backup. Additionally, any configuration not managed by Ansible (and therefore backed up with the rest of the IaC) should be colocated with other critical data or at least backed up in the same way. Containerized applications are great for this! The official Gitea container puts the app configuration in the same root data directory as your user data. Any changes you make in the app are reflected in your backups. This is part of the reason I actually opted to write my own minimal Gitea role with no template for app.ini. I’m avoiding configuration drift at the cost of not being able to programatically create new Gitea instances with tweaked or preset configuration. For me that is an S-tier tradeoff to make.

But what is Tarsnap?

I got a little ahead of myself introducing this setup. My bad! Tarsnap is a cloud backup solution written by Colin Percival. It is, in my opinion, the best offsite backup option for small-medium organizations that can afford to store their backups in the cloud (workflows that store big, binary data or lots of frequently changing data often find more value in building physical servers). It’s cheap, safe and minimal. A solid core you can build just about anything around. I’ve been using it for years now and I can’t recommend it enough.

It’s like tar but instead of creating files on your hard drive it creates objects in AWS S3. Everything is encrypted on your machine and you always control your keys. Object Storage + tar is already the small/medium backup solution. Percival made it easy without an upsell. If you need to know more…

An infodump about Tarsnap

The server portion of Tarsnap runs in the cloud and mediates access to the AWS S3 object storage where your data is warehoused. Percival maintains the backend, including the closed source for it, but there are some details out there about the implementation. The most notable is probably Percival’s work on FreeBSD AMIs for AWS EC2. He worked hard to get boot times down, presumably to allow aggressive scaling of the Tarsnap backend.

Closed source sounds bad to some people, but in this case it makes a lot of sense. For a service like this with an intentionally limited scope and a goal of stability, there is benefit to allowing a seasoned developer the space to work in private. Distractions are minimal and focus can be diverted where it matters: the client. The server implementation is relatively low trust, especially once you know that the data lives on S3. There’s nothing special about the backend at all, except that it works incredibly reliably. When things do go down, Percival diligently provides a post-mortem. Forget the license wars for a minute and really appreciate how aspirational that is for a solo dev.

The client is open source, and for good reasons beyond being able to trust the code that runs on your machine. Percival is a cryptomancer and has packed a lot of understatedly kickassfunctionality into a simple CLI. On the surface it mimics tar – when you use the -f flag you are specifying a “filename” in the cloud. Many things are the same, but with a few useful additions like --list-archives. When you actually hand it some targets to archive it does more than pack the contents into a single file with some metadata. Rather, the data is split into blocks, deduplicated, compressed and encrypted using a key that lives on your machine. Only blocks that have changed are actually sent up and stored. The client doesn’t send a single byte of unencrypted user data – ever!

The cherry on top is the billing. Percival bills in picodollars for storage and bandwidth. It’s a bit of an adjustment at first but it really shows off how frugal it is a managed backup service. Couple the fine-grained pricing with knowledge of how the client deduplicates your data and you have a dirt cheap solution for backing up text files without having to make any compression choices at all. And when it comes to blobbier, volatile data the raw prices are still highly competitive.

Just don’t lose your key. Your data is worthless (to you, and to potential theives) without the key!

Oh, there is one more thing! The client is very simple so most users choose (or develop) their own helpers. The official GUI helper never appealed to me but there is a nice list on the Tarsnap website of projects in various states of maintenance. I really like ACTS or tarsnapper for personal use on a laptop or similar host. Here’s the DIY version I’m using to back up my core infrastructure (including Terraform state and various keys).

Single Target, Rolling Backup Script

#!/bin/bash

# NOTE: .git is excluded in my ~/.tarsnaprc -- if I want to back up repos I'll use git 
#       For Terraform it happens to be useful to have the exact source that corresponds
#       to a given state file. So I back up everything **but** the repo after I apply
SOURCE="$HOME/core-infra"
BACKUP_SLUG="core-infra"
BACKUP_NAME="${BACKUP_SLUG}--$(date +%Y-%m-%d_%H-%M-%S)"
LOGFILE="$HOME/.local/log/tarsnap.${BACKUP_SLUG}.log"
BANNER='==========================='"$BACKUP_NAME"'==========================='

echo "$BANNER" >> "$LOGFILE"

# create new backup
tarsnap -c -f "$BACKUP_NAME" $SOURCE >> "$LOGFILE" 2>&1

# delete old backups
BACKUPS_KEPT=3
tarsnap --list-archives | grep "${BACKUP_SLUG}" | sort -r | awk -v keep=$BACKUPS_KEPT '
	NR > keep  { print "Deleting: " $0; system("tarsnap -d -f \"" $0 "\"") }
	NR <= keep { print "Keeping: " $0; }
' >> "$LOGFILE" 2>&1

Most helpers are just this plus config files and an algorithm to figure out which backups to delete. You’ll want all that if you’re setting and forgetting tarsnap.

A note on scope before I dive in

Right now I am using this setup on my personal, private Gitea server. I plan to share it with some people and at that time I’ll need to add a bit more to this. Primarily, I’ll want to poke a hole through to Gitea directly while my reverse proxy is configured to display the maintenance page so that I can test what’s going on without users freaking out that the commits they pushed earlier are suddenly gone. Also there is a daily maintenance window where that message is displayed during the backup process – but I don’t mind! I don’t have an SLA with myself.

A simple backup strategy

My original sketches for a role that installs the tarsnap helper I reccomended above, tarsnapper. Unfortunately there is a big issue that makes monitoring the backup process impossible. Hopefully the PR solving the issue is merged soon but I have little interest in the overhead of running a fork at this time.

In order to address this issue I added a bit of pluggability to the role in the form of the strategy pattern. To begin with, I designed my own simple strategy (called simple) that keeps multiple rolling backups but only creates/prunes them once per day. Here’s the relevant portion of the tarsnap configuration provided as a dependency of my Gitea server role.

- role: enbeec.tarsnap
    vars:
      tarsnap_simple:
        - name: gitea
          targets:
            - "{{ gitea_volmount_data }}/gitea"
          keep: "7"
      tarsnap_config:
        humanize_numbers: true
        exclude:
          - "{{ gitea_volmount_data }}/gitea/gitea/indexers"
          - "{{ gitea_volmount_data }}/gitea/gitea/sessions"
          - "{{ gitea_volmount_data }}/gitea/gitea/tmp"
          - "{{ gitea_volmount_data }}/gitea/gitea/queues"
          - "{{ gitea_volmount_data }}/gitea/gitea/log"
      # ...

Note that tarsnap_key_file isn’t present – that is always passed as a host_var. The client uses the key to list archives and allow deleting archives. One key per machine is a must.

The other part of the config is a feature to be shared between all strategies: hooks. These are behaviors that run before and after backups. Here are the hooks I use to alert my reverse proxy that this is a planned outage before I spin the container down (and then undo both of those actions, in order).

- role: enbeec.tarsnap
    vars:
      # ...
      tarsnap_hook_before:
        - "sudo touch /var/www/maintenance"
        - "printf 'Stopping '"
        - "docker stop --time=30 gitea"
      tarsnap_hook_after:
        - "sudo rm /var/www/maintenance"
        - "printf 'Starting '"
        - "docker start gitea"

Each backup strategy must implement the hook script while all strategies also share the same restore script. This is because restoring backups is always a manual process. Any kind of machine cloning should be performed using onsite storage replication to avoid bandwidth costs and key sharing.

Speaking of restoring backups…

Disaster!

In order to streamline my recovery process I have built a playbook with a bash frontend named disaster.yaml and disaster.sh, respectively.

All backup strategies use some kind of identifier (usually a timestamp) to link together the different archives created at the same time. Part of the role configuration is a restore script that templates out any configured archives so that only an identifier need be provided to restore a batch of cotemporaneous archives. I really like this approach as it cuts down on the complexity of the script. Rather than maintaining some configuration structure and looping over it at runtime you get a bespoke script on each machine that is highly readable and kept in sync with the latest configuration updates by Ansible.

The frontend, disaster.sh, cannot provide a unique list of identifiers to choose from without targeting a host (and passing its keyfile to tarsnap with --keyfile). So, the first thing the script does is use ansible-inventory and jq to find any hosts with a tarsnap_key_file host_var and pass them to dialog. Once a selection is made, tarsnap --list-archives and some processing allows another dialog to select the desired identifier. Finally, read is used to get an optional snapshot name (so there is a backup of the machine just before restoration) and type-to-confirm confirmation.

disaster.sh

#!/usr/bin/env bash
# MUST BE COMPATIBLE WITH BASH 3.2 (i.e. no associative arrays, etc.)
#  (that's the version of bash, from 2007,
#   that macOS still ships for license reasons)
set -e

## SETUP
__dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd ${__dir}

MISSING_DEPENDENCIES="false"
if ! command -v tarsnap &> /dev/null; then
    echo "tarsnap is not installed. Please install tarsnap and try again."
    MISSING_DEPENDENCIES="true"
fi
if ! command -v jq &> /dev/null; then
    echo "jq is not installed. Please install jq and try again."
    MISSING_DEPENDENCIES="true"
fi
if ! command -v dialog &> /dev/null; then
    echo "dialog is not installed. Please install dialog and try again."
    MISSING_DEPENDENCIES="true"
fi
if [ "$MISSING_DEPENDENCIES" = "true" ]; then exit 1; fi

## GLOBALS
# depends on `cd ${__dir}`
ANSIBLE_FLAGS="-i inventory --vault-password-file ../.keys/vault"
# prompt strings for read are weeeeird
NEWLINE=`echo $'\n.'`; NEWLINE=${NEWLINE%.}
# set by: select_host;      read by: select_archive, restore
HOST=""; KEYFILE=""
# set by: select_archive;    read by: restore
ARCHIVE=""

## FUNCTIONS
prompt() {
    local var_name="$1"
    local prompt_text="$2"
    read -p "$prompt_text" $var_name
}

select_host() {
    # retrieve all hosts with hostvars 
    # (any host with a tarsnap key will be here)
    local hosts="$(
        ansible-inventory $ANSIBLE_FLAGS --list \
            | jq -r '._meta.hostvars | keys[]')"

    local host_array=()

    # create a poor man's associative array of only hosts with keys
    for host in $hosts; do
        local keyfile="$(
            ansible-inventory $ANSIBLE_FLAGS --host "$host" \
                | jq -r '.tarsnap_key_file // empty')"
        if [ -n "$keyfile" ]; then
            host_array+=("$host::$keyfile")
        fi
    done

    # Create the array for dialog.
    # Consists of hostnames each with the placeholder description, "_".
    # BE CAREFUL -- you have to have something there
    #                 or the quoting gets harder.
    local host_options=()
    for host in "${host_array[@]}"; do
        description=""
        host_options+=("${host%%::*}" "_")
    done

    # bail early if we can't do anything
    if [ ${#host_options[@]} -eq 0 ]; then
        >&2 echo "No hosts with a tarsnap_key_file exist."
        exit 1
    fi

    prompt="Select a host to restore backups on:"
    local selected_host="$(
        dialog --stdout --menu "$prompt" 0 0 0 "${host_options[@]}")"

    if [ $? -ne 0 ]; then
        >&2 "Host selection cancelled."
        exit 1
    fi

    clear

    # search back for the selected host so we can set global vars
    for item in "${host_array[@]}"; do
        host="${item%%::*}"
        keyfile="${item##*::}"
        if [ "$host" = "$selected_host" ]; then
            HOST="$host"
            KEYFILE="$keyfile"
            break
        fi
    done
}

select_archive() {
    local host="$1"
    local keyfile="$2"

    # NOTE sed is used to separate the identifier suffix from the archive name
    #      The full names look like: host-archive--YYYY-MM-DD-HH-MM-SS
    #       When I first did this I used: archive-YYYYMMDD
    #       I have reasons for each and every difference between the two
    local archives=(
        $(tarsnap --keyfile "$keyfile" --list-archives \
            | sed 's/.*--//' | sort -r | uniq | awk '{print $0, "_"}')
    )

    prompt="Select an archive to restore:"
    local selected_archive="$(
        dialog --stdout --menu "$prompt" 0 0 0 "${archives[@]}")"

    if [ $? -ne 0 ]; then
        >&2 "Archive selection cancelled."
        exit 1
    fi

    clear

    ARCHIVE="$selected_archive"
}

recover() {
    local host="$1"
    local archive="$2"

    echo '==> Begin Recovery'
    prompt snapshot 'Provide a name for a pre-restore snapshot archive.'"$NEWLINE"'(Leave empty to skip creating a snapshot)'"$NEWLINE"'> '
    prompt confirmation 'Even with a pre-restore snapshot there are risks to every restore.'"$NEWLINE"'Please type "confirm" to continue.'"$NEWLINE"'> '
    echo ""

    # this playbook is the plumbing that restores backups 
    #   -- the script itself is just the porcelain
    ansible-playbook disaster.yaml $ANSIBLE_FLAGS \
        --extra-vars "restore_host=$host" \
        --extra-vars "tarsnap_archive=$archive" \
        --extra-vars "tarsnap_snapshot=$snapshot" \
        --extra-vars "confirmation=$confirmation"
}

## EXECUTE
select_host
select_archive "${HOST?}" "${KEYFILE?}"
recover "${HOST?}" "${ARCHIVE?}"

Then, control is passed to Ansible by calling ansible-playbook with an --extra-vars flag for restore_host, tarsnap_archive, tarsnap_snapshot and confirmation. After a little verification, the backup (if a snapshot was named) and restore scripts are called and the output shown to the user. I like using Ansible for this because the ansible.builtin.asserts module is a concise way to perform readable validation. Also, SSH is pre-configured and there might eventually be a module for tarsnap.

disaster.yml

# Playbook for Disaster Recovery
# Do not use directly.

- name: Restore From Tarsnap
  hosts: "{{ restore_host }}"
  tasks:
    - name: Strip Variables
      set_fact:
        tarsnap_archive: "{{ tarsnap_archive | trim }}"
        tarsnap_snapshot: "{{ tarsnap_snapshot | trim }}"
        confirmation: "{{ confirmation | trim }}"

    - name: Validate Variables
      ansible.builtin.assert:
        that:
          - tarsnap_archive is defined and tarsnap_archive != ""
          - tarsnap_snapshot is match('[a-zA-Z_-].*') or tarsnap_snapshot == ""
          - confirmation == "confirm"
        fail_msg: "Aborting Gitea restore: invalid input"

    - name: "Snapshot: command"
      ansible.builtin.command: "sudo /opt/tarsnap.sh backup snapshot {{ tarsnap_snapshot }}"
      register: snapshot_result
      when: tarsnap_snapshot != ""

    - name: "Snapshot: stdout"
      debug:
        msg: "{{ snapshot_result.stdout_lines }}"
      when: tarsnap_snapshot != ""

    - name: "Snapshot: stderr"
      debug:
        msg: "{{ snapshot_result.stderr_lines }}"
      when: tarsnap_snapshot != ""

    - name: "Restore: command"
      ansible.builtin.command: "sudo /opt/tarsnap.sh restore {{ tarsnap_archive }}"
      register: restore_result

    - name: "Restore: stdout"
      debug:
        msg: "{{ restore_result.stdout_lines }}"

    - name: "Restore: stderr"
      debug:
        msg: "{{ restore_result.stderr_lines }}"

Even if my DigitalOcean account was nuked I could set up a new node with this keyfile have the backups from Tarsnap restored in a snap – no pun intended. For completeness, here is the tarsnap command that gets called once I finish reconstructing the full archive names from just the identifier. Remember, this runs on the remote host where the keyfile is automatically picked up from the configuration.

# ... {BEFORE HOOKS}

# -x ~> extract
# -v ~> verbose
# -P ~> preserve root ("etc/foo/bar" restores to "/etc/foo/bar")
# -f ~> file (archive) name
tarsnap -xvPf "$ARCHIVE" 2>&1 | tee -a "$LOGFILE"
# this is repeated in the aforementioned 
#   unrolled loop we template out at config time
#   once per group of targets (Gitea only has one -- "gitea")

# {AFTER HOOKS} ...

At risk of repeating too many times, more details on this to come. I want to test drive the setup for a couple weeks before I go too deep into detail on it.

You said it’s on ~sight~ site?

Tarsnap is just my offsite backup – onsite is even easier. I place all critical data on distinctly provisioned block storage (in this case a DigitalOcean volume). This resource is marked destroy = false in Terraform and can be set up with rolling snapshots using DigitalOcean Functions. Technically, using destroy = false really isn’t that bulletproof. Extra tools like terrasafe are useful when automating your applies so that nothing can remove your critical resources without manual intervention.

There is more I could do. Currently I don’t have automated volume backups. I take the occasional snapshot but nothing automated as of yet. Maybe when I have other users to support I will be able to justify it but for now I’m content with my ability to tear down and rebuild the compute instance at will and rely on the same volume to just work. Should something go sideways, I can always perform an offsite restore using Tarsnap!

There are things that keep me up at night sometimes – disaster recovery for my self hosted infrastructure is not one of them.