Backups with IPFS

Posted on 2021-02-24

With that out of the way let’s talk about what value IPFS brings to backups

Cryptographic integrity of the backed up files.
Deduplication between backups, even between hosts.
Ease of moving backups between hosts and providers.
Easy to pin certain backups forever while retaining the deduplication.
Easily view backup contents via IPFS gateways.

Other than the massive caveat of making all of your data public IPFS seems like an fantastic option for backups. There is a feature request for supporting encryption in IPFS but unfortunately it doesn’t appear to be making much progress.

Basics

What does it take to make a backup system out of IPFS? Here is an annotated version of my script. It uses the ipfs files (MFS) API to store the backups. For our purposes it basically serves as a better form of pinning.

# The MFS directory where all backups are stored
root="/backups/valheim"
ipfs files mkdir -p "$root"

# Location of the new backup.
new="$root/$(date -u '+%FT%TZ')"

# Add the backup to IPFS and get the CID.
cid=$(ipfs add -QrH -- /var/lib/valhiem/worlds)

# Add the backup to the backup directory in MFS
ipfs files cp "/ipfs/$cid" "$new"

# Unpin. We already have a pin from MFS.
ipfs pin rm "$cid"

There we go. With just a couple of commands we are backing up our directory via IPFS!

Replication

Right now the files are only stored on the host that backed them up (unless you are running a remote IPFS daemon) so while we do have some history we will be unable to access the files if the host dies. In order to replicate the data we can pin our backup to a remote host.

curl -fXPOST --retry 5 "https://$REMOTE_HOST/api/v0/pin/add?arg=/ipfs/$cid"

This will cause the remote host to download the (changed) data and keep a copy locally.

Deduplication

Right now we are getting basic deduplication however IPFS’s default add options don’t deduplicate well for many cases. Unchanged files and directories should be deduplicated, however modified files will be mostly independent (this is highly dependent on how the files are modified). This is because IPFS’s default chunker is fixed-size blocks which only works well in specific cases.

In my case (of backing up Valheim wold data) I found that I got very good deduplication using -s rabin-2048-65536-131072. This uses a rolling checksum to decide where to break chunks and does a very good job in a wide variety of use cases at the cost of CPU and time spent to identify chunk boundaries.

For my use case I have about 6 days of backups that are about 30MiB each. In that time I have taken 49 backups and use a total of 159MiB of storage. This more than 10x savings compared to storing the data without deduplication.

Future Work

Cleanup

I don’t currently have a cleanup script. However the file names are the date of the backup so writing a script to implement a simple policy should be trivial. (In fact I’m sure I have a script for time and count retention lying around in an old project repo.) However since I am backing up a small volume of data and because the deduplication is so effective I probably won’t need to delete any backups for quite a while.

Encryption

It would be nice if I could encrypt the backups. Until native encryption is supported the state of the art is to encrypt pre-backup. However if done naively this completely eliminates the benefit of deduplication. You can encrypt based on a rolling hash however stacking rolling hashes is not optimal. To do a good job you would need to do your own chunking to match the encryption chunking. You also need to be careful how you “reset” the encryption state on chunk boundaries. I’m not a cryptographer but comparing two encrypted backups is very interesting from a cryptanalysis point of view.

Either way, if you have sensitive data you probably shouldn’t be putting it on the internet. Especially not putting multiple versions on the internet. That being said I suspect that encrypted backups on IPFS would be a good enough solution for a large number of people.

Compression

Compression and deduplication go together like oil and water. You could compress each block but the gains would be limited by the block size. I think at the end of the day chunk-then-compress would still provide some benefits. This would require writing your own add function but shouldn’t be too difficult.

Another downside is that without native compression support this means that files would no longer be viewable via IPFS gateways.

Filesystem Snapshots

Right now I just hope that the files don’t change while backing them up. It would be better to use a filesystem that supports snapshots and take a snapshot and back that up. This way I should get a good backup every time (as long as the application is crash safe).

Valheim only writes to disk once an hour so this isn’t a major concern at this point. There is an incredibly slim chance that I get a bad backup and my backups are frequent enough that a single bad backup is a very minor concern.

Concerns

Pinning

If the backup script dies between the ipfs add and the ipfs pin rm the item will be pinned on my node forever. This isn’t really an issue right now because I have no cleanup anyways and in practice I manually remove these from time to time but it would be nice to have an atomic “add to MFS” option that doesn’t create a pin.

I could alternatively run ipfs add --nopin however then I risk the garbage collector removing the data before I add it to MFS. If that occurs I would think I have taken a backup but the blocks are not actually available. This option seemed worse than keeping a backup around if I fail to unpin. This could be mitigated by verifying that all of the blocks are still available after adding to MFS (ex: ifps get -a >/dev/null) and abort the backup if they aren’t.

Full Scan

This approach reads everything off of the filesystem and chunks it for every backup. If nothing changed this is a lot of unnecessary work. With filesystem assistance we could avoid inspecting directories and files that haven’t changed since the last backup and use the CIDs from last time.

Conclusion

Overall it is a very effective and very easy to set up backup system for public data. When encryption is supported it will be a great system for all but the most sensitive data. The fact that it is easy to get started, easy to view online, has effective deduplication and is easy to replicate solves the biggest concerns of backups.

I would love to see a project that wraps this up into a easy to use policy framework. (For example backup locally every hour and offsite daily. Keep daily backups for a month and monthly backups for a year as well as configuring directories and exclusions.) But for now just running the script on cron has proven to be satisfactory.