Next post: Exotic frequencies

Manually storing files on amazon glacier

Amazon Glacier is an amazingly cost-effective way to store large amounts data on the cloud. Storage costs less than half a cent per gigabyte per month. Backing up all ~200Gb of my documents costs me only 80 cents a month, which is incredible.

There are a few tradeoffs. Most importantly, there are small charges associated with downloading data from Glacier, and the more quickly the data is needed, the greater the charge. Also, Glacier is raw key-value storage - and the keys for each row are automatically generated and can’t be customized. Even getting information about an archive takes time: retrieving a list of rows takes more than 3 hours. These limitations are fine for backups, though, since it’s more common to write to the backup than read from it.

Because it took me a little while to understand how to use Glacier to store files, I thought I’d share my thoughts for others.

Each Glacier database resides in a region such as us-west-1. Amazon calls a Glacier database a “vault”. Each row is called an “archive”.

The only data for an archive is this:

  • an archiveid (a long auto-generated binary string)
  • the creation date
  • the size in bytes
  • the content
  • the ArchiveDescription, a string you can set.

None of the information, even the description, can be altered; you have to delete and re-add the archive.

What to put in a ArchiveDescription

I recommend using a tool like FastGlacier or my own open source Glacial Backup for convenience. I’ve written this guide, though, to show how to store files manually.

Glacier has no built-in concept of a “filepath”, and we cannot choose an ID for a row. We have to store all metadata in the ArchiveDescription. This also means that applications storing files to Glacier need to keep their own record of what has been uploaded, because the only way to retrieve the ArchiveDescription is to retrieve the inventory, which takes more than 3 hours to complete. Because Glacier does not care if two archives have the same description, it also means that it’s up to applications to prevent two uploaded files with the same exact path, because Glacier doesn’t care if 5 different archives have the same description.

FastGlacier’s use of ArchiveDescription makes sense, and is flexible enough to allow additional metadata to be added over time. The file backups/file001.tar is given the description <m><v>4</v><p>YmFja3Vwcy9maWxlMDAxLnRhcg==</p><lm>20200830T171803Z</lm></m>.

What does that mean? <v>4</v> is just the format version. YmFja3Vwcy9maWxlMDAxLnRhcg== is backups/file001.tar encoded as utf-8, then into base64. 20200830T171803Z is the last-modified time in UTC.

I can also upload archives that represent directories. FastGlacier didn’t include it on their website, but to represent a “directory”, they upload a 1 byte file. To upload a representation of the directory backups/, I archive a file with content of one byte, a space, and give it a description like <m><v>4</v><p>YmFja3Vwcy8=</p><lm>20200830T171803Z</lm></m>, where YmFja3Vwcy8= is backups/ encoded as utf-8, then into base64.

Setting up AWSCLI

If we want to manually upload files to Glacier, it’s easiest to use the AWS CLI (command-line interface). Download the CLI from Amazon’s site https://aws.amazon.com/cli/.

  • Then, create a Amazon Web Services account, if you don’t already have one.
  • Then, create a IAM account for the Amazon Web Services account. The IAM account will be a ‘subaccount’ with restricted access, for safety.
  • Give the IAM account permissions at least for Glacier-related actions. For example, you can use this policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Resource": "*", "Action":["glacier:*"] } ] }
  • Click in the top right corner of the website to Log Out of AWS
  • Go to the Amazon Web Services console and log in with your new IAM account.
  • Go to https://console.aws.amazon.com/iam/home?#security_credential
  • Write down the ‘Access Key ID’ and ‘Secret Access Key’ in a trusted place.
  • Next, go to https://console.aws.amazon.com/glacier
  • Notice the ‘Region’ in the top right (e.g. us-west-1) and write it down, or optionally change it.
  • Click ‘Create Vault’.
  • Enter a vault name, recommended with only alphanumeric characters.
  • Click Next through each page (you don’t need to set up notifications) and click Create Vault.
  • Then open a command-line console, and type:
  • export AWS_DEFAULT_REGION=us-west-1, or the region where the Vault was created
  • export AWS_ACCESS_KEY_ID=<access key id>
  • export AWS_SECRET_ACCESS_KEY=<secret access key>
  • (On Windows, type set AWS_DEFAULT_REGION instead).
  • Run
aws glacier describe-vault --vault-name <vaultname> --account-id -
  • If everything’s set up right, you’ll see a description of the vault printed.

Storing files onto Glacier

Now that we know what ArchiveDescription to assign, and we have the aws cli set up, we can upload files.

Let’s say we are uploading /path/to/backups/file001.tar and we want the path on Glacier to be backups/file001.tar. We’ll use the description format mentioned earlier, encoding the path into base64 gives YmFja3Vwcy9maWxlMDAxLnRhcg==.

  • Open a command line console
  • Run export AWS_DEFAULT_REGION and export AWS_ACCESS_KEY_ID as described earlier
  • Run
aws glacier upload-archive --vault-name <vaultname> --account-id - --archive-description "<m><v>4</v><p>YmFja3Vwcy9maWxlMDAxLnRhcg==</p><lm>20200830T171803Z</lm></m>" --body "/path/to/backups/file001.tar"
  • If the upload succeeded, you’ll be given an archiveId. Uploading large files might cause time-outs, so in those cases it’s best to use multi-part uploads. See more in the documentation here.

Downloading files from Glacier

  • Open a command line console
  • Run export AWS_DEFAULT_REGION and export AWS_ACCESS_KEY_ID as described earlier
  • Look up the archiveId of the archive you want to retrieve.
  • Run
aws glacier initiate-job --vault-name <vaultname> --account-id - --job-parameters '{"Type": "archive-retrieval","ArchiveId": "<thearchiveid>","Tier" : "Bulk"}'
  • The tier types are Expedited , Standard , or Bulk. Expedited jobs are faster but cost more.
  • You’ll be given a job-id.
  • You can then query that job-id,
  • Run
aws glacier get-job-output --vault-name <vaultname> --account-id - --job-id <jobid> output.json
  • Then look at the results in output.json. (It might take some time for the job to be ready)

Retrieving the list of files from Glacier

To get a list of all of the files in a vault, this is called “inventory retrieval”. Helpfully, the inventory contains all of the descriptions and archive sizes.

  • Open a command line console
  • Run export AWS_DEFAULT_REGION and export AWS_ACCESS_KEY_ID as described earlier
  • Run
aws glacier initiate-job --vault-name <vaultname> --account-id - --job-parameters '{"Type": "inventory-retrieval"}'`
  • You’ll be given a job-id.
  • You can then query that job-id,
  • Run
aws glacier get-job-output --vault-name <vaultname> --account-id - --job-id <jobid> output.json
  • Then look at the results in output.json. (It might take some time - at least 3 hours - for the job to be ready)

In general, FastGlacier does everything I need, but I thought it was worthwhile for me to know how to upload and download data myself if I need to.