Remote state isolation with terraform workspaces for multi-account deployments

Today we gonna look at how to use terraform workspaces (cli not the enterprise) to manage multi-environmental and multi-account deployments in a secure way. The main issue in tf workspaces that it is using a single bucket for all environments and we will look at how to improve that by segregating the access to each environment, so when we are authenticated to dev env, we can’t read/edit the state file of the prod env, etc, thus making use of terraform workspaces more secure and prod ready, and keeping terraform command neat:

terraform workspace new prod
..
terraform workspace new dev
..
terraform workspace select prod
terraform plan
..
terraform workspace select dev
terraform plan

First of all, before getting into nitty gritties of tf, lets have a quick look into alternative ways how to structure tf project to manage deployments of the infra in multiple accounts or environments:

1. using directory structure
2. providing separate var files
3. having some sort of wrapper script or 3rd party tool like terragrunt
4. finally, the one we gonna check out today, using native terraform workspaces

1. using directory structure.

nginx-app
├── dublin
│ ├── backend.tf
│ ├── main.tf -> ../main.tf
│ ├── outputs.tf -> ../outputs.tf
│ ├── terraform.pout
│ ├── terraform.tfvars
│ ├── userdata.tpl -> ../userdata.tpl
│ └── variables.tf -> ../variables.tf
├── main.tf
├── outputs.tf
├── userdata.tpl
├── variables.tf
└── virginia
    ├── backend.tf
    ├── main.tf -> ../main.tf
    ├── outputs.tf -> ../outputs.tf
    ├── terraform.pout
    ├── terraform.tfvars
    ├── userdata.tpl -> ../userdata.tpl
    └── variables.tf -> ../variables.tf

so in the parent dir we have tf files (main, outputs, etc) and in env folder we have symbolic links to those files plus env specific details kept in terraform.tfvars, backend.tf, etc

The advantages is everything is visually separated, code looks cleaner, you do not need to worry about what var files to provide when running tf, especially when you are authed in the prod account and running tf with dev var, things can get messy. Another advantage is you can override certain behaviour by having actual file instead of symbolic link, similar to how override method in Java. It is especially helpful in bigger and complex components where you have separate file per infra component, like DNS, data provider, security groups etc. So if instead of providing values in tfvar you simply override behaviour by having different file in directory instead of symbolic link. This is very useful for rather an antipattern snowflake envs (which unfortunately hard to avoid) and is much better than having mind-blowing conditionals.
By having separate directory you can also run terraform in parallel, so tf plan output and other meta information terraform creates during run in .terraform folder doesn’t get mixed up as it stays in different dirs.

Only, and biggest disadvantage is managing more files because of file duplication.

2. Providing separate var files

This is probably the most popular way most ppl use, so I won’t spend too much time explaining it. In essence it implies everything is same for all envs but stuff defined in env specific vars:

├── README.md
├── authz.tf
├── conf
│   ├── prod
│   │   ├── conf.tf
│   │   └── var.tf
│   └── test
│       ├── conf.tf
│       └── var.tf
├── route.tf
└── var.tf

so you would simply run it as:

terraform -var-file conf/$env/var.tf ... -var-file bla..

3. having some sort of wrapper script or 3rd party tool like terragrunt

so if the above project was more complex and you didn’t want to provide all the var files, plus external scripts were needed to be executed to get data prepared before tf run, you could just run simple wrapper script like:

env=$1
shift

...
....
# 1000 more lines of code doing god knows what..


terraform $@ -var-file conf/$env/var.tf

and run it as:

./tf.sh development plan

4. Finally lets check terraform workspaces.

Firstly, just few words on how to use workspaces (lots of info on the internet, so I will be super brief).

With tf workspaces we can define the flow as below:

1. auth to the aws prod account 
2. terraform workspace new production_client1_prd
3. (if already created then) terraform workspace select production_client1_prd
4. tf plan/apply
now switch to another env:
5. auth to aws dev account
6. terraform workspace new development_client1_dev1
7. (if already created then) terraform workspace select development_client1_dev1
8. tf plan/apply

As you can see, we first create a workspace, then we can switch between diff workspaces.

When we use tf workspaces, two key things happen which we an benefit from immediately:

1. tf automatically puts each environment under a different key
2. tf creates a variable to refer to each environment

so lets say if w/o workspaces our bucket was set up as:
s3://bucket/key
now it automatically becomes:
s3://bucket/env:/$worspace/key

Now we don’t need to worry about diff state files per diff env, as tf automatically done it:

Say if our tf state looks like:

terraform {
  backend "s3" {
    bucket = "tf-state"
    key    = "tf-network"
    region = "eu-west-1"
  }
}

then tf state structure in the s3 with tf workspaces will look like below:

s3://tf-state/env:
				/development_client_env1/tf-ecs				
				/development_client_env2/tf-ecs
				/production_client_env1/tf-ecs

The second feature is a reference to the implicitly defined workspace variable.

Now lets say we have multi account structure and each account can have multiple environments segregated by vpc, then we can have env specific vpc data configured as below var file:

vpc_cidr = {
  "development_client1_dev1" : "10.0.0.0/16",
  "development_client1_poc1" : "10.1.0.0/16",
  "production_client2_prd1" : "10.2.0.0/16",
}

and then refer to env specific data:

module "network" {
  source               = "./modules/network"
  ...
  .....
  vpc_cidr             = lookup(var.vpc_cidr, terraform.workspace)
}

We simply get vpc cidr from the map using workspace name as a key. This way we don’t need to have multiple var files and can simply run terraform plan/apply without providing any extra var files per env.

Security of the state file.

Now lets get to the main problem: which account exactly would hold the state file and how we can segregate the access?

Here comes the management account pattern, the stuff not related to the business environments always goes to the central management account.

Once we have the bucket defined in the extra account, not related to the environments mapped to our workspaces, all we need to do is to set proper s3 policy, which would separate access to state key by environment as below:


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessToTerraformInStaging1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111111111111:role/aws-reserved/sso.amazonaws.com/eu-west-1/AWSReservedSSO_AdministratorAccess_111"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::tf-state"
        },
        {
            "Sid": "AccessToTerraformInStaging2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111111111111:role/aws-reserved/sso.amazonaws.com/eu-west-1/AWSReservedSSO_AdministratorAccess_111"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::tf-state/env:/staging_*"
        },
        
        {
            "Sid": "AccessToTerraformInPrd1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::222222222222222:role/aws-reserved/sso.amazonaws.com/eu-west-1/AWSReservedSSO_AdministratorAccess2_222"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::tf-state"
        },
        {
            "Sid": "AccessToTerraformInPrd2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::222222222222222:role/aws-reserved/sso.amazonaws.com/eu-west-1/AWSReservedSSO_AdministratorAccess_222"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::tf-state/env:/production_client1_*"
        },
        {
            "Sid": "AccessToTerraformInPrd3",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::222222222222222:role/aws-reserved/sso.amazonaws.com/eu-west-1/AWSReservedSSO_AdministratorAccess_3333"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::tf-state/env:/production_client2_*"
        },

and so on..

As you can see, each account only can list all, but read/write it’s own environment key only, thus preventing us to fetch data from diff env. We can segregate on account level as well as on client level should we have multiple client envs per account (although it is not the best approach)

You can list all of your envs thanks to s3:ListBucket permission we set in each account but wont’ be able to do anything else with other envs state files.

terraform workspace list
  default
* development_client1_dev1
  production_client1_prd

If you try to auth to dev but switch to prod workspace tf will fail with s3 bucket access, thus preventing us to accidentally messing the things around.