Creating a true private network in EC2 with network address translation for ingress and egress.

In brief

The EC2 Nat Gateway doesn’t support ingress traffic and the network load balancer doesn’t support egress traffic. If you want both directions you have to use both (or a public IP).

Schematic

The plan is to create a private subnet 10.0.2.0/24 with our machine(s) under test and a public subnet 10.0.1.0/24 with other test components. All machines need to access the internet but the machines in 10.0.2.0/24 are really restricted in what they can do.

schematic

The load balancer configuration is the only really unusual (i.e. not what you’d normally see in the cloud) deployment:

Each machine under test is its own target group
A mapping exists for various C&C ports to each group

Why?!

We needed a test network that matched, very very closely, and existing network so that the legacy application could be tested on EC2 exactly as if it were running on-prem. Messy… but its a stepping stone to breaking this particular monolith. Our steps are:

Setup an Ansible stack so that everything is reproducible
Construct an environment in EC2 that matches prod as closely as possible
Load the existing application (with latest data) into the environment
Test the application in the environment automatically

Now, any changes to the application during its decomposition can be made after step 4:

Make the change
Hit the big red test button

(Should that button be green? Its not risky anymore!)

Steps

Creating this is pretty easy, the only real gotcha is that you must have a public subnet and internet gateway - otherwise the NAT Gateway has no egress route.

Create a VPC (10.0.0.0/16) with public (10.0.1.0/24) and private (10.0.2.0/24) subnets
Create an internet and a NAT gateway
Create the routing tables
Create a target group for ingress targets and a load balancer
Create the security groups and machines

1. Constructing the VPC and subnets

Constructing the VPC and subnets is pretty easy. The following in the playbook will do it:

- name: Create VPC
  ec2_vpc_net:
      state:          'present'
      name:           'Test VPC'
      cidr_block:     '10.0.0.0/16'
      region:         'eu-west-1'
  register: result_vpc

- name: Create public subnet
  ec2_vpc_subnet:
    state:            'present'
    vpc_id:           "{{ result_vpc.vpc.id }}"
    cidr:             '10.0.1.0/16'
    az:               'eu-west-1a'
    region:           'eu-west-1'
    map_public:       true
  register: result_public_subnet

- name: Create private subnet
  ec2_vpc_subnet:
    state:            'present'
    vpc_id:           "{{ result_vpc.vpc.id }}"
    cidr:             '10.0.2.0/16'
    az:               'eu-west-1a'
    region:           'eu-west-1'
    map_public:       false
  register: result_private_subnet

2. Construct the gateways

Constructing the gateways is pretty simple too. …but… we must have a public subnet for the NAT gateway to work.

- name: Create Internet Gateway for VPC
  ec2_vpc_igw:
     state:           'present'
     vpc_id:          "{{ result_vpc.vpc.id }}"
     region:          'eu-west-1'
  register: result_igw

- name: Create NAT Gateway
  ec2_vpc_nat_gateway:
    state:                  'present'
    subnet_id:              "{{ result_public_subnet.subnet.id }}"
    wait:                   yes
    if_exist_do_not_create: true
    release_eip:            true
  register: result_nat_gateway

Adding in release_eip and if_exist_do_not_create keeps things nice and neat as the gateway creation is idempotent and when we destroy it we get rid of the EIP too.

3. Create the routing tables

This is pretty easy. As we control all the machines we don’t need to worry about particular routes, we just tell the machines how to access the outside world:

- name: Set up the public subnet route table
  ec2_vpc_route_table:
    vpc_id:           "{{ result_vpc.vpc.id }}"
    region:           'eu-west-1'
    subnets:          "{{ result_public_subnet.subnet.id }}"
    routes:
      - dest:         '0.0.0.0/0'
        gateway_id:   "{{ result_igw.gateway_id }}"
  register: result_public_route

- name: Set up private subnet route table
  ec2_vpc_route_table:
    vpc_id:           "{{ result_vpc.vpc.id }}"
    region:           'eu-west-1'
    subnets:          "{{ result_private_subnet.subnet.id }}"
    routes:
      - dest:         '0.0.0.0/0'
        gateway_id:   "{{ result_nat_gateway.nat_gateway_id }}"
  register: result_private_route

4. Construct the target groups and load balancer

This is where things get hairy! We defined the machines as a list in Ansible so we can loop over them to make the target groups and fit them into the balancer:

machines:
  - name: machine1
    address: 10.0.2.101
    routes:
        - from_port: 80
          to_port: 80
        - from_port: 43000
          to_port: 22
  - name: machine2
    address: 10.0.2.102
    routes:
        - from_port: 443
          to_port: 443
        - from_port: 43001
          to_port: 22

Now we “just” have to make a bunch of target groups:

- name: Create website target group
  elb_target_group:
    name:              "{{ item.0.name + '-' + item.1.from_port|string }}"
    protocol:          'tcp'
    port:              "{{ item.1.from_port }}"
    vpc_id:            "{{ result_vpc.vpc.id }}"
    target_type:       'ip'
    targets:
      - Id:            "{{ item.0.address }}"
        Port:          "{{ item.1.to_port }}"
    state:             present
  loop: "{{ machines | subelements('routes') | list }}"

…and we “just” have to supply these as listeners to the load balancer:

- name: Set listeners fact
  set_fact:
    listeners: >-
      {{ (listeners | default([])) + [{
        'Protocol':          'tcp',
        'Port':              item.1.from_port,
        'DefaultActions': {
          'Type':            'forward',
          'TargetGroupName': item.0.name + '-' + item.1.from_port|string
        }
      }] }}
  loop: "{{ machines | subelements('routes') | list }}"

- name: Create network load balancer
  elb_network_lb:
    state:                   'present'
    name:                    'locallb'
    subnets:                 "{{ result_public_subnet.subnet.id }}"
    listeners:               "{{ listeners }}"
  register: result_network_lb

Jinja2 templating

This looks a bit crazy at first but the process isn’t too bad once its broken down:

machines:
  - name: machine1
    routes:
      - port: 1
      - port: 2
  - name: machine2
    routes:
      - port: 3
      - port: 4

to:

- 
  - name: machine1
    routes:
      - port: 1
      - port: 2
  - port: 1
-
  - name: machine1
    ...
  - port: 2
-
  - name: machine2
    ...
  - port: 3
...

So item.0.name = machine1..machine2 and item.1.port = 1..4. Creating the listeners fact uses a handy Ansible pattern where we can default an undefined value to an empty list (or dictionary):

- set_fact:
    variable: >-
      {{ (variable | default([])) + [...] }}
  loop: " {{ ... }}"

so we can build up the list of listeners by iterating over the machines and ports.

Does it work?

It does! We can construct a Virtual Private Cloud with a completely controlled private network in just over 30 seconds. If the machines are available as AMIs then the whole test process can be completed in a matter of minutes and torn down afterwards.