Self-Hosted NAT Gateway

CloudTune's self-hosted NAT gateway replaces AWS Managed NAT Gateway with a cost-optimized EC2 instance running on ARM64 (Graviton). This guide covers deployment via CloudFormation, instance sizing, observability, and network considerations.

Overview

The CloudTune NAT Gateway provides:

Up to 80% cost reduction compared to AWS Managed NAT Gateway
Pre-tuned for 200,000+ concurrent connections with optimized kernel settings
Automatic route table management via Lambda and instance-level scripts
Built-in CloudWatch metrics for network, CPU, memory, and connection tracking
ARM64/Graviton instances for best price-performance ratio

Deployment

CloudFormation Parameters

Deploy using the cloudtune-nat-gateway.yaml template. You'll need to gather the following values from your AWS environment:

Required Parameters

Parameter	How to Find	Description
`VpcId`	AWS Console → VPC → Your VPCs	The VPC where the NAT gateway will be deployed
`PublicSubnetId`	AWS Console → VPC → Subnets → Filter by "public"	A public subnet with an Internet Gateway route
`KeyPairName`	AWS Console → EC2 → Key Pairs	SSH key pair for instance access
`RouteTableIds`	AWS Console → VPC → Route Tables	Comma-separated list of private subnet route tables to update

Finding Route Table IDs:

# List all route tables in your VPC
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxxxxxxx" \
  --query 'RouteTables[*].[RouteTableId,Tags[?Key==`Name`].Value|[0]]' \
  --output table

Look for route tables associated with your private subnets (those without a direct 0.0.0.0/0 route to an Internet Gateway).

Optional Parameters

Parameter	Default	Description
`InstanceType`	`t4g.small`	EC2 instance type (see sizing guide below)
`AutomaticallyUpdateRouteTables`	`true`	Automatically update route tables on instance launch
`EnableCloudWatchMonitoring`	`true`	Enable custom CloudWatch metrics
`CloudWatchInterval`	`60`	Metric collection interval in seconds (30-300)
`VolumeSize`	`30`	EBS volume size in GB (20-100)

Instance Sizing Guide

Before selecting an instance type, assess your current NAT gateway usage to right-size the replacement.

Measuring Current Throughput

Query your existing AWS Managed NAT Gateway metrics to understand your throughput requirements:

# Get peak bytes per second over the last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-xxxxxxxxxxxxxxxxx \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Maximum \
  --query 'sort_by(Datapoints, &Maximum)[-1]'

For Linux, use date -d "7 days ago" instead of date -v-7d.

Key Metrics to Check

Metric	What It Tells You	Sizing Impact
`BytesOutToDestination`	Outbound throughput to internet	Primary sizing factor
`BytesInFromDestination`	Inbound response traffic	Usually lower than outbound
`ConnectionEstablishedCount`	Active connections per minute	Affects memory requirements
`PacketsDropCount`	Dropped packets (capacity issues)	If non-zero, size up

Quick Throughput Assessment

Run this to get a summary of your NAT gateway usage:

# Replace nat-xxxxxxxxxxxxxxxxx with your NAT Gateway ID
NAT_ID="nat-xxxxxxxxxxxxxxxxx"

# Peak throughput (bytes/sec, multiply by 8 for bits)
echo "=== Peak Throughput (last 7 days) ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=$NAT_ID \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Maximum \
  --query 'max(Datapoints[].Maximum)' \
  --output text | awk '{printf "%.2f Gbps\n", ($1 * 8) / 1000000000 / 300}'

# Peak connections per minute
echo "=== Peak Connections (last 7 days) ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name ConnectionEstablishedCount \
  --dimensions Name=NatGatewayId,Value=$NAT_ID \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --query 'max(Datapoints[].Maximum)' \
  --output text

Sizing Rule of Thumb

Your Peak Throughput	Your Peak Connections/min	Recommended Starting Point
Under 500 Mbps	Under 10,000	`t4g.small`
500 Mbps - 2 Gbps	10,000 - 50,000	`t4g.medium` or `t4g.large`
2 - 5 Gbps	50,000 - 100,000	`c7gn.medium` or `c7gn.large`
5 - 20 Gbps	Over 100,000	`c7gn.xlarge` or larger

The NAT gateway supports two instance families optimized for different workloads:

T4G Family (Burstable, General Purpose)

Best for variable traffic patterns with occasional bursts:

Instance	vCPU	RAM	Network	Recommended For
`t4g.nano`	0.25	0.5 GB	Up to 5 Gbps	Development/testing
`t4g.micro`	0.25	1 GB	Up to 5 Gbps	Low traffic
`t4g.small`	0.5	2 GB	Up to 5 Gbps	Light production (under 1 Gbps)
`t4g.medium`	1	4 GB	Up to 5 Gbps	Medium traffic (1-2 Gbps)
`t4g.large`	2	8 GB	Up to 5 Gbps	Higher traffic (2-5 Gbps)
`t4g.xlarge`	4	16 GB	Up to 5 Gbps	High connection count
`t4g.2xlarge`	8	32 GB	Up to 5 Gbps	Very high connection count

C7gN Family (Network-Optimized)

Best for sustained high-throughput workloads:

Instance	vCPU	RAM	Network	Recommended For
`c7gn.medium`	1	2 GB	Up to 25 Gbps	Sustained medium traffic
`c7gn.large`	2	4 GB	Up to 30 Gbps	High throughput (5-10 Gbps)
`c7gn.xlarge`	4	8 GB	Up to 40 Gbps	Very high throughput (10-20 Gbps)
`c7gn.2xlarge`	8	16 GB	Up to 50 Gbps	Extreme throughput (20-40 Gbps)
`c7gn.4xlarge`	16	32 GB	Up to 100 Gbps	Maximum throughput

Sizing Recommendations:

Start with t4g.small for most workloads
Monitor TCPEstablished and NetworkTxBytes metrics
Upgrade to C7gN if you see sustained high network utilization
Consider t4g.medium or larger if connection count exceeds 50,000

Deploy via AWS CLI

aws cloudformation create-stack \
  --stack-name cloudtune-nat-gateway \
  --template-body file://cloudtune-nat-gateway.yaml \
  --parameters \
    ParameterKey=VpcId,ParameterValue=vpc-0abc123def456789 \
    ParameterKey=PublicSubnetId,ParameterValue=subnet-0abc123def456789 \
    ParameterKey=RouteTableIds,ParameterValue="rtb-111111111,rtb-222222222" \
    ParameterKey=InstanceType,ParameterValue=t4g.small \
    ParameterKey=KeyPairName,ParameterValue=my-key-pair \
  --capabilities CAPABILITY_IAM

Deploy via AWS Console

Navigate to CloudFormation → Create Stack
Upload cloudtune-nat-gateway.yaml
Fill in the parameters using the guidance above
Check "I acknowledge that AWS CloudFormation might create IAM resources"
Create stack

Observability

CloudWatch Metrics

When EnableCloudWatchMonitoring is enabled (default), the NAT gateway publishes metrics to the NATGateway namespace with InstanceId as a dimension.

Network Metrics

Metric	Unit	Description
`NetworkRxBytes`	Bytes	Received bytes on primary interface
`NetworkTxBytes`	Bytes	Transmitted bytes on primary interface
`NetworkRxPackets`	Count	Received packet count
`NetworkTxPackets`	Count	Transmitted packet count

System Metrics

Metric	Unit	Description
`CPUUtilization`	Percent	CPU usage (0-100)
`MemoryUtilization`	Percent	RAM usage (0-100)
`DiskUtilization`	Percent	Root filesystem usage (0-100)
`InstanceOnline`	Count	Always 1 when running (heartbeat)

Connection Metrics

Metric	Unit	Description
`TCPEstablished`	Count	Active TCP connections
`TCPClose`	Count	Closed TCP connections
`TCPTimeWait`	Count	Connections in TIME-WAIT state

Creating CloudWatch Alarms

High CPU Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-High-CPU" \
  --metric-name CPUUtilization \
  --namespace NATGateway \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

High Memory Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-High-Memory" \
  --metric-name MemoryUtilization \
  --namespace NATGateway \
  --statistic Average \
  --period 300 \
  --threshold 85 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Connection Count Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-High-Connections" \
  --metric-name TCPEstablished \
  --namespace NATGateway \
  --statistic Maximum \
  --period 60 \
  --threshold 100000 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

TIME-WAIT Buildup Alarm

High TIME-WAIT counts can indicate connection exhaustion:

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-TimeWait-Buildup" \
  --metric-name TCPTimeWait \
  --namespace NATGateway \
  --statistic Maximum \
  --period 300 \
  --threshold 50000 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Recommended Alert Thresholds

Metric	Warning	Critical	Action
CPUUtilization	70%	85%	Consider larger instance
MemoryUtilization	75%	90%	Upgrade instance type
TCPEstablished	75,000	150,000	Scale or add instances
TCPTimeWait	30,000	50,000	Check for connection leaks
DiskUtilization	70%	85%	Check log rotation

Adjusting Collection Interval

The metric collection interval can be adjusted via SSM Parameter Store without redeploying:

# Set to 30 seconds for higher granularity (more CloudWatch cost)
aws ssm put-parameter \
  --name "/natgw/cloudwatch/interval" \
  --value "30" \
  --type String \
  --overwrite

Valid values: 30-300 seconds. Lower intervals provide more granular data but increase CloudWatch costs.

Network Disruption During Deployment

Will there be a network blip?

Short answer: Minimal disruption (1-2 seconds) during instance replacement in the default single-instance configuration.

Route Table Update Mechanism

The system uses two mechanisms to update route tables:

Instance-level updates (pull model): When a new instance launches, systemd services automatically update route tables within 2-3 seconds of boot
Lambda-triggered updates (push model): Auto Scaling Group SNS notifications trigger a Lambda function that updates all route tables tagged with UseCustomNAT=true

Timeline During Instance Replacement

Time	Event	Network Impact
T+0s	Old instance terminates	Routes point to terminated instance
T+30-60s	New instance launching	No NAT available
T+60-63s	New instance runs route update	Routes updated atomically
T+63s+	Normal operation	Full connectivity

During the ~60 second window:

New outbound TCP connections will fail
Existing connections may timeout and receive RST
UDP traffic will be dropped
After route update, new connections succeed immediately

Minimizing Disruption

Option 1: Multi-Instance Deployment (Recommended for Production)

Deploy multiple NAT instances with an Application Load Balancer or multiple route tables:

# Modify the ASG to run 2 instances
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name cloudtune-nat-gateway-asg \
  --min-size 2 \
  --desired-capacity 2

With 2+ instances using rolling updates, the old instance continues serving traffic while the new instance starts, achieving near-zero downtime.

Option 2: Scheduled Maintenance Windows

For single-instance deployments, schedule updates during low-traffic periods:

# Update the ASG with a scheduled action
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name cloudtune-nat-gateway-asg \
  --scheduled-action-name "MaintenanceWindow" \
  --recurrence "0 4 * * SUN" \
  --desired-capacity 1

Option 3: Application-Level Retry Logic

Ensure applications connecting through the NAT gateway have retry logic for transient network failures:

# Example: requests with retry
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=3, backoff_factor=0.5)
session.mount('https://', HTTPAdapter(max_retries=retries))

Route Table Tagging

For Lambda-based route updates to work, tag your private subnet route tables:

aws ec2 create-tags \
  --resources rtb-xxxxxxxxx \
  --tags Key=UseCustomNAT,Value=true

Only route tables with this tag will be automatically updated by the Lambda function.

Troubleshooting

Verify NAT Instance Health

# Check instance is running
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=*nat*" \
  --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PrivateIpAddress]'

# Verify source/dest check is disabled
aws ec2 describe-instance-attribute \
  --instance-id i-xxxxxxxxx \
  --attribute sourceDestCheck

Verify Route Tables

# Check route table has correct NAT route
aws ec2 describe-route-tables \
  --route-table-id rtb-xxxxxxxxx \
  --query 'RouteTables[0].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'

Check Instance Logs

Connect via Session Manager or SSH:

# View NAT service status
sudo systemctl status pat-routing

# Check route update logs
sudo journalctl -u update-aws-route-table@rtb-xxxxxxxxx

# View CloudWatch monitoring logs
sudo journalctl -u cloudwatch-monitoring

# Check current connections
sudo conntrack -L | wc -l

Common Issues

Symptom	Cause	Solution
No outbound connectivity	Route table not updated	Check Lambda logs, verify tagging
Intermittent failures	Instance undersized	Upgrade instance type
High latency	CPU saturation	Monitor CPUUtilization, upgrade
Connection timeouts	Connection limit reached	Check TCPEstablished metric

Security Considerations

The NAT gateway instance is deployed with:

Security Group: Allows all ingress from 10.0.0.0/8, all egress to 0.0.0.0/0
IAM Role: Minimal permissions for EC2, Route Tables, CloudWatch, and SSM
EBS Encryption: Enabled by default
SSM Session Manager: Enabled for secure shell access without SSH keys

To restrict the security group further, modify the CloudFormation template's SecurityGroup resource.