Self-Hosted NAT Gateway
by Caleb Geene, Platform Engineer
Self-Hosted NAT Gateway
CloudTune's self-hosted NAT gateway replaces AWS Managed NAT Gateway with a cost-optimized EC2 instance running on ARM64 (Graviton). This guide covers deployment via CloudFormation, instance sizing, observability, and network considerations.
Overview
The CloudTune NAT Gateway provides:
- Up to 80% cost reduction compared to AWS Managed NAT Gateway
- Pre-tuned for 200,000+ concurrent connections with optimized kernel settings
- Automatic route table management via Lambda and instance-level scripts
- Built-in CloudWatch metrics for network, CPU, memory, and connection tracking
- ARM64/Graviton instances for best price-performance ratio
Deployment
CloudFormation Parameters
Deploy using the cloudtune-nat-gateway.yaml template. You'll need to gather the following values from your AWS environment:
Required Parameters
| Parameter | How to Find | Description |
|---|---|---|
VpcId | AWS Console → VPC → Your VPCs | The VPC where the NAT gateway will be deployed |
PublicSubnetId | AWS Console → VPC → Subnets → Filter by "public" | A public subnet with an Internet Gateway route |
KeyPairName | AWS Console → EC2 → Key Pairs | SSH key pair for instance access |
RouteTableIds | AWS Console → VPC → Route Tables | Comma-separated list of private subnet route tables to update |
Finding Route Table IDs:
# List all route tables in your VPC
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-xxxxxxxx" \
--query 'RouteTables[*].[RouteTableId,Tags[?Key==`Name`].Value|[0]]' \
--output table
Look for route tables associated with your private subnets (those without a direct 0.0.0.0/0 route to an Internet Gateway).
Optional Parameters
| Parameter | Default | Description |
|---|---|---|
InstanceType | t4g.small | EC2 instance type (see sizing guide below) |
AutomaticallyUpdateRouteTables | true | Automatically update route tables on instance launch |
EnableCloudWatchMonitoring | true | Enable custom CloudWatch metrics |
CloudWatchInterval | 60 | Metric collection interval in seconds (30-300) |
VolumeSize | 30 | EBS volume size in GB (20-100) |
Instance Sizing Guide
Before selecting an instance type, assess your current NAT gateway usage to right-size the replacement.
Measuring Current Throughput
Query your existing AWS Managed NAT Gateway metrics to understand your throughput requirements:
# Get peak bytes per second over the last 7 days
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name BytesOutToDestination \
--dimensions Name=NatGatewayId,Value=nat-xxxxxxxxxxxxxxxxx \
--start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Maximum \
--query 'sort_by(Datapoints, &Maximum)[-1]'
For Linux, use date -d "7 days ago" instead of date -v-7d.
Key Metrics to Check
| Metric | What It Tells You | Sizing Impact |
|---|---|---|
BytesOutToDestination | Outbound throughput to internet | Primary sizing factor |
BytesInFromDestination | Inbound response traffic | Usually lower than outbound |
ConnectionEstablishedCount | Active connections per minute | Affects memory requirements |
PacketsDropCount | Dropped packets (capacity issues) | If non-zero, size up |
Quick Throughput Assessment
Run this to get a summary of your NAT gateway usage:
# Replace nat-xxxxxxxxxxxxxxxxx with your NAT Gateway ID
NAT_ID="nat-xxxxxxxxxxxxxxxxx"
# Peak throughput (bytes/sec, multiply by 8 for bits)
echo "=== Peak Throughput (last 7 days) ==="
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name BytesOutToDestination \
--dimensions Name=NatGatewayId,Value=$NAT_ID \
--start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 \
--statistics Maximum \
--query 'max(Datapoints[].Maximum)' \
--output text | awk '{printf "%.2f Gbps\n", ($1 * 8) / 1000000000 / 300}'
# Peak connections per minute
echo "=== Peak Connections (last 7 days) ==="
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name ConnectionEstablishedCount \
--dimensions Name=NatGatewayId,Value=$NAT_ID \
--start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Maximum \
--query 'max(Datapoints[].Maximum)' \
--output text
Sizing Rule of Thumb
| Your Peak Throughput | Your Peak Connections/min | Recommended Starting Point |
|---|---|---|
| Under 500 Mbps | Under 10,000 | t4g.small |
| 500 Mbps - 2 Gbps | 10,000 - 50,000 | t4g.medium or t4g.large |
| 2 - 5 Gbps | 50,000 - 100,000 | c7gn.medium or c7gn.large |
| 5 - 20 Gbps | Over 100,000 | c7gn.xlarge or larger |
The NAT gateway supports two instance families optimized for different workloads:
T4G Family (Burstable, General Purpose)
Best for variable traffic patterns with occasional bursts:
| Instance | vCPU | RAM | Network | Recommended For |
|---|---|---|---|---|
t4g.nano | 0.25 | 0.5 GB | Up to 5 Gbps | Development/testing |
t4g.micro | 0.25 | 1 GB | Up to 5 Gbps | Low traffic |
t4g.small | 0.5 | 2 GB | Up to 5 Gbps | Light production (under 1 Gbps) |
t4g.medium | 1 | 4 GB | Up to 5 Gbps | Medium traffic (1-2 Gbps) |
t4g.large | 2 | 8 GB | Up to 5 Gbps | Higher traffic (2-5 Gbps) |
t4g.xlarge | 4 | 16 GB | Up to 5 Gbps | High connection count |
t4g.2xlarge | 8 | 32 GB | Up to 5 Gbps | Very high connection count |
C7gN Family (Network-Optimized)
Best for sustained high-throughput workloads:
| Instance | vCPU | RAM | Network | Recommended For |
|---|---|---|---|---|
c7gn.medium | 1 | 2 GB | Up to 25 Gbps | Sustained medium traffic |
c7gn.large | 2 | 4 GB | Up to 30 Gbps | High throughput (5-10 Gbps) |
c7gn.xlarge | 4 | 8 GB | Up to 40 Gbps | Very high throughput (10-20 Gbps) |
c7gn.2xlarge | 8 | 16 GB | Up to 50 Gbps | Extreme throughput (20-40 Gbps) |
c7gn.4xlarge | 16 | 32 GB | Up to 100 Gbps | Maximum throughput |
Sizing Recommendations:
- Start with
t4g.smallfor most workloads - Monitor
TCPEstablishedandNetworkTxBytesmetrics - Upgrade to C7gN if you see sustained high network utilization
- Consider
t4g.mediumor larger if connection count exceeds 50,000
Deploy via AWS CLI
aws cloudformation create-stack \
--stack-name cloudtune-nat-gateway \
--template-body file://cloudtune-nat-gateway.yaml \
--parameters \
ParameterKey=VpcId,ParameterValue=vpc-0abc123def456789 \
ParameterKey=PublicSubnetId,ParameterValue=subnet-0abc123def456789 \
ParameterKey=RouteTableIds,ParameterValue="rtb-111111111,rtb-222222222" \
ParameterKey=InstanceType,ParameterValue=t4g.small \
ParameterKey=KeyPairName,ParameterValue=my-key-pair \
--capabilities CAPABILITY_IAM
Deploy via AWS Console
- Navigate to CloudFormation → Create Stack
- Upload
cloudtune-nat-gateway.yaml - Fill in the parameters using the guidance above
- Check "I acknowledge that AWS CloudFormation might create IAM resources"
- Create stack
Observability
CloudWatch Metrics
When EnableCloudWatchMonitoring is enabled (default), the NAT gateway publishes metrics to the NATGateway namespace with InstanceId as a dimension.
Network Metrics
| Metric | Unit | Description |
|---|---|---|
NetworkRxBytes | Bytes | Received bytes on primary interface |
NetworkTxBytes | Bytes | Transmitted bytes on primary interface |
NetworkRxPackets | Count | Received packet count |
NetworkTxPackets | Count | Transmitted packet count |
System Metrics
| Metric | Unit | Description |
|---|---|---|
CPUUtilization | Percent | CPU usage (0-100) |
MemoryUtilization | Percent | RAM usage (0-100) |
DiskUtilization | Percent | Root filesystem usage (0-100) |
InstanceOnline | Count | Always 1 when running (heartbeat) |
Connection Metrics
| Metric | Unit | Description |
|---|---|---|
TCPEstablished | Count | Active TCP connections |
TCPClose | Count | Closed TCP connections |
TCPTimeWait | Count | Connections in TIME-WAIT state |
Creating CloudWatch Alarms
High CPU Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "NAT-Gateway-High-CPU" \
--metric-name CPUUtilization \
--namespace NATGateway \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-xxxxxxxxx \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:alerts
High Memory Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "NAT-Gateway-High-Memory" \
--metric-name MemoryUtilization \
--namespace NATGateway \
--statistic Average \
--period 300 \
--threshold 85 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-xxxxxxxxx \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:alerts
Connection Count Alarm
aws cloudwatch put-metric-alarm \
--alarm-name "NAT-Gateway-High-Connections" \
--metric-name TCPEstablished \
--namespace NATGateway \
--statistic Maximum \
--period 60 \
--threshold 100000 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-xxxxxxxxx \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:us-east-1:123456789:alerts
TIME-WAIT Buildup Alarm
High TIME-WAIT counts can indicate connection exhaustion:
aws cloudwatch put-metric-alarm \
--alarm-name "NAT-Gateway-TimeWait-Buildup" \
--metric-name TCPTimeWait \
--namespace NATGateway \
--statistic Maximum \
--period 300 \
--threshold 50000 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-xxxxxxxxx \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:alerts
Recommended Alert Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| CPUUtilization | 70% | 85% | Consider larger instance |
| MemoryUtilization | 75% | 90% | Upgrade instance type |
| TCPEstablished | 75,000 | 150,000 | Scale or add instances |
| TCPTimeWait | 30,000 | 50,000 | Check for connection leaks |
| DiskUtilization | 70% | 85% | Check log rotation |
Adjusting Collection Interval
The metric collection interval can be adjusted via SSM Parameter Store without redeploying:
# Set to 30 seconds for higher granularity (more CloudWatch cost)
aws ssm put-parameter \
--name "/natgw/cloudwatch/interval" \
--value "30" \
--type String \
--overwrite
Valid values: 30-300 seconds. Lower intervals provide more granular data but increase CloudWatch costs.
Network Disruption During Deployment
Will there be a network blip?
Short answer: Minimal disruption (1-2 seconds) during instance replacement in the default single-instance configuration.
Route Table Update Mechanism
The system uses two mechanisms to update route tables:
- Instance-level updates (pull model): When a new instance launches, systemd services automatically update route tables within 2-3 seconds of boot
- Lambda-triggered updates (push model): Auto Scaling Group SNS notifications trigger a Lambda function that updates all route tables tagged with
UseCustomNAT=true
Timeline During Instance Replacement
| Time | Event | Network Impact |
|---|---|---|
| T+0s | Old instance terminates | Routes point to terminated instance |
| T+30-60s | New instance launching | No NAT available |
| T+60-63s | New instance runs route update | Routes updated atomically |
| T+63s+ | Normal operation | Full connectivity |
During the ~60 second window:
- New outbound TCP connections will fail
- Existing connections may timeout and receive RST
- UDP traffic will be dropped
- After route update, new connections succeed immediately
Minimizing Disruption
Option 1: Multi-Instance Deployment (Recommended for Production)
Deploy multiple NAT instances with an Application Load Balancer or multiple route tables:
# Modify the ASG to run 2 instances
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name cloudtune-nat-gateway-asg \
--min-size 2 \
--desired-capacity 2
With 2+ instances using rolling updates, the old instance continues serving traffic while the new instance starts, achieving near-zero downtime.
Option 2: Scheduled Maintenance Windows
For single-instance deployments, schedule updates during low-traffic periods:
# Update the ASG with a scheduled action
aws autoscaling put-scheduled-update-group-action \
--auto-scaling-group-name cloudtune-nat-gateway-asg \
--scheduled-action-name "MaintenanceWindow" \
--recurrence "0 4 * * SUN" \
--desired-capacity 1
Option 3: Application-Level Retry Logic
Ensure applications connecting through the NAT gateway have retry logic for transient network failures:
# Example: requests with retry
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.5)
session.mount('https://', HTTPAdapter(max_retries=retries))
Route Table Tagging
For Lambda-based route updates to work, tag your private subnet route tables:
aws ec2 create-tags \
--resources rtb-xxxxxxxxx \
--tags Key=UseCustomNAT,Value=true
Only route tables with this tag will be automatically updated by the Lambda function.
Troubleshooting
Verify NAT Instance Health
# Check instance is running
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=*nat*" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,PrivateIpAddress]'
# Verify source/dest check is disabled
aws ec2 describe-instance-attribute \
--instance-id i-xxxxxxxxx \
--attribute sourceDestCheck
Verify Route Tables
# Check route table has correct NAT route
aws ec2 describe-route-tables \
--route-table-id rtb-xxxxxxxxx \
--query 'RouteTables[0].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'
Check Instance Logs
Connect via Session Manager or SSH:
# View NAT service status
sudo systemctl status pat-routing
# Check route update logs
sudo journalctl -u update-aws-route-table@rtb-xxxxxxxxx
# View CloudWatch monitoring logs
sudo journalctl -u cloudwatch-monitoring
# Check current connections
sudo conntrack -L | wc -l
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| No outbound connectivity | Route table not updated | Check Lambda logs, verify tagging |
| Intermittent failures | Instance undersized | Upgrade instance type |
| High latency | CPU saturation | Monitor CPUUtilization, upgrade |
| Connection timeouts | Connection limit reached | Check TCPEstablished metric |
Security Considerations
The NAT gateway instance is deployed with:
- Security Group: Allows all ingress from 10.0.0.0/8, all egress to 0.0.0.0/0
- IAM Role: Minimal permissions for EC2, Route Tables, CloudWatch, and SSM
- EBS Encryption: Enabled by default
- SSM Session Manager: Enabled for secure shell access without SSH keys
To restrict the security group further, modify the CloudFormation template's SecurityGroup resource.