Self-Hosted Solutions

Self-Hosted NAT Gateway

by Caleb Geene, Platform Engineer

Self-Hosted NAT Gateway

CloudTune's self-hosted NAT gateway replaces AWS Managed NAT Gateway with a cost-optimized EC2 instance running on ARM64 (Graviton). This guide covers deployment via CloudFormation, instance sizing, observability, and network considerations.

Overview

The CloudTune NAT Gateway provides:

  • Up to 80% cost reduction compared to AWS Managed NAT Gateway
  • Pre-tuned for 200,000+ concurrent connections with optimized kernel settings
  • Automatic route table management via Lambda and instance-level scripts
  • Built-in CloudWatch metrics for network, CPU, memory, and connection tracking
  • ARM64/Graviton instances for best price-performance ratio

Deployment

CloudFormation Parameters

Deploy using the cloudtune-nat-gateway.yaml template. You'll need to gather the following values from your AWS environment:

Required Parameters

ParameterHow to FindDescription
VpcIdAWS Console → VPC → Your VPCsThe VPC where the NAT gateway will be deployed
PublicSubnetIdAWS Console → VPC → Subnets → Filter by "public"A public subnet with an Internet Gateway route
KeyPairNameAWS Console → EC2 → Key PairsSSH key pair for instance access
RouteTableIdsAWS Console → VPC → Route TablesComma-separated list of private subnet route tables to update

Finding Route Table IDs:

# List all route tables in your VPC
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxxxxxxx" \
  --query 'RouteTables[*].[RouteTableId,Tags[?Key==`Name`].Value|[0]]' \
  --output table

Look for route tables associated with your private subnets (those without a direct 0.0.0.0/0 route to an Internet Gateway).

Optional Parameters

ParameterDefaultDescription
InstanceTypet4g.smallEC2 instance type (see sizing guide below)
AutomaticallyUpdateRouteTablestrueAutomatically update route tables on instance launch
EnableCloudWatchMonitoringtrueEnable custom CloudWatch metrics
CloudWatchInterval60Metric collection interval in seconds (30-300)
VolumeSize30EBS volume size in GB (20-100)

Instance Sizing Guide

Before selecting an instance type, assess your current NAT gateway usage to right-size the replacement.

Measuring Current Throughput

Query your existing AWS Managed NAT Gateway metrics to understand your throughput requirements:

# Get peak bytes per second over the last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-xxxxxxxxxxxxxxxxx \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Maximum \
  --query 'sort_by(Datapoints, &Maximum)[-1]'

For Linux, use date -d "7 days ago" instead of date -v-7d.

Key Metrics to Check

MetricWhat It Tells YouSizing Impact
BytesOutToDestinationOutbound throughput to internetPrimary sizing factor
BytesInFromDestinationInbound response trafficUsually lower than outbound
ConnectionEstablishedCountActive connections per minuteAffects memory requirements
PacketsDropCountDropped packets (capacity issues)If non-zero, size up

Quick Throughput Assessment

Run this to get a summary of your NAT gateway usage:

# Replace nat-xxxxxxxxxxxxxxxxx with your NAT Gateway ID
NAT_ID="nat-xxxxxxxxxxxxxxxxx"

# Peak throughput (bytes/sec, multiply by 8 for bits)
echo "=== Peak Throughput (last 7 days) ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=$NAT_ID \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Maximum \
  --query 'max(Datapoints[].Maximum)' \
  --output text | awk '{printf "%.2f Gbps\n", ($1 * 8) / 1000000000 / 300}'

# Peak connections per minute
echo "=== Peak Connections (last 7 days) ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name ConnectionEstablishedCount \
  --dimensions Name=NatGatewayId,Value=$NAT_ID \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --query 'max(Datapoints[].Maximum)' \
  --output text

Sizing Rule of Thumb

Your Peak ThroughputYour Peak Connections/minRecommended Starting Point
Under 500 MbpsUnder 10,000t4g.small
500 Mbps - 2 Gbps10,000 - 50,000t4g.medium or t4g.large
2 - 5 Gbps50,000 - 100,000c7gn.medium or c7gn.large
5 - 20 GbpsOver 100,000c7gn.xlarge or larger

The NAT gateway supports two instance families optimized for different workloads:

T4G Family (Burstable, General Purpose)

Best for variable traffic patterns with occasional bursts:

InstancevCPURAMNetworkRecommended For
t4g.nano0.250.5 GBUp to 5 GbpsDevelopment/testing
t4g.micro0.251 GBUp to 5 GbpsLow traffic
t4g.small0.52 GBUp to 5 GbpsLight production (under 1 Gbps)
t4g.medium14 GBUp to 5 GbpsMedium traffic (1-2 Gbps)
t4g.large28 GBUp to 5 GbpsHigher traffic (2-5 Gbps)
t4g.xlarge416 GBUp to 5 GbpsHigh connection count
t4g.2xlarge832 GBUp to 5 GbpsVery high connection count

C7gN Family (Network-Optimized)

Best for sustained high-throughput workloads:

InstancevCPURAMNetworkRecommended For
c7gn.medium12 GBUp to 25 GbpsSustained medium traffic
c7gn.large24 GBUp to 30 GbpsHigh throughput (5-10 Gbps)
c7gn.xlarge48 GBUp to 40 GbpsVery high throughput (10-20 Gbps)
c7gn.2xlarge816 GBUp to 50 GbpsExtreme throughput (20-40 Gbps)
c7gn.4xlarge1632 GBUp to 100 GbpsMaximum throughput

Sizing Recommendations:

  • Start with t4g.small for most workloads
  • Monitor TCPEstablished and NetworkTxBytes metrics
  • Upgrade to C7gN if you see sustained high network utilization
  • Consider t4g.medium or larger if connection count exceeds 50,000

Deploy via AWS CLI

aws cloudformation create-stack \
  --stack-name cloudtune-nat-gateway \
  --template-body file://cloudtune-nat-gateway.yaml \
  --parameters \
    ParameterKey=VpcId,ParameterValue=vpc-0abc123def456789 \
    ParameterKey=PublicSubnetId,ParameterValue=subnet-0abc123def456789 \
    ParameterKey=RouteTableIds,ParameterValue="rtb-111111111,rtb-222222222" \
    ParameterKey=InstanceType,ParameterValue=t4g.small \
    ParameterKey=KeyPairName,ParameterValue=my-key-pair \
  --capabilities CAPABILITY_IAM

Deploy via AWS Console

  1. Navigate to CloudFormation → Create Stack
  2. Upload cloudtune-nat-gateway.yaml
  3. Fill in the parameters using the guidance above
  4. Check "I acknowledge that AWS CloudFormation might create IAM resources"
  5. Create stack

Observability

CloudWatch Metrics

When EnableCloudWatchMonitoring is enabled (default), the NAT gateway publishes metrics to the NATGateway namespace with InstanceId as a dimension.

Network Metrics

MetricUnitDescription
NetworkRxBytesBytesReceived bytes on primary interface
NetworkTxBytesBytesTransmitted bytes on primary interface
NetworkRxPacketsCountReceived packet count
NetworkTxPacketsCountTransmitted packet count

System Metrics

MetricUnitDescription
CPUUtilizationPercentCPU usage (0-100)
MemoryUtilizationPercentRAM usage (0-100)
DiskUtilizationPercentRoot filesystem usage (0-100)
InstanceOnlineCountAlways 1 when running (heartbeat)

Connection Metrics

MetricUnitDescription
TCPEstablishedCountActive TCP connections
TCPCloseCountClosed TCP connections
TCPTimeWaitCountConnections in TIME-WAIT state

Creating CloudWatch Alarms

High CPU Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-High-CPU" \
  --metric-name CPUUtilization \
  --namespace NATGateway \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

High Memory Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-High-Memory" \
  --metric-name MemoryUtilization \
  --namespace NATGateway \
  --statistic Average \
  --period 300 \
  --threshold 85 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Connection Count Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-High-Connections" \
  --metric-name TCPEstablished \
  --namespace NATGateway \
  --statistic Maximum \
  --period 60 \
  --threshold 100000 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

TIME-WAIT Buildup Alarm

High TIME-WAIT counts can indicate connection exhaustion:

aws cloudwatch put-metric-alarm \
  --alarm-name "NAT-Gateway-TimeWait-Buildup" \
  --metric-name TCPTimeWait \
  --namespace NATGateway \
  --statistic Maximum \
  --period 300 \
  --threshold 50000 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Recommended Alert Thresholds

MetricWarningCriticalAction
CPUUtilization70%85%Consider larger instance
MemoryUtilization75%90%Upgrade instance type
TCPEstablished75,000150,000Scale or add instances
TCPTimeWait30,00050,000Check for connection leaks
DiskUtilization70%85%Check log rotation

Adjusting Collection Interval

The metric collection interval can be adjusted via SSM Parameter Store without redeploying:

# Set to 30 seconds for higher granularity (more CloudWatch cost)
aws ssm put-parameter \
  --name "/natgw/cloudwatch/interval" \
  --value "30" \
  --type String \
  --overwrite

Valid values: 30-300 seconds. Lower intervals provide more granular data but increase CloudWatch costs.

Network Disruption During Deployment

Will there be a network blip?

Short answer: Minimal disruption (1-2 seconds) during instance replacement in the default single-instance configuration.

Route Table Update Mechanism

The system uses two mechanisms to update route tables:

  1. Instance-level updates (pull model): When a new instance launches, systemd services automatically update route tables within 2-3 seconds of boot
  2. Lambda-triggered updates (push model): Auto Scaling Group SNS notifications trigger a Lambda function that updates all route tables tagged with UseCustomNAT=true

Timeline During Instance Replacement

TimeEventNetwork Impact
T+0sOld instance terminatesRoutes point to terminated instance
T+30-60sNew instance launchingNo NAT available
T+60-63sNew instance runs route updateRoutes updated atomically
T+63s+Normal operationFull connectivity

During the ~60 second window:

  • New outbound TCP connections will fail
  • Existing connections may timeout and receive RST
  • UDP traffic will be dropped
  • After route update, new connections succeed immediately

Minimizing Disruption

Option 1: Multi-Instance Deployment (Recommended for Production)

Deploy multiple NAT instances with an Application Load Balancer or multiple route tables:

# Modify the ASG to run 2 instances
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name cloudtune-nat-gateway-asg \
  --min-size 2 \
  --desired-capacity 2

With 2+ instances using rolling updates, the old instance continues serving traffic while the new instance starts, achieving near-zero downtime.

Option 2: Scheduled Maintenance Windows

For single-instance deployments, schedule updates during low-traffic periods:

# Update the ASG with a scheduled action
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name cloudtune-nat-gateway-asg \
  --scheduled-action-name "MaintenanceWindow" \
  --recurrence "0 4 * * SUN" \
  --desired-capacity 1

Option 3: Application-Level Retry Logic

Ensure applications connecting through the NAT gateway have retry logic for transient network failures:

# Example: requests with retry
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=3, backoff_factor=0.5)
session.mount('https://', HTTPAdapter(max_retries=retries))

Route Table Tagging

For Lambda-based route updates to work, tag your private subnet route tables:

aws ec2 create-tags \
  --resources rtb-xxxxxxxxx \
  --tags Key=UseCustomNAT,Value=true

Only route tables with this tag will be automatically updated by the Lambda function.

Troubleshooting

Verify NAT Instance Health

# Check instance is running
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=*nat*" \
  --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PrivateIpAddress]'

# Verify source/dest check is disabled
aws ec2 describe-instance-attribute \
  --instance-id i-xxxxxxxxx \
  --attribute sourceDestCheck

Verify Route Tables

# Check route table has correct NAT route
aws ec2 describe-route-tables \
  --route-table-id rtb-xxxxxxxxx \
  --query 'RouteTables[0].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'

Check Instance Logs

Connect via Session Manager or SSH:

# View NAT service status
sudo systemctl status pat-routing

# Check route update logs
sudo journalctl -u update-aws-route-table@rtb-xxxxxxxxx

# View CloudWatch monitoring logs
sudo journalctl -u cloudwatch-monitoring

# Check current connections
sudo conntrack -L | wc -l

Common Issues

SymptomCauseSolution
No outbound connectivityRoute table not updatedCheck Lambda logs, verify tagging
Intermittent failuresInstance undersizedUpgrade instance type
High latencyCPU saturationMonitor CPUUtilization, upgrade
Connection timeoutsConnection limit reachedCheck TCPEstablished metric

Security Considerations

The NAT gateway instance is deployed with:

  • Security Group: Allows all ingress from 10.0.0.0/8, all egress to 0.0.0.0/0
  • IAM Role: Minimal permissions for EC2, Route Tables, CloudWatch, and SSM
  • EBS Encryption: Enabled by default
  • SSM Session Manager: Enabled for secure shell access without SSH keys

To restrict the security group further, modify the CloudFormation template's SecurityGroup resource.

Interested in CloudTune.ai?

Whether you have questions about our products or want to learn more, we’d love to hear from you. Reach out, and we'll get back to you as soon as possible.