Keep Site-to-Site VPN Tunnels Stable

Best Practices to Keep Site-to-Site VPN Tunnels Stable (AWS ↔ On-Prem)

1. Root Cause: Why Tunnels Drop

Common reasons:

IKE/IPSec lifetime mismatch between AWS and firewall
DPD (Dead Peer Detection) not configured or too aggressive
No interesting traffic → tunnel goes idle and tears down
NAT-T / MTU / fragmentation issues
Single tunnel used (AWS provides 2 tunnels per VPN for redundancy)
PFS / DH group mismatch after rekey

2. AWS Side – Recommended Settings

AWS Site-to-Site VPN gives you 2 tunnels per connection. Use both.

Parameter | Recommended Value

IKE Version : IKEv2

Phase 1 Lifetime : 28800 sec (8 hr)

Phase 2 Lifetime : 3600 sec (1 hr)

DH Group (P1/P2) : 14 or higher (14, 19, 20, 21, 24)

Encryption : AES256

Integrity : SHA256

PFS : Enabled (Group 14+)

DPD Timeout Action : Restart (not Clear)

Startup Action : Start (AWS initiates)

Rekey Margin Time : 540 sec

Rekey Fuzz : 100%

Set "Startup Action = Start" so AWS initiates the tunnel instead of waiting on on-prem.

3. On-Prem Firewall Side

Match AWS parameters exactly (lifetimes, DH group, encryption, PFS).
Enable DPD (interval 10s, retries 3).
Configure both AWS tunnels as active/active or active/standby (BGP preferred).
Use BGP dynamic routing instead of static routes — auto failover between tunnels.
Set MSS clamping to 1379 (or MTU 1436) to avoid fragmentation.
Enable NAT Traversal (NAT-T) on UDP 4500.
Whitelist AWS tunnel public IPs on firewall WAN ACLs.

4. Keep Tunnel Alive – Traffic Generation

If traffic is sporadic tunnels may idle out:

Configure an SLA monitor / ping probe from on-prem to an AWS-side IP (e.g., EC2 ENI or VGW tunnel inside IP) every 10 sec.
On Palo Alto: Use Tunnel Monitor
On Fortinet: Use dead-peer-detection on-demand + ping keepalive
On Cisco ASA/FTD: Use SLA monitor with track
On Check Point: Enable Permanent Tunnels

5. Redundancy & Resiliency

Use BGP over both AWS tunnels (ASN from AWS: 64512 default, on-prem: your private ASN).
For higher SLA: use AWS Transit Gateway + VPN instead of VGW — supports ECMP across 4 tunnels.
Consider AWS Direct Connect + VPN backup for production manufacturing workloads (low latency + encryption).

6. Monitoring

CloudWatch metrics: TunnelState, TunnelDataIn/Out
Set SNS alarms for tunnel down events
On-prem: SNMP traps for IPSec SA deletion
Enable VPN logs on firewall (Phase 1 / Phase 2 negotiation logs) to catch rekey failures

7. Rekey Best Practice

You should never need to manually rekey. If you do:

Lifetimes don't match → AWS drops SA before firewall renegotiates
PFS mismatch → rekey fails silently
Fix by aligning Phase 2 lifetime and enabling DPD Restart

Quick Checklist for the Customer

Both AWS tunnels configured on firewall
IKEv2 with matching P1/P2 lifetimes
DPD enabled (Restart action)
BGP dynamic routing enabled
Tunnel monitor / keepalive ping configured
MSS/MTU clamping applied
CloudWatch alarms on tunnel state
Startup Action = Start on AWS side

Keepalive vs. Startup Action – the Difference
these are two different mechanisms and serve different purposes. We need both for a rock-solid tunnel.

1. Startup Action = Start (AWS side)

What it does:

Tells AWS to initiate the IKE negotiation instead of waiting passively for the on-prem firewall.
Useful when the tunnel is first established or after a rekey/failure.

What it does NOT do:

Does not keep the tunnel alive during idle periods
Does not detect if the peer becomes unreachable mid-session
Does not generate traffic across the tunnel

Think of it as: "Who makes the first phone call?" — not "Is the call still connected?"

2. DPD (Dead Peer Detection)

What it does:

Periodically sends R-U-THERE messages to check if the peer is alive.
If no reply → tears down SA and (with "Restart") re-initiates.

Limitation:

DPD only triggers when there's suspicion of a dead peer (usually after missed traffic).
It is reactive, not proactive.
Some firewalls only send DPD when idle → if peer silently drops, it can take minutes to detect.

3. Keepalive / Tunnel Monitor (Proactive)

Why it's still needed:

Even with Startup=Start and DPD enabled, you need continuous interesting traffic through the tunnel because:

IPSec SAs can become "half-open" (one side thinks it's up, other doesn't)
NAT/stateful devices in between (ISP, firewall) may time out idle UDP 4500 sessions after 30–300 seconds
Manufacturing traffic may be bursty or idle overnight, causing the tunnel to appear dead
Forces faster failover to the second AWS tunnel

How to implement:

Firewall | Keepalive Mechanism

Palo Alto : Tunnel Monitor → ping AWS tunnel inside IP every 3–10s

Fortinet : set keepalive enable + ping monitor, or dpd on-idle/on-demand

Cisco ASA/FTD : SLA Monitor + Track object pinging AWS side

Check Point : Enable Permanent Tunnels (Tunnel Management)

SonicWall : Enable "Keep Alive" checkbox on VPN policy

Juniper SRX : VPN Monitor with ping + optimized option

Target to ping: AWS tunnel inside IP (e.g., 169.254.x.x) — always reachable when tunnel is up.

4. How They Work Together

┌─────────────────────────────────────────────────────────┐
│ Startup Action (Start) → Brings tunnel UP initially                                                         │
│ DPD (Restart) → Detects dead peer, rebuilds │
│ Tunnel Monitor/Keepalive→ Keeps tunnel ACTIVE always                                           │
│ BGP Hellos (if used) → Extra liveness + failover                                                             │
└─────────────────────────────────────────────────────────┘

Each layer covers a gap the others don't.

5. Recommended Combination

AWS side:

Startup Action = Start
DPD Timeout Action = Restart
DPD Timeout = 30 sec

On-prem firewall:

DPD enabled (interval 10s, retries 3)
Tunnel Monitor / Keepalive ping to AWS inside IP every 10s
BGP over the tunnel (BGP keepalives add another liveness layer — every 10s by default, hold time 30s)

Routing:

BGP preferred over static — it inherently generates keepalive traffic and handles failover.

6. Pro Tip – If You Use BGP, Keepalive Is Semi-Built-In

If BGP is configured over the VPN:

BGP sends KEEPALIVE packets every 10 seconds
This traffic itself keeps the tunnel alive
BGP session drop = immediate failover to tunnel 2
You may not need an explicit ping monitor, but it's still recommended as a belt-and-braces measure.

"Startup Action = Start" alone is not enough. We still need DPD + a keepalive mechanism (tunnel monitor or BGP) on the on-prem firewall to keep the tunnel continuously alive and detect failures quickly.

Page updated

Google Sites

Report abuse

Keep Site-to-Site VPN Tunnels Stable

Best Practices to Keep Site-to-Site VPN Tunnels Stable (AWS ↔ On-Prem)

1. Root Cause: Why Tunnels Drop

2. AWS Side – Recommended Settings

3. On-Prem Firewall Side

4. Keep Tunnel Alive – Traffic Generation

5. Redundancy & Resiliency

6. Monitoring

7. Rekey Best Practice

Quick Checklist for the Customer

Keepalive vs. Startup Action – the Differencethese are two different mechanisms and serve different purposes. We need both for a rock-solid tunnel.

1. Startup Action = Start (AWS side)

What it does:

What it does NOT do:

2. DPD (Dead Peer Detection)

What it does:

Limitation:

3. Keepalive / Tunnel Monitor (Proactive)

Why it's still needed:

How to implement:

4. How They Work Together

5. Recommended Combination

6. Pro Tip – If You Use BGP, Keepalive Is Semi-Built-In

Keepalive vs. Startup Action – the Difference
these are two different mechanisms and serve different purposes. We need both for a rock-solid tunnel.