Common reasons:
IKE/IPSec lifetime mismatch between AWS and firewall
DPD (Dead Peer Detection) not configured or too aggressive
No interesting traffic → tunnel goes idle and tears down
NAT-T / MTU / fragmentation issues
Single tunnel used (AWS provides 2 tunnels per VPN for redundancy)
PFS / DH group mismatch after rekey
AWS Site-to-Site VPN gives you 2 tunnels per connection. Use both.
Parameter | Recommended Value
IKE Version : IKEv2
Phase 1 Lifetime : 28800 sec (8 hr)
Phase 2 Lifetime : 3600 sec (1 hr)
DH Group (P1/P2) : 14 or higher (14, 19, 20, 21, 24)
Encryption : AES256
Integrity : SHA256
PFS : Enabled (Group 14+)
DPD Timeout Action : Restart (not Clear)
Startup Action : Start (AWS initiates)
Rekey Margin Time : 540 sec
Rekey Fuzz : 100%
Set "Startup Action = Start" so AWS initiates the tunnel instead of waiting on on-prem.
Match AWS parameters exactly (lifetimes, DH group, encryption, PFS).
Enable DPD (interval 10s, retries 3).
Configure both AWS tunnels as active/active or active/standby (BGP preferred).
Use BGP dynamic routing instead of static routes — auto failover between tunnels.
Set MSS clamping to 1379 (or MTU 1436) to avoid fragmentation.
Enable NAT Traversal (NAT-T) on UDP 4500.
Whitelist AWS tunnel public IPs on firewall WAN ACLs.
If traffic is sporadic tunnels may idle out:
Configure an SLA monitor / ping probe from on-prem to an AWS-side IP (e.g., EC2 ENI or VGW tunnel inside IP) every 10 sec.
On Palo Alto: Use Tunnel Monitor
On Fortinet: Use dead-peer-detection on-demand + ping keepalive
On Cisco ASA/FTD: Use SLA monitor with track
On Check Point: Enable Permanent Tunnels
Use BGP over both AWS tunnels (ASN from AWS: 64512 default, on-prem: your private ASN).
For higher SLA: use AWS Transit Gateway + VPN instead of VGW — supports ECMP across 4 tunnels.
Consider AWS Direct Connect + VPN backup for production manufacturing workloads (low latency + encryption).
CloudWatch metrics: TunnelState, TunnelDataIn/Out
Set SNS alarms for tunnel down events
On-prem: SNMP traps for IPSec SA deletion
Enable VPN logs on firewall (Phase 1 / Phase 2 negotiation logs) to catch rekey failures
You should never need to manually rekey. If you do:
Lifetimes don't match → AWS drops SA before firewall renegotiates
PFS mismatch → rekey fails silently
Fix by aligning Phase 2 lifetime and enabling DPD Restart
Both AWS tunnels configured on firewall
IKEv2 with matching P1/P2 lifetimes
DPD enabled (Restart action)
BGP dynamic routing enabled
Tunnel monitor / keepalive ping configured
MSS/MTU clamping applied
CloudWatch alarms on tunnel state
Startup Action = Start on AWS side
Tells AWS to initiate the IKE negotiation instead of waiting passively for the on-prem firewall.
Useful when the tunnel is first established or after a rekey/failure.
Does not keep the tunnel alive during idle periods
Does not detect if the peer becomes unreachable mid-session
Does not generate traffic across the tunnel
Think of it as: "Who makes the first phone call?" — not "Is the call still connected?"
Periodically sends R-U-THERE messages to check if the peer is alive.
If no reply → tears down SA and (with "Restart") re-initiates.
DPD only triggers when there's suspicion of a dead peer (usually after missed traffic).
It is reactive, not proactive.
Some firewalls only send DPD when idle → if peer silently drops, it can take minutes to detect.
Even with Startup=Start and DPD enabled, you need continuous interesting traffic through the tunnel because:
IPSec SAs can become "half-open" (one side thinks it's up, other doesn't)
NAT/stateful devices in between (ISP, firewall) may time out idle UDP 4500 sessions after 30–300 seconds
Manufacturing traffic may be bursty or idle overnight, causing the tunnel to appear dead
Forces faster failover to the second AWS tunnel
Firewall | Keepalive Mechanism
Palo Alto : Tunnel Monitor → ping AWS tunnel inside IP every 3–10s
Fortinet : set keepalive enable + ping monitor, or dpd on-idle/on-demand
Cisco ASA/FTD : SLA Monitor + Track object pinging AWS side
Check Point : Enable Permanent Tunnels (Tunnel Management)
SonicWall : Enable "Keep Alive" checkbox on VPN policy
Juniper SRX : VPN Monitor with ping + optimized option
Target to ping: AWS tunnel inside IP (e.g., 169.254.x.x) — always reachable when tunnel is up.
┌─────────────────────────────────────────────────────────┐
│ Startup Action (Start) → Brings tunnel UP initially │
│ DPD (Restart) → Detects dead peer, rebuilds │
│ Tunnel Monitor/Keepalive→ Keeps tunnel ACTIVE always │
│ BGP Hellos (if used) → Extra liveness + failover │
└─────────────────────────────────────────────────────────┘
Each layer covers a gap the others don't.
AWS side:
Startup Action = Start
DPD Timeout Action = Restart
DPD Timeout = 30 sec
On-prem firewall:
DPD enabled (interval 10s, retries 3)
Tunnel Monitor / Keepalive ping to AWS inside IP every 10s
BGP over the tunnel (BGP keepalives add another liveness layer — every 10s by default, hold time 30s)
Routing:
BGP preferred over static — it inherently generates keepalive traffic and handles failover.
If BGP is configured over the VPN:
BGP sends KEEPALIVE packets every 10 seconds
This traffic itself keeps the tunnel alive
BGP session drop = immediate failover to tunnel 2
You may not need an explicit ping monitor, but it's still recommended as a belt-and-braces measure.
"Startup Action = Start" alone is not enough. We still need DPD + a keepalive mechanism (tunnel monitor or BGP) on the on-prem firewall to keep the tunnel continuously alive and detect failures quickly.