How Do I Fix AWS Lambda Timeouts and Cold Start Issues?
How do I fix AWS Lambda timeouts and cold start issues?
TL;DR
- Bottom line: Lambda timeouts stem from 4 root causes: insufficient memory/CPU (Lambda allocates CPU proportionally to memory), downstream service latency, VPC networking misconfiguration, and deployment package bloat causing slow INIT. Cold starts affect <1% of invocations but add 100ms–5s latency. Fix timeouts by increasing memory to 1769 MB (1 full vCPU), adding connection timeouts to downstream calls, and configuring NAT gateway for VPC functions. Fix cold starts with SnapStart (Java/Python/.NET), provisioned concurrency, or smaller packages.
- Key tool/command:
aws lambda update-function-configuration --function-name my-func --memory-size 1769 --timeout 30 - Watch out for: Setting timeout close to average duration causes intermittent failures. The default 3-second timeout is almost always too low for production functions.
- Works with: All Lambda runtimes (Node.js, Python, Java, Go, .NET, Rust). SnapStart requires Java 11+, Python 3.12+, or .NET 8+. Provisioned concurrency works with all runtimes.
Constraints
- Standard Lambda timeout is 900 seconds (15 minutes). For longer workloads, use Lambda Durable Functions (released late 2025; up to 1 year via checkpoint/replay across containers) or fall back to Step Functions, ECS, or Fargate. [src1, src9]
- INIT phase has a hard 10-second timeout (extended to 15 min with provisioned concurrency or SnapStart). If initialization takes >10s, the function fails before the handler runs. [src3]
- SnapStart only works on published function versions and aliases — not $LATEST. Cannot be combined with provisioned concurrency, Amazon EFS, or ephemeral storage >512 MB. [src4]
- Lambda functions in a VPC have no internet access by default. They require a NAT gateway in a public subnet for outbound traffic. Without it, external HTTP calls timeout silently. [src5]
- Memory ranges from 128 MB to 10,240 MB. CPU is allocated proportionally: 1769 MB = 1 vCPU, 3538 MB = 2 vCPUs. Below 1769 MB, your function runs on fractional CPU. [src1, src3]
- Lambda supports max 20 concurrent TCP connections for DNS resolution. Exceeding this causes
UNKNOWNHOSTEXCEPTIONregardless of timeout. [src5]
Quick Reference
| # | Cause | Likelihood | Signature | Fix |
|---|---|---|---|---|
| 1 | Insufficient memory/CPU | ~30% | Duration near timeout; Max Memory Used near
Memory Size |
Increase memory to 1769 MB (1 vCPU) [src1, src3] |
| 2 | Downstream service timeout | ~25% | Task timed out after X.XX seconds; no error log before timeout |
Add explicit connection/read timeouts (5s/10s) to HTTP clients [src2] |
| 3 | VPC without NAT gateway | ~20% | ETIMEDOUT or Task timed out on external HTTP calls |
Add NAT gateway to public subnet; route private subnet through it [src5] |
| 4 | Cold start during INIT | ~10% | Init Duration: NNNNms in REPORT log; first invocation slow |
SnapStart, provisioned concurrency, or reduce package size [src3, src4] |
| 5 | Large deployment package | ~5% | High Init Duration (>1s); large ZIP artifact |
Tree-shake dependencies; use Lambda Layers [src6, src7] |
| 6 | Recursive/infinite loop | ~5% | Function always times out at exact timeout limit | Check S3 trigger writing to same bucket, SQS re-queue loops [src2] |
| 7 | Payload too large | ~3% | Timeout with large event payloads | Batch smaller; use S3 pre-signed URLs for large data [src2] |
| 8 | Default 3s timeout | ~2% | Task timed out after 3.00 seconds |
Set 30–60s for APIs, 300s for data processing [src1] |
| 9 | Network ACL blocking ephemeral ports | Rare | Intermittent ETIMEDOUT in VPC functions |
Allow TCP/UDP ports 1024–65535 in subnet Network ACLs [src5] |
| 10 | DNS resolution limit exceeded | Rare | UNKNOWNHOSTEXCEPTION under high concurrency |
Reduce concurrent DNS lookups; max 20 TCP DNS connections [src5] |
Decision Tree
START — Lambda function timing out or slow
├── Check REPORT log in CloudWatch
│ ├── "Init Duration" present and > 1000ms?
│ │ ├── YES → COLD START ISSUE
│ │ │ ├── Runtime is Java/Python/.NET? → Enable SnapStart [src4]
│ │ │ ├── Package > 50 MB? → Tree-shake, use Layers [src6, src7]
│ │ │ ├── Need guaranteed <100ms start? → Provisioned Concurrency [src3]
│ │ │ └── Heavy SDK imports? → Lazy-load, import only needed clients [src8]
│ │ └── NO → RUNTIME TIMEOUT ISSUE ↓
│ │
│ ├── "Max Memory Used" close to "Memory Size"?
│ │ ├── YES → Increase memory (doubles CPU too) [src1, src3]
│ │ └── NO ↓
│ │
│ ├── Duration close to timeout every time?
│ │ ├── YES → Likely infinite loop or recursive trigger [src2]
│ │ └── NO ↓
│ │
│ ├── Function in VPC?
│ │ ├── YES → Check NAT gateway exists for outbound internet [src5]
│ │ │ ├── No NAT → Add NAT gateway + route table
│ │ │ ├── AWS services only? → Use VPC endpoints [src5]
│ │ │ └── Intermittent? → Check Network ACL ephemeral ports [src5]
│ │ └── NO ↓
│ │
│ └── Timeout only on some invocations?
│ ├── YES → Downstream service latency → add client-side timeouts [src2]
│ └── NO → Increase function timeout; check payload size [src1]
│
└── No REPORT log at all?
└── Check execution role has AWSLambdaBasicExecutionRole [src2]
Step-by-Step Guide
1. Check CloudWatch REPORT logs for diagnosis
Every Lambda invocation produces a REPORT line with key metrics. This is your starting point. [src2, src3]
# CloudWatch Logs Insights — find recent timeouts
fields @timestamp, @message
| filter @message like /Task timed out/
| sort @timestamp desc
| limit 20
# Analyze cold start frequency and duration
fields @timestamp, @initDuration, @duration, @maxMemoryUsed, @memorySize
| filter ispresent(@initDuration)
| stats avg(@initDuration) as avgColdStart, max(@initDuration) as maxColdStart,
count(*) as coldStartCount
| sort coldStartCount desc
Verify: REPORT line shows Duration, Billed Duration,
Memory Size, Max Memory Used, and Init Duration (if cold start).
2. Increase memory to get more CPU
Lambda allocates CPU proportionally to memory. Below 1769 MB, you get fractional CPU. This is the single most impactful tuning knob. [src1, src3]
# Set memory to 1769 MB (1 full vCPU) and timeout to 30 seconds
aws lambda update-function-configuration \
--function-name my-function \
--memory-size 1769 \
--timeout 30
Verify:
aws lambda get-function-configuration --function-name my-function --query '{Memory: MemorySize, Timeout: Timeout}'
3. Add explicit timeouts to downstream calls
Never rely on the Lambda timeout as your only safety net. Set client-side timeouts on every external call. [src2]
# Python — explicit timeouts on AWS SDK and HTTP calls
import boto3
from botocore.config import Config
boto_config = Config(
connect_timeout=5, # 5s to establish connection
read_timeout=10, # 10s to read response
retries={'max_attempts': 2}
)
dynamodb = boto3.client('dynamodb', config=boto_config)
4. Fix VPC internet access (if VPC-attached)
VPC-attached functions have no internet access by default. All outbound traffic goes through the VPC. [src5]
# SAM template — Lambda in VPC with NAT gateway
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
Runtime: nodejs20.x
MemorySize: 1769
Timeout: 30
VpcConfig:
SecurityGroupIds:
- !Ref LambdaSecurityGroup
SubnetIds:
- !Ref PrivateSubnet1
5. Enable SnapStart for Java, Python, or .NET
SnapStart takes a microVM snapshot after INIT, reducing cold starts from seconds to sub-second. [src4]
# Enable SnapStart and publish a version
aws lambda update-function-configuration \
--function-name my-function \
--snap-start ApplyOn=PublishedVersions
aws lambda publish-version --function-name my-function
Verify: Check CloudWatch for Restore Duration instead of
Init Duration — restore should be <200ms.
6. Configure provisioned concurrency for zero cold starts
Provisioned concurrency keeps environments pre-warmed. Most reliable but most expensive approach. [src3, src6]
# Set 10 provisioned concurrent executions
aws lambda put-provisioned-concurrency-config \
--function-name my-function \
--qualifier live \
--provisioned-concurrent-executions 10
Verify: Status should be READY. REPORT log should show no
Init Duration.
Code Examples
Python: Lambda handler with cold start optimization
# Input: API Gateway event
# Output: JSON response with downstream data
import json, os, boto3
from botocore.config import Config
# INIT phase: runs once per cold start, persists across warm invocations
boto_config = Config(connect_timeout=5, read_timeout=10, retries={'max_attempts': 2})
dynamodb = boto3.resource('dynamodb', config=boto_config)
table = dynamodb.Table(os.environ['TABLE_NAME'])
def handler(event, context):
remaining_ms = context.get_remaining_time_in_millis()
if remaining_ms < 5000:
return {'statusCode': 503, 'body': json.dumps({'error': 'Insufficient time'})}
try:
item_id = event.get('pathParameters', {}).get('id', '')
response = table.get_item(Key={'id': item_id})
item = response.get('Item')
if not item:
return {'statusCode': 404, 'body': json.dumps({'error': 'Not found'})}
return {'statusCode': 200, 'body': json.dumps(item, default=str)}
except Exception as e:
return {'statusCode': 500, 'body': json.dumps({'error': str(e)})}
Node.js: Optimized handler with modular imports
// Input: API Gateway event
// Output: JSON response
const { DynamoDBClient, GetItemCommand } = require('@aws-sdk/client-dynamodb');
const { unmarshall } = require('@aws-sdk/util-dynamodb');
const client = new DynamoDBClient({
requestHandler: { connectionTimeout: 5000, socketTimeout: 10000 },
maxAttempts: 2
});
exports.handler = async (event, context) => {
if (context.getRemainingTimeInMillis() < 5000) {
return { statusCode: 503, body: JSON.stringify({ error: 'Insufficient time' }) };
}
const id = event.pathParameters?.id;
const { Item } = await client.send(new GetItemCommand({
TableName: process.env.TABLE_NAME,
Key: { id: { S: id } }
}));
if (!Item) return { statusCode: 404, body: JSON.stringify({ error: 'Not found' }) };
return { statusCode: 200, body: JSON.stringify(unmarshall(Item)) };
};
Java: SnapStart-optimized handler with CRaC hooks
// Input: API Gateway proxy request
// Output: API Gateway proxy response
// Requires: Java 11+ runtime with SnapStart enabled
import software.amazon.awssdk.services.dynamodb.DynamoDBClient;
import org.crac.Core;
import org.crac.Resource;
public class Handler implements RequestHandler<APIGatewayProxyRequestEvent,
APIGatewayProxyResponseEvent>, Resource {
private final DynamoDBClient dynamodb = DynamoDBClient.create();
public Handler() {
Core.getGlobalContext().register(this); // Register CRaC hooks
}
@Override
public void beforeCheckpoint(org.crac.Context<?> ctx) {
dynamodb.describeEndpoints(); // Pre-warm connection before snapshot
}
@Override
public void afterRestore(org.crac.Context<?> ctx) {
// Re-validate connections after restore
}
}
Anti-Patterns
Wrong: Using default 3-second timeout in production
# ❌ BAD — 3-second default is almost never enough [src1]
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
Runtime: nodejs20.x
# No Timeout — defaults to 3 seconds; cold start + any call > 3s = failure
Correct: Set appropriate timeout with margin
# ✅ GOOD — explicit timeout with memory tuning [src1]
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
Runtime: nodejs20.x
MemorySize: 1769 # 1 full vCPU
Timeout: 30 # 30s for API backends
Wrong: Importing entire AWS SDK
// ❌ BAD — imports entire SDK (~60MB), dramatically increases cold start [src7, src8]
const AWS = require('aws-sdk');
const dynamodb = new AWS.DynamoDB();
// Init Duration: 800-1500ms due to massive import
Correct: Import only needed clients
// ✅ GOOD — modular imports, minimal cold start impact [src7, src8]
const { DynamoDBClient } = require('@aws-sdk/client-dynamodb');
// Init Duration: 150-300ms — only loads what's needed
Wrong: Creating SDK clients inside the handler
# ❌ BAD — new client every invocation, wasting warm-start reuse [src3, src8]
def handler(event, context):
dynamodb = boto3.resource('dynamodb') # NEW client every time
table = dynamodb.Table('my-table')
return table.get_item(Key={'id': event['id']})
Correct: Initialize clients outside the handler
# ✅ GOOD — client created once in INIT, reused across warm invocations [src3, src8]
import boto3
dynamodb = boto3.resource('dynamodb') # Created once during INIT
table = dynamodb.Table('my-table')
def handler(event, context):
return table.get_item(Key={'id': event['id']}) # Reuses warm connection
Wrong: VPC Lambda without NAT for external calls
# ❌ BAD — VPC function with no internet path silently times out [src5]
VpcConfig:
SecurityGroupIds: [!Ref SG]
SubnetIds: [!Ref PrivateSubnet]
# No NAT gateway — ALL external HTTP calls will ETIMEDOUT
Correct: Route VPC traffic through NAT or VPC endpoints
# ✅ GOOD — private subnet routes to NAT for internet; VPC endpoints for AWS [src5]
NATGateway:
Type: AWS::EC2::NatGateway
Properties:
SubnetId: !Ref PublicSubnet
AllocationId: !GetAtt EIP.AllocationId
PrivateRoute:
Type: AWS::EC2::Route
Properties:
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NATGateway
Common Pitfalls
- Default 3-second timeout: The Lambda console defaults to 3 seconds. Set to at least 10–30s for API backends and 120–900s for data processing. [src1]
- Memory = CPU misconception: Developers increase timeout when the real problem is CPU starvation. At 128 MB, you get fractional CPU. Increasing to 1769 MB gives 1 full vCPU and often makes functions 10x faster at the same cost. [src1, src3]
- VPC internet access: VPC-attached functions lose internet unless you configure a NAT
gateway. The error is a silent
ETIMEDOUT, not a clear networking error. [src5] - SnapStart uniqueness trap: SnapStart restores from snapshot. UUIDs, random seeds, or encryption keys generated during INIT will be identical across all restored instances. Generate these in the handler. [src4]
- Recursive invocation loops: An S3 trigger writing to the same bucket creates an infinite loop. Always write to a different resource or use key prefixes to filter. [src2]
- Provisioned concurrency + SnapStart conflict: These cannot be combined. SnapStart is free (Java); provisioned concurrency costs ~$15–30/month per 10 environments at 512 MB. [src4]
- API Gateway 29-second limit: Even with 900s Lambda timeout, API Gateway has a hard 29-second integration timeout. Use async invocation or Step Functions for long tasks. [src1]
- Cold start burst after deployment: Every code deploy invalidates warm environments. Deploy during low traffic or use traffic shifting with aliases. [src3, src6]
Diagnostic Commands
# Check current function configuration
aws lambda get-function-configuration --function-name my-function \
--query '{Memory: MemorySize, Timeout: Timeout, Runtime: Runtime, VPC: VpcConfig.SubnetIds, SnapStart: SnapStart}'
# Find recent timeouts (CloudWatch Logs Insights)
fields @timestamp, @message
| filter @message like /Task timed out/
| sort @timestamp desc | limit 50
# Analyze cold start frequency
fields @timestamp, @initDuration, @duration, @maxMemoryUsed, @memorySize
| filter ispresent(@initDuration)
| stats count(*) as coldStarts, avg(@initDuration) as avgInitMs,
max(@initDuration) as maxInitMs, pct(@initDuration, 99) as p99InitMs
by bin(1h)
# Check memory utilization
fields @maxMemoryUsed, @memorySize, @duration
| stats avg(@maxMemoryUsed) as avgMemUsed, max(@maxMemoryUsed) as maxMemUsed
# Check provisioned concurrency status
aws lambda get-provisioned-concurrency-config \
--function-name my-function --qualifier live
# Check VPC route table for NAT gateway
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-xxxxx" \
--query 'RouteTables[*].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'
# Test invocation
aws lambda invoke --function-name my-function \
--payload '{"test": true}' --cli-read-timeout 60 response.json
Version History & Compatibility
| Feature | Available Since | Notes |
|---|---|---|
| Lambda timeout (max 15 min) | 2018 | Increased from 5 min; all runtimes [src1] |
| VPC Hyperplane ENI | 2019 | Eliminated ~10s VPC cold start penalty [src5] |
| Provisioned concurrency | Dec 2019 | All runtimes; eliminates cold starts [src3] |
| ARM64/Graviton support | 2021 | 20% faster cold starts vs x86; lower cost [src6] |
| SnapStart for Java | Nov 2022 | Java 11+ Corretto; free [src4] |
| INIT phase logging (INIT_REPORT) | Nov 2023 | Explicit Init/Restore phase error reporting [src3] |
| SnapStart for Python | Dec 2024 | Python 3.12+; caching/restore charges apply [src4] |
| SnapStart for .NET | Dec 2024 | .NET 8+; requires Annotations v1.6.0+ [src4] |
| Lambda Managed Instances | 2025 preview | Multi-concurrent execution on EC2-class instances |
| Lambda Durable Functions | Late 2025 (GA) | Stateful workflows up to 1 year via checkpoint/replay; hibernation pauses billing while awaiting external signals (~20-30% cost reduction for human-in-the-loop workflows) [src9] |
| Amazon Linux 2 runtime EOL | June 30, 2026 | Migrate to Amazon Linux 2023-based runtimes (Node.js 20+, Python 3.12+, Java 21); AL2-based Java 8/11/17 deprecate without migration |
Decision Logic
If Init Duration is present and >1000ms in REPORT log
--> Cold start is the bottleneck. If runtime is Java/Python 3.12+/.NET 8+, enable SnapStart (free for Java, charged for Python/.NET). [src3, src4]
If runtime is Node.js/Go/Ruby/Rust and cold starts are unacceptable
--> SnapStart is not available as of May 2026. Either provisioned concurrency, ARM64 Graviton (15-20% faster cold start), or trim the deployment package. [src4, src6, src7]
If Max Memory Used is close to Memory Size in REPORT log
--> Increase memory to at least 1769 MB (1 full vCPU). CPU scales with memory; this is the single highest-leverage tuning knob. [src1, src3]
If function is VPC-attached and external HTTPS calls hang
--> Add a NAT gateway in a public subnet and route the private subnet's 0.0.0.0/0 through it. For AWS-only calls (DynamoDB, S3), use VPC endpoints instead to avoid NAT cost. [src5]
If workload exceeds 15 minutes
--> Use Lambda Durable Functions (late 2025+) for stateful workflows up to 1 year, OR Step Functions for orchestration, OR ECS/Fargate for long-running compute. Do not try to chain regular Lambda invocations manually. [src1, src9]
If Task timed out at 3.00 seconds appears
--> Default timeout is too low. Set 30s for API backends, 300-900s for data processing. Always set client-side timeouts on downstream calls before the Lambda timeout fires. [src1, src2]
If timeouts are intermittent and only under load
--> Check DNS resolution limits (max 20 concurrent TCP connections for DNS) and connection pool exhaustion on downstream databases. Lower per-invocation parallelism or pool connections via RDS Proxy. [src5]
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Timeout <15 min and stateless workload | Processing >15 min (single invocation) | Lambda Durable Functions (stateful, ≤1 year), Step Functions, or ECS/Fargate |
| Cold start <2s is acceptable | Hard real-time <10ms requirement | EC2, ECS, or always-on containers |
| Traffic is spiky or unpredictable | Steady >1000 req/s sustained | ECS/Fargate with ALB (cheaper at scale) |
| SnapStart available for your runtime | Need zero cold starts guaranteed | Provisioned concurrency or containers |
| API Gateway integration <29s | Long-running API response >29s | Async invocation + polling, or WebSocket API |
Important Caveats
- Cold starts affect <1% of invocations in steady-state but can be 100% after a
deployment or traffic spike from zero. Monitor
Init Durationin CloudWatch, not just averages. [src3, src7] - Memory cost trade-off: 14x memory does not always mean 14x cost. Faster execution = shorter billed duration. Use AWS Lambda Power Tuning to find cost-optimal memory. [src1]
- SnapStart snapshot uniqueness: Initialization code runs once, then the snapshot is reused. Random values or credentials generated during INIT are identical across restored instances. Generate in the handler. [src4]
- Provisioned concurrency costs even when idle: ~$0.015/GB-hour. Use Application Auto Scaling to adjust based on traffic patterns. [src3]
- ARM64 (Graviton) gives free performance: Up to 20% better price-performance with faster cold starts. No code changes needed for interpreted runtimes. [src6]