Lambda task timed out

- Bottom line: Lambda timeouts stem from 4 root causes in order of likelihood: insufficient memory/CPU (Lambda allocates CPU proportionally to memory), downstream service latency, VPC networking misconfiguration, and deployment package bloat causing slow INIT. Cold starts affect <1% of invocations but add 100ms-5s latency depending on runtime and package size. Fix timeouts by increasing memory to 1769 MB (1 full vCPU), adding connection timeouts to downstream calls, and configuring NAT gateway for VPC functions. Fix cold starts with SnapStart (Java/Python/.NET), provisioned concurrency, or ...

AWS Lambda cold start

- Bottom line: Lambda timeouts stem from 4 root causes in order of likelihood: insufficient memory/CPU (Lambda allocates CPU proportionally to memory), downstream service latency, VPC networking misconfiguration, and deployment package bloat causing slow INIT. Cold starts affect <1% of invocations but add 100ms-5s latency depending on runtime and package size. Fix timeouts by increasing memory to 1769 MB (1 full vCPU), adding connection timeouts to downstream calls, and configuring NAT gateway for VPC functions. Fix cold starts with SnapStart (Java/Python/.NET), provisioned concurrency, or ...

Lambda timeout troubleshooting

- Bottom line: Lambda timeouts stem from 4 root causes in order of likelihood: insufficient memory/CPU (Lambda allocates CPU proportionally to memory), downstream service latency, VPC networking misconfiguration, and deployment package bloat causing slow INIT. Cold starts affect <1% of invocations but add 100ms-5s latency depending on runtime and package size. Fix timeouts by increasing memory to 1769 MB (1 full vCPU), adding connection timeouts to downstream calls, and configuring NAT gateway for VPC functions. Fix cold starts with SnapStart (Java/Python/.NET), provisioned concurrency, or ...

Lambda cold start optimization

- Bottom line: Lambda timeouts stem from 4 root causes in order of likelihood: insufficient memory/CPU (Lambda allocates CPU proportionally to memory), downstream service latency, VPC networking misconfiguration, and deployment package bloat causing slow INIT. Cold starts affect <1% of invocations but add 100ms-5s latency depending on runtime and package size. Fix timeouts by increasing memory to 1769 MB (1 full vCPU), adding connection timeouts to downstream calls, and configuring NAT gateway for VPC functions. Fix cold starts with SnapStart (Java/Python/.NET), provisioned concurrency, or ...

Lambda provisioned concurrency

- Bottom line: Lambda timeouts stem from 4 root causes in order of likelihood: insufficient memory/CPU (Lambda allocates CPU proportionally to memory), downstream service latency, VPC networking misconfiguration, and deployment package bloat causing slow INIT. Cold starts affect <1% of invocations but add 100ms-5s latency depending on runtime and package size. Fix timeouts by increasing memory to 1769 MB (1 full vCPU), adding connection timeouts to downstream calls, and configuring NAT gateway for VPC functions. Fix cold starts with SnapStart (Java/Python/.NET), provisioned concurrency, or ...

How Do I Fix AWS Lambda Timeouts and Cold Start Issues?

How do I fix AWS Lambda timeouts and cold start issues?

TL;DR

Bottom line: Lambda timeouts stem from 4 root causes: insufficient memory/CPU (Lambda allocates CPU proportionally to memory), downstream service latency, VPC networking misconfiguration, and deployment package bloat causing slow INIT. Cold starts affect <1% of invocations but add 100ms–5s latency. Fix timeouts by increasing memory to 1769 MB (1 full vCPU), adding connection timeouts to downstream calls, and configuring NAT gateway for VPC functions. Fix cold starts with SnapStart (Java/Python/.NET), provisioned concurrency, or smaller packages.
Key tool/command: aws lambda update-function-configuration --function-name my-func --memory-size 1769 --timeout 30
Watch out for: Setting timeout close to average duration causes intermittent failures. The default 3-second timeout is almost always too low for production functions.
Works with: All Lambda runtimes (Node.js, Python, Java, Go, .NET, Rust). SnapStart requires Java 11+, Python 3.12+, or .NET 8+. Provisioned concurrency works with all runtimes.

Constraints

Standard Lambda timeout is 900 seconds (15 minutes). For longer workloads, use Lambda Durable Functions (released late 2025; up to 1 year via checkpoint/replay across containers) or fall back to Step Functions, ECS, or Fargate. [src1, src9]
INIT phase has a hard 10-second timeout (extended to 15 min with provisioned concurrency or SnapStart). If initialization takes >10s, the function fails before the handler runs. [src3]
SnapStart only works on published function versions and aliases — not $LATEST. Cannot be combined with provisioned concurrency, Amazon EFS, or ephemeral storage >512 MB. [src4]
Lambda functions in a VPC have no internet access by default. They require a NAT gateway in a public subnet for outbound traffic. Without it, external HTTP calls timeout silently. [src5]
Memory ranges from 128 MB to 10,240 MB. CPU is allocated proportionally: 1769 MB = 1 vCPU, 3538 MB = 2 vCPUs. Below 1769 MB, your function runs on fractional CPU. [src1, src3]
Lambda supports max 20 concurrent TCP connections for DNS resolution. Exceeding this causes UNKNOWNHOSTEXCEPTION regardless of timeout. [src5]

Quick Reference

#	Cause	Likelihood	Signature	Fix
1	Insufficient memory/CPU	~30%	`Duration` near timeout; `Max Memory Used` near `Memory Size`	Increase memory to 1769 MB (1 vCPU) [src1, src3]
2	Downstream service timeout	~25%	`Task timed out after X.XX seconds`; no error log before timeout	Add explicit connection/read timeouts (5s/10s) to HTTP clients [src2]
3	VPC without NAT gateway	~20%	`ETIMEDOUT` or `Task timed out` on external HTTP calls	Add NAT gateway to public subnet; route private subnet through it [src5]
4	Cold start during INIT	~10%	`Init Duration: NNNNms` in REPORT log; first invocation slow	SnapStart, provisioned concurrency, or reduce package size [src3, src4]
5	Large deployment package	~5%	High `Init Duration` (>1s); large ZIP artifact	Tree-shake dependencies; use Lambda Layers [src6, src7]
6	Recursive/infinite loop	~5%	Function always times out at exact timeout limit	Check S3 trigger writing to same bucket, SQS re-queue loops [src2]
7	Payload too large	~3%	Timeout with large event payloads	Batch smaller; use S3 pre-signed URLs for large data [src2]
8	Default 3s timeout	~2%	`Task timed out after 3.00 seconds`	Set 30–60s for APIs, 300s for data processing [src1]
9	Network ACL blocking ephemeral ports	Rare	Intermittent `ETIMEDOUT` in VPC functions	Allow TCP/UDP ports 1024–65535 in subnet Network ACLs [src5]
10	DNS resolution limit exceeded	Rare	`UNKNOWNHOSTEXCEPTION` under high concurrency	Reduce concurrent DNS lookups; max 20 TCP DNS connections [src5]

Decision Tree

START — Lambda function timing out or slow
├── Check REPORT log in CloudWatch
│   ├── "Init Duration" present and > 1000ms?
│   │   ├── YES → COLD START ISSUE
│   │   │   ├── Runtime is Java/Python/.NET? → Enable SnapStart [src4]
│   │   │   ├── Package > 50 MB? → Tree-shake, use Layers [src6, src7]
│   │   │   ├── Need guaranteed <100ms start? → Provisioned Concurrency [src3]
│   │   │   └── Heavy SDK imports? → Lazy-load, import only needed clients [src8]
│   │   └── NO → RUNTIME TIMEOUT ISSUE ↓
│   │
│   ├── "Max Memory Used" close to "Memory Size"?
│   │   ├── YES → Increase memory (doubles CPU too) [src1, src3]
│   │   └── NO ↓
│   │
│   ├── Duration close to timeout every time?
│   │   ├── YES → Likely infinite loop or recursive trigger [src2]
│   │   └── NO ↓
│   │
│   ├── Function in VPC?
│   │   ├── YES → Check NAT gateway exists for outbound internet [src5]
│   │   │   ├── No NAT → Add NAT gateway + route table
│   │   │   ├── AWS services only? → Use VPC endpoints [src5]
│   │   │   └── Intermittent? → Check Network ACL ephemeral ports [src5]
│   │   └── NO ↓
│   │
│   └── Timeout only on some invocations?
│       ├── YES → Downstream service latency → add client-side timeouts [src2]
│       └── NO → Increase function timeout; check payload size [src1]
│
└── No REPORT log at all?
    └── Check execution role has AWSLambdaBasicExecutionRole [src2]

Step-by-Step Guide

1. Check CloudWatch REPORT logs for diagnosis

Every Lambda invocation produces a REPORT line with key metrics. This is your starting point. [src2, src3]

# CloudWatch Logs Insights — find recent timeouts
fields @timestamp, @message
| filter @message like /Task timed out/
| sort @timestamp desc
| limit 20

# Analyze cold start frequency and duration
fields @timestamp, @initDuration, @duration, @maxMemoryUsed, @memorySize
| filter ispresent(@initDuration)
| stats avg(@initDuration) as avgColdStart, max(@initDuration) as maxColdStart,
        count(*) as coldStartCount
| sort coldStartCount desc

Verify: REPORT line shows Duration, Billed Duration, Memory Size, Max Memory Used, and Init Duration (if cold start).

2. Increase memory to get more CPU

Lambda allocates CPU proportionally to memory. Below 1769 MB, you get fractional CPU. This is the single most impactful tuning knob. [src1, src3]

# Set memory to 1769 MB (1 full vCPU) and timeout to 30 seconds
aws lambda update-function-configuration \
  --function-name my-function \
  --memory-size 1769 \
  --timeout 30

Verify: aws lambda get-function-configuration --function-name my-function --query '{Memory: MemorySize, Timeout: Timeout}'

3. Add explicit timeouts to downstream calls

Never rely on the Lambda timeout as your only safety net. Set client-side timeouts on every external call. [src2]

# Python — explicit timeouts on AWS SDK and HTTP calls
import boto3
from botocore.config import Config

boto_config = Config(
    connect_timeout=5,    # 5s to establish connection
    read_timeout=10,      # 10s to read response
    retries={'max_attempts': 2}
)
dynamodb = boto3.client('dynamodb', config=boto_config)

4. Fix VPC internet access (if VPC-attached)

VPC-attached functions have no internet access by default. All outbound traffic goes through the VPC. [src5]

# SAM template — Lambda in VPC with NAT gateway
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Runtime: nodejs20.x
      MemorySize: 1769
      Timeout: 30
      VpcConfig:
        SecurityGroupIds:
          - !Ref LambdaSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnet1

5. Enable SnapStart for Java, Python, or .NET

SnapStart takes a microVM snapshot after INIT, reducing cold starts from seconds to sub-second. [src4]

# Enable SnapStart and publish a version
aws lambda update-function-configuration \
  --function-name my-function \
  --snap-start ApplyOn=PublishedVersions

aws lambda publish-version --function-name my-function

Verify: Check CloudWatch for Restore Duration instead of Init Duration — restore should be <200ms.

6. Configure provisioned concurrency for zero cold starts

Provisioned concurrency keeps environments pre-warmed. Most reliable but most expensive approach. [src3, src6]

# Set 10 provisioned concurrent executions
aws lambda put-provisioned-concurrency-config \
  --function-name my-function \
  --qualifier live \
  --provisioned-concurrent-executions 10

Verify: Status should be READY. REPORT log should show no Init Duration.

Code Examples

Python: Lambda handler with cold start optimization

# Input:  API Gateway event
# Output: JSON response with downstream data

import json, os, boto3
from botocore.config import Config

# INIT phase: runs once per cold start, persists across warm invocations
boto_config = Config(connect_timeout=5, read_timeout=10, retries={'max_attempts': 2})
dynamodb = boto3.resource('dynamodb', config=boto_config)
table = dynamodb.Table(os.environ['TABLE_NAME'])

def handler(event, context):
    remaining_ms = context.get_remaining_time_in_millis()
    if remaining_ms < 5000:
        return {'statusCode': 503, 'body': json.dumps({'error': 'Insufficient time'})}
    try:
        item_id = event.get('pathParameters', {}).get('id', '')
        response = table.get_item(Key={'id': item_id})
        item = response.get('Item')
        if not item:
            return {'statusCode': 404, 'body': json.dumps({'error': 'Not found'})}
        return {'statusCode': 200, 'body': json.dumps(item, default=str)}
    except Exception as e:
        return {'statusCode': 500, 'body': json.dumps({'error': str(e)})}

Node.js: Optimized handler with modular imports

// Input:  API Gateway event
// Output: JSON response

const { DynamoDBClient, GetItemCommand } = require('@aws-sdk/client-dynamodb');
const { unmarshall } = require('@aws-sdk/util-dynamodb');

const client = new DynamoDBClient({
  requestHandler: { connectionTimeout: 5000, socketTimeout: 10000 },
  maxAttempts: 2
});

exports.handler = async (event, context) => {
  if (context.getRemainingTimeInMillis() < 5000) {
    return { statusCode: 503, body: JSON.stringify({ error: 'Insufficient time' }) };
  }
  const id = event.pathParameters?.id;
  const { Item } = await client.send(new GetItemCommand({
    TableName: process.env.TABLE_NAME,
    Key: { id: { S: id } }
  }));
  if (!Item) return { statusCode: 404, body: JSON.stringify({ error: 'Not found' }) };
  return { statusCode: 200, body: JSON.stringify(unmarshall(Item)) };
};

Java: SnapStart-optimized handler with CRaC hooks

// Input:  API Gateway proxy request
// Output: API Gateway proxy response
// Requires: Java 11+ runtime with SnapStart enabled

import software.amazon.awssdk.services.dynamodb.DynamoDBClient;
import org.crac.Core;
import org.crac.Resource;

public class Handler implements RequestHandler<APIGatewayProxyRequestEvent,
        APIGatewayProxyResponseEvent>, Resource {

    private final DynamoDBClient dynamodb = DynamoDBClient.create();

    public Handler() {
        Core.getGlobalContext().register(this);  // Register CRaC hooks
    }

    @Override
    public void beforeCheckpoint(org.crac.Context<?> ctx) {
        dynamodb.describeEndpoints();  // Pre-warm connection before snapshot
    }

    @Override
    public void afterRestore(org.crac.Context<?> ctx) {
        // Re-validate connections after restore
    }
}

Anti-Patterns

Wrong: Using default 3-second timeout in production

# ❌ BAD — 3-second default is almost never enough [src1]
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Runtime: nodejs20.x
      # No Timeout — defaults to 3 seconds; cold start + any call > 3s = failure

Correct: Set appropriate timeout with margin

# ✅ GOOD — explicit timeout with memory tuning [src1]
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Runtime: nodejs20.x
      MemorySize: 1769    # 1 full vCPU
      Timeout: 30         # 30s for API backends

Wrong: Importing entire AWS SDK

// ❌ BAD — imports entire SDK (~60MB), dramatically increases cold start [src7, src8]
const AWS = require('aws-sdk');
const dynamodb = new AWS.DynamoDB();
// Init Duration: 800-1500ms due to massive import

Correct: Import only needed clients

// ✅ GOOD — modular imports, minimal cold start impact [src7, src8]
const { DynamoDBClient } = require('@aws-sdk/client-dynamodb');
// Init Duration: 150-300ms — only loads what's needed

Wrong: Creating SDK clients inside the handler

# ❌ BAD — new client every invocation, wasting warm-start reuse [src3, src8]
def handler(event, context):
    dynamodb = boto3.resource('dynamodb')  # NEW client every time
    table = dynamodb.Table('my-table')
    return table.get_item(Key={'id': event['id']})

Correct: Initialize clients outside the handler

# ✅ GOOD — client created once in INIT, reused across warm invocations [src3, src8]
import boto3
dynamodb = boto3.resource('dynamodb')      # Created once during INIT
table = dynamodb.Table('my-table')

def handler(event, context):
    return table.get_item(Key={'id': event['id']})  # Reuses warm connection

Wrong: VPC Lambda without NAT for external calls

# ❌ BAD — VPC function with no internet path silently times out [src5]
VpcConfig:
  SecurityGroupIds: [!Ref SG]
  SubnetIds: [!Ref PrivateSubnet]
# No NAT gateway — ALL external HTTP calls will ETIMEDOUT

Correct: Route VPC traffic through NAT or VPC endpoints

# ✅ GOOD — private subnet routes to NAT for internet; VPC endpoints for AWS [src5]
NATGateway:
  Type: AWS::EC2::NatGateway
  Properties:
    SubnetId: !Ref PublicSubnet
    AllocationId: !GetAtt EIP.AllocationId
PrivateRoute:
  Type: AWS::EC2::Route
  Properties:
    DestinationCidrBlock: 0.0.0.0/0
    NatGatewayId: !Ref NATGateway

Common Pitfalls

Default 3-second timeout: The Lambda console defaults to 3 seconds. Set to at least 10–30s for API backends and 120–900s for data processing. [src1]
Memory = CPU misconception: Developers increase timeout when the real problem is CPU starvation. At 128 MB, you get fractional CPU. Increasing to 1769 MB gives 1 full vCPU and often makes functions 10x faster at the same cost. [src1, src3]
VPC internet access: VPC-attached functions lose internet unless you configure a NAT gateway. The error is a silent ETIMEDOUT, not a clear networking error. [src5]
SnapStart uniqueness trap: SnapStart restores from snapshot. UUIDs, random seeds, or encryption keys generated during INIT will be identical across all restored instances. Generate these in the handler. [src4]
Recursive invocation loops: An S3 trigger writing to the same bucket creates an infinite loop. Always write to a different resource or use key prefixes to filter. [src2]
Provisioned concurrency + SnapStart conflict: These cannot be combined. SnapStart is free (Java); provisioned concurrency costs ~$15–30/month per 10 environments at 512 MB. [src4]
API Gateway 29-second limit: Even with 900s Lambda timeout, API Gateway has a hard 29-second integration timeout. Use async invocation or Step Functions for long tasks. [src1]
Cold start burst after deployment: Every code deploy invalidates warm environments. Deploy during low traffic or use traffic shifting with aliases. [src3, src6]

Diagnostic Commands

# Check current function configuration
aws lambda get-function-configuration --function-name my-function \
  --query '{Memory: MemorySize, Timeout: Timeout, Runtime: Runtime, VPC: VpcConfig.SubnetIds, SnapStart: SnapStart}'

# Find recent timeouts (CloudWatch Logs Insights)
fields @timestamp, @message
| filter @message like /Task timed out/
| sort @timestamp desc | limit 50

# Analyze cold start frequency
fields @timestamp, @initDuration, @duration, @maxMemoryUsed, @memorySize
| filter ispresent(@initDuration)
| stats count(*) as coldStarts, avg(@initDuration) as avgInitMs,
        max(@initDuration) as maxInitMs, pct(@initDuration, 99) as p99InitMs
  by bin(1h)

# Check memory utilization
fields @maxMemoryUsed, @memorySize, @duration
| stats avg(@maxMemoryUsed) as avgMemUsed, max(@maxMemoryUsed) as maxMemUsed

# Check provisioned concurrency status
aws lambda get-provisioned-concurrency-config \
  --function-name my-function --qualifier live

# Check VPC route table for NAT gateway
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-xxxxx" \
  --query 'RouteTables[*].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'

# Test invocation
aws lambda invoke --function-name my-function \
  --payload '{"test": true}' --cli-read-timeout 60 response.json

Version History & Compatibility

Feature	Available Since	Notes
Lambda timeout (max 15 min)	2018	Increased from 5 min; all runtimes [src1]
VPC Hyperplane ENI	2019	Eliminated ~10s VPC cold start penalty [src5]
Provisioned concurrency	Dec 2019	All runtimes; eliminates cold starts [src3]
ARM64/Graviton support	2021	20% faster cold starts vs x86; lower cost [src6]
SnapStart for Java	Nov 2022	Java 11+ Corretto; free [src4]
INIT phase logging (INIT_REPORT)	Nov 2023	Explicit Init/Restore phase error reporting [src3]
SnapStart for Python	Dec 2024	Python 3.12+; caching/restore charges apply [src4]
SnapStart for .NET	Dec 2024	.NET 8+; requires Annotations v1.6.0+ [src4]
Lambda Managed Instances	2025 preview	Multi-concurrent execution on EC2-class instances
Lambda Durable Functions	Late 2025 (GA)	Stateful workflows up to 1 year via checkpoint/replay; hibernation pauses billing while awaiting external signals (~20-30% cost reduction for human-in-the-loop workflows) [src9]
Amazon Linux 2 runtime EOL	June 30, 2026	Migrate to Amazon Linux 2023-based runtimes (Node.js 20+, Python 3.12+, Java 21); AL2-based Java 8/11/17 deprecate without migration

Decision Logic

If `Init Duration` is present and >1000ms in REPORT log

--> Cold start is the bottleneck. If runtime is Java/Python 3.12+/.NET 8+, enable SnapStart (free for Java, charged for Python/.NET). [src3, src4]

If runtime is Node.js/Go/Ruby/Rust and cold starts are unacceptable

--> SnapStart is not available as of May 2026. Either provisioned concurrency, ARM64 Graviton (15-20% faster cold start), or trim the deployment package. [src4, src6, src7]

If `Max Memory Used` is close to `Memory Size` in REPORT log

--> Increase memory to at least 1769 MB (1 full vCPU). CPU scales with memory; this is the single highest-leverage tuning knob. [src1, src3]

If function is VPC-attached and external HTTPS calls hang

--> Add a NAT gateway in a public subnet and route the private subnet's 0.0.0.0/0 through it. For AWS-only calls (DynamoDB, S3), use VPC endpoints instead to avoid NAT cost. [src5]

If workload exceeds 15 minutes

--> Use Lambda Durable Functions (late 2025+) for stateful workflows up to 1 year, OR Step Functions for orchestration, OR ECS/Fargate for long-running compute. Do not try to chain regular Lambda invocations manually. [src1, src9]

If `Task timed out at 3.00 seconds` appears

--> Default timeout is too low. Set 30s for API backends, 300-900s for data processing. Always set client-side timeouts on downstream calls before the Lambda timeout fires. [src1, src2]

If timeouts are intermittent and only under load

--> Check DNS resolution limits (max 20 concurrent TCP connections for DNS) and connection pool exhaustion on downstream databases. Lower per-invocation parallelism or pool connections via RDS Proxy. [src5]

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Timeout <15 min and stateless workload	Processing >15 min (single invocation)	Lambda Durable Functions (stateful, ≤1 year), Step Functions, or ECS/Fargate
Cold start <2s is acceptable	Hard real-time <10ms requirement	EC2, ECS, or always-on containers
Traffic is spiky or unpredictable	Steady >1000 req/s sustained	ECS/Fargate with ALB (cheaper at scale)
SnapStart available for your runtime	Need zero cold starts guaranteed	Provisioned concurrency or containers
API Gateway integration <29s	Long-running API response >29s	Async invocation + polling, or WebSocket API

Important Caveats

Cold starts affect <1% of invocations in steady-state but can be 100% after a deployment or traffic spike from zero. Monitor Init Duration in CloudWatch, not just averages. [src3, src7]
Memory cost trade-off: 14x memory does not always mean 14x cost. Faster execution = shorter billed duration. Use AWS Lambda Power Tuning to find cost-optimal memory. [src1]
SnapStart snapshot uniqueness: Initialization code runs once, then the snapshot is reused. Random values or credentials generated during INIT are identical across restored instances. Generate in the handler. [src4]
Provisioned concurrency costs even when idle: ~$0.015/GB-hour. Use Application Auto Scaling to adjust based on traffic patterns. [src3]
ARM64 (Graviton) gives free performance: Up to 20% better price-performance with faster cold starts. No code changes needed for interpreted runtimes. [src6]