How Do I Fix AWS Lambda Timeouts and Cold Start Issues?

Type: Software Reference Confidence: 0.93 Sources: 8 Verified: 2026-02-20 Freshness: evolving

TL;DR

Constraints

Quick Reference

# Cause Likelihood Signature Fix
1 Insufficient memory/CPU ~30% Duration near timeout; Max Memory Used near Memory Size Increase memory to 1769 MB (1 vCPU) [src1, src3]
2 Downstream service timeout ~25% Task timed out after X.XX seconds; no error log before timeout Add explicit connection/read timeouts (5s/10s) to HTTP clients [src2]
3 VPC without NAT gateway ~20% ETIMEDOUT or Task timed out on external HTTP calls Add NAT gateway to public subnet; route private subnet through it [src5]
4 Cold start during INIT ~10% Init Duration: NNNNms in REPORT log; first invocation slow SnapStart, provisioned concurrency, or reduce package size [src3, src4]
5 Large deployment package ~5% High Init Duration (>1s); large ZIP artifact Tree-shake dependencies; use Lambda Layers [src6, src7]
6 Recursive/infinite loop ~5% Function always times out at exact timeout limit Check S3 trigger writing to same bucket, SQS re-queue loops [src2]
7 Payload too large ~3% Timeout with large event payloads Batch smaller; use S3 pre-signed URLs for large data [src2]
8 Default 3s timeout ~2% Task timed out after 3.00 seconds Set 30–60s for APIs, 300s for data processing [src1]
9 Network ACL blocking ephemeral ports Rare Intermittent ETIMEDOUT in VPC functions Allow TCP/UDP ports 1024–65535 in subnet Network ACLs [src5]
10 DNS resolution limit exceeded Rare UNKNOWNHOSTEXCEPTION under high concurrency Reduce concurrent DNS lookups; max 20 TCP DNS connections [src5]

Decision Tree

START — Lambda function timing out or slow
├── Check REPORT log in CloudWatch
│   ├── "Init Duration" present and > 1000ms?
│   │   ├── YES → COLD START ISSUE
│   │   │   ├── Runtime is Java/Python/.NET? → Enable SnapStart [src4]
│   │   │   ├── Package > 50 MB? → Tree-shake, use Layers [src6, src7]
│   │   │   ├── Need guaranteed <100ms start? → Provisioned Concurrency [src3]
│   │   │   └── Heavy SDK imports? → Lazy-load, import only needed clients [src8]
│   │   └── NO → RUNTIME TIMEOUT ISSUE ↓
│   │
│   ├── "Max Memory Used" close to "Memory Size"?
│   │   ├── YES → Increase memory (doubles CPU too) [src1, src3]
│   │   └── NO ↓
│   │
│   ├── Duration close to timeout every time?
│   │   ├── YES → Likely infinite loop or recursive trigger [src2]
│   │   └── NO ↓
│   │
│   ├── Function in VPC?
│   │   ├── YES → Check NAT gateway exists for outbound internet [src5]
│   │   │   ├── No NAT → Add NAT gateway + route table
│   │   │   ├── AWS services only? → Use VPC endpoints [src5]
│   │   │   └── Intermittent? → Check Network ACL ephemeral ports [src5]
│   │   └── NO ↓
│   │
│   └── Timeout only on some invocations?
│       ├── YES → Downstream service latency → add client-side timeouts [src2]
│       └── NO → Increase function timeout; check payload size [src1]
│
└── No REPORT log at all?
    └── Check execution role has AWSLambdaBasicExecutionRole [src2]

Step-by-Step Guide

1. Check CloudWatch REPORT logs for diagnosis

Every Lambda invocation produces a REPORT line with key metrics. This is your starting point. [src2, src3]

# CloudWatch Logs Insights — find recent timeouts
fields @timestamp, @message
| filter @message like /Task timed out/
| sort @timestamp desc
| limit 20

# Analyze cold start frequency and duration
fields @timestamp, @initDuration, @duration, @maxMemoryUsed, @memorySize
| filter ispresent(@initDuration)
| stats avg(@initDuration) as avgColdStart, max(@initDuration) as maxColdStart,
        count(*) as coldStartCount
| sort coldStartCount desc

Verify: REPORT line shows Duration, Billed Duration, Memory Size, Max Memory Used, and Init Duration (if cold start).

2. Increase memory to get more CPU

Lambda allocates CPU proportionally to memory. Below 1769 MB, you get fractional CPU. This is the single most impactful tuning knob. [src1, src3]

# Set memory to 1769 MB (1 full vCPU) and timeout to 30 seconds
aws lambda update-function-configuration \
  --function-name my-function \
  --memory-size 1769 \
  --timeout 30

Verify: aws lambda get-function-configuration --function-name my-function --query '{Memory: MemorySize, Timeout: Timeout}'

3. Add explicit timeouts to downstream calls

Never rely on the Lambda timeout as your only safety net. Set client-side timeouts on every external call. [src2]

# Python — explicit timeouts on AWS SDK and HTTP calls
import boto3
from botocore.config import Config

boto_config = Config(
    connect_timeout=5,    # 5s to establish connection
    read_timeout=10,      # 10s to read response
    retries={'max_attempts': 2}
)
dynamodb = boto3.client('dynamodb', config=boto_config)

4. Fix VPC internet access (if VPC-attached)

VPC-attached functions have no internet access by default. All outbound traffic goes through the VPC. [src5]

# SAM template — Lambda in VPC with NAT gateway
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Runtime: nodejs20.x
      MemorySize: 1769
      Timeout: 30
      VpcConfig:
        SecurityGroupIds:
          - !Ref LambdaSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnet1

5. Enable SnapStart for Java, Python, or .NET

SnapStart takes a microVM snapshot after INIT, reducing cold starts from seconds to sub-second. [src4]

# Enable SnapStart and publish a version
aws lambda update-function-configuration \
  --function-name my-function \
  --snap-start ApplyOn=PublishedVersions

aws lambda publish-version --function-name my-function

Verify: Check CloudWatch for Restore Duration instead of Init Duration — restore should be <200ms.

6. Configure provisioned concurrency for zero cold starts

Provisioned concurrency keeps environments pre-warmed. Most reliable but most expensive approach. [src3, src6]

# Set 10 provisioned concurrent executions
aws lambda put-provisioned-concurrency-config \
  --function-name my-function \
  --qualifier live \
  --provisioned-concurrent-executions 10

Verify: Status should be READY. REPORT log should show no Init Duration.

Code Examples

Python: Lambda handler with cold start optimization

# Input:  API Gateway event
# Output: JSON response with downstream data

import json, os, boto3
from botocore.config import Config

# INIT phase: runs once per cold start, persists across warm invocations
boto_config = Config(connect_timeout=5, read_timeout=10, retries={'max_attempts': 2})
dynamodb = boto3.resource('dynamodb', config=boto_config)
table = dynamodb.Table(os.environ['TABLE_NAME'])

def handler(event, context):
    remaining_ms = context.get_remaining_time_in_millis()
    if remaining_ms < 5000:
        return {'statusCode': 503, 'body': json.dumps({'error': 'Insufficient time'})}
    try:
        item_id = event.get('pathParameters', {}).get('id', '')
        response = table.get_item(Key={'id': item_id})
        item = response.get('Item')
        if not item:
            return {'statusCode': 404, 'body': json.dumps({'error': 'Not found'})}
        return {'statusCode': 200, 'body': json.dumps(item, default=str)}
    except Exception as e:
        return {'statusCode': 500, 'body': json.dumps({'error': str(e)})}

Node.js: Optimized handler with modular imports

// Input:  API Gateway event
// Output: JSON response

const { DynamoDBClient, GetItemCommand } = require('@aws-sdk/client-dynamodb');
const { unmarshall } = require('@aws-sdk/util-dynamodb');

const client = new DynamoDBClient({
  requestHandler: { connectionTimeout: 5000, socketTimeout: 10000 },
  maxAttempts: 2
});

exports.handler = async (event, context) => {
  if (context.getRemainingTimeInMillis() < 5000) {
    return { statusCode: 503, body: JSON.stringify({ error: 'Insufficient time' }) };
  }
  const id = event.pathParameters?.id;
  const { Item } = await client.send(new GetItemCommand({
    TableName: process.env.TABLE_NAME,
    Key: { id: { S: id } }
  }));
  if (!Item) return { statusCode: 404, body: JSON.stringify({ error: 'Not found' }) };
  return { statusCode: 200, body: JSON.stringify(unmarshall(Item)) };
};

Java: SnapStart-optimized handler with CRaC hooks

// Input:  API Gateway proxy request
// Output: API Gateway proxy response
// Requires: Java 11+ runtime with SnapStart enabled

import software.amazon.awssdk.services.dynamodb.DynamoDBClient;
import org.crac.Core;
import org.crac.Resource;

public class Handler implements RequestHandler<APIGatewayProxyRequestEvent,
        APIGatewayProxyResponseEvent>, Resource {

    private final DynamoDBClient dynamodb = DynamoDBClient.create();

    public Handler() {
        Core.getGlobalContext().register(this);  // Register CRaC hooks
    }

    @Override
    public void beforeCheckpoint(org.crac.Context<?> ctx) {
        dynamodb.describeEndpoints();  // Pre-warm connection before snapshot
    }

    @Override
    public void afterRestore(org.crac.Context<?> ctx) {
        // Re-validate connections after restore
    }
}

Anti-Patterns

Wrong: Using default 3-second timeout in production

# ❌ BAD — 3-second default is almost never enough [src1]
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Runtime: nodejs20.x
      # No Timeout — defaults to 3 seconds; cold start + any call > 3s = failure

Correct: Set appropriate timeout with margin

# ✅ GOOD — explicit timeout with memory tuning [src1]
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Runtime: nodejs20.x
      MemorySize: 1769    # 1 full vCPU
      Timeout: 30         # 30s for API backends

Wrong: Importing entire AWS SDK

// ❌ BAD — imports entire SDK (~60MB), dramatically increases cold start [src7, src8]
const AWS = require('aws-sdk');
const dynamodb = new AWS.DynamoDB();
// Init Duration: 800-1500ms due to massive import

Correct: Import only needed clients

// ✅ GOOD — modular imports, minimal cold start impact [src7, src8]
const { DynamoDBClient } = require('@aws-sdk/client-dynamodb');
// Init Duration: 150-300ms — only loads what's needed

Wrong: Creating SDK clients inside the handler

# ❌ BAD — new client every invocation, wasting warm-start reuse [src3, src8]
def handler(event, context):
    dynamodb = boto3.resource('dynamodb')  # NEW client every time
    table = dynamodb.Table('my-table')
    return table.get_item(Key={'id': event['id']})

Correct: Initialize clients outside the handler

# ✅ GOOD — client created once in INIT, reused across warm invocations [src3, src8]
import boto3
dynamodb = boto3.resource('dynamodb')      # Created once during INIT
table = dynamodb.Table('my-table')

def handler(event, context):
    return table.get_item(Key={'id': event['id']})  # Reuses warm connection

Wrong: VPC Lambda without NAT for external calls

# ❌ BAD — VPC function with no internet path silently times out [src5]
VpcConfig:
  SecurityGroupIds: [!Ref SG]
  SubnetIds: [!Ref PrivateSubnet]
# No NAT gateway — ALL external HTTP calls will ETIMEDOUT

Correct: Route VPC traffic through NAT or VPC endpoints

# ✅ GOOD — private subnet routes to NAT for internet; VPC endpoints for AWS [src5]
NATGateway:
  Type: AWS::EC2::NatGateway
  Properties:
    SubnetId: !Ref PublicSubnet
    AllocationId: !GetAtt EIP.AllocationId
PrivateRoute:
  Type: AWS::EC2::Route
  Properties:
    DestinationCidrBlock: 0.0.0.0/0
    NatGatewayId: !Ref NATGateway

Common Pitfalls

Diagnostic Commands

# Check current function configuration
aws lambda get-function-configuration --function-name my-function \
  --query '{Memory: MemorySize, Timeout: Timeout, Runtime: Runtime, VPC: VpcConfig.SubnetIds, SnapStart: SnapStart}'

# Find recent timeouts (CloudWatch Logs Insights)
fields @timestamp, @message
| filter @message like /Task timed out/
| sort @timestamp desc | limit 50

# Analyze cold start frequency
fields @timestamp, @initDuration, @duration, @maxMemoryUsed, @memorySize
| filter ispresent(@initDuration)
| stats count(*) as coldStarts, avg(@initDuration) as avgInitMs,
        max(@initDuration) as maxInitMs, pct(@initDuration, 99) as p99InitMs
  by bin(1h)

# Check memory utilization
fields @maxMemoryUsed, @memorySize, @duration
| stats avg(@maxMemoryUsed) as avgMemUsed, max(@maxMemoryUsed) as maxMemUsed

# Check provisioned concurrency status
aws lambda get-provisioned-concurrency-config \
  --function-name my-function --qualifier live

# Check VPC route table for NAT gateway
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-xxxxx" \
  --query 'RouteTables[*].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'

# Test invocation
aws lambda invoke --function-name my-function \
  --payload '{"test": true}' --cli-read-timeout 60 response.json

Version History & Compatibility

Feature Available Since Notes
Lambda timeout (max 15 min) 2018 Increased from 5 min; all runtimes [src1]
VPC Hyperplane ENI 2019 Eliminated ~10s VPC cold start penalty [src5]
Provisioned concurrency Dec 2019 All runtimes; eliminates cold starts [src3]
ARM64/Graviton support 2021 20% faster cold starts vs x86; lower cost [src6]
SnapStart for Java Nov 2022 Java 11+ Corretto; free [src4]
INIT phase logging (INIT_REPORT) Nov 2023 Explicit Init/Restore phase error reporting [src3]
SnapStart for Python Dec 2024 Python 3.12+; caching/restore charges apply [src4]
SnapStart for .NET Dec 2024 .NET 8+; requires Annotations v1.6.0+ [src4]
Lambda Managed Instances 2025 preview Multi-concurrent execution on EC2-class instances

When to Use / When Not to Use

Use When Don't Use When Use Instead
Timeout <15 min and stateless workload Processing >15 min AWS Step Functions or ECS/Fargate
Cold start <2s is acceptable Hard real-time <10ms requirement EC2, ECS, or always-on containers
Traffic is spiky or unpredictable Steady >1000 req/s sustained ECS/Fargate with ALB (cheaper at scale)
SnapStart available for your runtime Need zero cold starts guaranteed Provisioned concurrency or containers
API Gateway integration <29s Long-running API response >29s Async invocation + polling, or WebSocket API

Important Caveats

Related Units