What is the difference between taskRoleArn and executionRoleArn in ECS?

The executionRoleArn is used by the ECS/Fargate agent to set up the task: pull images from ECR, write logs to CloudWatch, fetch secrets from SSM or Secrets Manager at startup. The taskRoleArn is used by your application code running inside the container to call AWS APIs like S3, DynamoDB, and SQS. Both require a trust policy allowing ecs-tasks.amazonaws.com to assume them.

What are the valid Fargate CPU and memory combinations?

Fargate has 8 CPU tiers: 256 (.25 vCPU, 512 MiB/1 GB/2 GB only), 512 (.5 vCPU, 1–4 GB), 1024 (1 vCPU, 2–8 GB in 1 GB steps), 2048 (2 vCPU, 4–16 GB), 4096 (4 vCPU, 8–30 GB), 8192 (8 vCPU, 16–60 GB in 4 GB steps), 16384 (16 vCPU, 32–120 GB in 8 GB steps), 32768 (32 vCPU, 60/120/244 GB). Invalid combinations fail with: No Fargate configuration exists for given values.

Why does my ECS task keep restarting?

Most likely causes: (1) health check startPeriod is too short — app not ready before retries exhaust; (2) essential container exiting — if any essential container stops, all containers stop; (3) OOM kill — container exceeds its memory hard limit; (4) missing secrets — executionRoleArn lacks permission to fetch from SSM or Secrets Manager, causing ResourceInitializationError at launch.

How do I pass secrets to ECS containers without storing them in plaintext?

Use the secrets array in containerDefinitions with valueFrom pointing to an SSM Parameter Store ARN or Secrets Manager ARN. The ECS agent fetches the value at task launch using the executionRoleArn and injects it as an environment variable. The plaintext value is never stored in the task definition. SSM Standard is free up to 10,000 parameters. Secrets Manager costs $0.40 per secret per month.

How do I run a DB migration before the app container starts in ECS?

Set essential: false on the migration container, define a healthCheck on it if possible, and add dependsOn with condition SUCCESS to the app container. The migration runs first, exits with code 0, and the app container starts after. Without essential: false, the task stops when the migration container exits — even on success.

What is stopTimeout in ECS and what is the maximum for Fargate?

stopTimeout is how long ECS waits after sending SIGTERM before sending SIGKILL. The default is 30 seconds. The maximum on Fargate is 120 seconds. Set it to match your app's graceful shutdown time — JVM-based services and apps with long-running requests often need more than 30 seconds.

GuideJune 12, 2026·10 min read

ecs-task-definitionecs-task-definition-guidefargate-task-definition

ECS Task Definitions: Every Field, Common Mistakes, Best Practices

The AWS docs are a reference, not a guide. This covers the 8 mistakes that break ECS deployments — wrong IAM role, invalid Fargate CPU/memory combos, health checks that restart forever, secrets that don't rotate. Each one: what fails, what the error looks like, what the fix is.

Matt S

Platform engineer · Fortem

TL;DR

01executionRoleArn = ECS agent (ECR pull, CloudWatch, secrets fetch). taskRoleArn = your app code (S3, DynamoDB, SQS). Wrong role = AccessDenied that's hard to trace.
02Fargate CPU/memory combos are not ranges — 256 CPU only accepts 3 memory values. Invalid combos fail at deploy time with a cryptic error.
03Secrets are injected once at task start. Secret rotation does not update running containers. You must force a new deployment.
04Health check startPeriod is off by default. JVM apps die before they finish booting. Set startPeriod to at least 1.5× your worst-case startup time.
05Omitting essential on a container defaults it to true. An init/migration container that exits successfully will stop the entire task.
06sharedMemorySize and tmpfs are not supported on Fargate per AWS docs. ML workloads needing large /dev/shm must use EC2 launch type instead.

Two IAM roles, one common mistake

ECS task definitions have two IAM roles: executionRoleArn lets the Fargate agent pull images from ECR and fetch secrets; taskRoleArn lets your app code call S3, DynamoDB, and SQS. They look similar and use the same trust principal (ecs-tasks.amazonaws.com), but they serve completely different parts of the system.

RoleUsed byPurpose

executionRoleArnECS / Fargate agentPull image from ECR, write logs to CloudWatch, fetch secrets from SSM / Secrets Manager at task startup

taskRoleArnYour app codeCall AWS APIs from inside the container: S3 GetObject, DynamoDB Query, SQS SendMessage, etc.

The mistake: the ECS console presents executionRoleArn prominently during task definition creation. Teams add S3 or DynamoDB permissions there. The task launches fine — but every AWS SDK call from the app returns AccessDeniedException. The task is running, logs are flowing, health check passes. The only symptom is API calls failing inside the app.

The execution role needs AmazonECSTaskExecutionRolePolicy (covers ECR + CloudWatch) plus explicit permissions for any secrets you reference:

json

{
  "Effect": "Allow",
  "Action": [
    "ssm:GetParameters",
    "secretsmanager:GetSecretValue",
    "kms:Decrypt"
  ],
  "Resource": [
    "arn:aws:ssm:us-east-1:123456789012:parameter/prod/*",
    "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/*"
  ]
}

Key insight

If your app can't reach S3 or DynamoDB, check the task role first. If the task fails to start entirely (no logs, no health checks), check the execution role. Different roles, different failure modes.

Source: Amazon ECS task execution IAM role · Amazon ECS task IAM role

Fargate CPU and memory — the combinations that don't exist

Fargate CPU and memory are not free-form ranges — there are 8 CPU tiers, each with a fixed set of valid memory values, and any invalid combination fails at deployment. Anything outside this table fails at deployment with: No Fargate configuration exists for given values.

CPU valuevCPUValid memory

2560.25512 MiB, 1 GB, 2 GB only (3 discrete values)

5120.51–4 GB (1 GB steps)

102412–8 GB (1 GB steps)

204824–16 GB (1 GB steps)

409648–30 GB (1 GB steps)

8192816–60 GB (4 GB steps) — Linux, platform 1.4.0+

163841632–120 GB (8 GB steps) — Linux, platform 1.4.0+

327683260, 120, or 244 GB — Linux, platform 1.4.0+

Gotcha 1 — 256 CPU is not a range

The 256 CPU tier accepts exactly 512 MiB, 1024 MiB, or 2048 MiB. You cannot specify 768 MiB or any value between. If you need 1.5 GB, step up to the 512 CPU tier.

Gotcha 2 — 8192+ CPU uses non-1 GB increments

The 8192 tier (8 vCPU) uses 4 GB steps. Requesting 17 GB fails — you must choose 16 GB or 20 GB. The 16384 tier uses 8 GB steps. The 32768 tier only accepts 60, 120, or 244 GB.

Gotcha 3 — Terraform memory is in MiB, not GB

Terraform's aws_ecs_task_definition takes memory as an integer in MiB. Writing memory = 4 means 4 MiB, not 4 GB. The deployment will fail. Use memory = 4096 for 4 GB.

Source: Troubleshoot Amazon ECS invalid CPU or memory errors · Fargate task definitions

Secrets: SSM vs Secrets Manager, and why rotation does nothing

Use the secrets array with valueFrom pointing to an SSM or Secrets Manager ARN — never store credentials in the environment array, where values are plaintext in the task definition. The syntax difference is value vs valueFrom — and mixing them up causes a silent empty variable. This is also a SOC 2 control (ECS.8): if you're heading into an audit, see how to prepare ECS Fargate for SOC 2.

json

"containerDefinitions": [{
  "environment": [
    {"name": "LOG_LEVEL", "value": "warn"},
    {"name": "PORT",      "value": "8080"}
  ],
  "secrets": [
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/prod/db/password"
    },
    {
      "name": "API_KEY",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/api-key-AbCdEf"
    },
    {
      "name": "DB_USER",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-AbCdEf:username::"
    }
  ]
}]

The last entry shows JSON key extraction from Secrets Manager — appending :username:: to the ARN injects only the username field from a JSON secret. The trailing :: (empty version-stage and version-id) are required. Requires Fargate platform 1.4.0+.

SSM Parameter StoreSecrets Manager

CostFree (Standard, up to 10k params)$0.40 / secret / month

IAM actionssm:GetParameters (plural)secretsmanager:GetSecretValue

Max size4 KB (Standard), 8 KB (Advanced)64 KB

Built-in rotationNo (custom Lambda needed)Yes (RDS, Redshift, DocumentDB)

JSON key extractionNoYes (ARN:key:: syntax)

Key insight

Secrets are fetched once at task startup and injected as environment variables. Rotating a secret in SSM or Secrets Manager does not update any running container. To pick up rotated credentials, you must force a new deployment: aws ecs update-service --force-new-deployment.

At scale, the cost difference is significant. 100 secrets: SSM is ~$0/month, Secrets Manager is ~$40/month. The common pattern: use SSM for simple key-value secrets, Secrets Manager when you need built-in rotation or JSON key extraction. Note that ssm:GetParameters (plural) is the correct permission — ssm:GetParameter (singular) will silently fail for tasks that fetch multiple parameters at once.

Source: Pass sensitive data to an Amazon ECS container · Pass Secrets Manager secrets through ECS environment variables

Health checks — three ways to loop forever

Three mistakes compound to create an endless restart loop: startPeriod too short, timeout ≥ interval, and the health check binary not installed in the image. All three are ECS defaults or copy-paste traps.

FieldDefaultNotes

interval30 sTime between checks. Min: 5 s

timeout5 sMust be less than interval. Min: 2 s

retries3Failures before UNHEALTHY. Max: 10

startPeriodoff (0)Grace period — failures don't count against retries. Max: 300 s

Mistake 1 — startPeriod off by default

A JVM app takes 45 seconds to boot. With startPeriod: 0, health checks start firing immediately. With retries: 3 at interval: 30, the container is marked UNHEALTHY at ~90 seconds — before the app finishes booting. The service launches a replacement, which also dies at 90 seconds.

Fix: set startPeriod to 1.5× your worst-case startup time.

Mistake 2 — timeout ≥ interval

timeout: 30 with interval: 30 is invalid. The timeout must be strictly less than the interval — ECS needs time between when a check times out and when the next one starts. Use timeout: 5, interval: 30 as a baseline.

Mistake 3 — curl not in the image

Minimal images (distroless, alpine without extras) don't have curl. The health check returns exit code 126 (command not found) on every attempt. The container is UNHEALTHY before startPeriodends. Fix: use your app's native runtime (see examples below), or add a compiled healthcheck binary to the image.

Build a proper /health endpoint

The best health check calls an endpoint your app already serves. If the app can't respond to HTTP, the container isn't healthy regardless of what curl says. Here's a minimal /health endpoint in each common language:

package main

import (
    "encoding/json"
    "net/http"
    "os"
)

func main() {
    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
    })

    port := os.Getenv("PORT")
    if port == "" {
        port = "8080"
    }
    http.ListenAndServe(":"+port, nil)
}

Then reference it in the task definition:

json

"healthCheck": {
  "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
  "interval": 30,
  "timeout": 5,
  "retries": 3,
  "startPeriod": 60
}

For images without curl, use the app's runtime directly instead of CMD-SHELL:

json

"command": ["CMD-SHELL", "python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8080/health')\""]
// or Go binary:
"command": ["CMD", "/bin/healthcheck"]

Note: for service tasks, an UNHEALTHY container is deregistered from the load balancer and replaced. For standalone tasks (run manually), health status is reported but the task is NOT stopped automatically — health checks are advisory only.

Source: Container health checks · HealthCheck API reference

Container ordering with dependsOn — and the essential trap

The dependsOn field controls container startup order, while essential controls what happens when any container exits — misunderstanding either causes tasks to restart for no obvious reason. dependsOn controls startup order. essential controls what happens when a container exits.

ConditionMeaningGotcha

STARTDependency is in RUNNING stateWeakest — container may crash immediately after entering RUNNING

COMPLETEDependency exited (any code)Cannot be set on essential containers

SUCCESSDependency exited with code 0Cannot be set on essential containers; for init/migration patterns

HEALTHYDependency's healthCheck passesConfirmed at startup only — not monitored continuously after

The essential field defaults to true when omitted. An init container that runs a DB migration and exits with code 0 will — by default — stop the entire task. This is the correct behavior for application crashes, but wrong for intentional short-lived containers.

DB migration pattern:

json

"containerDefinitions": [
  {
    "name": "db-migration",
    "image": "my-app:latest",
    "command": ["python", "manage.py", "migrate"],
    "essential": false,
    "healthCheck": {
      "command": ["CMD-SHELL", "exit 0"],
      "startPeriod": 5
    }
  },
  {
    "name": "app",
    "image": "my-app:latest",
    "essential": true,
    "dependsOn": [
      {"containerName": "db-migration", "condition": "SUCCESS"}
    ],
    "startTimeout": 120
  }
]

startTimeouton the app container (max 120 s on Fargate) sets a deadline: if the migration doesn't exit with SUCCESS within 120 s, the app gives up and the task stops. Without it, a hung migration hangs the entire task forever. Shutdown order is the reverse of startup order — ECS sends SIGTERM to the app first, then the migration container.

Source: ContainerDependency API reference

Linux parameters — what works and what silently does nothing

Several linuxParameters fields — sharedMemorySize, tmpfs, devices — are accepted by Fargate without error but silently ignored at runtime, causing hard-to-debug failures in ML workloads. Knowing which ones actually work prevents silent failures — especially on ML workloads.

ParameterFargate supportNotes

initProcessEnabled✓ SupportedRuns /sbin/docker-init as PID 1 — recommended

capabilities.add: SYS_PTRACE✓ SupportedOnly addable capability on Fargate — needed for debuggers, strace

capabilities.drop✓ SupportedDrop any default Linux capability

readonlyRootFilesystem✓ SupportedHardens the container; apps writing to / need a bind mount volume

sharedMemorySize✗ Not supportedNot supported on Fargate per AWS docs — /dev/shm is limited on Fargate; move workloads needing large shared memory to EC2 launch type

tmpfs✗ Not supportedNot supported on Fargate per AWS docs

devices✗ Not supportedHost device passthrough unavailable on Fargate

maxSwap / swappiness✗ Not supportedFargate manages swap internally

initProcessEnabled: true runs /sbin/docker-init as PID 1. It forwards signals (so SIGTERM actually reaches your app), and reaps zombie processes. Without it, if your app spawns child processes, SIGTERM may not reach them — stopTimeout expires and SIGKILL fires instead of a graceful shutdown.

Key insight

sharedMemorySize and tmpfs are listed in the AWS docs under linuxParameters but are explicitly not supported on Fargate. PyTorch multi-process dataloaders and other workloads that need large /dev/shm will fail on Fargate. Move these workloads to EC2 launch type.

stopTimeout defaults to 30 seconds, max 120 seconds on Fargate. Set it to match your graceful shutdown time. Combined with initProcessEnabled: true, your app will actually receive SIGTERM and have time to act on it. This directly affects how cleanly environments drain during scheduled shutdowns.

Source: LinuxParameters API reference · KernelCapabilities API reference

Networking in awsvpc mode

Fargate requires awsvpc network mode, which gives each task its own ENI with a private IP — containers in the same task share that ENI and communicate via localhost, not service names. Each task gets its own elastic network interface with a private IP and security group.

Container-to-container within the same task

Use localhost (or 127.0.0.1). Containers in the same task share one ENI. The hostname parameter in containerDefinitions is not supported in awsvpc mode.

Private subnet with no NAT gateway

ECS needs to pull images from ECR and reach AWS APIs. In a private subnet without NAT, you need VPC endpoints: com.amazonaws.region.ecr.api, com.amazonaws.region.ecr.dkr, and the S3 gateway endpoint (ECR uses S3 for image layers). Missing endpoints cause image pull failures with cryptic timeout errors.

Security groups are per-task, not per-container

The ENI belongs to the task. All containers in the task share the same security group rules. You cannot apply different inbound rules to different containers in the same task — if you need isolation, put them in separate tasks.

portMappings in awsvpc mode

Specify containerPort only. The hostPort field is ignored, or if provided must equal containerPort. There is no port remapping — the container port is the port exposed on the task's IP.

Source: Task networking with the awsvpc network mode

Log configuration — the one line that breaks everything

ECS defaults to the awslogs driver, creates log groups with Never Expire retention, and requires the log group to exist before task launch — or silently fail. Three traps in one feature.

json

"logConfiguration": {
  "logDriver": "awslogs",
  "options": {
    "awslogs-group":          "/ecs/prod-api",
    "awslogs-region":         "us-east-1",
    "awslogs-stream-prefix":  "ecs",
    "awslogs-create-group":   "true",
    "awslogs-multiline-pattern": "^(ERROR|WARN|INFO)"
  }
}

Trap 1 — awslogs-create-group needs a permission

If the log group doesn't exist and awslogs-create-group is omitted (or false), the task fails at startup with: ResourceInitializationError: failed to configure log driver. Set awslogs-create-group: "true" AND add logs:CreateLogGroup to the execution role. Or pre-create log groups via Terraform with an explicit retention_in_days.

Trap 2 — Never Expire is the default

ECS creates log groups with no retention policy. Logs accumulate forever. A 15-service fleet at INFO level generates ~$135/month in CloudWatch costs from storage alone. See the full breakdown in the CloudWatch cost guide.

Trap 3 — json-file and syslog are not supported on Fargate

Only awslogs, splunk, and awsfirelens work on Fargate. Specifying json-file (the Docker default) will fail silently or cause launch errors. The awslogs-stream-prefix names each stream as: {prefix}/{container-name}/{task-id}.

The awslogs-multiline-pattern option groups multi-line log events — useful for Java stack traces and Python tracebacks. Without it, each line in a stack trace is a separate CloudWatch event, which makes Insights queries harder and slightly increases ingestion cost. Full steps to cut CloudWatch Logs costs on ECS are in the dedicated guide.

If you read this, you might also want to know

Can I update a task definition without a new deployment?

No. Registering a new task definition revision doesn't update running tasks. You must update the service (aws ecs update-service --task-definition family:revision) and trigger a deployment. ECS replaces old tasks with new ones according to the service's deployment configuration (rolling update or blue/green). The old revision stays registered and can be rolled back to.

How many containers can I run in a single task?

ECS has no hard limit on containers per task, but Fargate has a soft limit of 10 containers per task. In practice, more than 3–4 containers per task is a design smell — each container in a task shares the CPU, memory, and network of that task. Separate concerns that scale independently into separate services with separate tasks.

What happens to my task definition if I delete the ECR image it references?

The task definition remains registered and the revision persists. Running tasks that already pulled the image are unaffected. New tasks launched from that definition will fail at image pull with a 404. ECS does not validate image existence at registration time — only at task launch.

FAQ

Task definitions are one layer

Operating the fleet
is the other 90%.

Correct task definitions get your containers running. Scheduling, cost visibility, environment cloning, and developer self-service keep 10+ environments from becoming a full-time job.

Run Fleet Audit →Book a call

Worth reading

GuideAWS ECS Fargate: What It Is, How It Works, What It CostsTask definitions in the full Fargate context — how CPU/memory choices affect your bill, not just your containers.GuideHow to Debug AWS Fargate Containers with ECS ExecTask definitions set up the container. ECS Exec gets you inside when something goes wrong.

Two IAM roles, one common mistake

Fargate CPU and memory — the combinations that don't exist

Secrets: SSM vs Secrets Manager, and why rotation does nothing

Health checks — three ways to loop forever

Container ordering with dependsOn — and the essential trap

Linux parameters — what works and what silently does nothing

Networking in awsvpc mode

Log configuration — the one line that breaks everything

If you read this, you might also want to know

FAQ

Operating the fleetis the other 90%.

Operating the fleet
is the other 90%.