GuideJune 12, 2026·10 min read

ECS Task Definitions: Every Field, Common Mistakes, Best Practices

The AWS docs are a reference, not a guide. This covers the 8 mistakes that break ECS deployments — wrong IAM role, invalid Fargate CPU/memory combos, health checks that restart forever, secrets that don't rotate. Each one: what fails, what the error looks like, what the fix is.

Matt S
Matt S
Platform engineer · Fortem
TL;DR
  • 01executionRoleArn = ECS agent (ECR pull, CloudWatch, secrets fetch). taskRoleArn = your app code (S3, DynamoDB, SQS). Wrong role = AccessDenied that's hard to trace.
  • 02Fargate CPU/memory combos are not ranges — 256 CPU only accepts 3 memory values. Invalid combos fail at deploy time with a cryptic error.
  • 03Secrets are injected once at task start. Secret rotation does not update running containers. You must force a new deployment.
  • 04Health check startPeriod is off by default. JVM apps die before they finish booting. Set startPeriod to at least 1.5× your worst-case startup time.
  • 05Omitting essential on a container defaults it to true. An init/migration container that exits successfully will stop the entire task.
  • 06sharedMemorySize and tmpfs are not supported on Fargate per AWS docs. ML workloads needing large /dev/shm must use EC2 launch type instead.

Two IAM roles, one common mistake

Every ECS task definition has two separate IAM roles. They look similar and use the same trust principal (ecs-tasks.amazonaws.com), but they serve completely different parts of the system.

RoleUsed byPurpose
executionRoleArnECS / Fargate agentPull image from ECR, write logs to CloudWatch, fetch secrets from SSM / Secrets Manager at task startup
taskRoleArnYour app codeCall AWS APIs from inside the container: S3 GetObject, DynamoDB Query, SQS SendMessage, etc.

The mistake: the ECS console presents executionRoleArn prominently during task definition creation. Teams add S3 or DynamoDB permissions there. The task launches fine — but every AWS SDK call from the app returns AccessDeniedException. The task is running, logs are flowing, health check passes. The only symptom is API calls failing inside the app.

The execution role needs AmazonECSTaskExecutionRolePolicy (covers ECR + CloudWatch) plus explicit permissions for any secrets you reference:

json
{
  "Effect": "Allow",
  "Action": [
    "ssm:GetParameters",
    "secretsmanager:GetSecretValue",
    "kms:Decrypt"
  ],
  "Resource": [
    "arn:aws:ssm:us-east-1:123456789012:parameter/prod/*",
    "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/*"
  ]
}
Key insight
If your app can't reach S3 or DynamoDB, check the task role first. If the task fails to start entirely (no logs, no health checks), check the execution role. Different roles, different failure modes.

Source: Amazon ECS task execution IAM role · Amazon ECS task IAM role

Fargate CPU and memory — the combinations that don't exist

Fargate CPU and memory are not free-form numbers. There are 8 CPU tiers, each with a fixed set of valid memory values. Anything outside this table fails at deployment with: No Fargate configuration exists for given values.

CPU valuevCPUValid memory
2560.25512 MiB, 1 GB, 2 GB only (3 discrete values)
5120.51–4 GB (1 GB steps)
102412–8 GB (1 GB steps)
204824–16 GB (1 GB steps)
409648–30 GB (1 GB steps)
8192816–60 GB (4 GB steps) — Linux, platform 1.4.0+
163841632–120 GB (8 GB steps) — Linux, platform 1.4.0+
327683260, 120, or 244 GB — Linux, platform 1.4.0+
Gotcha 1 — 256 CPU is not a range

The 256 CPU tier accepts exactly 512 MiB, 1024 MiB, or 2048 MiB. You cannot specify 768 MiB or any value between. If you need 1.5 GB, step up to the 512 CPU tier.

Gotcha 2 — 8192+ CPU uses non-1 GB increments

The 8192 tier (8 vCPU) uses 4 GB steps. Requesting 17 GB fails — you must choose 16 GB or 20 GB. The 16384 tier uses 8 GB steps. The 32768 tier only accepts 60, 120, or 244 GB.

Gotcha 3 — Terraform memory is in MiB, not GB

Terraform's aws_ecs_task_definition takes memory as an integer in MiB. Writing memory = 4 means 4 MiB, not 4 GB. The deployment will fail. Use memory = 4096 for 4 GB.

Source: Troubleshoot Amazon ECS invalid CPU or memory errors · Fargate task definitions

Secrets: SSM vs Secrets Manager, and why rotation does nothing

Use the secrets array, not environment, for anything sensitive. The syntax difference is value vs valueFrom — and mixing them up causes a silent empty variable.

json
"containerDefinitions": [{
  "environment": [
    {"name": "LOG_LEVEL", "value": "warn"},
    {"name": "PORT",      "value": "8080"}
  ],
  "secrets": [
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/prod/db/password"
    },
    {
      "name": "API_KEY",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/api-key-AbCdEf"
    },
    {
      "name": "DB_USER",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-AbCdEf:username::"
    }
  ]
}]

The last entry shows JSON key extraction from Secrets Manager — appending :username:: to the ARN injects only the username field from a JSON secret. The trailing :: (empty version-stage and version-id) are required. Requires Fargate platform 1.4.0+.

SSM Parameter StoreSecrets Manager
CostFree (Standard, up to 10k params)$0.40 / secret / month
IAM actionssm:GetParameters (plural)secretsmanager:GetSecretValue
Max size4 KB (Standard), 8 KB (Advanced)64 KB
Built-in rotationNo (custom Lambda needed)Yes (RDS, Redshift, DocumentDB)
JSON key extractionNoYes (ARN:key:: syntax)
Key insight
Secrets are fetched once at task startup and injected as environment variables. Rotating a secret in SSM or Secrets Manager does not update any running container. To pick up rotated credentials, you must force a new deployment: aws ecs update-service --force-new-deployment.

At scale, the cost difference is significant. 100 secrets: SSM is ~$0/month, Secrets Manager is ~$40/month. The common pattern: use SSM for simple key-value secrets, Secrets Manager when you need built-in rotation or JSON key extraction. Note that ssm:GetParameters (plural) is the correct permission — ssm:GetParameter (singular) will silently fail for tasks that fetch multiple parameters at once.

Source: Pass sensitive data to an Amazon ECS container · Pass Secrets Manager secrets through ECS environment variables

Health checks — three ways to loop forever

Three mistakes compound to create an endless restart loop: startPeriod too short, timeout ≥ interval, and the health check binary not installed in the image. All three are ECS defaults or copy-paste traps.

FieldDefaultNotes
interval30 sTime between checks. Min: 5 s
timeout5 sMust be less than interval. Min: 2 s
retries3Failures before UNHEALTHY. Max: 10
startPeriodoff (0)Grace period — failures don't count against retries. Max: 300 s
Mistake 1 — startPeriod off by default

A JVM app takes 45 seconds to boot. With startPeriod: 0, health checks start firing immediately. With retries: 3 at interval: 30, the container is marked UNHEALTHY at ~90 seconds — before the app finishes booting. The service launches a replacement, which also dies at 90 seconds.

Fix: set startPeriod to 1.5× your worst-case startup time.

Mistake 2 — timeout ≥ interval

timeout: 30 with interval: 30 is invalid. The timeout must be strictly less than the interval — ECS needs time between when a check times out and when the next one starts. Use timeout: 5, interval: 30 as a baseline.

Mistake 3 — curl not in the image

Minimal images (distroless, alpine without extras) don't have curl. The health check returns exit code 126 (command not found) on every attempt. The container is UNHEALTHY before startPeriodends. Fix: use your app's native runtime (see examples below), or add a compiled healthcheck binary to the image.

Build a proper /health endpoint

The best health check calls an endpoint your app already serves. If the app can't respond to HTTP, the container isn't healthy regardless of what curl says. Here's a minimal /health endpoint in each common language:

package main

import (
    "encoding/json"
    "net/http"
    "os"
)

func main() {
    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
    })

    port := os.Getenv("PORT")
    if port == "" {
        port = "8080"
    }
    http.ListenAndServe(":"+port, nil)
}

Then reference it in the task definition:

json
"healthCheck": {
  "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
  "interval": 30,
  "timeout": 5,
  "retries": 3,
  "startPeriod": 60
}

For images without curl, use the app's runtime directly instead of CMD-SHELL:

json
"command": ["CMD-SHELL", "python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8080/health')\""]
// or Go binary:
"command": ["CMD", "/bin/healthcheck"]

Note: for service tasks, an UNHEALTHY container is deregistered from the load balancer and replaced. For standalone tasks (run manually), health status is reported but the task is NOT stopped automatically — health checks are advisory only.

Source: Container health checks · HealthCheck API reference

Container ordering with dependsOn — and the essential trap

dependsOn controls startup order. essential controls what happens when a container exits. Misunderstanding either causes tasks that restart for no obvious reason.

ConditionMeaningGotcha
STARTDependency is in RUNNING stateWeakest — container may crash immediately after entering RUNNING
COMPLETEDependency exited (any code)Cannot be set on essential containers
SUCCESSDependency exited with code 0Cannot be set on essential containers; for init/migration patterns
HEALTHYDependency's healthCheck passesConfirmed at startup only — not monitored continuously after

The essential field defaults to true when omitted. An init container that runs a DB migration and exits with code 0 will — by default — stop the entire task. This is the correct behavior for application crashes, but wrong for intentional short-lived containers.

DB migration pattern:

json
"containerDefinitions": [
  {
    "name": "db-migration",
    "image": "my-app:latest",
    "command": ["python", "manage.py", "migrate"],
    "essential": false,
    "healthCheck": {
      "command": ["CMD-SHELL", "exit 0"],
      "startPeriod": 5
    }
  },
  {
    "name": "app",
    "image": "my-app:latest",
    "essential": true,
    "dependsOn": [
      {"containerName": "db-migration", "condition": "SUCCESS"}
    ],
    "startTimeout": 120
  }
]

startTimeouton the app container (max 120 s on Fargate) sets a deadline: if the migration doesn't exit with SUCCESS within 120 s, the app gives up and the task stops. Without it, a hung migration hangs the entire task forever. Shutdown order is the reverse of startup order — ECS sends SIGTERM to the app first, then the migration container.

Source: ContainerDependency API reference

Linux parameters — what works and what silently does nothing

Fargate accepts several linuxParameters fields without error but ignores them at runtime. Knowing which ones actually work prevents silent failures — especially on ML workloads.

ParameterFargate supportNotes
initProcessEnabled✓ SupportedRuns /sbin/docker-init as PID 1 — recommended
capabilities.add: SYS_PTRACE✓ SupportedOnly addable capability on Fargate — needed for debuggers, strace
capabilities.drop✓ SupportedDrop any default Linux capability
readonlyRootFilesystem✓ SupportedHardens the container; apps writing to / need a bind mount volume
sharedMemorySize✗ Not supportedNot supported on Fargate per AWS docs — /dev/shm is limited on Fargate; move workloads needing large shared memory to EC2 launch type
tmpfs✗ Not supportedNot supported on Fargate per AWS docs
devices✗ Not supportedHost device passthrough unavailable on Fargate
maxSwap / swappiness✗ Not supportedFargate manages swap internally

initProcessEnabled: true runs /sbin/docker-init as PID 1. It forwards signals (so SIGTERM actually reaches your app), and reaps zombie processes. Without it, if your app spawns child processes, SIGTERM may not reach them — stopTimeout expires and SIGKILL fires instead of a graceful shutdown.

Key insight
sharedMemorySize and tmpfs are listed in the AWS docs under linuxParameters but are explicitly not supported on Fargate. PyTorch multi-process dataloaders and other workloads that need large /dev/shm will fail on Fargate. Move these workloads to EC2 launch type.

stopTimeout defaults to 30 seconds, max 120 seconds on Fargate. Set it to match your graceful shutdown time. Combined with initProcessEnabled: true, your app will actually receive SIGTERM and have time to act on it. This directly affects how cleanly environments drain during scheduled shutdowns.

Source: LinuxParameters API reference · KernelCapabilities API reference

Networking in awsvpc mode

Fargate requires networkMode: awsvpc. Each task gets its own elastic network interface with a private IP and security group. Containers within the same task share this ENI and communicate via localhost.

Container-to-container within the same task

Use localhost (or 127.0.0.1). Containers in the same task share one ENI. The hostname parameter in containerDefinitions is not supported in awsvpc mode.

Private subnet with no NAT gateway

ECS needs to pull images from ECR and reach AWS APIs. In a private subnet without NAT, you need VPC endpoints: com.amazonaws.region.ecr.api, com.amazonaws.region.ecr.dkr, and the S3 gateway endpoint (ECR uses S3 for image layers). Missing endpoints cause image pull failures with cryptic timeout errors.

Security groups are per-task, not per-container

The ENI belongs to the task. All containers in the task share the same security group rules. You cannot apply different inbound rules to different containers in the same task — if you need isolation, put them in separate tasks.

portMappings in awsvpc mode

Specify containerPort only. The hostPort field is ignored, or if provided must equal containerPort. There is no port remapping — the container port is the port exposed on the task's IP.

Source: Task networking with the awsvpc network mode

Log configuration — the one line that breaks everything

ECS defaults to the awslogs driver, creates log groups with Never Expire retention, and requires the log group to exist before task launch — or silently fail. Three traps in one feature.

json
"logConfiguration": {
  "logDriver": "awslogs",
  "options": {
    "awslogs-group":          "/ecs/prod-api",
    "awslogs-region":         "us-east-1",
    "awslogs-stream-prefix":  "ecs",
    "awslogs-create-group":   "true",
    "awslogs-multiline-pattern": "^(ERROR|WARN|INFO)"
  }
}
Trap 1 — awslogs-create-group needs a permission

If the log group doesn't exist and awslogs-create-group is omitted (or false), the task fails at startup with: ResourceInitializationError: failed to configure log driver. Set awslogs-create-group: "true" AND add logs:CreateLogGroup to the execution role. Or pre-create log groups via Terraform with an explicit retention_in_days.

Trap 2 — Never Expire is the default

ECS creates log groups with no retention policy. Logs accumulate forever. A 15-service fleet at INFO level generates ~$135/month in CloudWatch costs from storage alone. See the full breakdown in the CloudWatch cost guide.

Trap 3 — json-file and syslog are not supported on Fargate

Only awslogs, splunk, and awsfirelens work on Fargate. Specifying json-file (the Docker default) will fail silently or cause launch errors. The awslogs-stream-prefix names each stream as: {prefix}/{container-name}/{task-id}.

The awslogs-multiline-pattern option groups multi-line log events — useful for Java stack traces and Python tracebacks. Without it, each line in a stack trace is a separate CloudWatch event, which makes Insights queries harder and slightly increases ingestion cost. Full steps to cut CloudWatch Logs costs on ECS are in the dedicated guide.

If you read this, you might also want to know

Can I update a task definition without a new deployment?

No. Registering a new task definition revision doesn't update running tasks. You must update the service (aws ecs update-service --task-definition family:revision) and trigger a deployment. ECS replaces old tasks with new ones according to the service's deployment configuration (rolling update or blue/green). The old revision stays registered and can be rolled back to.

How many containers can I run in a single task?

ECS has no hard limit on containers per task, but Fargate has a soft limit of 10 containers per task. In practice, more than 3–4 containers per task is a design smell — each container in a task shares the CPU, memory, and network of that task. Separate concerns that scale independently into separate services with separate tasks.

What happens to my task definition if I delete the ECR image it references?

The task definition remains registered and the revision persists. Running tasks that already pulled the image are unaffected. New tasks launched from that definition will fail at image pull with a 404. ECS does not validate image existence at registration time — only at task launch.

FAQ

Task definitions are one layer

Operating the fleet
is the other 90%.

Correct task definitions get your containers running. Scheduling, cost visibility, environment cloning, and developer self-service keep 10+ environments from becoming a full-time job.