Guide
Matt S
Matt S
Platform engineer at Fortem··9 min read

ECS Deployment Strategies: When Rolling Updates Break and What to Do Instead

Rolling updates are the ECS default. For most stateless services, they work fine — ECS launches new tasks, waits for health checks, then drains the old ones. No downtime. But three specific situations break this model: incompatible database schema changes, long-lived connections like WebSockets, and teams that need canary testing before full traffic shift. AWS launched ECS Native Blue/Green in July 2025 to address all three. Here's how to pick the right strategy for your service.

TL;DR
  • ·ECS rolling update (default) launches new tasks first, waits for health checks, then drains old ones — no service degradation with minimumHealthyPercent=100
  • ·Rolling update rollback takes 30–120 seconds per batch (new tasks must launch). Blue/green rollback takes seconds (ALB rule flip, blue tasks already warm)
  • ·3 cases that need blue/green: incompatible DB schema migrations, WebSocket/gRPC connections, or canary testing at 10% before full traffic shift
  • ·ECS Native Blue/Green launched July 2025 — no CodeDeploy required. CodeDeploy was removed from the ECS console in early 2026
  • ·Canary and Linear strategies added October 2025. NLB support added February 2026

How rolling updates work

With minimumHealthyPercent=100 (the Fargate default), ECS launches all new tasks first, waits for ALB health checks to pass, then drains old tasks. Old tasks never lose traffic until new ones are healthy.

Two parameters control the rollout:minimumHealthyPercentsets how few healthy tasks ECS can run during deployment, andmaximumPercentsets how many total tasks (old + new) ECS can run simultaneously. With the defaults (100/200) and 4 desired tasks, ECS spins up 4 new tasks, waits for all 4 to become healthy, then drains all 4 old ones.

Rolling update sequence (4 tasks, minimumHealthyPercent=100)
T+0sOld: 4 tasksNew: 0 tasksDeploy triggered
T+30sOld: 4 tasksNew: 4 tasksNew tasks launching (PENDING)
T+60sOld: 4 tasksNew: 4 tasksNew tasks pass health checks
T+90sOld: 0 tasksNew: 4 tasksOld tasks draining (300s deregistration)

The deregistration delay (default 300 seconds) is where long-lived connections live or die. During those 5 minutes, the ALB sends no new requests to draining tasks but allows existing connections to complete. HTTP APIs handle this fine. WebSocket connections that last longer than 300 seconds get forcibly closed.

“ECS rolling updates use the minimum healthy percent and maximum percent settings to determine the deployment strategy, ensuring that the specified number of tasks continue to run during the deployment.”

AWS ECS Deployment Documentation, verified June 2026

When rolling update is enough

Rolling update is the right choice for most ECS services: stateless APIs, background workers, and anything with short-lived HTTP connections. The ECS circuit breaker (enabled by default since 2024) handles bad deploys automatically.

The circuit breaker threshold is max(3, min(0.5 × desiredCount, 200)). For a 4-task service, ECS will roll back automatically after 3 consecutive task failures. For a 100-task service, the threshold is 50 failed tasks. Once triggered, ECS rolls back by redeploying the last successful task definition — no manual intervention.

Rolling update is sufficient when
Stateless HTTP APIsNo session state means any task can handle any request — mixed old/new versions are safe
Background workers / queuesWorkers pick up jobs independently — version mismatch doesn't cause errors if messages are compatible
Backward-compatible schema changesAdding nullable columns, adding new tables, or adding indexes — old tasks can run against the new schema
Rollback speed > 2 minutes is acceptableCircuit breaker rolls back in the same time as a forward deploy — one cold-start cycle
Short HTTP connections onlyDeregistration delay of 30–60s handles 99% of HTTP keep-alive sessions
Key insight

Enable the ECS deployment circuit breaker on every service. Set deploymentCircuitBreaker: { enable: true, rollback: true } in your service config or Terraform. It costs nothing and prevents a bad deploy from staying stuck in a deployment loop for hours.

The 3 cases where it breaks

Three specific patterns make rolling updates dangerous: breaking database migrations, long-lived connections that can't survive the deregistration window, and canary testing requirements where you need to validate 10% of traffic before committing to 100%.

01Breaking database schema migrations

During a rolling update, v1 and v2 tasks run simultaneously for 30–90 seconds. If your v2 migration renames column user_name to username, the still-running v1 tasks throw column user_name does not exist. Not a partial failure — 100% of v1 task queries fail until they're drained.

Blue/green fix: run the migration during the POST_SCALE_UP lifecycle hook, after green tasks launch but before any production traffic shifts. If validation fails, zero users are affected.
02WebSocket and gRPC streaming connections

The ALB deregistration delay (default 300 seconds) forcibly closes any connections still open after the window expires. WebSocket clients see close code 1006 (abnormal closure). gRPC streaming clients get UNAVAILABLE. Sessions longer than 5 minutes — real-time dashboards, chat, live data feeds — get dropped on every deploy.

Blue/green fix: green tasks take all new connections; blue tasks drain passively with no time limit until clients naturally disconnect. No force-close.
03Canary testing before full traffic shift

Rolling updates have no traffic control. ECS adds tasks and drains old ones — you can't say “send 10% of traffic to v2 and watch error rates for 5 minutes before committing.” With a 4-task service and maximumPercent=125, you get one new task (25%) or nothing — no granular control.

Blue/green fix: ECS Native Canary sends exactly 10% of ALB traffic to green for a configurable bake period. CloudWatch alarms automatically roll back if error rate or p99 latency spikes.

ECS Native Blue/Green (July 2025)

AWS launched built-in ECS blue/green on July 18, 2025. No CodeDeploy required, no AppSpec files, no separate application resource. Set strategy: BLUE_GREEN on your ECS service, point it at two ALB target groups, and ECS manages the rest.

The mechanism: ECS stores an immutable snapshot of your service config called a service revision. When you deploy, ECS creates a new “green” task set registered to the secondary target group. Blue continues serving 100% of production traffic. After the green tasks pass health checks, ECS shifts the ALB listener rule weight from 100/0 to 0/100 (all-at-once), or incrementally for canary and linear strategies.

Ready to use — minimum ECS Native Blue/Green setup

Terraform — requires two ALB target groups (blue/green) and an IAM role with AmazonECSInfrastructureRolePolicyForLoadBalancers.

hcl
resource "aws_ecs_service" "my_service" {
  name            = "my-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 4

  deployment_configuration {
    strategy = "BLUE_GREEN"

    blue_green_configuration {
      deployment_ready_wait_time_in_seconds = 0  # all-at-once
      termination_wait_time_in_seconds      = 300

      deployment_lifecycle_hook {
        hook_target_arn = aws_lambda_function.smoke_test.arn
        lifecycle_stage  = "POST_SCALE_UP"
        role_arn         = aws_iam_role.ecs_hooks.arn
      }
    }
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.blue.arn
    container_name   = "my-service"
    container_port   = 8080
  }

  # Second target group for green traffic
  load_balancer {
    target_group_arn = aws_lb_target_group.green.arn
    container_name   = "my-service"
    container_port   = 8080
  }

  # Required: prevent Terraform from reverting ECS-managed task def updates
  lifecycle {
    ignore_changes = [task_definition]
  }
}

The 7 lifecycle hooks (PRE_SCALE_UP, POST_SCALE_UP, TEST_TRAFFIC_SHIFT, and 4 more) are Lambda functions that ECS polls every 30 seconds. Each hook returns SUCCEEDED, FAILED, or IN_PROGRESS. A hook that returns FAILED aborts the deployment and triggers rollback. This is where you run smoke tests, DB migration validation, or health checks against the test listener — before any production user sees the new version.

Key insight

The TEST_TRAFFIC_SHIFThook enables “dark canary” testing: a separate ALB listener (different port or header rule) routes a copy of real production requests to the green task set before any traffic shifts. Zero production impact — your v2 gets exercised with live traffic patterns while blue still serves 100%.

For teams already using CodeDeploy: AWS removed the CODE_DEPLOY controller type from the ECS console in early 2026 and published a migration guide. The native approach has no AppSpec files, longer lifecycle hook timeouts (up to 24 hours per stage vs CodeDeploy's 1-hour total limit), and supports ECS Service Connect which CodeDeploy never did.

Canary and Linear strategies

Canary sends a fixed percentage of traffic to green for a bake period, then shifts the rest. Linear moves traffic in equal increments on a schedule. Both were added to ECS Native Blue/Green on October 30, 2025. Both support automatic CloudWatch alarm rollback.

StrategyDurationTraffic splitWhen to use
All-at-once< 5 min0% → 100%Dev/staging. Fast iteration. No validation needed.
Canary 10%/5 min~5 min10% → wait → 100%Production APIs. Catches error spikes at 1/10 blast radius.
Linear 10%/1 min10 min10%, 20%, … 100%Regulated workloads. SOC 2 evidence of gradual rollout.

Canary and Linear require two CloudWatch alarms — one on error rate, one on p99 latency — wired as rollback triggers. If either alarm fires during the bake period, ECS flips the ALB rule back to blue in seconds. Configure alarms on the green target group specifically, not the service as a whole, so the alarm fires on the new version's traffic only.

bash
# Set canary strategy on an existing service
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --deployment-configuration '{
    "strategy": "BLUE_GREEN",
    "blueGreenConfiguration": {
      "terminationWaitTimeInSeconds": 300,
      "deploymentReadyWaitTimeInSeconds": 300,
      "blueGreenUpdateStrategy": {
        "type": "CANARY",
        "canaryPercentage": 10,
        "stableWaitTimeInSeconds": 300
      },
      "rollbackConfiguration": {
        "cloudWatchAlarms": [
          { "alarmName": "my-service-green-5xx-rate" },
          { "alarmName": "my-service-green-p99-latency" }
        ]
      }
    }
  }'

Decision table — pick your strategy

Start with rolling update. Add blue/green only when your service hits one of the three failure modes. Use canary in production when your team needs evidence of gradual rollout — either for operational confidence or compliance.

Your situationRollingBlue/Green All-at-onceBlue/Green Canary
Stateless HTTP API, backward-compatible deploys
Background worker / queue consumer
Need instant rollback (seconds, not minutes)
DB migration that breaks old schema
WebSocket / gRPC streaming
SOC 2 / PCI audit evidence of gradual rollout
Want to catch errors at 10% blast radius first
Dev and staging environments

Don't blue/green dev and staging environments — the ALB cost ($0.008/hour per LCU, roughly $7–15/month per ALB) adds up fast across 10+ environments. Use rolling update there, save blue/green for production. If you're already running environment scheduling to cut non-prod costs, adding a second ALB per environment would offset most of those savings.

If you read this, you might also want to know

Can I switch an existing ECS service from rolling to blue/green without downtime?

Yes. Update the service's deploymentConfiguration.strategy to BLUE_GREEN and add the second ALB target group. The change takes effect on the next deployment — the running tasks continue serving traffic unaffected. The first blue/green deployment after the switch will use the new strategy.

How does ECS blue/green interact with ECS Service Auto Scaling?

Auto Scaling stays active during blue/green deployments, but scaling events during a deployment can cause it to fail. If a scaling event fires while ECS is waiting for green tasks to reach steady state, ECS gives the service 5 minutes to stabilize — if it doesn't, the deployment fails. Pause scaling policies during critical deployments by setting min/max capacity to your desired count temporarily.

Does ECS Native Blue/Green work with AWS CDK?

Yes, as of late 2025. The CDK ECS module supports deploymentStrategy: ecs.DeploymentStrategy.BLUE_GREEN with all three traffic shifting options (all-at-once, canary, linear). The CDK approach is cleaner than CodeDeploy because there's no separate CodeDeploymentGroup resource — it's all on the FargateService construct.

What happens to blue tasks after a successful blue/green deployment?

Blue tasks stay running for the termination wait time you configured (default 5 minutes, max 2 days). During this window you can roll back instantly. After the wait, ECS drains and terminates the blue task set. You pay Fargate per-second for both sets during the overlap — at $0.04048/vCPU-hour, 5 minutes of a 2-vCPU service costs $0.007.

Common questions

Deployments are solved. Operations at scale isn't.

Blue/green handles the deploy. Fleet scheduling, cost visibility, environment cloning, and developer self-service are what breaks at 10+ ECS environments. We review your setup in 20 minutes.

Response within 4 hours, weekdays.

Continue reading