When should you use ECS with Fargate?

Use Fargate when you want zero server management — no AMI patching, no ASG tuning, no capacity planning. Best for: dev/staging environments, bursty workloads, teams under 5 people, any workload where ops simplicity matters more than per-vCPU cost. Run ECS on EC2 when you need GPU, specific instance types, or deep cost optimisation at scale.

What are the limitations of ECS Fargate?

No GPU support, fixed task size combos (you can't pick arbitrary vCPU/GB ratios), 20 GB ephemeral storage per task, no privileged containers, no daemon scheduling. Fargate also costs more than EC2 for memory-heavy workloads because memory is charged separately ($0.00444/GB-hr).

Is ECS Fargate HIPAA compliant?

Yes — Fargate is covered under the AWS HIPAA BAA. You still need to configure encryption at rest (EFS/EBS) and in transit (TLS), use Fargate platform version 1.4.0+, and ensure your task IAM roles follow least privilege. Fargate's isolation model (per-task ENI) is actually stronger for compliance than shared EC2 instances.

What are ECS best practices for production?

Use a consistent naming convention (region-account-envname), separate AWS accounts for prod/non-prod, enable ECS Service Connect for inter-service communication, use Fargate Spot for non-critical tasks, set CloudWatch log retention to 30 days (not indefinite), and schedule non-prod environments off outside business hours.

Is ECS Fargate cheaper than EC2?

Per vCPU they're similar. EC2 includes memory in the instance price — Fargate charges separately. For memory-heavy workloads, EC2 is significantly cheaper. But Fargate eliminates bin-packing waste (15-25% on EC2) and ops overhead. The break-even depends on your workload shape and team size.

Guide

Matt S

Platform engineer at Fortem·June 1, 2026·9 min read

ecs-fargate-best-practicesecs-fleet-best-practicesaws-ecs-production-tips

ECS Fargate Best Practices: Running a Fleet of 10+ Environments Without the Pain

Most ECS Fargate best practices guides tell you what to do. This one tells you what breaks between environment 5 and environment 20 — the point where platform engineering for ECS at 10+ environments stops being a nice-to-have — and gives you the exact fix for each. The numbers come from AWS published pricing, service quotas, and patterns we've seen managing fleets at scale. If you're running fewer than 5 environments, most of this won't matter yet. Bookmark it. And if you're still deciding whether a service belongs on Fargate at all, start with when Lambda stops being cheaper than Fargate — the duration-based breakeven decides it before any of this applies.

TL;DR

·Name everything consistently from day one; retrofitting naming across 10+ environments takes weeks.
·Fixed overhead is $85–100/mo per environment before a single container runs — at 50 envs that's $4,250–5,000/mo invisible spend.
·Schedule dev/staging off-hours first. It cuts compute cost 60–70% and requires zero infrastructure changes.
·Set CloudWatch log retention before ingestion hits 15 TB/mo and you get a $7,500 bill.
·Isolate Terraform state per environment before the 25 MB threshold makes plans take 30+ minutes.

Start with naming and account structure

Use one naming convention (region-account-envname) on every resource, separate AWS accounts for prod and non-prod, and one ECS cluster per environment before you hit five.

At 3 environments you can get away with ad-hoc names. At 10 you can't — because every AWS resource name is simultaneously a billing dimension, an IAM scope, and a CloudWatch filter. Inconsistent names mean you can't attribute cost, can't write scoped IAM policies, and can't build dashboards without a lookup table.

The convention that scales: {region_short}-{account}-{envname}. Applied to every resource from day one. One Terraform local generates every downstream resource name — ECS cluster, task definition, SSM parameter path, IAM role, CloudWatch log group — all from one source.

Ready to use — copy this today

hcl

locals {
  env_prefix = "${var.region_short}-${var.account}-${var.envname}"
}

resource "aws_ecs_cluster" "main" {
  name = local.env_prefix  # → "use1-prod-main"
}

resource "aws_ecs_task_definition" "api" {
  family = "${local.env_prefix}-api-td"
  # → "use1-prod-main-api-td"
}

resource "aws_ssm_parameter" "db_host" {
  name = "/${local.env_prefix}/api/DB_HOST"
  # → "/use1-prod-main/api/DB_HOST"
}

resource "aws_iam_role" "task_role" {
  name = "${local.env_prefix}-api-task-role"
  # → "use1-prod-main-api-task-role"
}

resource "aws_cloudwatch_log_group" "api" {
  name = "/ecs/${local.env_prefix}-api"
  retention_in_days = var.log_retention_days
}

Map naming to account structure. The most common pattern that works at 10+ environments: one AWS account for production, one for all non-prod. This separates Fargate vCPU quota pools, hardens IAM boundaries, and makes Cost Explorer attribution clean.

One constraint your naming convention must handle: ALB target group names are capped at 32 characters, and each ALB has a hard limit of 100 target groups. At 20 environments with 6 services each, you're at 120 target groups — past the limit. This forces per-environment ALBs sooner than you think, which increases your fixed overhead — the shared-vs-per-service ALB tradeoff is worked through in how to put an ALB in front of ECS Fargate. A short naming prefix (use1-prod-api — 12 chars) leaves room for the target group suffix.

For the full naming pattern table, including the 32-character target group constraint and per-resource examples, see the dedicated section on consistent naming conventions for ECS environments.

Know your fixed overhead per environment

Every ECS environment carries ~$85–100/mo in fixed costs — ALB, NAT Gateway, CloudWatch — before a single container task runs. At 10 environments that's $850–1,000/mo of invisible spend; at 50, $4,250–5,000/mo.

The full per-line-item breakdown (ALB $22, NAT $33–66, CloudWatch $3–15, data transfer) lives in what an ECS environment actually costs. The best practice that matters here is what to do about the biggest line item — NAT Gateway:

Key insight

NAT Gateway is the single most expensive fixed line item in any ECS environment — and the easiest to eliminate for non-prod. Teams that care about NAT cost switch non-prod environments to public subnet placement with strict security group rules and Network ACLs instead of private subnets with a NAT. This is meaningfully cheaper but does reduce your network boundary — regulated environments (PCI, HIPAA) and prod should keep the NAT. Evaluate your compliance posture before cutting this corner.

One more lever: VPC Endpoints. If your containers only need to reach AWS services (S3, ECR, CloudWatch, SSM), a VPC Endpoint costs ~$7.20/mo per endpoint — roughly 1/5th of one NAT Gateway. For ECR pulls and CloudWatch pushes, Gateway Endpoints (S3, DynamoDB) are free. Combined with the public-subnet approach above, this is the cheapest path to eliminating NAT entirely for non-prod. Strategy: use VPC Endpoints for AWS dependencies and public subnets for outbound internet, and you drop NAT from non-prod without sacrificing functionality.

Schedule dev/staging before the bill bleeds

Scheduling non-prod environments off outside business hours cuts compute cost 60–70% and requires zero code changes — typically $1,000–3,000/mo saved on a 10-environment fleet. It's the single largest ECS cost lever, and the first best practice to put in place.

The catch is operating it at fleet scale: AWS-native scheduling works at the service level, so 10 environments × 8 services = 160 Auto Scaling actions to create and maintain, and a DIY EventBridge + Lambda setup quietly rots past 15–20 environments. The 160-actions breakdown, the three fleet-scale failure modes, and the AWS-native-vs-environment-level comparison all live in the complete guide to ECS environment scheduling — this checklist just flags that scheduling comes first.

Scheduling handles the calendar; scaling to live traffic is a separate best practice. Get the metric, cooldowns, and failure modes right with ECS Fargate autoscaling (target tracking and step scaling).

Isolate Terraform state before it isolates you

Use one Terraform state file per environment (S3 backend + DynamoDB locking) to limit blast radius to one environment and keep plan times under three minutes.

A single Terraform state file containing all environments starts fast. At 25–50 MB, plans take 30+ minutes. At the HCP Terraform hard limit of ~100 MB (from base64 encoding), Terraform stops working entirely.

The blast radius is worse than the speed problem: one module bug in a shared state file can take down every environment in a single apply. A typo in a variable that propagates to 10 environments creates 10 simultaneous incidents.

The fix is per-environment state, applied independently. One folder per environment, each with its own S3 backend. No shared state files, no workspaces, no extra tooling — directories you can see and reason about:

Folder-per-environment pattern

hcl

# Directory structure — one folder per environment, independent state
# terraform/environments/
#   prod/
#     backend.tf        → prod's own S3 backend (separate state file)
#     main.tf           → calls the shared module
#     terraform.tfvars
#   staging/
#     backend.tf        → staging's own S3 backend
#     main.tf
#     terraform.tfvars
#   dev-01/
#     ...

# environments/prod/backend.tf — each environment has its own state
terraform {
  backend "s3" {
    bucket         = "tfstate-org"
    key            = "envs/prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

# environments/prod/main.tf — thin, calls the shared module
module "environment" {
  source = "../../modules/ecs-environment"

  env_name   = "prod"
  account_id = "111111111111"
  # Plans run independently, blast radius is one environment
}

Each environments/<name>/ folder is self-contained: its own backend, its own tfvars, its own plan/apply lifecycle. You can see the entire fleet structure by looking at the directory tree — no jumping between files to trace configuration inheritance. Adding an environment means copying one folder and changing three lines. This is the pattern teams converge on after workspaces stop scaling, and it works with vanilla Terraform — no extra tooling required.

How to know when to split — check your state file size:

bash

terraform state pull | wc -c

Under 5 MB — fine. 10–25 MB — start planning the migration. Over 25 MB — plans take 30+ minutes and locking contention becomes noticeable. The 3-minute plan threshold is also a strong signal: if a plan against one environment takes longer than 3 minutes, your state file is too large regardless of its byte count.

Practical guidance: teams managing 10+ environments should move to per-environment state before hitting 25 MB, not after. The migration is mechanical — extract each environment into its own directory, run one init per directory, and verify with a plan. It takes an afternoon and prevents a week of incidents. For the full implementation guide, see managing ECS Fargate with Terraform: what works and what doesn't.

Set CloudWatch retention on day one

Set CloudWatch log retention to 30 days for dev/staging and 90 days for production in Terraform — the default is never-expire, which silently compounds into thousands per month.

The default CloudWatch log group setting is “never expire.” Teams routinely forget to change this. At $0.50/GB ingested, a fleet of 50 containers writing 5 GB/day generates $75/mo in ingestion costs alone — before storage, before metrics.

CloudWatch Logs at scale

50 containers × 5 GB/day: 7,500 GB/mo × $0.50/GB = $3,750/mo

Double the fleet to 100 containers: 15 TB/mo = $7,500/mo. We've seen this.

Container Insights: billed per metric/month (~$0.30/metric) — check your CloudWatch Metrics line

The fix: set retention_in_days in Terraform. 30 days for dev/staging, 90 for prod. Never “never expire.” The Container Insights metrics line above has the same shape — it grows per task and per container, so monitoring ECS Fargate across 10+ environments is worth splitting prod from dev the same way retention is.

hcl

resource "aws_cloudwatch_log_group" "api" {
  name              = "/ecs/${local.env_prefix}-${var.service_name}"
  retention_in_days = var.env_type == "prod" ? 90 : 30

  # Optional: switch non-prod to Infrequent Access — 50% cheaper storage
  # for logs read less than once a week
  log_group_class = var.env_type == "prod" ? "STANDARD" : "INFREQUENT_ACCESS"
}

Also: SSM parameters at $0.05/parameter/month creep unnoticed. At 10 environments × 8 services × 5 parameters each = 400 parameters = $20/mo. Small, but nobody accounts for it.

Key insight

We've seen teams discover a $7,500/mo CloudWatch bill six months after launching their 15th environment. The Terraform was deployed with default retention, and nobody looked at the CloudWatch line in Cost Explorer until the CFO asked. Set retention in your module defaults. It costs nothing to set and thousands to miss.

CloudWatch is one piece of the ECS cost puzzle. For the full picture — Fargate compute, data transfer, load balancing, and the 65% savings playbook — see how to cut AWS ECS Fargate costs by 65%.

Use Fargate Spot where it belongs

Fargate Spot gives a ~70% discount over on-demand for dev, staging, CI/CD, and batch — combine it with off-hours scheduling and production stays on on-demand.

Fargate Spot offers a 68% discount over on-demand: $0.01291/vCPU-hr vs $0.04048. The trade-off is a 2-minute interruption notice when AWS reclaims capacity, per the AWS Fargate pricing page (verified May 2026).

“Fargate Spot runs tasks on spare AWS EC2 capacity at up to a 70% discount compared to Fargate On-Demand. If AWS needs the capacity back, your running tasks will be given a two-minute warning and then stopped.”
— AWS Fargate Pricing, verified May 2026

Real interruption rates: large instance families see under 5% interruption; common instance types see 5–15%.

Best practice: use a capacity provider strategy with a 70/30 or 80/20 Spot/On-Demand split. Spot for CI/CD runners, staging, automated tests, and non-interactive batch jobs. On-Demand for production, customer-facing staging, and demo environments.

To enable: create a capacity provider strategy that includes both FARGATE_SPOT and FARGATE with a weighted base. AWS distributes tasks proportionally. The base weight (first number) is the minimum On-Demand count; the weight determines the split for additional tasks.

Capacity provider strategy with weighted split

hcl

# Define capacity providers for the ECS cluster
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 1
    base              = 0  # 0 On-Demand tasks minimum for non-prod
  }

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 0  # Use On-Demand only when Spot unavailable
  }
}

# Per-service: adjust weights based on workload criticality
# Prod services use base=2 + more FARGATE weight
# Non-prod services use base=0 + FARGATE_SPOT only

One operational note: Fargate Spot provides a 2-minute SIGTERM window before SIGKILL. Your containers must handle graceful shutdown within this window — drain connections, flush buffers, checkpoint state. If your app takes 3+ minutes to shut down, Spot tasks will be force-killed mid-flight. For CI/CD runners and stateless workers this is fine; for anything with in-flight state, On-Demand is the safer choice. For more on Spot savings strategy, see how to cut ECS Fargate costs by 65%.

Split your Fargate quota before dev takes down prod

Fargate vCPU quota is shared per region per account — dev load tests can exhaust the pool and block production from scaling; separate AWS accounts eliminate this entirely.

AWS has no native mechanism to reserve quota for production. The default Fargate On-Demand vCPU quota is 6 vCPUs per region (soft limit, increaseable to 10,000+ via support ticket). Dev and prod compete for the same pool.

Key insight

Fargate quota sharing is invisible until it bites you. You won't know it happened until prod fails to scale during an incident. At that point, the fix takes hours — filing a support ticket and waiting for the quota increase to propagate. Account-level separation (prod in one account, non-prod in another) eliminates this class of incident.

The fix: separate accounts for prod vs non-prod. If that's not immediately feasible, monitor quota utilization proactively. Go to Service Quotas → AWS Fargate → Running On-Demand Fargate vCPUs in the AWS Console. Set a CloudWatch alarm at 70% utilization so you have time to react before hitting the limit. Quota increase requests can take 24–72 hours — at 70% you have days of runway; at 95% you have hours.

Two more constraints that hit at fleet scale: (1) Fargate launch rate — 20 tasks/second sustained in older regions, 5/second in newer ones. If your scheduler tries to start 100 tasks across 10 environments simultaneously, you'll hit the throttle. Add jitter to scheduled starts. (2) ECS API throttle — 100 burst requests/second, 20 sustained (AWS documented defaults). Scripts that poll DescribeServices across 50 services in a tight loop will still get rate-limited. Add exponential backoff and batch calls.

The ECS multi-environment strategy guide covers account structure patterns in detail, including when to split further. Once you're spread across accounts, managing ECS Fargate across multiple AWS accounts walks through the cross-account IAM, central ECR, and networking cost that follow.

Common questions

If you read this, you might also want to know

When should I use ECS Service Connect vs traditional service discovery?

Service Connect (launched 2024) is the newer, simpler option — it gives you service-to-service communication via DNS without managing Cloud Map namespaces. Use it for new deployments. Traditional Cloud Map service discovery is still needed for custom DNS or Route 53 integration.

How do I monitor ECS Fargate at fleet scale?

Container Insights gives you per-task CPU/memory. For per-environment cost and status across accounts, you need a fleet-level tool. The AWS-native approach (Cost Explorer + tags + CloudWatch dashboards) works at 3-5 environments but becomes unmaintainable at 10+.

What's the difference between Fargate platform versions?

Platform version 1.4.0 (current) supports ECS Exec, EFS volumes, and improved networking. Earlier versions lack these features. AWS handles platform version upgrades transparently.

See what your fleet would save

Run the calculator in 30 seconds, then book 20 minutes to go through it together with a Fortem engineer.

Run Fleet Audit →Book a call

Response within 4 hours, weekdays.

Worth reading

LandingECS Environment SchedulingBest practice #1 for non-prod fleets: stop paying for environments nobody uses. All scheduling approaches compared.GuideHow to Cut AWS ECS Fargate Costs by 65%Scheduling, right-sizing, Spot, and orphaned environments — the four methods that take a 12-environment fleet from $1,730 to $380/month.