Guide
Matt S
Matt S
Platform engineer at Fortem··10 min read
ecs-fargate-monitoringcontainer-insights-costecs-cloudwatch-alarms

How Do You Monitor ECS Fargate Across 10+ Environments?

A CPU alarm on one service is a five-minute job. The same three alarms on thirty services in five accounts — consistently, and without a Container Insights bill that quietly grows into four figures — is the actual work. This guide covers which metrics are worth paying for, the exact Container Insights cost math at fleet scale, and a Terraform pattern that alarms your whole fleet from one module.

TL;DR
  • ·Fargate needs no monitoring agent. Container Insights collects CPU, memory, network, and ephemeral-storage metrics with no sidecar — the catch is the per-metric bill.
  • ·Enhanced observability bills $0.07/metric/month across cluster, service, task-def, task, and container. Enabled everywhere on a 30-service fleet, that runs into four figures a month.
  • ·Turn enhanced observability ON in prod, OFF in dev. Dev doesn't need container-level metrics — that one split roughly halves the bill.
  • ·Alarm as code with a Terraform for_each: one module, N services, zero copy-paste. Hand-writing three alarms per service is 90 blocks at 30 services.
  • ·The metric that catches most incidents isn't CPU — it's RunningTaskCount below desired.
Ready to use — fleet alarms from one Terraform module
hcl
# One module, every service. Add a service to the map, get its 3 alarms.
variable "services" {
  type = map(object({
    cluster       = string
    desired_count = number
  }))
  # Example:
  # {
  #   api      = { cluster = "prod", desired_count = 4 }
  #   worker   = { cluster = "prod", desired_count = 2 }
  #   payments = { cluster = "prod", desired_count = 3 }
  # }
}

resource "aws_sns_topic" "alerts" {
  name = "ecs-fleet-alerts"
}

# CPU > 90% for 5 min
resource "aws_cloudwatch_metric_alarm" "cpu" {
  for_each            = var.services
  alarm_name          = "ecs-${each.key}-cpu-high"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 1
  threshold           = 90
  comparison_operator = "GreaterThanThreshold"
  dimensions          = { ClusterName = each.value.cluster, ServiceName = each.key }
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

# Memory > 80% for 5 min
resource "aws_cloudwatch_metric_alarm" "mem" {
  for_each            = var.services
  alarm_name          = "ecs-${each.key}-mem-high"
  namespace           = "AWS/ECS"
  metric_name         = "MemoryUtilization"
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 1
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  dimensions          = { ClusterName = each.value.cluster, ServiceName = each.key }
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

# Running tasks below desired — the incident signal that matters most
resource "aws_cloudwatch_metric_alarm" "running" {
  for_each            = var.services
  alarm_name          = "ecs-${each.key}-tasks-low"
  namespace           = "ECS/ContainerInsights"
  metric_name         = "RunningTaskCount"
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 3
  threshold           = each.value.desired_count
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  dimensions          = { ClusterName = each.value.cluster, ServiceName = each.key }
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

CPU/memory come from the AWS/ECS namespace (free vended metrics). RunningTaskCount comes from ECS/ContainerInsights — it needs Container Insights on the cluster.

Why Fargate monitoring is different from EC2

On Fargate you can't SSH to a host or run node_exporter. Every metric comes from CloudWatch or an in-task sidecar — Container Insights is the no-agent default, and it's metered per metric.

With EC2-backed ECS you own the instance, so you can install the CloudWatch agent, Prometheus node exporter, or any host-level tooling you like, and you pay for the instance either way. Fargate takes the host away. You get no shell, no daemon set, no privileged sidecar reading the host cgroup. What you get instead is CloudWatch: the free vended metrics (CPUUtilization, MemoryUtilization at the service level) and, when you turn it on, Container Insights.

This is genuinely convenient — no agent to patch, no version drift across a hundred tasks — but it changes the cost model. On EC2 your monitoring is bundled into the instance you already pay for. On Fargate, deeper visibility is a separate line item that scales with the number of tasks and containers you run. The rest of this guide is about spending that line item where it earns its keep and not where it doesn't.

Key insight

The free service-level CPU and memory metrics in the AWS/ECS namespace are enough to alarm on. You do not need Container Insights to know a service is hot. You need Container Insights when you want to know which task or container inside that service is hot — and for RunningTaskCount and ephemeral-storage metrics.

The metrics that actually matter (and which to skip)

RunningTaskCount below desired catches more incidents than CPU. Alarm on running-task deficit, memory above 80%, and CPU above 90% — leave network and storage on a dashboard.

Most Fargate monitoring guides open with CPU and memory because those are the metrics everyone recognizes. But CPU is a slow-degradation signal — a throttled service is slow, not down. The signal that actually correlates with "customers are seeing errors" is a service that can't keep its desired number of tasks running: a crash loop, a failed health check, an image that won't pull. That shows up as RunningTaskCount dropping below DesiredTaskCount, long before CPU says anything.

MetricWhy it mattersAction
RunningTaskCount < desiredService can't stay up — the #1 real incident signalAlarm always
MemoryUtilization > 80%OOM kills the task; no graceful degradationAlarm always
CPUUtilization > 90%Throttling — slow, not downAlarm in prod
EphemeralStorageUtilizedDisk fills → task fails silentlyAlarm if you write to disk
NetworkRx/TxBytesDiagnostic, rarely actionable as an alarmDashboard only
DeploymentCount / TaskSetCountUseful during blue/green, noisy otherwiseDashboard only

Metrics are only half of observability — the other half is logs, and the two get correlated during an incident. If your log setup isn't solid, alarms just tell you something broke without telling you what. It's worth getting how to set up ECS logging the right way nailed down before you tune alarm thresholds, because a memory alarm with no readable logs behind it is a page you can't act on.

"EphemeralStorageReserved and EphemeralStorageUtilized ... are only available for tasks that run on Fargate Linux platform version 1.4.0 or later."

AWS Container Insights ECS metrics, verified June 2026

What Container Insights actually costs at fleet scale

Enhanced observability bills $0.07 per metric per month across cluster, service, task-def, task, and container. At 30 services that's thousands of metrics — real money, not a rounding error.

Here's the arithmetic nobody shows you. AWS bills Container Insights metrics as custom metrics, and enhanced observability (released December 2, 2024) reports them at every level — a handful per cluster and per service, and then, critically, a set per task and per container. A count in the low teens per container feels harmless in isolation, until you multiply it by a real fleet. Containers are the multiplier that hurts: a 30-service fleet with a couple of containers per task is already at 180+ containers, each reporting its own metric set. The table below models that — the exact number moves with how many containers you pack per task, so treat it as an order-of-magnitude estimate, not a quote.

FleetMetrics reportedCost / monthCost / year
5 services475$33$399
15 services1,820$127$1,529
30 services4,800$336$4,032
Modeled estimate. Enhanced observability at $0.07/metric/month (AWS CloudWatch pricing, verified June 2026); per-resource metric counts derived from the AWS enhanced-observability metric table, GPU metrics excluded. Your count varies with containers-per-task.

And that's just the metric bill. It's separate from log ingestion and storage, which the logging side of observability owns — CloudWatch Logs is $0.50/GB ingested and $0.03/GB stored, and the default Never-Expire retention means storage never stops growing. Watching that side is a different exercise; the mechanics of controlling CloudWatch Logs costs on ECS covers the retention and ingestion levers that stack on top of these metric numbers.

Key insight

Standard and enhanced Container Insights bill the same $0.07 per metric. The difference is entirely how many metrics each reports — enhanced adds per-task and per-container granularity, which is exactly where the count (and the bill) explodes. So the question is never "standard or enhanced?" in the abstract. It's "on which resources is per-container granularity worth $0.07 times a dozen-odd metrics each?"

Turn enhanced observability ON in prod, OFF in dev

Dev environments don't need container-level metrics. Enable enhanced observability on prod clusters only; leave dev on standard or off. That one split roughly halves the Container Insights bill.

When something breaks in a dev environment, you redeploy it — you don't run a forensic investigation into which container held memory for three seconds too long. The per-container metrics that justify their cost in production are pure waste in dev, where the tasks are often idle or scaled to one. Yet the most common Container Insights mistake is flipping it on cluster-wide, in every account, and never revisiting it.

Because Container Insights is a per-cluster setting, the split is trivial to express — set it in the Terraform that every environment shares, keyed off the environment name:

hcl
resource "aws_ecs_cluster" "this" {
  name = "${var.environment}-cluster"

  setting {
    name = "containerInsights"
    # enhanced only in prod; dev/staging get standard (cheaper) or "disabled"
    value = var.environment == "prod" ? "enhanced" : "enabled"
  }
}

If your dev environments are genuinely throwaway, "disabled" is a defensible value for them — the free AWS/ECS service-level CPU and memory metrics still exist without Container Insights, so you keep your CPU and memory alarms. You lose RunningTaskCount and ephemeral-storage metrics in dev, which is usually an acceptable trade.

Alarm as code — the fleet for_each pattern

Hand-writing three alarms per service means 90 blocks at 30 services. A Terraform for_each over a service map creates CPU, memory, and running-task alarms for the whole fleet from one module.

The Ready-to-use block above is the whole pattern: a map of services, three for_each alarm resources, one SNS topic. Adding the thirty-first service is a one-line map entry, not three copied-and-tweaked alarm blocks that drift out of sync the moment someone edits one and forgets the rest. That drift is the real failure mode of click-ops and copy-paste monitoring — not that the alarms are wrong on day one, but that they're inconsistent by month six.

A few things the pattern gets right that hand-written alarms usually miss. The running-task alarm sets treat_missing_data = "breaching" — if the metric stops reporting entirely (a service deleted itself, or Container Insights got turned off), that should page you, not silently resolve. It uses evaluation_periods = 3 at 60-second periods so a single blip during a deploy doesn't fire. And every alarm points at one SNS topic, so routing to PagerDuty or Slack is one subscription, not thirty.

None of this is exotic — the CloudPosse ecs-cloudwatch-sns-alarms module wraps the same idea, and it's popular precisely because per-service alarm sprawl is a problem enough people hit to want it solved. Whether you use a module or the raw resources above, the principle is the same: the alarm definition lives in one place and applies to the fleet.

The newly-created-environment coverage gap

Alarms defined per service don't cover the environment someone spins up next week. Either enforce monitoring in the module every environment uses, or scope a Lambda to auto-alarm any tagged cluster.

Here's the failure that a service map quietly introduces: it only monitors the services in the map. When a developer spins up a preview environment, or a new team stands up a service in a fresh account, that workload is invisible until someone remembers to add it — and nobody remembers. You find the gap during the incident, when you go looking for the alarm that should have fired and it was never created.

There are two honest fixes. The first is to make monitoring non-optional in the shared module every environment is built from — if you can't create an ECS service without also creating its alarms, there's no gap to forget. The second, for teams whose environments aren't all built the same way, is AWS's own tag-scoped pattern: a Lambda that watches for new clusters and attaches a standard alarm set to any cluster carrying a given tag. AWS ships this because the gap is real enough to need a system, not discipline.

Key insight

Coverage is a fleet property, not a per-service one. The right question isn't "does this service have alarms?" — it's "can a service exist in this org without alarms?" If the answer is yes, your monitoring has a hole shaped like every environment you haven't manually added yet.

Beyond metrics: events and per-environment cost

Metrics tell you a task is unhealthy; EventBridge tells you why it stopped. Route ECS task-state-change events to SNS, and tag every service so per-environment cost shows up in Cost Explorer.

A metric alarm says RunningTaskCount dropped. It doesn't say the task was killed by an OOM, failed its health check, or couldn't pull its image. That reason lives in the ECS Task State Change event, which ECS emits to EventBridge for free. A single EventBridge rule matching stoppedReason and routing to the same SNS topic turns "a task stopped" into "a task stopped because OutOfMemoryError" — the difference between a page you can act on and one you have to investigate.

The other blind spot at fleet scale is cost per environment. CloudWatch shows you utilization, not dollars, and AWS bills Fargate at the account level — so a staging environment burning money looks identical to a busy prod one until you've tagged everything. Consistent Environment and Service tags on every task, activated as cost-allocation tags, are what make per-environment spend visible in Cost Explorer — and they're the same tags the Lambda coverage pattern keys off. Monitoring and cost attribution end up being the same tagging discipline.

If you read this, you might also want to know

Do I need Datadog if I already have Container Insights?

Not for infrastructure metrics — Container Insights covers CPU, memory, network, storage, and task counts across the fleet. You reach for Datadog (or an OpenTelemetry sidecar) when you need custom application metrics, distributed traces, or a single pane across non-AWS systems. Many teams run Container Insights for infra and a sidecar only on the handful of services that need APM.

How do I monitor a Fargate task that exits immediately on startup?

Metrics won't help — a task that dies in seconds never reports a meaningful data point. The signal is the ECS Task State Change event in EventBridge, which carries the stoppedReason (image pull failure, essential container exited, OOM). Route that event to SNS and read the reason; the logs, if the log driver was configured, hold the stack trace.

Can I alarm on a metric across all environments at once?

Yes, with a metric-math or aggregate alarm — for example, total RunningTaskCount across a cluster versus total DesiredTaskCount. It's useful for a fleet-health top-line, but keep the per-service alarms too: an aggregate stays green while one critical service is fully down, because the healthy services mask it.

Frequently asked questions

Every ECS environment, one view

Container Insights and per-service alarms tell you about one cluster at a time. Fortem puts every ECS Fargate environment — health, running tasks, and idle spend — on one screen, so the gap-shaped hole isn't there to fall into. Book a 20-minute call and we'll walk your fleet.

Book a 20-min call
Worth reading