Guide
Matt S
Matt S
Platform engineer at Fortem··8 min read
aws-cost-anomaly-detectionecs-cost-monitoringfargate-cost-alerts

AWS Cost Anomaly Detection for ECS Teams: What It Catches, What It Misses, and How to Set It Up

AWS Cost Anomaly Detection is free, ships with ML-based pattern detection, and can catch ECS spend spikes automatically. The catch: it runs on billing data that's up to 24 hours old, and the default setup monitors all ECS spend as one pooled number — not per environment. A spike in staging looks identical to a spike in prod. This guide covers how CAD actually works, how to wire it to your environment tags for per-environment alerts, what Terraform to drop in, and where the tool has real blind spots.

TL;DR
  • 01CAD is free and uses ML — no static thresholds to maintain, no per-service configuration by default.
  • 02The default AWS service monitor pools all ECS spend together — set up a tag-based monitor to get per-environment alerts.
  • 03Detection takes up to 24 hours after a spike appears in billing data. Sub-12-hour spikes often go undetected.
  • 04IMMEDIATE alerts require an SNS topic, not an email address — email-only subscriptions get a ValidationException.
  • 05CAD is your monthly fire detector. It won't catch a runaway task that was killed before the billing data arrived.
Ready to use — drop this into your Terraform today

Tag-based monitor on your environment key, SNS topic with correct IAM policy, and an IMMEDIATE subscription with combined $ + % threshold. Replace [email protected] with your on-call address.

hcl
# SNS topic for cost anomaly alerts
resource "aws_sns_topic" "cost_anomaly" {
  name = "ecs-cost-anomaly-alerts"
}

# Required: grant CAD permission to publish to SNS
data "aws_iam_policy_document" "cost_anomaly_sns" {
  statement {
    sid     = "AllowCostAnomalyDetection"
    effect  = "Allow"
    actions = ["SNS:Publish"]
    principals {
      type        = "Service"
      identifiers = ["costalerts.amazonaws.com"]
    }
    resources = [aws_sns_topic.cost_anomaly.arn]
    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"
      values   = [data.aws_caller_identity.current.account_id]
    }
  }
}

resource "aws_sns_topic_policy" "cost_anomaly" {
  arn    = aws_sns_topic.cost_anomaly.arn
  policy = data.aws_iam_policy_document.cost_anomaly_sns.json
}

resource "aws_sns_topic_subscription" "oncall_email" {
  topic_arn = aws_sns_topic.cost_anomaly.arn
  protocol  = "email"
  endpoint  = "[email protected]"
}

# Tag-based monitor — one ML baseline per environment tag value
resource "aws_ce_anomaly_monitor" "env_monitor" {
  name         = "ecs-per-environment-monitor"
  monitor_type = "CUSTOM"

  monitor_specification = jsonencode({
    Tags = {
      Key          = "environment"      # must match your cost allocation tag key
      MatchOptions = ["EQUALS"]
    }
  })
}

# Subscription: IMMEDIATE via SNS (email-only = ValidationException)
resource "aws_ce_anomaly_subscription" "env_alerts" {
  name      = "ecs-environment-anomaly-alerts"
  frequency = "IMMEDIATE"

  monitor_arn_list = [aws_ce_anomaly_monitor.env_monitor.arn]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_anomaly.arn
  }

  depends_on = [aws_sns_topic_policy.cost_anomaly]

  # AND logic: both conditions must be met to reduce alert noise
  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        values        = ["30"]           # $30 minimum impact
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
        values        = ["25"]           # 25% above expected
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }
}

How AWS Cost Anomaly Detection works

CAD uses ML to model your normal spend per dimension, runs approximately 3× daily on billing data that's up to 24 hours old, and alerts when actual spend deviates from expected by more than your configured threshold.

The service launched in 2020 and has been updated significantly since. The November 2025 update switched from calendar-day batches to rolling 24-hour windows — meaning the model now compares your current spend against the same time of day in previous periods, rather than against a full-day total. For ECS workloads with business-hours patterns, this reduces false positives on Monday mornings when spend jumps from a quiet weekend.

Monitor dimensionWhat it tracksECS use
AWS servicesAll ECS spend pooled across all envsDefault — too coarse for fleets
Linked accountsPer AWS account spendUseful for account-per-env setups
Cost allocation tagsPer tag value (e.g. per environment)Best for ECS fleets using env tags
Cost categoriesPer business unit or productUseful for multi-product orgs

The ML model adjusts for trends and seasonality automatically. You don't set a fixed budget cap — the model learns what "normal" looks like for your specific spend pattern and alerts only when that pattern breaks. The tradeoff: the model needs at least 10 days of history per dimension before it can fire. A brand-new ECS environment with zero history gets no anomaly alerts until day 11.

What ECS cost spikes CAD actually catches

CAD catches ECS spend anomalies at AWS service level by default — meaning all your environments pooled together. It reliably catches sustained scale-out events, forgotten running environments, and Fargate On-Demand vs Spot fall-through spikes that last longer than one billing cycle.

The "sustained" qualifier matters. per-environment cost visibility on ECS is already hard — CAD makes it harder when all environments share one anomaly baseline. A $200 spike in your dev environment looks like noise when your prod environment spends $2,000.

ScenarioSpike typeCAD (default)Delay
Dev env running 24/7 after sprint endsSustained (3+ days)Catches it24–48 hours
Fargate Spot falls back to On-Demand for 8 hoursSustained (8+ hours)Usually catches it24 hours
Runaway task scales to 50 replicas for 3 hoursShort spike (<6h)Often missesTask gone before detection
NAT Gateway burst from one env's batch jobSingle-env spikeMisses (pools with others)No per-env alert without tag monitor
Key insight
The 24-hour delay is the biggest constraint for ECS teams. A Fargate task that scaled out at 9am and was killed by 3pm generates no anomaly alert — the spending happens in a single billing period, and CAD reads billing data with a 24-hour lag. By the time the data arrives, the task is gone.

Setting up a per-environment tag monitor

Create a Cost Allocation Tag monitor on your environmentkey. Each tag value — dev, staging, prod — gets its own ML baseline and can fire independently without one environment's spend pattern polluting another's alert.

Before creating a tag-based monitor, your cost allocation tags must be activated. Tags only appear in Cost Explorer — and therefore in CAD — after activation. This catches teams off guard: you've been tagging your ECS tasks for months, but none of that data flows into CAD until you flip the switch.

Prerequisite — activate cost allocation tags
  1. 1.Open AWS Billing and Cost Management console
  2. 2.Navigate to Cost allocation tags → "AWS-generated tags" and "User-defined tags" tabs
  3. 3.Find your environment tag key (e.g. "environment") → click Activate
  4. 4.Wait up to 24 hours for historical data to appear in Cost Explorer
  5. 5.Then create your CAD monitor — do not create it before activation

Choose an AWS managed monitor(not customer managed) for your environment tag. Managed monitors automatically discover new tag values as you add environments — if you spin up a new "staging-eu" environment next month, the monitor picks it up without any config change. The trade-off: all tag values share one alert threshold. If you need different thresholds for prod vs dev, use customer managed monitors — but they cap at 10 tag values per monitor.

New tag values need 10 days of billing history before CAD can model normal spend and fire alerts. Plan for this when spinning up a new environment — don't expect anomaly alerts in the first two weeks.

Terraform: the full config

Two core resources: aws_ce_anomaly_monitor (tag-based) and aws_ce_anomaly_subscription (SNS, IMMEDIATE). Email-only subscriptions cannot use IMMEDIATE frequency — you need an SNS topic with the correct IAM policy first.

The Terraform block at the top of this article is the complete production config. Two things to get right:

SNS topic policy

The SNS topic must explicitly grant costalerts.amazonaws.com permission to publish. Without this policy, CAD sends no error — alerts fail silently and you get nothing. The aws:SourceAccount condition limits the permission to your own account only.

CUSTOM vs DIMENSIONAL monitor type

Tag-based monitors use monitor_type = "CUSTOM" with a monitor_specification JSON block. Service-level monitors use monitor_type = "DIMENSIONAL" with monitor_dimension = "SERVICE". These are different resource shapes in the Terraform provider — using the wrong type will error at apply time.

For teams using consistent cost allocation tagging across environments, the tag-based monitor can also be created as an AWS managed monitor (not customer managed) — which auto-discovers new environments. The Terraform resource for a managed TAG monitor looks slightly different: omit the monitor_specification block and instead use monitor_type = "DIMENSIONAL" with monitor_dimension = "TAG". Check the AWS Cost Anomaly Detection docs for the current provider version syntax.

Threshold strategy for ECS fleets

For a 10-environment fleet, start with $30 absolute AND 25% relative (AND logic). Production alone warrants a lower absolute threshold — $20 with 20% catches real incidents without drowning in dev noise. The AWS default (40% + $100) is too blunt for ECS environments with variable baselines.

The problem with the $100 default: a dev environment spending $40/month on idle Fargate tasks can spike to $120 — a 200% increase — and never trigger an alert because the $100 absolute threshold isn't met. For small environments, percentage-based thresholds catch what dollar thresholds miss.

EnvironmentAbsolute ($)Percentage (%)Logic
Production$2020%AND
Staging$3025%AND
Dev / ephemeral$1530%AND
All environments (fallback)$3025%AND

Use AND, not OR. OR logic on a percentage threshold fires every time a tiny environment has any activity after a quiet weekend — because 100% above $0 is infinite. AND requires both the dollar amount and the percentage to be exceeded simultaneously, which dramatically reduces noise from small environments with variable usage.

After the first few weeks, mark detected anomalies as "Accurate anomaly" or "Not an issue" in the console. CAD uses this feedback to tune the model. A model trained on your team's feedback converges on your actual noise floor faster than one running without it.

Where CAD falls short for ECS teams

CAD won't catch a Fargate task that scales to 50 replicas, runs for 6 hours, and is killed before billing data arrives. It also can't alert on per-service cost within an environment — only on per-environment total spend.

Three hard limits to plan around:

No real-time detection

Cost Explorer has up to a 24-hour data lag. CAD runs 3× per day on that data. A Fargate task that spends $300 between 8am and 5pm on a Tuesday won't appear in CAD until Wednesday at the earliest — and only if the spending pattern looks anomalous relative to your history. Real-time cost monitoring requires CloudWatch metrics and billing alarms, which operate on estimated charges with a different (faster) refresh cycle.

No service-level granularity within an environment

The tag-based monitor fires when total spend for the "dev" tag value deviates from normal. It cannot tell you which ECS service within "dev" caused the spike. Root cause analysis surfaces up to 10 contributing factors (service, region, account, usage type) — but these are dimensions in Cost Explorer, not ECS service names. You still need Cost Explorer or a per-service tagging strategy to narrow it down.

Scheduled environments create false anomalies

If you schedule non-prod environments to stop outside business hours, CAD sees a cost of $0 at night and a spike every morning when they restart. The ML model learns this pattern over time — but the first 2–4 weeks after introducing scheduling will generate false positive alerts. Disable alerts during the model warm-up period or set a higher absolute threshold temporarily.

Key insight
CAD is your monthly fire detector. It catches sustained burns — a forgotten environment left running, a Spot fallback that held for three days. Fortem's per-environment cost tracking is your smoke alarm: it sees what's happening now, before it becomes a billing-cycle problem.

"AWS Cost Anomaly Detection runs approximately three times a day after your billing data is processed. Anomaly detection relies on the data from Cost Explorer which has a latency of up to 24 hours. Therefore, it can take up to 24 hours to detect an anomaly after the anomalous usage happens."

AWS Cost Anomaly Detection FAQ, verified June 2026

If you read this, you might also want to know

Can I use CAD with a multi-account ECS setup?

Yes. In a management account, create a linked account monitor to track per-member-account spend. Combine it with a tag-based monitor per account if you want both dimensions. Member accounts can only create an AWS service monitor — linked account and tag monitors require the management account.

What if my ECS tasks don't have environment tags yet?

Tag-based monitoring only works on costs that are tagged. Untagged ECS tasks appear in the 'no tag value' bucket. The fastest path: add a default_tags block to your Terraform AWS provider — every resource gets the environment tag automatically without changing individual resource configs.

Does CAD replace AWS Budgets?

No — they answer different questions. Budgets: 'alert me when I cross $X.' CAD: 'alert me when I'm abnormally above my historical pattern, even if I haven't hit a fixed cap.' Use Budgets for hard financial limits and CAD for pattern deviation. A $50 spike in a normally-$10 environment is an anomaly even if it's well below your budget cap.

Common questions

See what Fortem shows you that CAD doesn't

CAD catches sustained billing anomalies. Fortem shows you per-environment cost in real time — which service, which environment, which task definition changed. 20 minutes to review your fleet.

Response within 4 hours, weekdays.

Worth reading