Guide
Matt S
Matt S
Platform engineer at Fortem··7 min read
aws-staging-environment-costfargate-idle-costecs-environment-scheduling

Why Do AWS Staging Environments Cost So Much?

You have 10 ECS environments. Most of them are staging, QA, or dev. No one is using them at 2am on Saturday. But Fargate bills by the second, and by the time the monthly invoice arrives the number is larger than expected. This isn't an infrastructure design problem — it's an idle compute problem. Here's where the money goes, and what moves the needle.

TL;DR
  • 01Non-prod ECS environments run 168 hours a week. Your team works 40. That's 128 hrs/week of idle compute per environment.
  • 02Fargate compute is ~68% of your ECS bill. The rest (CloudWatch Logs, ALB baseline) doesn't stop when the environment sits idle.
  • 03NAT Gateway, VPC, and often ALB are shared across environments — that overhead doesn't multiply. Compute does.
  • 04Fargate Spot cuts non-prod compute by up to 70% for fault-tolerant tasks. Not suitable for demo environments or shared QA sessions.
  • 05Business-hours scheduling (Mon–Fri 09:00–19:00) cuts active compute time to ~30% of the 24/7 baseline with zero architecture changes.
Ready to use — drop this into your Terraform today

ECS Application Auto Scaling scheduled actions — stops all tasks at 19:00 and restarts at 09:00, Mon–Fri. No Lambda required. Replace your-cluster and your-service with your values. Repeat the aws_appautoscaling_* blocks for each service.

hcl
# Register the ECS service as a scalable target
resource "aws_appautoscaling_target" "staging_svc" {
  max_capacity       = 4
  min_capacity       = 0
  resource_id        = "service/your-cluster/your-service"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# Stop at 19:00 UTC Mon–Fri
resource "aws_appautoscaling_scheduled_action" "stop_evening" {
  name               = "stop-staging-evening"
  service_namespace  = aws_appautoscaling_target.staging_svc.service_namespace
  resource_id        = aws_appautoscaling_target.staging_svc.resource_id
  scalable_dimension = aws_appautoscaling_target.staging_svc.scalable_dimension
  schedule           = "cron(0 19 ? * MON-FRI *)"

  scalable_target_action {
    min_capacity = 0
    max_capacity = 0
  }
}

# Restart at 09:00 UTC Mon–Fri
resource "aws_appautoscaling_scheduled_action" "start_morning" {
  name               = "start-staging-morning"
  service_namespace  = aws_appautoscaling_target.staging_svc.service_namespace
  resource_id        = aws_appautoscaling_target.staging_svc.resource_id
  scalable_dimension = aws_appautoscaling_target.staging_svc.scalable_dimension
  schedule           = "cron(0 9 ? * MON-FRI *)"

  scalable_target_action {
    min_capacity = 1
    max_capacity = 4
  }
}

# Optional: Fargate Spot capacity provider for non-prod
resource "aws_ecs_service" "staging_svc" {
  # ... your existing service config ...

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 1
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 0
    base              = 0
  }
}
Monthly compute cost — 10 non-prod environments (80 services, 0.5 vCPU each)
us-east-1, Linux x86, on-demand rates June 2026
24/7 on-demand
$1,442/mo
Business hours on-demand
-70%$428/mo
Business hours + Fargate Spot
-91%$128/mo

Business hours = Mon–Fri 09:00–19:00 (50 hrs/wk, ~217 hrs/mo). Fargate Spot at 70% discount. Shared infrastructure (NAT Gateway, VPC, ALB) not included — shared cost does not multiply per environment.

Why non-prod spend stays invisible

Non-prod costs get lumped into a single “infrastructure” line item with no per-environment breakdown. No one owns the number, so it doesn't get fixed.

Production gets optimized after a big bill. Staging gets the same config it had when the second engineer joined and no one has touched it since. The reason isn't negligence — it's visibility. AWS Cost Explorer shows you ECS as a service total. Without per-environment cost allocation tags, there's no way to see that your staging environment costs more than your QA environment, or that three dev environments have been running since February with no active work behind them.

The result: non-prod spend is invisible in reviews, gets absorbed into the overall AWS bill, and deferred indefinitely with “it's just staging, we'll fix it later.”

Key insight
“Nobody noticed because staging bills get lumped into ‘infrastructure costs’ and nobody questions them.” — practitioner, dev.to

Where the money goes on Fargate

Fargate compute is ~68% of a typical ECS bill at $0.04048/vCPU-hr and $0.004445/GB-hr. The remaining 32% — CloudWatch Logs at $0.50/GB ingested, ALB baseline at $0.0225/hr — doesn't scale to zero when tasks are idle.

The big number is compute, and compute is the lever. But a few non-obvious charges compound the problem for non-prod environments specifically:

  • 01
    CloudWatch Logs — verbose by default

    Non-prod environments often run at DEBUG log level. A service generating 1 GB/day of logs costs $15/month in ingestion alone. Multiply by 8 services and 10 environments and you have a meaningful line item that has nothing to do with compute.

  • 02
    Container Insights — charged per observation

    Container Insights is on by default on many clusters. For non-prod, it adds cost without adding value. Turn it off on dev and staging clusters.

  • 03
    ALB dedicated to one environment

    If each environment has its own ALB, the $0.0225/hr base charge ($16.43/mo) runs regardless of traffic. Teams running 10 environments with dedicated ALBs pay $164/mo in ALB base charges before a single request is processed.

The 168-hour problem

A non-prod environment running 24/7 runs 168 hours a week. Your team works 40. That gap — 128 hours per week of idle compute per environment — is the real cost driver on Fargate.

Let's do the math on a realistic fleet. Ten non-prod environments, each running 8 services at 0.5 vCPU and 1 GB memory:

ScenarioHrs/mo activeCompute/movs 24/7
24/7 on-demand730$1,442
Business hours on-demand~217$428−70%
Business hours + Spot~217~$128−91%

80 services × 0.5 vCPU × $0.04048/hr + 80 × 1 GB × $0.004445/hr. Business hours = Mon–Fri 09:00–19:00 UTC (~217 hrs/mo).

Key insight
The compute in a non-prod environment doesn't know it's 2am on Sunday. It charges the same rate as a Tuesday afternoon.

Fargate bills by the second with no minimum charge. A task stopped at 19:00 pays nothing until it restarts at 09:00. That's not an approximation — it's how the billing model works. The savings from scheduling are immediate and exact.

What shared infrastructure changes (and doesn't change)

NAT Gateway, VPC, and often ALB are shared across environments. That overhead doesn't multiply per environment. What multiplies is compute — one set of running tasks per environment, billed independently.

A well-structured ECS fleet shares:

  • NAT Gateway — one per VPC, ~$32.85/mo base. Shared across all environments. $3.29/env at 10 environments.
  • ALB with host-based routing — one ALB routes to all environments via hostname rules. $16.43/mo base total, not per environment.
  • VPC, subnets, security groups — no per-environment charge.

What doesn't share: Fargate task hours, CloudWatch Logs ingestion per environment, and ECR image pull data. These are the numbers that multiply at fleet scale — and they're all driven by idle compute.

This is why the fix is scheduling tasks, not redesigning network architecture. Once you understand that shared infra is already cheap per environment, the question becomes: how do you stop paying for 128 idle compute hours per week?

You can set up per-environment cost allocation tags with AWS Cost Anomaly Detection to get alerted when any single environment deviates from its historical spend baseline — useful once you have scheduling in place and want to catch drift.

Fargate Spot for non-prod: when it works, when it doesn't

Fargate Spot runs non-prod tasks on spare AWS capacity at up to 70% off on-demand rates. It works well for dev and QA. Avoid it for environments used for customer demos or with stateful in-memory work that can't tolerate a restart.

The mechanics: AWS gives 2 minutes' warning via SIGTERM before reclaiming Spot capacity. ECS marks the task as SPOT_INTERRUPTIONand, if desired count is still > 0, launches a replacement.

Environment typeFargate Spot?Reason
Dev environments✓ YesStateless, restartable, no active users
Feature branch preview✓ YesEphemeral, restartable on interrupt
CI / integration tests✓ YesShort-lived tasks, retry on failure
QA (automated)✓ YesTests restart automatically on failure
QA (live session)✗ RiskyInterrupt kills active QA session
Demo environment✗ NoCustomer impact if interrupted
Staging (production-like)✗ Usually notUsed for final validation, needs stability

The capacity provider strategy in the Terraform block above sets FARGATE_SPOT weight=1, FARGATE weight=0 — pure Spot. For environments that need occasional stability, set Spot weight to 3 and on-demand weight to 1 to prefer Spot but fall back automatically.

Business-hours scheduling: the fastest ROI

Scheduling ECS tasks to stop at 19:00 and restart at 09:00 Mon–Fri cuts active compute time from 730 hours/month to ~217 hours — a 70% reduction with no architecture changes required.

The AWS-native approach uses ECS Application Auto Scaling scheduled actions. No Lambda function, no custom scheduler, no third-party tool — this is a first-class ECS feature. The Terraform block at the top of this article implements it exactly.

A few operational details worth knowing before you deploy:

  • Deregistration delay. ALB target groups have a default 300-second deregistration delay. Reduce this to 30 seconds on non-prod target groups so environments stop promptly at 19:00 instead of draining for 5 minutes.
  • Stateful services. RDS and ElastiCache run independently — they're not stopped by this config. Data persists across task restarts. EFS mounts reattach on task start.
  • Timezone offset. EventBridge cron uses UTC. Mon–Fri 09:00–19:00 ET is 13:00–23:00 UTC. Adjust the cron expressions for your team's timezone.
  • Override capability. The scheduled action sets desired count — any engineer can manually set it back to 1 for an after-hours session. The schedule resumes as normal the next morning.

At 10+ environments, this math becomes unavoidable

One staging environment running 24/7 is an annoyance. Ten of them is a line item that starts appearing in board decks. The fix doesn't scale manually.

Manual scheduling via the AWS console or one-off Terraform blocks works at 1–2 environments. At 10+, the operational overhead compounds:

  • Schedule drift — different engineers set different start/stop times, no one audits
  • Environment-specific hours — the ML team needs their env at 6am, QA needs theirs until 9pm
  • On-demand overrides — “can you keep staging up tonight, we have a client demo” — sent in Slack, forgotten in Terraform
  • New environments inherit no schedule by default — the next dev environment someone spins up runs 24/7 until someone notices

This is where fleet-level tooling pays for itself. Fortem manages scheduling across all non-prod environments from one interface — with override capability per environment, audit log of who changed what, and defaults that apply to new environments automatically.

See which environments in your fleet are burning budget right now.

Talk to us about your fleet

Questions this article doesn't answer

How do I actually see which environment is costing what in AWS?+
Enable cost allocation tags for your environment key in the AWS Billing console, then use Cost Explorer with a Group by filter on that tag. You'll see per-environment spend broken out as individual rows. Our article on per-environment cost visibility walks through the exact steps.
Can I automatically stop ECS environments when there's no active deployment or open PR?+
Not with native ECS scheduling alone — you'd need to wire EventBridge to your CI/CD events. A GitHub Actions workflow can call the ECS UpdateService API to set desired count to 0 when a PR is closed and back to 1 when a new deployment completes. Some teams add this to their deploy pipeline directly.
What's the difference between desired count = 0 and deleting the ECS service entirely?+
Setting desired count to 0 stops all running tasks but preserves the service definition, IAM roles, capacity provider strategies, and auto-scaling rules. The service restarts exactly as configured. Deleting the service removes all of this and you'd need to recreate it from Terraform. For scheduling, use desired count = 0 — not service deletion.
Does stopping and restarting ECS tasks affect RDS or other stateful services?+
RDS, ElastiCache, and other stateful services run independently of ECS task count. Stopping tasks at 19:00 has no effect on your database — it continues running (and billing) until you separately stop it. Data persists across task restarts. EFS volumes reattach automatically when tasks start again.

Common questions

Related articles