Why Do AWS Staging Environments Cost So Much?
You have 10 ECS environments. Most of them are staging, QA, or dev. No one is using them at 2am on Saturday. But Fargate bills by the second, and by the time the monthly invoice arrives the number is larger than expected. This isn't an infrastructure design problem — it's an idle compute problem. Here's where the money goes, and what moves the needle.
- 01Non-prod ECS environments run 168 hours a week. Your team works 40. That's 128 hrs/week of idle compute per environment.
- 02Fargate compute is ~68% of your ECS bill. The rest (CloudWatch Logs, ALB baseline) doesn't stop when the environment sits idle.
- 03NAT Gateway, VPC, and often ALB are shared across environments — that overhead doesn't multiply. Compute does.
- 04Fargate Spot cuts non-prod compute by up to 70% for fault-tolerant tasks. Not suitable for demo environments or shared QA sessions.
- 05Business-hours scheduling (Mon–Fri 09:00–19:00) cuts active compute time to ~30% of the 24/7 baseline with zero architecture changes.
ECS Application Auto Scaling scheduled actions — stops all tasks at 19:00 and restarts at 09:00, Mon–Fri. No Lambda required. Replace your-cluster and your-service with your values. Repeat the aws_appautoscaling_* blocks for each service.
# Register the ECS service as a scalable target
resource "aws_appautoscaling_target" "staging_svc" {
max_capacity = 4
min_capacity = 0
resource_id = "service/your-cluster/your-service"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
# Stop at 19:00 UTC Mon–Fri
resource "aws_appautoscaling_scheduled_action" "stop_evening" {
name = "stop-staging-evening"
service_namespace = aws_appautoscaling_target.staging_svc.service_namespace
resource_id = aws_appautoscaling_target.staging_svc.resource_id
scalable_dimension = aws_appautoscaling_target.staging_svc.scalable_dimension
schedule = "cron(0 19 ? * MON-FRI *)"
scalable_target_action {
min_capacity = 0
max_capacity = 0
}
}
# Restart at 09:00 UTC Mon–Fri
resource "aws_appautoscaling_scheduled_action" "start_morning" {
name = "start-staging-morning"
service_namespace = aws_appautoscaling_target.staging_svc.service_namespace
resource_id = aws_appautoscaling_target.staging_svc.resource_id
scalable_dimension = aws_appautoscaling_target.staging_svc.scalable_dimension
schedule = "cron(0 9 ? * MON-FRI *)"
scalable_target_action {
min_capacity = 1
max_capacity = 4
}
}
# Optional: Fargate Spot capacity provider for non-prod
resource "aws_ecs_service" "staging_svc" {
# ... your existing service config ...
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 1
}
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 0
base = 0
}
}Business hours = Mon–Fri 09:00–19:00 (50 hrs/wk, ~217 hrs/mo). Fargate Spot at 70% discount. Shared infrastructure (NAT Gateway, VPC, ALB) not included — shared cost does not multiply per environment.
Why non-prod spend stays invisible
Non-prod costs get lumped into a single “infrastructure” line item with no per-environment breakdown. No one owns the number, so it doesn't get fixed.
Production gets optimized after a big bill. Staging gets the same config it had when the second engineer joined and no one has touched it since. The reason isn't negligence — it's visibility. AWS Cost Explorer shows you ECS as a service total. Without per-environment cost allocation tags, there's no way to see that your staging environment costs more than your QA environment, or that three dev environments have been running since February with no active work behind them.
The result: non-prod spend is invisible in reviews, gets absorbed into the overall AWS bill, and deferred indefinitely with “it's just staging, we'll fix it later.”
Where the money goes on Fargate
Fargate compute is ~68% of a typical ECS bill at $0.04048/vCPU-hr and $0.004445/GB-hr. The remaining 32% — CloudWatch Logs at $0.50/GB ingested, ALB baseline at $0.0225/hr — doesn't scale to zero when tasks are idle.
The big number is compute, and compute is the lever. But a few non-obvious charges compound the problem for non-prod environments specifically:
- 01CloudWatch Logs — verbose by default
Non-prod environments often run at DEBUG log level. A service generating 1 GB/day of logs costs $15/month in ingestion alone. Multiply by 8 services and 10 environments and you have a meaningful line item that has nothing to do with compute.
- 02Container Insights — charged per observation
Container Insights is on by default on many clusters. For non-prod, it adds cost without adding value. Turn it off on dev and staging clusters.
- 03ALB dedicated to one environment
If each environment has its own ALB, the $0.0225/hr base charge ($16.43/mo) runs regardless of traffic. Teams running 10 environments with dedicated ALBs pay $164/mo in ALB base charges before a single request is processed.
The 168-hour problem
A non-prod environment running 24/7 runs 168 hours a week. Your team works 40. That gap — 128 hours per week of idle compute per environment — is the real cost driver on Fargate.
Let's do the math on a realistic fleet. Ten non-prod environments, each running 8 services at 0.5 vCPU and 1 GB memory:
| Scenario | Hrs/mo active | Compute/mo | vs 24/7 |
|---|---|---|---|
| 24/7 on-demand | 730 | $1,442 | — |
| Business hours on-demand | ~217 | $428 | −70% |
| Business hours + Spot | ~217 | ~$128 | −91% |
80 services × 0.5 vCPU × $0.04048/hr + 80 × 1 GB × $0.004445/hr. Business hours = Mon–Fri 09:00–19:00 UTC (~217 hrs/mo).
Fargate bills by the second with no minimum charge. A task stopped at 19:00 pays nothing until it restarts at 09:00. That's not an approximation — it's how the billing model works. The savings from scheduling are immediate and exact.
Fargate Spot for non-prod: when it works, when it doesn't
Fargate Spot runs non-prod tasks on spare AWS capacity at up to 70% off on-demand rates. It works well for dev and QA. Avoid it for environments used for customer demos or with stateful in-memory work that can't tolerate a restart.
The mechanics: AWS gives 2 minutes' warning via SIGTERM before reclaiming Spot capacity. ECS marks the task as SPOT_INTERRUPTIONand, if desired count is still > 0, launches a replacement.
| Environment type | Fargate Spot? | Reason |
|---|---|---|
| Dev environments | ✓ Yes | Stateless, restartable, no active users |
| Feature branch preview | ✓ Yes | Ephemeral, restartable on interrupt |
| CI / integration tests | ✓ Yes | Short-lived tasks, retry on failure |
| QA (automated) | ✓ Yes | Tests restart automatically on failure |
| QA (live session) | ✗ Risky | Interrupt kills active QA session |
| Demo environment | ✗ No | Customer impact if interrupted |
| Staging (production-like) | ✗ Usually not | Used for final validation, needs stability |
The capacity provider strategy in the Terraform block above sets FARGATE_SPOT weight=1, FARGATE weight=0 — pure Spot. For environments that need occasional stability, set Spot weight to 3 and on-demand weight to 1 to prefer Spot but fall back automatically.
Business-hours scheduling: the fastest ROI
Scheduling ECS tasks to stop at 19:00 and restart at 09:00 Mon–Fri cuts active compute time from 730 hours/month to ~217 hours — a 70% reduction with no architecture changes required.
The AWS-native approach uses ECS Application Auto Scaling scheduled actions. No Lambda function, no custom scheduler, no third-party tool — this is a first-class ECS feature. The Terraform block at the top of this article implements it exactly.
A few operational details worth knowing before you deploy:
- —Deregistration delay. ALB target groups have a default 300-second deregistration delay. Reduce this to 30 seconds on non-prod target groups so environments stop promptly at 19:00 instead of draining for 5 minutes.
- —Stateful services. RDS and ElastiCache run independently — they're not stopped by this config. Data persists across task restarts. EFS mounts reattach on task start.
- —Timezone offset. EventBridge cron uses UTC. Mon–Fri 09:00–19:00 ET is 13:00–23:00 UTC. Adjust the cron expressions for your team's timezone.
- —Override capability. The scheduled action sets desired count — any engineer can manually set it back to 1 for an after-hours session. The schedule resumes as normal the next morning.
At 10+ environments, this math becomes unavoidable
One staging environment running 24/7 is an annoyance. Ten of them is a line item that starts appearing in board decks. The fix doesn't scale manually.
Manual scheduling via the AWS console or one-off Terraform blocks works at 1–2 environments. At 10+, the operational overhead compounds:
- —Schedule drift — different engineers set different start/stop times, no one audits
- —Environment-specific hours — the ML team needs their env at 6am, QA needs theirs until 9pm
- —On-demand overrides — “can you keep staging up tonight, we have a client demo” — sent in Slack, forgotten in Terraform
- —New environments inherit no schedule by default — the next dev environment someone spins up runs 24/7 until someone notices
This is where fleet-level tooling pays for itself. Fortem manages scheduling across all non-prod environments from one interface — with override capability per environment, audit log of who changed what, and defaults that apply to new environments automatically.
See which environments in your fleet are burning budget right now.
Talk to us about your fleet