Matt S
June 14, 2026 · 9 min read
Platform Engineering for ECS Teams: What It Actually Means at 10+ Environments
"Platform engineering" gets used to mean everything from Backstage portals to golden paths to internal tooling teams. For ECS Fargate teams, it means something more specific: closing the gap between what Terraform provisions and what your environments need to operate at scale. At 10 environments the gap is annoying. At 30 it's a full-time job. Here's what that gap looks like and what to do about it.
TL;DR
- —Terraform provisions ECS environments. It doesn't operate them — no scheduling, no self-service, no fleet visibility, no cost attribution per environment.
- —The "operations gap" opens at ~10 environments and gets worse with every new environment you add.
- —Platform engineering for ECS = closing that gap. It doesn't require Backstage, a portal, or a 5-person dedicated team.
- —Three things every ECS platform team needs: environment scheduling, developer self-service with scoped access, fleet visibility with cost attribution.
- —Build vs buy: custom Lambda + EventBridge scheduling works at 3 environments. At 20 it's a maintenance burden.
What "platform engineering" actually means for an ECS team
Platform engineering solves problems that recur across every service and environment — so developers stop solving them individually. For ECS teams, those problems are operational, not organizational.
The best one-sentence definition comes from a Hacker News thread on the topic: "common problems that your software engineers are having to solve that aren't about the unique value of the system they're building — solved once, for everybody, in a coherent and managed way." That's it. The label doesn't matter.
For an ECS Fargate team, those recurring problems are almost always operational. You have 15 environments. Each one needs to be started in the morning, stopped at night, cloneable for QA, and visible as a fleet. Each developer needs to be able to restart their own environment without asking you. Each environment needs a cost number attached to it.
What platform engineering does not mean for most ECS teams: a Backstage portal, Score language, landing zones, or cloud account governance. Those are enterprise IDP problems — appropriate for a 200+ engineer org running workloads across AWS, GCP, and Azure with five dedicated platform engineers. For a 50-person company running 20 ECS Fargate environments on Terraform, that's the wrong solution to the wrong problem.
The reframe that makes this concrete: platform engineering for ECS = the operational layer that sits on top of Terraform. Terraform provisions. The platform layer operates.
The operations gap — what Terraform can't do
Terraform provisions ECS infrastructure. It has no concept of "stop this environment at 7pm" or "show me which environments are idle right now." That gap widens with every new environment.
This isn't a criticism of Terraform. IaC is the right tool for provisioning. The problem is that provisioning is only half of the job. Once an environment exists, someone has to operate it — and Terraform has no primitives for that.
Here's what the operations gap looks like concretely at 10+ ECS environments:
| Gap | What it costs | DIY fix (and its price) |
|---|---|---|
| Scheduling | Environments run 168 hrs/week; team works ~55 | Lambda + EventBridge + CW cron per environment — 20 separate stacks to maintain at 20 envs |
| Self-service | Developers open Slack tickets to restart staging on Friday at 6pm | Per-developer IAM policies — updated manually every time a new environment or developer is added |
| Visibility | No single view of which environments are running, drifted, or healthy | CloudWatch dashboards per environment — manually created, quickly stale |
| Cost attribution | Cost Explorer shows total Fargate spend, not per-environment cost | Custom cost allocation tags + Cost Explorer grouping — requires consistent tagging across all resources from day one |
| Orphan detection | $200–$400/month per dead environment nobody shut down | Manual audit — someone opens the console and checks last-used timestamps quarterly |
The state sprawl problem compounds all of this. At 50 environments, you're looking at roughly 1,500 Terraform resources. A terraform plan across the full fleet takes 4+ minutes. Adding a new environment requires updating a checklist of steps, not running a single command.
None of this is Terraform's fault. These are operations problems. Terraform was never designed to solve them.
For more on the Terraform state sprawl problem at ECS scale, see Managing ECS Fargate with Terraform: What Works and What Doesn't.
Three things every ECS platform team needs
Environment scheduling, developer self-service with scoped access, and fleet visibility with cost attribution. Everything else is optional until you've solved these three.
These are the same three things every ECS team at 10+ environments independently discovers they need — usually in this order, usually after a painful incident or an unexpected AWS bill.
1. Environment scheduling
A stopped ECS service costs $0 in Fargate compute. Your dev and staging environments run 168 hours a week. Your team works roughly 55. The other 113 hours — evenings, nights, weekends — those environments are billing you for compute nobody is using.
Scheduling those environments off outside business hours cuts Fargate compute spend by 60–70%. For a team with 12 dev environments at $200/month each, that's roughly $1,200/month saved — more than the annual cost of most off-the-shelf tools.
Without scheduling: you're paying for environments 128 hours a week that nobody is using. This is the single highest-leverage platform engineering action for any ECS team.
2. Developer self-service with scoped access
At some point a developer needed to restart staging on a Friday evening, couldn't, and sent you a Slack message. If that's happened once, it happens regularly. You are the bottleneck.
The solution is scoped IAM: each developer can restart, stop, or start services within their own environment — and only their environment. They can't touch production. They can't modify infrastructure. They can do the thing they need to do without opening a ticket.
Without self-service: your platform team fields operational support tickets instead of building platform infrastructure. Every environment you add makes the problem worse.
3. Fleet visibility with cost attribution
CloudWatch shows you metrics per service. Cost Explorer shows you total Fargate spend. Neither tells you: "which of my 20 environments is running right now, and how much does each one cost per month?"
Fleet visibility means one view: all environments, their running state, any drift from their expected configuration, and cost per environment per month. Without it, orphaned environments — those spun up for a feature branch six months ago that nobody shut down — accumulate invisibly at $200–$400/month each.
Without cost attribution: you can't answer "how much does the QA environment cost?" You can't make informed decisions about which environments to keep, which to schedule more aggressively, or which to shut down entirely.
Build vs buy — the honest breakdown
Building scheduling and self-service yourself works at 3 environments. At 15 it's a maintenance burden. At 30 it's a second product your platform team maintains instead of ships.
The build path is real and reasonable at small scale. Lambda + EventBridge for scheduling, custom IAM policies per developer, CloudWatch dashboards per environment — doable, well-understood, free. Total engineering cost: 2–4 weeks to build, plus ongoing maintenance.
The maintenance cost is what gets teams. At 20 environments, you have 20 separate scheduling stacks. Adding a new environment means a 30-minute checklist. Changing the schedule logic means 20 updates. A developer joins — you update 20 sets of IAM policies. A developer leaves — same thing. Each of these tasks is small. Collectively, across 20 environments, they fill the platform team's week.
| Dimension | Build (DIY) | Buy (control plane) |
|---|---|---|
| Initial cost | 2–4 weeks engineering | Days to connect; $790–$2,490/mo |
| Per-environment overhead | ~30 min setup, ongoing maintenance | Tag-based discovery — zero per-env config |
| Schedule logic change | Update N Lambda functions | Change once, applies fleet-wide |
| Terraform requirement | No change | No change — reads existing state via tags |
| Right for | <5 environments | 10+ environments |
Here's what the build path looks like for one environment — a business-hours schedule that stops ECS services at 8pm and starts them at 8am:
# Stop all services in dev environment at 8pm Mon–Fri
resource "aws_cloudwatch_event_rule" "stop_dev" {
name = "stop-dev-environment"
schedule_expression = "cron(0 20 ? * MON-FRI *)"
}
resource "aws_cloudwatch_event_target" "stop_dev" {
rule = aws_cloudwatch_event_rule.stop_dev.name
arn = aws_lambda_function.ecs_scaler.arn
input = jsonencode({
cluster = "dev"
action = "stop"
environment = "dev"
})
}
# Start all services at 8am Mon–Fri
resource "aws_cloudwatch_event_rule" "start_dev" {
name = "start-dev-environment"
schedule_expression = "cron(0 8 ? * MON-FRI *)"
}
resource "aws_cloudwatch_event_target" "start_dev" {
rule = aws_cloudwatch_event_rule.start_dev.name
arn = aws_lambda_function.ecs_scaler.arn
input = jsonencode({
cluster = "dev"
action = "start"
environment = "dev"
})
}
# Plus: the Lambda function, its IAM role, the permission to invoke it,
# and the logic to iterate over every service in the cluster.
# Multiply everything above by the number of environments you have.That's ~50 lines of Terraform for one environment. At 20 environments, you have 20 copies of this — each with its own EventBridge rules, Lambda permissions, and IAM roles. When you want to change the stop time from 8pm to 7pm, you update 20 files. When you add a new environment, you copy-paste and rename. This is the maintenance burden that accumulates.
The decision rule is simple: if you have fewer than 5 environments, build. The DIY approach is fast, cheap, and fits the problem. If you have 10+ environments and your platform team is spending meaningful time on operational maintenance instead of platform infrastructure, the math shifts.
For a full breakdown of ECS environment scheduling options — AWS-native approaches, EventBridge rules, and what each costs to maintain — see ECS Environment Scheduling: The Complete Guide.
Do you need a full IDP?
A Backstage portal or Humanitec-style platform orchestrator solves developer-experience problems across a whole org. Most ECS teams at 10–50 environments have an operations problem, not a portal problem.
A full Internal Developer Platform — Backstage, Port, Humanitec, Cortex — is the right answer when you have 50+ engineers across multiple platforms (AWS, GCP, Azure, Kubernetes), a dedicated platform team of 5+, and a mandate to standardize how all of engineering provisions and deploys. That's a real problem and these are real solutions.
For a team of 30–150 people running primarily ECS Fargate with 1–3 platform engineers, an IDP is almost certainly overkill. The build cost alone — Backstage requires weeks of customization before it reflects your actual stack — is disproportionate to the problem. Developers aren't asking for a software catalog. They're asking to restart their own staging environment without opening a Slack ticket.
IDP vs operational layer — when each fits
Full IDP is right when:
- — 50+ engineers across multiple platforms
- — Dedicated platform team of 5+
- — Multi-cloud or mixed Kubernetes + ECS
- — Mandate to standardize all of engineering
- — 3–6 month implementation timeline is acceptable
Operational layer is right when:
- — Primarily or entirely ECS Fargate
- — 1–3 platform engineers
- — 10–80 environments growing over time
- — Terraform stays source of truth
- — Need results in days, not months
One tool worth flagging: AWS Proton was positioned as a platform engineering layer for ECS teams. AWS shut it down to new customers in October 2025. If you were evaluating Proton, it's off the table. See the AWS Proton deprecation guide for migration options.
The middle ground that covers 90% of ECS platform engineering needs at 10–50 environments: a lightweight operational layer (scheduling + self-service + fleet visibility) on top of Terraform + your existing CI/CD pipeline. No new abstraction layers. No proprietary config formats. No 6-month implementation.
Questions this raises
- →How do I convince my CTO we need a platform team?
- →What's the difference between platform engineering and DevOps?
- →When should an ECS team actually build a developer portal?
FAQ
10+ ECS environments and a 2-person platform team?
That's exactly who Fortem is built for. We'll show you what fleet operations looks like for your specific stack — scheduling, self-service, visibility — without replacing Terraform.
Book a 20-min call