Who Restarted Prod?
How to Find It in CloudTrail
Your ECS service restarted. Or a task was manually stopped. Or desiredCount dropped to zero and nobody admits it. The ECS console shows WHAT happened — not WHO. CloudTrail has the answer, and three CLI commands get you there in under two minutes.
- 01CloudTrail captures every ECS API call — UpdateService, StopTask, RunTask, RegisterTaskDefinition — with who, when, and from where.
- 02Event History is free for the last 90 days. Three CLI commands find the culprit in under 2 minutes.
- 03The userIdentity field tells you human vs CI/CD vs AWS service. Root account activity in ECS is always suspicious.
- 04Download the skill file — an AI agent runs the full fleet audit and produces a structured report automatically.
Why the ECS events tab doesn't tell you who did it
ECS events show WHAT happened — "service updated", "task stopped" — but not WHO. The userIdentity lives in CloudTrail, not in the ECS console. That's the gap most teams waste an hour trying to bridge.
You open the ECS service page. Under Events: "service my-api has started 1 tasks" at 14:23, "service my-api has stopped 1 running tasks" at 14:21. Something stopped your service and triggered a redeploy. The ECS console stops there — it doesn't record the API caller, the IAM identity, or whether it was a human clicking the console or Terraform applying a change.
Three commands to find the culprit in under 2 minutes
aws cloudtrail lookup-events with AttributeKey=EventName filters to specific actions. Pipe through jq to extract userIdentity.userName, eventTime, and sourceIPAddress. Covers the last 90 days at no charge.
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=StopTask \
--query 'Events[*].CloudTrailEvent' \
--output text | \
jq -r '. | {
time: .eventTime,
who: (
if .userIdentity.type == "IAMUser" then .userIdentity.userName
elif .userIdentity.type == "AssumedRole" then .userIdentity.sessionContext.sessionIssuer.userName
else .userIdentity.type
end
),
from: .sourceIPAddress,
via: .userAgent,
task: .requestParameters.task
}'aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=UpdateService \
--query 'Events[*].CloudTrailEvent' \
--output text | \
jq -r '. | {
time: .eventTime,
who: (
if .userIdentity.type == "IAMUser" then .userIdentity.userName
elif .userIdentity.type == "AssumedRole" then .userIdentity.sessionContext.sessionIssuer.userName
else .userIdentity.type
end
),
via: .userAgent,
service: .requestParameters.service,
desiredCount: .requestParameters.desiredCount
}'# Find everything a specific IAM user did in the last 24h
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,AttributeValue=john.smith \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-24H +%Y-%m-%dT%H:%M:%SZ) \
--query 'Events[*].{Time:EventTime,Event:EventName}' \
--output tablelookup-events is capped at 2 requests/second per account per region. If you're scripting across many event types, add a 0.5s sleep between calls or use --next-token for pagination. Max 50 events per request; paginate if you need more.Which ECS events map to which actions
UpdateService = scale change or deployment. StopTask = manual kill. RegisterTaskDefinition = new image or config. RunTask = standalone task launch. Each has a different userIdentity pattern worth knowing.
The most ambiguous one is StopTask. It appears in CloudTrail when a human manually stops a task, when a script does it, and when ECS itself stops a task during a rolling deployment. Check userIdentity.invokedBy — if it says ecs.amazonaws.com, ECS triggered the stop internally during service orchestration, not a human.
Decoding userIdentity: human, CI/CD, or AWS service
userIdentity.type tells you who called the API: IAMUser = human, AssumedRole = CI/CD or Lambda, AWSService = autoscaler or ECS itself. Root type should never appear in ECS — alert immediately if it does.
The tricky one is AssumedRole. When a GitHub Actions pipeline runs aws ecs update-service, the CloudTrail event shows type: AssumedRole and the ARN of the role. The human-readable role name is in sessionContext.sessionIssuer.userName. That's the field to surface in your audit report — not the full ARN.
To distinguish console vs CLI vs Terraform, use the userAgent field:
userIdentity.type is Root, stop everything else and investigate. Root credentials should never be used for routine ECS operations. A Root call in CloudTrail means either someone is using the root account directly (a security failure) or credentials were compromised.Alerting in real time: EventBridge rule for critical ECS changes
EventBridge can trigger a notification within seconds of a StopTask or UpdateService call — before you notice the incident. One Terraform resource sets up the rule with no additional infrastructure.
Searching CloudTrail after an incident is reactive. EventBridge makes it proactive: you define a rule that matches specific CloudTrail events, and EventBridge triggers an SNS notification, Lambda, or Slack webhook immediately when the event occurs. For teams running 10+ ECS environments, catching a DeleteService before the on-call rotation starts saves significant incident response time.
resource "aws_cloudwatch_event_rule" "ecs_critical" {
name = "ecs-critical-changes"
description = "Alert on destructive or suspicious ECS API calls"
event_pattern = jsonencode({
source = ["aws.ecs"]
detail-type = ["AWS API Call via CloudTrail"]
detail = {
eventSource = ["ecs.amazonaws.com"]
eventName = [
"StopTask",
"DeleteService",
"DeleteCluster",
"UpdateService"
]
}
})
}
resource "aws_cloudwatch_event_target" "ecs_critical_sns" {
rule = aws_cloudwatch_event_rule.ecs_critical.name
target_id = "SendToSNS"
arn = aws_sns_topic.alerts.arn
input_transformer {
input_paths = {
event = "$.detail.eventName"
who = "$.detail.userIdentity.sessionContext.sessionIssuer.userName"
time = "$.time"
service = "$.detail.requestParameters.service"
}
input_template = ""ECS alert: <event> on <service> by <who> at <time>""
}
}For UpdateService, add a second rule specifically for scale-to-zero: filter where requestParameters.desiredCount = 0. That's the most common accidental incident — someone running a cleanup script that hits the wrong environment.
The Oct 2025 addition: ECS CloudTrail data events
Since October 2025, ECS supports CloudTrail data events for ContainerInstance agent API activity (ecs:Poll, ecs:StartTelemetrySession). These aren't in Event History — they require a CloudTrail trail or CloudTrail Lake.
AWS management events (UpdateService, StopTask, etc.) are what most teams need for incident response. The October 2025 addition is different: ECS now supports CloudTrail data events for ContainerInstance agent API calls — the low-level polling activity between the ECS agent and the control plane.
For most ECS Fargate teams, data events aren't needed for incident response — management events cover UpdateService and StopTask which is where incidents come from. Data events matter if you run EC2 launch type and need to audit ContainerInstance registration activity, or if compliance requires a full record of agent-to-control-plane communication. Enable them only if you have a specific requirement — at scale, ContainerInstance polling events generate significant volume and cost. Details in the ECS CloudTrail logging docs.
Download the skill file — let the AI agent do the audit
The skill file instructs an AI agent to pull all critical ECS CloudTrail events from the last 24 hours across every cluster in your account and produce a structured "who did what" report. Read-only — no changes applied.
The agent lists all clusters, runs lookup-eventsfor each critical event type, decodes the userIdentity, and produces a structured output: "Service X was updated at HH:MM by role deploy-prod via GitHub Actions from IP 140.82.114.3." It also flags Root account activity, unexpected source IPs, and scale-to-zero incidents. For teams where "who did this?" is a recurring post-incident question, this is the 2-minute version of the 20-minute manual process.
"To identify the user who initiates a StopTask API call, view StopTask in AWS CloudTrail for userIdentity information."
— AWS Knowledge Center: Troubleshoot running task count changes in ECS
FAQ
If you read this, you might also want to know
Can I search CloudTrail events older than 90 days?
Not with lookup-events — it only covers the last 90 days. For older events, you need a CloudTrail trail delivering to S3. Query the S3 bucket with Athena using the cloudtrail_logs partition table, or use CloudTrail Lake if you enabled it. Both options incur additional costs: S3 storage + Athena query costs, or CloudTrail Lake ingestion charges.
How do I tell if a change was made by Terraform vs a human?
Check the userAgent field in the CloudTrail event. Terraform calls show 'Terraform/1.x.x (+https://www.terraform.io) terraform-provider-aws/5.x.x'. A human via CLI shows 'aws-cli/2.x'. The AWS console shows 'console.amazonaws.com'. This works even when both Terraform and a human share the same IAM role — the userAgent tells them apart.
What if the ECS event was triggered by autoscaling — does it show in CloudTrail?
Yes — Application Auto Scaling calling UpdateService appears in CloudTrail with userIdentity.type = AWSService and invokedBy = application-autoscaling.amazonaws.com. You can distinguish autoscaling actions from human actions by filtering on invokedBy. This is important when investigating 'who scaled my service' — it might be auto scaling doing its job, not a person.
Can I set up a CloudTrail alert that fires before the on-call gets paged?
Yes — the EventBridge approach in section 5 fires within seconds of the CloudTrail event, which is typically 1-2 minutes after the API call. EventBridge → SNS → PagerDuty (or directly to your alerting platform) gives you a notification before the monitoring system catches the downstream effects. For DeleteService or scale-to-zero, this is the difference between 1-minute and 5-minute detection.
Stop hunting CloudTrail
at 2am.
Book 20 minutes — we'll show you what Fortem surfaces across your ECS environments so you know who changed what before your monitoring even fires.