Your service just spiked to 90% CPU and alarms are screaming. You pop open the dashboard, but metrics take forever to load. This is where a good ECS Prometheus setup either saves your day or ruins it entirely. The combination looks simple—Amazon ECS runs your containers, Prometheus scrapes their metrics—but getting them to cooperate securely and reliably takes some care.
ECS handles the orchestration. It schedules and scales containers across your compute fleet. Prometheus does the watching. It pulls metrics from targets and keeps time-series data so you can graph, alert, and troubleshoot. Put the two together and you get live observability for container workloads without duct-tape scripts or random sidecars.
The workflow starts with service discovery. Prometheus needs to know which ECS tasks exist and where they live. AWS provides a service discovery API that exposes task metadata, but Prometheus must be configured to poll it and translate that data into target endpoints. Next come IAM permissions. Prometheus must assume a role with the correct policy to list ECS tasks, often through an OIDC identity provider to avoid hard-coded credentials. When metrics are fetched successfully, they flow into Prometheus's time-series store and can be exported to tools like Grafana or Alertmanager.
Run this in production and you quickly learn two lessons. First, missing IAM permissions cause silent failures. Second, scraping too often from ephemeral tasks increases load. The fix is to tune scrape intervals and enforce least-privilege roles. Keep EC2 instance metrics separate from container metrics, and define explicit relabel rules to label by service, version, or environment. It keeps dashboards readable and alerts meaningful.
Five tangible benefits of a clean ECS Prometheus integration: