The lights in the data center go red at 2:14 a.m. Your phone buzzes. You are the on‑call engineer staring at an alert fired from Domino Data Lab’s compute cluster. PagerDuty is already calling, routing, and escalating before the logs finish syncing. That is what controlled chaos looks like when managed well.
Domino Data Lab powers serious ML workloads: model training, experiment tracking, reproducible notebooks. PagerDuty orchestrates human response to machine trouble. Together they turn “what just broke?” into “who already owns it?” with less coffee spilled in between. Integration isn’t just convenience. It is how teams enforce accountability across sprawling data infrastructure.
At its core, Domino pushes compute jobs, spins up containers, and logs metrics. PagerDuty sits one layer higher, listening for incidents from Domino’s event stream or monitoring tools like Prometheus. When thresholds trip, PagerDuty routes alerts based on schedules, escalation policies, and business impact. The connection is simple: Domino emits, PagerDuty decides, humans react.
How the integration works
Domino Data Lab sends job or node failures through a webhook that PagerDuty consumes. Each event carries metadata such as project, owner, and runtime context. PagerDuty transforms that payload into an incident, applies routing logic, and triggers the appropriate responder channel—Slack, email, or a direct mobile alert. The cycle completes when resolution updates flow back, automatically closing the loop inside Domino’s UI so teams can track mean time to recovery without a spreadsheet audit.
Best practices that keep it sane
Tie PagerDuty service keys to project namespaces, not individuals. Map Domino’s role-based access control (RBAC) directly to PagerDuty escalation paths to preserve the audit trail. Rotate tokens the same way you manage AWS IAM credentials. Keep alert titles structured so responders can parse them at a glance.