I quit my job after deleting a Kubernetes namespace in production
I'm staring at my screen in disbelief. I've just deleted a production Kubernetes namespace, and now millions of users can't make payments. How did I get here? And more importantly, how do we prevent this from happening again?
SSH's Staying Power
As I frantically try to restore the namespace, I wonder why we're still using SSH. It persists because alternatives aren't 10x better. Solving SSO and recording sessions doesn't drive adoption, and secure connections with well-managed passwords work fine. Passwords are flawed but familiar. The passwordless transition hasn't happened in web apps, therefore it likely won't happen just for safety. A 10x better experience is needed for transition.
Beyond Secure Connections
While waiting for systems to come back online, I start thinking about improvements beyond secure connections. It seems challenging at first, but as I look at the aftermath of my mistake, four critical areas emerge:
- Authorization
- Automation
- Collaboration
- Reliability and security guard-rails
Breaking Down Challenges
Authorization
As I regain access to our systems, I consider authorization. It asks: Can this person perform this action on this resource? It boils down into reads and writes.
Reads: We need to limit sensitive data access. There are no issues with general data access; in fact, we want to encourage it.
Writes: I realize our security teams waste time on policies, and developers sometimes write these policies. Cloud scale makes least-access policies impossible, so we should abandon policies for an intent-based approach. We need to implement granular permissions for different write types (append, modify, delete).
Automation
As I manually restore services, I think about automation. Repeating actions on similar resources should be automated, but automation takes time. CI/CD/rollout pipelines need extensive building and maintenance. Multiple technologies make complete automation impossible.
Collaboration
The incident response team joins me, and I'm grateful for the oversight. Critical environments need this, and peer reviews. But it's not enough. I wonder: Why was there no notification about my execution? And why didn't the client alert me about high-impact deletion?
Reliability and Security Guard-Rails
As we stabilize the system, I consider guard-rails. They were hard to build, so teams defaulted to denying access. Access tightening often follows production incidents like this one. Blameless culture philosophy exists, but "everyone has a plan until they get punched in the face." Most companies won't do blameless, and revoking access costs less than building automations.
Solving Problems Inside Secure Connections
As the dust settles, I start thinking about solutions. We need to shift perspective on secure connections. We have all necessary data flowing through these connections, so we should extract insights without compromising security. The missing piece: easy ability to tap into it.
I remember struggling with packet manipulation in the past. It was historically challenging, and the high number of protocols made it nearly impossible. Therefore, we need to develop protocol-agnostic packet analysis tools and create standardized interfaces for cross-protocol packet manipulation.
Middlewares catch my attention: We could implement "benevolent MITM" to enhance security and functionality without compromising encryption. We could attach custom logic at the network layer, similar to API gateways and reverse proxies. Why not for non-HTTP protocols? We should design protocol-specific middleware for common infrastructure protocols (SSH, RDP, SFTP) and implement policy enforcement at middleware layer.
Thinking about our monitoring gaps, I consider instrumentation: We should apply site reliability concepts to infrastructure connections. Visibility and actionability are game-changers, so we need to develop real-time monitoring and alerting. We must create connection metrics dashboards.
For future-proofing, I think about extensibility: We need a publicly available API for new protocols. Simple protocol support contribution should enable everything else. Thus, we should design plugin architecture for easy protocol integration and create a community-driven protocol implementation repository.
Given the complexity of what we're facing, I realize the problem was too big to ignore.
From Incident to Innovation
In the weeks following the incident, I couldn't shake the feeling that there was a massive opportunity here. I realized I could make a bigger impact. So, I made a decision that surprised even myself: I quit my job.
Today, Hoop.dev solves most of these problems:
- We've developed a command and query copilot: An system for protocol-specific commands and configurations across languages and systems.
- We've implemented AI data masking: A plug-and-play solution for read authorization that dynamically adjusts masking based on user context and authorization.
- We've set up critical command review: It solves automation. CI/CD/rollout becomes transparent and works out-of-box for any technology.
- We've created smart runbooks: Version-controlled scripts for any infrastructure via simple REST API, with low-code/no-code automation tools.
- We've built an Access IDE: Purpose-built for new access use cases, featuring a collaborative, cloud-based IDE with built-in security checks and real-time impact analysis.
- We've developed a collaborative troubleshooting platform: It enables real-time incident response across multiple protocols.
As I look back at the incident that started this journey, I realize how far we've come. The next time someone accidentally deletes a critical namespace, our systems will be ready. And maybe, just maybe, there won't be a next time.