Rethinking RBAC: Part 1

There are two security features any piece of software implement: authentication and authorization. Authentication systems evolved a lot over past years. OpenID, SSO, MFA, Magic links. But authorization is the forgotten child of security. Today we use the same authorization system created in the 70s.

AWS IAM. Kubernetes RBAC. Postgres/Mysql roles. All the same mechanism of the first RBAC systems. Limitations of current RBAC systems block productivity and reduce security. The good news is that fixing them is feasible and unlocks huge potential.

In this 3 parts series we will explore RBAC systems. Part 1 talks about RBAC limitations. And gives some hints of what could be unlocked by fixing them. Part 2 explores solutions in-depth. It talks about existing solutions trying to extend RBAC. The good and bad parts. Part 3 talks about what is possible if we design a system that looks at all problems instead of niche solutions. There is a general and simple system, just like traditional RBAC, that solves for most use cases.

What is RBAC

Ad hoc forms of role-based access control were implemented on many systems beginning in the 1970s.” However, a formal model wasn't proposed until 1992, according to the National Institute of Science and Technology (NIST).

The roots of RBAC says a lot about its current state. RBAC is the formalization of a system. A documentation from the easiest implementation engineers came up with to get the problem out of their way.

image: Example of RBAC policy

RBAC is simple and does the job. It works well as a base protocol, but we can't be limited to it. As with any other base protocol, like TCP and UDP, the value comes from the applications built on top. Using RBAC as it is today is like writing the binary protocol to connect to a database for every new application instead of using a JDBC lib. Not practical.

But let's start covering what is good about traditional RBAC

Simple

RBAC is simple. It supports teams from two to hundreds of thousands of engineers today. A simple system that is able to serve such a wide spectrum of use cases is impressive. Understanding how it works and how to use it is intuitive. It is no surprise that engineers had been building RBAC into their systems long before it was formalized.

Explicit

You understand exactly what a policy is doing by looking at it. RBAC is declarative and explicit. If a policy says it does X, unless someone updates the policy it continues to do X. No external factors can change the behavior. And this is good for troubleshooting of problem.

Battle Tested

RBAC was created before the Cloud. And it aged well. Today RBAC is the standard for Cloud providers authorization systems. Even tough it was created in the 70s, the same system survived many years of technology change.

Problems

There is no free lunch. RBAC did leave behind a lot by not changing all these years. Yes, it is impressive that it still works, but there are downsides.

The problem RBAC proposes to solve scales proportional to the number of resources from a cloud provider used by companies. And the management of these policies is manual. Companies need to increase policy management headcount (security team). Many times they don't. So the few security engineer limit peoples' access to stuff to be able to manage the policies.

RBAC was fine when applications ran on one or two on-prem servers and a database. But now with every app using queues, blob storage, cache, multiple DBs, multiplied by many applications (Microservices), only RBAC doesn't solve fix the problem anymore.

RBAC is simple and easy to implement and manage policies. But it has its tradeoffs. If you need to manage and define several granular policies things can get messy. When you need more complex rules in an organization a system/product needs to be created in-house to allow administrators to define more granular policies. Kubernetes has a RBAC implementation and something called Admission Controllers that let you define custom policies using an external system. But very few companies have the resources to do so.

Let's break down these problems.

Unmanageble

We could stop here. It is impossible to manage RBAC policies. You either grant root access to everyone or give up. Anything in between is a time sunk and generate limited results. It's math. Let's consider this scenario:

Say you work for a payments company and your team is in charge of refunds. The team has 2 backend and 2 frontend engineers. As a backend engineer your focus is on the routine jobs, and the other engineer work on the BFF. You review each others' code. But work mostly on separated codebases.

Here is how to solve it with todays' RBAC: Create a group named refunds. Grant access to the four engineers to the refunds systems. This is RBAC. But considering Least Access [1] only the backend engineers would have any type of cloud access. And even between them accesses should be different. But this is impossible to implement with RBAC. Remember: a company with 50 engineers ends up with more than 10 thousand different policies for access control if implementing least access [2].

Static

RBAC is static. It requires human intervention to change the systems' behavior. If a developer needs access to a new table, someone needs to manually update a policy. At most companies this takes days, if it ever happens. [3] Managing these policies is complex. Instead of facing the problem, many companies decide to centralize access with a few more experienced people. Anybody that need to do something using privileged access needs to go trough them. A gated process.

Binary

Access denied. This is the only result out of an RBAC system besides success. Is that representative of the real-world? No. The output of an engineer accessing something has many paths in practice. If you don't have access to a resource, you have to manually do something today:

- File an access request ticket. - Share my script with a collegue that has the access. - Open a Zoom call with the team with privileges and have them screen share the database client to run my query.

Sometimes worst.

1. Someone from the DBA or infra team copy and paste the script in production. 2. Then share the results in the ticket comments. 3. File a Jira ticket pasting the script I need to execute. 

How terrible is that? Having someone from a different team check a script before it executes is one output of the authorization process. A proper authorization system should say:

You want to do this? Fine. I trust you (authentication). But first I need to check with some people. Or check with an API. Or do whatever. My goal is to unblock, not block you.

By not having the real-world scenarios baked into the system, RBAC creates manual work. But let's not get ahead of ourselves and finish exploring the problems of RBAC before we dive into how to fix them.

Reactive

Security Information and Event Management (SIEM) monitor every user and system event in a company. They try to find problems. Different types of correlation analyze whatever systems or users are doing. The goal is finding weird patterns. Audit trials from every systems run trough the SIEM.

Can you imagine what it is like to enable audit trails for multiple Kubernetes clusters, databases, servers, containers. As is the case with many security features, they don't directly impact functionalities. They are then left to be done in the future. Say your company didn't postpone this. There is also a long-tail of places hard to monitor. Like the syslog of containers of a Kubernetes pod. ✍️ It's hard.

Whenever the SIEM detects a problem, the security team receives an alert. Then what? The person that saw the alert has to manually find the system with the problem and work on a solution. Many times this solution is to disable the account while they investigate the incident further. Some more advanced security teams automate this process. When an alert is triggered the associated account is automatically disabled. Keep in mind that very few teams are able to do this. Usually only the ones with millions in investments in the security area. The rest of the world that even have a security team are struggling to enable audit trails across all the systems they have to manage. Or trying to get them into their SIEM tool. When they have one.

But despite all this work, this system is inefficient. Remediation happens after the fact. Systems have been broken into by the time the alerts go off. Damage is done. This model can only remediate the problem, not solve it.

Why are we monitoring the events after they happened and not before? Why can't RBAC systems check with the SIEM tool if it is ok to allow something to run before it runs? We'll explore what it means to rethink RBAC in a minute.

Permanent

Nobody should have permanent access to anything. You shouldn't have permissions to update the database [3] if there is no incident or support ticket assigned to you. But implementing this today is hard. Most of the time a security engineer has to manually provision temporary access. And then revoke it later. Just enough time so they can fix an incident. But in many cases the access is never revoked. Manual work. RBAC is permanent.

Hashicorp Vault and a few other solutions improve this problem by issuing temporary credentials that automatically revoke after some time. A way better solution. But this isn't the ultimate level of least access. Credentials are valid for 15 minutes. I can do way more than what I was supposed to in this time.

Impacts

Not fit for Least Access

At this point it should be clear why it is impossible to implement Least Access using RBAC. There is no way a team can manage thousands of unique policies. As a result, people get more privileges than they should. It is not enough to have two profiles one for developers and one for SRE. Some companies go further and create one group per team. Still, not enough. No two people in a company should have the same level o access. No two engineers do the exact same work. If they do, something is off. You can't implement least access with RBAC.

Ignore available data

We talked about the limitations of RBAC considering only it's design. But if we go a step further and consider what it is missing, things get worse. The software development process is full of data. Many systems are used to help organize teams and code. A lot of data is available at any given time on the current context.

Chat. Alerts. Monitoring. Feature Flag. Incident Management. Project Management.

All this data is available behind a simple API and being completely ignored by RBAC systems.

Going back to the scenario we discussed on the impacts of binary output (access denied). What happened after the access was denied is a bunch of manual work across all these tools. Manual work that could be baked into the RBAC system. Example:

1. System detectes I don't have access to something,2. Finds a Slack channel where people that currently have access hang out, 3. And share what I'm trying to do with them for code-review, or even file a Jira ticket so I don't have to open Jira. 

Further: if the team always grants me temporary access when I'm working on an incident, why isn't the system itself not doing it already? No need to have humans in the loop.

Rethinking RBAC

Enough with problems. There are more of them, but we covered the biggest. Time to talk about solutions. How to rethink RBAC.

Programability

It all starts with progammability. RBAC systems are analog. It makes sense, when it was created, most systems were analog. Digital systems with infinite possibilities were just starting to come out. But we need to get RBAC out of analog mode. It needs to be programmable (check part 2 to understand why OPA isn't the solution here).

We should be able to write a piece of code that runs before a given access command executes. Then use all the context of the command in this code. User. System. Time. Location. Without this foundation, no other improvements are possible. We could have extra features added to RBAC. But being limited by what the providers of these systems can deliver isn't gonna change things. The true change comes from programability.

Extensibility

Being able to program enables extension. Create rules fit for your business on how authorization should behave. Share these rules with others. Your IAM policy template is useless to most people. But what if you could code policies with python? Instead of fixed ARNs you could use resources names, tags, accounts, and any other dimension. Enabling reuse increases sharing. Sharing creates eco-systems.

Integrations

With a programming environment and all the context about an execution, you can call an external API. You don't need to be limited by what you have in context. Call other APIs. Get more data. Send data to other places. Program the authorization flow to whatever fits your needs.

Leverage data

Remember our example on checking if someone is on-call and granting temporary access? By using programability we can call the Incident Management system API. Check if the user is on-call and working on a critical incident. Then give them access to what they need to get our systems back up. Having someone else wake up at 3am only to grant this person access isn't helping. The engineer that woke up isn't happy. The customer that is waiting without access to the service isn't happy. The powerless engineer isn't happy.

Do you need to register everything that users do on Jira so you can get your PCI certification by end of the year? Fine. Have your programmable RBAC system call the Jira api to register a ticket for all requests.

What's next

RBAC is a solid standard. It is solving a problem today for many complex use cases. But we need more. In Part 2 of this series we cover all the existing solutions trying to solve these problems. From new standards to open-source tools and vendors. The good and bad parts of each system. And part 3 proposes a new simple and general system that keeps the simplicity of RBAC and fixes its limitations.

Thank you for reading and please let us know what you think. Do you agree? Not? Why? We have discussions happening on Hacker News, Reddit, and other communities.

Make sure to subscribe to our newsletter and get parts 2 & 3 in your inbox.

Subscribe


[1] Least access is an Information Security concept where systems are designed in a way so that agents, humans or machines, only have the minimum amount of access to perform their role at any given time.

[2] Some people could say: "we don't give people access unless they request it". We won't discuss this approach on this post. It is a reactive anti-pattern. It breaks any form of healthy culture and productivity. We assume that the security team is working hard to get out of the way from engineers building customer-facing features.

[3] It could be that you work for the few companies that have version-control policies and let the engineer that needs the access submit a PR requesting the access. And the security team only reviews them. The goal for this post is the broader industry state, not the small subset that had the time and resources to build in-house automations.

[4] Ideally one should never change things directly in the database, yet 99% of companies do, but we won't discuss it here.