How To Craft An IAM Strategy: 6 Learnings From Hundreds Of DevOps Managers

Andrios Robert

23 Mar 2023 • 5 min read

This article explores how to create an access management strategy that gets your team 5x the results you're currently getting with the same efforts.

If you implement this framework, you'll see a few benefits:

3 high-risk security problems mitigated
2x more time invested in cloud cost optimization
7 hours saved per engineer in the company
5x reduction in production ad-hoc access

By organizing your team around this framework you'll have a clear roadmap for access management that is aligned with the business vision and priorities.

Unfortunately, by investing on high noise but lower risk problems, DevSecOps Managers don't invest in the root cause of the problem. Because of that, they keep wasting quarter after quarter of the teams' efforts in inefficient initiatives.

They lack measurable risks to set priorities.

Understand the problem before creating a strategy

If your DevOps team spends 50% of their time with manual one-off tasks, the solution isn't hiring more people.

Here's a typical scenario: VP of engineering complains to Cloud Manager on the slowness of DevOps requests. Engineers are blocking on DevOps tasks to resolve bugs and incidents. One-off DevOps requests coming from Slack or some sort of service desk service seem too distinct to automate. DevOps Manager decides to hire more people to fix the problem.

Bad idea. Requests aren't that different if you look right.

But maybe you categorize one-off requests and spend energy on the type of requests with highest count. Say, database queries. Wrong again, this isn't the problem. Access management drains infinity energy and the wrong strategy makes you waste efforts on wrong the problem.

Here's why most teams fall to this trap:

Relying on traditional legacy technology to solve the problem
Lack of a single map with 100% of the access surface
Wasting time the first and most obvious solution
Conflicting Security and DevOps strategies
Wasting energy on higher noise problems

After talking to hundreds of DevOps and security manager in the process of building Hoop.dev, I found how the best teams use a simple framework. They break down the problem into 5 smaller and simpler problems. Then define their strategy and make progress with the right priorities.

Here’s how step by step:

Step 1: Define The Access Surface

Asking the security team how to do things will get you in an endless fight. Ask what, not how.

At one point when I was leading a DevOps team at a high growth startup and we were close to having half our priorities defined by the Security team. They got us all sort of urgent needs to work on because we had compliance audits arriving or large customers wanted security reports. A lot of it was access-related stuff.

But we weren’t working on the right thing.

I got tired of this and called a meeting with the CTO and Security lead

Before the meeting I listed all the things we needed to touch in the infrastructure, from developers accessing containers, to PMs running database reports, and CS making repeated db change requests. When we looked at the picture, the strategy changed. We opened two security engineer positions and grouped things we were fixing into buckets that got more than one problem with same amount of effort.

Lesson learned: map the access surface before starting to fix one access problem. You’ll find common problems and solutions.

Step 2: Prioritize Edge Authentication

Authentication is not so obvious as adding SSO to everything that supports it.

Many services your engineers access today don’t support SSO. Edge Authentication is a big challenge and where the biggest risks live. The one or two engineers with static credentials in their computers gatekeeping an weak entry is your largest security risk. Think about the things you limit access to only a handful of people “because it is too sensitive” or less secure.

Human error accounts for 85% of data breaches. Leaving weak spots behind and protecting them with trusted employees isn’t a good idea.

Step 3: Go Beyond RBAC for Authorization

Least Access isn’t feasible using static IAM/RBAC rules.

A Microservices-based product built by 500 engineers generates more than 2000 unique cloud resources to be managed. Each database table, schema, row, view. Each container app, namespace, configs. Message queues, dead letters. Search engines indexes, querying. The list goes on and on. To achieve Least Access you average 3 profiles for each team. In our example of 500 engineers you have to manually manage 150 policies with 40 resources in each.

You need something more flexible than RBAC.

Some solutions that can reduce this complexity:

Multi-team access grants and Runbooks execution workflows
Automated execution of one-off scripts
Temporary elevated access grants
Simple to build and use Runbooks

Step 4: Leverage Audit For More Than Compliance

Audit logs must be uniformly structured and queryable to generate value.

Gradually reduce ad-hoc access in favor of automation. "How many repeated production changes we made that were very similar and should be turned into application code or automation?" Indexed audit logs for flexible querying has the answer. A few other ways to leverage audit logs in your strategy:

Faster recovery from outages What caused this outage? Did someone run a bad query that took us down? Let’s check the history of updates made by human db users in the last 3 hours.

Learning from mistakes Create a timeline of commands executed by all users during an incident. Look back and find out what has to be different in the next incident. Did we take too long to try something? Do we need a new Runbook?

Zero Compliance TOIL. An audit strategy saves time. It is one thing to have the default audit logs enabled for you most important services. It is a completely different thing to have audit logs structured in the an uniform format across your infrastructure somewhere you can query. It is the difference of spending weeks manually downloading audit logs and consolidating them manually in spreadsheets. And running a query in seconds with all the information requested by the auditor.

Step 5: Simplify Automation To Reduce Ad-hoc

No company survives without ad-hoc access. Some companies use it for exception scenarios, others rely on it for operations.

The difference is that some companies create incentives for engineers to use less ad-hoc and more automations. For that you need observability to understand how ad-hoc is happening (structured audit logs) so you can make decisions. Then comes automations platform. It has to be easy. The right way has to be the easiest way. Is it easier for an engineer to crate an automation than it is to run an ad-hoc query? If not, make it so. Everyone likes automations, they are safer and faster to run. The problem could be that you platform is blocking engineers from doing that.

You can't blame developers for using too much ad-hoc access if you have the wrong incentives in place.

Step 6: Combine The 5 Pillars Into The Access Framework

Building a matrix with these 5 pillars and the technologies in your stack gives you a simple framework to build your strategy.

Here’s an example of what it looks like:

And here’s what many companies do when trying to fix access:

They will focus on a small but noisy problem, leaving high security risk and TOIL cutting areas untouched.

I built a spreadsheet to make it easier to define the strategy using the 5A framework. Fill it and you’ll understand the real size of your problem and craft the strategy that makes sense for your company. You can download it here.