All posts

PII/PHI redaction for AI coding agents on BigQuery

A data‑science contractor was off‑boarded, but the automated code‑generation pipeline that powers your internal analytics still holds the same service‑account key used to query BigQuery, exposing a gap in pii/phi redaction. The next day the pipeline produces a model that, when run, pulls patient records, merges them with public data, and writes the combined set to a downstream bucket. No human ever sees the raw query, yet the pipeline has full read access to protected health information. AI‑dri

Free White Paper

AI Agent Security + Single Sign-On (SSO): The Complete Guide

Architecture patterns, implementation strategies, and security best practices. Delivered to your inbox.

Free. No spam. Unsubscribe anytime.

A data‑science contractor was off‑boarded, but the automated code‑generation pipeline that powers your internal analytics still holds the same service‑account key used to query BigQuery, exposing a gap in pii/phi redaction. The next day the pipeline produces a model that, when run, pulls patient records, merges them with public data, and writes the combined set to a downstream bucket. No human ever sees the raw query, yet the pipeline has full read access to protected health information.

AI‑driven coding agents are attractive because they can write and execute SQL on demand, but they also become a blind spot. When an agent talks directly to BigQuery with a shared credential, every row that contains names, social security numbers, or medical codes flows unfiltered back to the agent’s runtime. The organization loses visibility into who read what, and there is no guarantee that sensitive fields are stripped before the data is used elsewhere.

In the most common deployment, the agent authenticates with a static Google service‑account key that is baked into the CI configuration. The key is distributed to every build runner, and the runner launches the BigQuery client directly against the Google endpoint. Because the connection is end‑to‑end, the platform cannot intervene, cannot redact fields, and cannot produce an audit trail of the exact queries and result sets.

One improvement teams often make is to replace the static key with per‑user OAuth tokens or GCP IAM federation. This limits the credential surface and ties each request to an individual identity, but the token still travels straight to BigQuery. The gateway that could inspect the traffic is missing, so the request still reaches the database unmediated, still returns raw rows, and still leaves the organization without a record of what data left the warehouse.

What you need is a control point that sits on the data path, inspects every query and response, and applies policy before the data reaches the AI agent. That control point must be able to hold the credential, enforce just‑in‑time approval, mask protected fields inline, and record the entire session for later replay.

Why pii/phi redaction matters for AI coding agents

Regulatory frameworks such as HIPAA and GDPR treat health and personal data as highly sensitive. When an AI agent can retrieve that data without any transformation, a single mis‑configuration can expose thousands of records. Inline redaction ensures that the agent only ever sees a sanitized view, reducing the risk of accidental leakage or downstream misuse.

Beyond compliance, redaction limits the blast radius of a compromised agent. If an attacker hijacks the CI runner, the worst‑case scenario is that they can only see masked identifiers, not the underlying personal data. This makes the breach less damaging and easier to contain.

Continue reading? Get the full guide.

AI Agent Security + Single Sign-On (SSO): Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.

Architectural pattern for inline pii/phi redaction

hoop.dev implements a layer‑7 gateway that proxies BigQuery connections. The gateway runs as a network‑resident service, typically deployed via Docker Compose for a quick start or as a Kubernetes pod for production. An agent located in the same network holds the actual Google credential; users and AI agents never see the secret.

When an AI coding agent initiates a query, it connects to the hoop.dev endpoint instead of the native BigQuery endpoint. The gateway authenticates the request using the OIDC token presented by the agent, maps the token to an identity, and checks any just‑in‑time approval workflow that you have defined. If approval is required, the request is paused until a human reviewer grants it.

Once the request is authorized, hoop.dev forwards the query to BigQuery using the stored service credential. The response streams back through the gateway, where a masking engine examines each row. Fields that match patterns for names, medical codes, or other protected identifiers are redacted in real time before the data is handed to the AI agent. The masking policy is declarative, allowing you to add or modify rules without touching the agent code.

Every command and every result set is recorded by hoop.dev. The session log includes the identity of the requester, the exact SQL text, timestamps, and the masked result set. Because the gateway owns the session, you can replay it later for forensic analysis or audit purposes.

The same gateway also enforces guardrails such as blocking dangerous commands (e.g., DROP TABLE) and routing them for manual review. These controls are applied uniformly, regardless of whether the caller is a human engineer, an automated CI job, or an AI‑driven coding assistant.

How the flow looks from an AI agent’s perspective

  • The agent’s code points its BigQuery client at the hoop.dev endpoint.
  • The gateway validates the OIDC token and, if needed, triggers a JIT approval step.
  • The query is forwarded to BigQuery using the gateway’s stored credential.
  • Responses are inspected, sensitive fields are redacted, and the sanitized data is returned to the agent.
  • The entire interaction is recorded for later replay.

This pattern keeps the credential out of the agent’s environment, guarantees that no raw PII/PHI leaves the gateway, and provides a complete audit trail without requiring changes to the AI agent’s code base.

Getting started with hoop.dev and BigQuery

To try this approach, begin with the official getting‑started guide. The quick‑start uses Docker Compose to spin up the gateway, configure OIDC authentication, and register a BigQuery connection. The documentation walks you through adding a masking rule set for common health‑care identifiers and enabling session recording.

For deeper dives into policy configuration, guardrail examples, and how to integrate the built‑in MCP server for AI agents, explore the learn section. All of the configuration is expressed in declarative YAML files, so you can version‑control your security posture alongside your infrastructure code.

FAQ

  • Does hoop.dev store my Google service‑account key? The gateway holds the credential in its runtime environment, never exposing it to the calling agent or AI process.
  • Can I use per‑user OAuth instead of a shared key? Yes. hoop.dev can be configured to obtain per‑user tokens via GCP IAM federation, and the gateway will still apply masking and recording because it remains on the data path.
  • How is the masked data verified? The masking engine runs on every response row, applying the same rule set consistently. You can audit the raw logs (which are stored securely) to confirm that the redaction behaved as expected.

Implementing inline pii/phi redaction at the gateway level gives you confidence that AI coding agents never see raw protected data, while still enabling them to generate useful insights.

Ready to see the code? Visit the open‑source repository on GitHub: github.com/hoophq/hoop.

Open source

Save the open-source gateway for agent data access

Hoop is MIT-licensed infrastructure for controlling how AI agents reach production data. Star hoophq/hoop so you can inspect it, deploy it, or share it when your team starts governing agent access.

Star and save the repo →More posts