All posts

PII/PHI redaction for AI agents on BigQuery

A support-summarization agent queries a BigQuery table that happens to include a free-text notes column. Buried in that column are patient identifiers a clinician typed months ago. The agent did not ask for protected health information, but the query returned it, and now regulated data is sitting in an LLM context and possibly a downstream log. The data class changed the stakes entirely. PII/PHI redaction for AI agents on BigQuery is the control that keeps regulated fields out of the agent's ha

Free White Paper

AI Agent Security + Single Sign-On (SSO): The Complete Guide

Architecture patterns, implementation strategies, and security best practices. Delivered to your inbox.

Free. No spam. Unsubscribe anytime.

A support-summarization agent queries a BigQuery table that happens to include a free-text notes column. Buried in that column are patient identifiers a clinician typed months ago. The agent did not ask for protected health information, but the query returned it, and now regulated data is sitting in an LLM context and possibly a downstream log. The data class changed the stakes entirely.

PII/PHI redaction for AI agents on BigQuery is the control that keeps regulated fields out of the agent's hands. Sensitive values are detected and redacted before the result returns, so the agent works with safe data and the protected fields never cross the boundary.

Regulated data raises the bar on where redaction happens

With ordinary data, a leak is a problem. With PII and PHI, it is a reportable event with legal weight. That raises the requirement: you cannot depend on the agent to avoid selecting sensitive columns, because the cost of one over-broad query is too high. Redaction has to be guaranteed by a layer the agent does not control, and it has to catch sensitive data even when it hides in free-text fields nobody flagged as PHI.

Pattern-only rules miss that long tail. You want classification that recognizes identifiers in unstructured text, not just in columns named ssn.

Why PII/PHI redaction cannot be left to the agent

It is worth being precise about why the agent is the wrong place for this control. An agent's instructions are a request, not a guarantee. You can tell an agent to avoid protected fields, and most of the time it will, but a security control measured by "most of the time" is not a control when the data is regulated. One jailbreak, one over-broad exploratory query, one schema the agent did not expect, and the protected values are returned.

Continue reading? Get the full guide.

AI Agent Security + Single Sign-On (SSO): Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.

PII/PHI redaction has to be a guarantee, which means it cannot depend on the agent's cooperation. It has to happen on the way back from BigQuery, applied to every result that crosses the boundary, so the agent's behavior is irrelevant to whether regulated data escapes. The agent can ask for anything; it simply cannot receive the raw protected fields.

Inline redaction on the BigQuery connection

hoop.dev proxies the connection to BigQuery, so result sets flow back through the gateway. With redaction configured, hoop.dev streams that content to a DLP provider, Presidio or Google DLP, which classifies the data with ML rather than fixed regex, and redacts the matched PII and PHI inline before results reach the agent. The agent receives clean output; the regulated values stay behind the boundary.

This is configured per BigQuery connection with a DLP provider attached, not on by default, so you choose the data classes that get redacted and confirm coverage includes free-text fields.

Steps

  1. Run the hoop.dev agent near your GCP project, connecting outbound to the gateway.
  2. Register a BigQuery connection, set CLOUDSDK_CORE_PROJECT, and enable GCP IAM federation for per-user OAuth.
  3. Attach a DLP provider (Presidio or Google DLP) and enable redaction, selecting the PII and PHI classes to detect.
  4. Route the agent's bq queries through the gateway.
# identifiers in the notes column are redacted before the agent sees them
bq query --use_legacy_sql=false \
  'SELECT case_id, notes, status FROM support.tickets LIMIT 500'

Verify

Query a table you know contains identifiers, including a free-text column, as the agent. Confirm the identifiers come back redacted while non-sensitive fields pass through, and confirm the same query without redaction would have exposed them.

Pitfalls

  • Do not assume redaction is on by default. It is per connection and needs a DLP provider attached.
  • Do not redact only named columns. Configure the DLP classes to catch identifiers inside free-text fields too.
  • Do not trust prompt instructions to keep an agent away from PHI. A prompt is not a boundary.

hoop.dev is open source, and you can support your data-protection program with it; it generates the evidence teams use for frameworks like HIPAA, without claiming a certification it does not hold. Read the getting started guide and see how redaction relates to data masking for AI agents on BigQuery. Start at github.com/hoophq/hoop and test redaction against a known PHI column.

FAQ

Does PII/PHI redaction modify my BigQuery data?

No. The stored tables are untouched. hoop.dev redacts in the result stream on the way back to the agent, so the data at rest is unchanged and the agent simply never receives the protected values.

Can redaction catch identifiers in free-text columns?

Yes. hoop.dev streams content to a DLP provider that classifies with ML rather than fixed patterns, so it can detect identifiers inside unstructured text and redact them inline.

Open source

Save the open-source gateway for agent data access

Hoop is MIT-licensed infrastructure for controlling how AI agents reach production data. Star hoophq/hoop so you can inspect it, deploy it, or share it when your team starts governing agent access.

Star and save the repo →More posts