All posts

Configuring AI agents access to BigQuery with data masking

An analytics agent runs SELECT * FROM users.accounts LIMIT 1000 to summarize signups, and BigQuery hands back a thousand rows of email addresses, phone numbers, and billing details. The agent only needed counts by region. It now holds, in its context window and quite possibly in a downstream prompt or log, a pile of raw personal data it had no business seeing. Data masking is the control that stops this. For an AI agent on BigQuery, data masking means sensitive columns are redacted before the r

Free White Paper

AI Data Exfiltration Prevention + Data Masking (Static): The Complete Guide

Architecture patterns, implementation strategies, and security best practices. Delivered to your inbox.

Free. No spam. Unsubscribe anytime.

An analytics agent runs SELECT * FROM users.accounts LIMIT 1000 to summarize signups, and BigQuery hands back a thousand rows of email addresses, phone numbers, and billing details. The agent only needed counts by region. It now holds, in its context window and quite possibly in a downstream prompt or log, a pile of raw personal data it had no business seeing.

Data masking is the control that stops this. For an AI agent on BigQuery, data masking means sensitive columns are redacted before the result ever reaches the agent, so the agent works with usable data and never touches the raw values.

Why masking at the agent is too late

You can ask the agent to filter columns. You can write a prompt that says never select PII. Neither is a control, because both depend on the agent doing what you asked, and an agent that is buggy, jailbroken, or simply over-eager will run the broad query anyway. By the time the agent could redact anything, BigQuery has already returned the raw rows and the exposure has happened.

Redaction has to occur on the path back from BigQuery, before the result reaches the agent, in a layer the agent does not control.

Why data masking belongs on the connection, not in the query

There is a tempting alternative: write careful queries that never select sensitive columns, or build views that exclude them. Those help, but they are not data masking and they do not hold up under an autonomous agent. A view depends on the agent querying the view and not the base table. A careful query depends on the agent staying careful. The first time an agent runs an exploratory SELECT * to understand a schema, the raw values are out.

Continue reading? Get the full guide.

AI Data Exfiltration Prevention + Data Masking (Static): Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.

Data masking on the connection removes that dependency. It does not matter which table the agent hits or how broad its query is, because redaction happens to the result stream regardless. The control is a property of the path, not of the agent's good behavior, and that is exactly why it survives an agent that misbehaves.

How inline masking works on the connection

hoop.dev proxies the connection to BigQuery, so the result set flows back through the gateway. With masking configured, hoop.dev streams that content to a DLP provider, Presidio or Google DLP, for classification and redacts the matched fields inline before results return to the agent. The agent receives a masked result. The raw values never leave the boundary.

Masking on BigQuery connections is configured per connection rather than on by default, so you turn it on with a DLP provider attached and decide which classes of data get redacted.

Configure it

  1. Run the hoop.dev agent near your GCP project, connecting outbound to the gateway.
  2. Create a BigQuery connection with CLOUDSDK_CORE_PROJECT set, and enable GCP IAM federation for per-user OAuth.
  3. Attach a DLP provider (Presidio or Google DLP) to the connection and turn on masking, choosing the data classes to redact.
  4. Route the agent's bq queries through the gateway.
# the agent gets masked output; raw PII never reaches it
bq query --use_legacy_sql=false \
  'SELECT email, phone, region FROM users.accounts LIMIT 1000'
# email/phone return redacted; region returns intact

Verify the redaction

Run a query that selects a known PII column as the agent, and confirm the returned values are redacted while non-sensitive columns pass through. Check that the same query run without masking would have exposed the raw data, so you know the gateway, not the agent, did the work.

Pitfalls

  • Do not assume masking is on by default for BigQuery. It is configured per connection and needs a DLP provider attached.
  • Do not rely on column-level prompts to the agent. A prompt is guidance, not a boundary.
  • Do not mask only the obvious fields. Configure the DLP classes to catch the long tail, free-text columns that carry names and identifiers too.

hoop.dev is open source, so you can verify where redaction happens before you route real data through it. See the getting started guide and how masking supports PII and PHI redaction for AI agents on BigQuery. Get the source at github.com/hoophq/hoop and test masking against a known sensitive column.

FAQ

Does data masking change my BigQuery tables?

No. The tables are untouched. hoop.dev redacts in the result stream on the way back to the agent, so the stored data is unchanged and the agent simply never receives the raw values.

Is masking automatic on every BigQuery query?

Masking is configured per connection with a DLP provider, not enabled by default. Once configured, it applies inline to the results that flow through the gateway.

Open source

Save the open-source gateway for agent data access

Hoop is MIT-licensed infrastructure for controlling how AI agents reach production data. Star hoophq/hoop so you can inspect it, deploy it, or share it when your team starts governing agent access.

Star and save the repo →More posts