All posts

Data masking vs tokenization: which actually controls AI agent risk (on Postgres)

Many engineers assume that tokenization alone can keep AI agents from seeing sensitive PostgreSQL data, but the reality is more nuanced. Tokenization replaces a value with a reversible placeholder, while data masking substitutes the value with a non‑reversible surrogate at query time. Both techniques aim to reduce the exposure of personally identifiable information, credit card numbers, or any field that should not be visible to downstream consumers. When a large language model or other auton

Free White Paper

AI Agent Security + AI Risk Assessment: The Complete Guide

Architecture patterns, implementation strategies, and security best practices. Delivered to your inbox.

Free. No spam. Unsubscribe anytime.

Many engineers assume that tokenization alone can keep AI agents from seeing sensitive PostgreSQL data, but the reality is more nuanced.

Tokenization replaces a value with a reversible placeholder, while data masking substitutes the value with a non‑reversible surrogate at query time.

Both techniques aim to reduce the exposure of personally identifiable information, credit card numbers, or any field that should not be visible to downstream consumers.

When a large language model or other autonomous agent is given direct access to a database, the risk profile changes.

The agent can issue ad‑hoc queries, synthesize new statements, and even iterate on results.

If the underlying data is not protected, the model can inadvertently learn or leak secrets, creating compliance and reputational hazards.

In many organizations, the default practice is to hand an AI service a static database user and password, let it connect straight to PostgreSQL, and trust that the model will behave.

The connection creates a regular TCP session; the agent runs the same client libraries as a human; and no independent audit records which queries were issued or which rows were returned.

Why tokenization alone is often insufficient

Tokenization works well when the consumer only needs to reference a value without ever needing the original. For example, an order‑processing workflow might store a tokenized credit‑card number and later match it against a token supplied by a payment gateway. The AI agent can join on the token, count occurrences, or filter rows, but it never sees the clear text.

Continue reading? Get the full guide.

AI Agent Security + AI Risk Assessment: Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.

However, AI agents frequently need to explore data patterns, generate summaries, or perform statistical analysis. Those workloads require access to the actual column contents. If the column is tokenized, the agent receives opaque strings that break most analytical functions, leading to failed queries or misleading results. Moreover, tokenization does not hide data that is returned by default SELECT * statements; the placeholder values are still transmitted over the wire.

When data masking provides the needed protection

Data masking operates at the protocol level, intercepting responses and substituting sensitive fields with masked equivalents before they reach the client. This means the AI agent can still run arbitrary SELECT statements, aggregate functions, or joins, but any column flagged as sensitive will be replaced with a safe value such as "XXXXX" or a format‑preserving mask.

Masking is particularly valuable for ad‑hoc exploratory queries, data‑science notebooks, or any scenario where the agent must see the shape of the data without the raw values.

Because the transformation happens on the fly, hoop.dev leaves the original database untouched, and the system grants the same user broader query capabilities without increasing the risk of data leakage.

Combining tokenization and masking for layered defense

In practice, the strongest posture uses both techniques. Tokenize columns that must never be reconstructed, such as Social Security numbers, and apply masking to other regulated fields like email addresses or phone numbers. The AI agent can still perform joins on the tokenized keys while receiving masked versions of the auxiliary data. This layered approach reduces the attack surface and satisfies a wider range of compliance requirements.

Where the enforcement must live

The controls described above only work if they sit in the data path between the identity initiating the request and the PostgreSQL server fulfilling it. Without a gateway, the database itself cannot reliably apply per‑request masking policies, and tokenization must be baked into the schema, which is brittle and hard to change.

hoop.dev provides that exact data‑path enforcement point. It proxies every PostgreSQL connection, inspects the wire‑level protocol, and applies inline masking, token substitution, just‑in‑time approval workflows, and session recording. Because hoop.dev is the only component that sees the traffic, it guarantees that no raw secret leaves the database without first being transformed.

When an AI agent attempts a query, hoop.dev evaluates the request against the configured policy. If the query touches a masked column, hoop.dev rewrites the response on the fly. If the query includes a tokenized field, hoop.dev enforces that only token values are returned. The gateway also logs the full statement and the masked result, creating an audit trail for later review.

Beyond masking, hoop.dev blocks dangerous commands such as DROP DATABASE or ALTER USER, routes risky statements to a human approver, and records the entire session for replay. All of these outcomes become possible only because hoop.dev sits in the data path, not in the identity or credential provisioning layer.

Getting started with hoop.dev for PostgreSQL

To adopt this model, start by deploying the hoop.dev gateway using the official Docker Compose quick‑start. The documentation guides you through configuring OIDC authentication, registering a PostgreSQL target, and defining masking rules for specific columns. You can find detailed guidance in the getting‑started guide and the broader learn section.

Once the gateway is running, your AI agents connect to the PostgreSQL endpoint through hoop.dev using their standard client libraries. From that point forward, hoop.dev enforces the policies you have defined on every query, and it records every session for compliance.

FAQ

  • Does tokenization protect against accidental exposure in logs? Yes, because the token value is stored instead of the clear text. However, if a log captures the raw response before masking, the exposure can still occur. hoop.dev ensures that masking happens before any downstream logging.
  • Can I apply masking to only a subset of rows? Masking policies are defined at the column level, but you can combine them with row‑level filters in the policy engine so that only rows matching certain criteria are masked.
  • Is the audit trail protected against tampering? hoop.dev generates the audit records after the data has passed through the gateway. While it does not claim cryptographic immutability, you can forward the logs to storage of your choice for additional protection.

Explore the open‑source code and contribute to the project on GitHub.

Open source

Save the open-source gateway for agent data access

Hoop is MIT-licensed infrastructure for controlling how AI agents reach production data. Star hoophq/hoop so you can inspect it, deploy it, or share it when your team starts governing agent access.

Star and save the repo →More posts