All posts

DLP for Embeddings

An offboarded contractor still has a CI job that pushes newly generated text embeddings into a shared vector store, exposing a dlp gap. The job runs with a static API key that was never rotated, and the store is exposed through an unprotected HTTP endpoint. When the contractor’s token is later compromised, an attacker can pull the embeddings, extract personally identifiable information, and feed it to downstream models. That scenario illustrates a common pattern: teams treat embeddings like any

Free White Paper

Embeddings: The Complete Guide

Architecture patterns, implementation strategies, and security best practices. Delivered to your inbox.

Free. No spam. Unsubscribe anytime.

An offboarded contractor still has a CI job that pushes newly generated text embeddings into a shared vector store, exposing a dlp gap. The job runs with a static API key that was never rotated, and the store is exposed through an unprotected HTTP endpoint. When the contractor’s token is later compromised, an attacker can pull the embeddings, extract personally identifiable information, and feed it to downstream models.

That scenario illustrates a common pattern: teams treat embeddings like any other artifact, store them in a database or vector service, and give every consumer a blanket credential. The connection is direct, the credential is long‑lived, and there is no visibility into who queried which vector or what data was returned. When a breach occurs, the lack of audit trails and data redaction makes containment and forensics almost impossible.

Why DLP matters for embeddings

Embeddings are dense representations of raw text, images, or audio. Because they retain semantic similarity, they can inadvertently leak sensitive phrases, names, or health information even after the original source is removed. Data loss prevention (DLP) for embeddings therefore requires two capabilities: (1) the ability to inspect the payload at the protocol level and mask or block fields that match a policy, and (2) a persistent record of every query and response for later review.

Most existing pipelines rely on the identity layer alone, OIDC, SAML, or service accounts, to decide who can connect. That setup enforces “who may start” but does not inspect the data flowing through the connection. Without a gateway that sits in the data path, the request travels straight to the vector store, bypassing any opportunity to enforce DLP rules.

Architectural requirement: a data‑path gateway

The missing piece is a Layer 7 gateway that intercepts every embedding request, applies policy, and records the interaction. The gateway must be positioned between the authenticated identity and the vector service, because only there can it see the actual vectors being sent or returned. If the gateway were placed after the service, the data would already have been delivered; if it were placed before authentication, it could not tie actions to a user.

In practice this means deploying a proxy that runs inside the same network as the vector store, registers the store as a protected resource, and trusts an external IdP for authentication. The IdP decides *who* may initiate a session, but the proxy decides *what* that session may do.

How hoop.dev fulfills the requirement

hoop.dev implements exactly the data‑path gateway described above. It sits on the network edge, authenticates users via OIDC or SAML, and then proxies connections to the vector store. While the traffic passes through hoop.dev, the platform can:

Continue reading? Get the full guide.

Embeddings: Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.
  • Inspect each embedding payload and apply DLP policies that redact or block fields matching sensitive patterns.
  • Require a just‑in‑time approval workflow for queries that exceed a risk threshold, such as those that request large batches of vectors.
  • Record every request and response, storing a replayable session that auditors can review.
  • Enforce least‑privilege access by issuing short‑lived credentials that expire when the session ends.

Because hoop.dev is the only component that sees the raw vectors, the masking and approval steps are guaranteed to happen. If hoop.dev were removed, the connection would go directly to the vector store, and none of the DLP controls would be applied.

Putting the pieces together

To retrofit DLP for embeddings, start with the existing identity provider. Configure OIDC groups that represent the different risk levels of your teams, e.g., data‑science‑read versus data‑science‑admin. Next, deploy hoop.dev near your vector service and register the service as a connection. Define DLP policies in hoop.dev’s policy language to mask any field that contains email addresses, SSNs, or custom regexes that match your organization’s PII.

When a user runs a query through the standard client (curl, python‑requests, or a library that respects HTTP proxies), the request is routed to hoop.dev. The gateway validates the user’s token, checks the request against the DLP policy, and either forwards the request, prompts for approval, or blocks it. The response is examined again; any sensitive values are redacted before they reach the client. Simultaneously, hoop.dev logs the full session, including the user identity, request parameters, and the masked response.

This approach satisfies three critical goals:

  1. Protection: Sensitive data never leaves the vector store unmasked.
  2. Visibility: Every access is recorded, enabling forensic analysis and compliance reporting.
  3. Control: Access is granted on a just‑in‑time basis, reducing the blast radius of compromised credentials.

Getting started

hoop.dev is open source and MIT‑licensed, so you can self‑host the gateway in your environment. The quick‑start guide walks you through deploying the Docker Compose stack, wiring OIDC authentication, and registering a vector store as a protected connection. Detailed feature documentation lives in the Learn section, where you can explore DLP policy syntax and approval workflow configuration.

Once the gateway is running, point your embedding clients at the hoop.dev endpoint instead of the raw store URL. The rest of your application code remains unchanged; hoop.dev handles the security layer transparently.

FAQ

Does hoop.dev store the raw embeddings?
No. The gateway only buffers data long enough to apply masking and then forwards the sanitized payload. The original vectors remain in the downstream store.

Can I use hoop.dev with any vector database?
hoop.dev supports any target that can be reached via a standard protocol (HTTP, gRPC, or a database wire protocol). As long as the vector service is reachable from the gateway’s network, you can register it as a connection.

How does hoop.dev integrate with existing CI pipelines?
Configure the pipeline’s HTTP client to use the hoop.dev proxy URL. The pipeline’s service account will be authenticated by the IdP, and hoop.dev will enforce DLP policies on every automated query.

For full installation instructions and to explore the source code, visit the GitHub repository. The getting‑started guide provides a step‑by‑step walkthrough to secure your embeddings with DLP today.

Open source

Save the open-source gateway for agent data access

Hoop is MIT-licensed infrastructure for controlling how AI agents reach production data. Star hoophq/hoop so you can inspect it, deploy it, or share it when your team starts governing agent access.

Star and save the repo →More posts