All posts

Data Classification for Embeddings

Unclassified embeddings can leak sensitive information across every downstream model. When a vector contains personally identifiable data, a recommendation engine, search service, or another AI component can inadvertently expose that data to users who should never see it. Data classification is the process of labeling raw inputs according to confidentiality, regulatory, or business impact. In traditional data pipelines, classification tags drive encryption, storage segregation, and access contr

Free White Paper

Data Classification: The Complete Guide

Architecture patterns, implementation strategies, and security best practices. Delivered to your inbox.

Free. No spam. Unsubscribe anytime.

Unclassified embeddings can leak sensitive information across every downstream model. When a vector contains personally identifiable data, a recommendation engine, search service, or another AI component can inadvertently expose that data to users who should never see it.

Data classification is the process of labeling raw inputs according to confidentiality, regulatory, or business impact. In traditional data pipelines, classification tags drive encryption, storage segregation, and access controls. Embedding pipelines, however, often flatten text into high‑dimensional vectors without preserving the original labels, making downstream enforcement blind to the source’s sensitivity.

The result is a hidden compliance gap: engineers may assume that once data has been tokenised into an embedding, the risk disappears. Auditors, on the other hand, still need evidence that every vector originated from a properly classified source, and that any downstream query respects those classifications.

Why data classification matters for embeddings

To close that gap, an enforcement point must sit on the path between the client that generates or consumes embeddings and the model‑serving infrastructure. The enforcement layer can read the classification label attached to each request, decide whether the operation requires additional approval, mask or drop sensitive fields, and log the transaction for later review. Without a dedicated data‑path component, the request travels directly to the model server, bypassing any policy check and leaving no immutable audit trail.

Where the enforcement must live

Setup components such as OIDC identity providers, group memberships, and role‑based permissions determine who is allowed to start a request. Those components are essential for authentication, but they cannot enforce per‑vector policies on their own. The only place enforcement can reliably happen is in the data path that all traffic traverses before reaching the model.

How hoop.dev enforces classification at the gateway

hoop.dev provides exactly that data‑path gateway. It proxies embedding requests, reads the user’s OIDC token, applies classification‑aware policies, and can block, mask, or route the request for manual approval before it reaches the model. Because the gateway sits outside the model container, the model never sees raw sensitive data unless the policy explicitly allows it. hoop.dev also records each session, enabling replay and audit without exposing credentials to the client.

When a request arrives, hoop.dev extracts the user’s identity and any group tags that encode classification levels. Policy rules, defined centrally, map those levels to actions such as:

Continue reading? Get the full guide.

Data Classification: Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.
  • Allowing the vector to be stored in a low‑risk vector store.
  • Masking fields that contain regulated identifiers before they are indexed.
  • Escalating the request to a human approver when high‑risk data is involved.
  • Rejecting the operation outright if it violates a compliance rule.

All decisions are logged by the gateway, creating a tamper evident audit trail that auditors can query. The logs include the identity of the requester, the classification label, the action taken, and a replayable session capture.

Replay capability lets security teams reconstruct exactly what a user typed and what the model returned, down to each vector payload. That level of granularity is essential for forensic investigations, for demonstrating compliance during audits, and for training purposes. The recorded sessions can be streamed into a SIEM or log‑analysis platform, ensuring that any anomalous pattern is detected in near real‑time.

Policies are defined centrally in a declarative file or through the web UI. They can reference identity attributes such as department, security clearance, or custom tags that encode the data classification level. Because the gateway evaluates policies on every request, administrators can tighten or relax rules without redeploying the underlying services, achieving true just‑in‑time governance.

In large organizations, multiple teams may share the same gateway instance while maintaining isolated policy scopes. The gateway isolates traffic per connection, ensuring that one team’s classification rules never leak into another’s workload. Horizontal scaling is achieved by adding more agent instances behind the load balancer, preserving low latency while keeping a single point of policy enforcement.

As AI models become more capable, the surface area for data leakage expands. Embedding gateways that enforce classification at the protocol level will remain a critical control, providing a consistent audit surface even as underlying services evolve. hoop.dev’s open‑source nature lets teams extend the policy engine to cover new data‑sensitivity tags or integrate with emerging identity standards.

Getting started with hoop.dev

Deploy the gateway using the documented quick‑start, configure your embedding service as a connection, and define classification policies in the policy editor. The getting started guide walks through the steps, and the learn page provides deeper insight into masking and approval workflows.

FAQ

Does hoop.dev modify the embedding itself?

No. hoop.dev inspects the request metadata and can mask or drop fields before they are encoded as vectors, but it never alters the mathematical representation of an already‑generated embedding.

Can I use hoop.dev with any vector database?

hoop.dev can proxy connections to supported database targets, including PostgreSQL, MySQL, and MongoDB, which are commonly used as vector stores. For other stores, you can expose them through a supported protocol or build a custom connector.

What evidence does hoop.dev provide for compliance audits?

Each session is recorded with identity, classification tags, policy decisions, and a replayable stream of the request and response. Those records satisfy the evidence requirements of many data‑protection frameworks without additional tooling.

Explore the open‑source code on GitHub.

Open source

Save the open-source gateway for agent data access

Hoop is MIT-licensed infrastructure for controlling how AI agents reach production data. Star hoophq/hoop so you can inspect it, deploy it, or share it when your team starts governing agent access.

Star and save the repo →More posts