When a contractor leaves a company, the CI pipelines they built often keep running. Those pipelines still hold the service account token that was used to call the organization’s embedding API, and they continue to send raw documents for vectorization. The result is a steady stream of proprietary text flowing through an uncontrolled channel, invisible to anyone who now owns the data.
A machine identity is any non‑human principal, service accounts, CI tokens, or automated keys, that authenticates software to a backend. Unlike a human user, a machine identity rarely has a password rotation policy, and it is typically granted broad permissions so that developers can move fast.
Embeddings turn raw text into high‑dimensional vectors that power semantic search, recommendation, and downstream LLM prompts. When a model consumes those vectors, the original text is often no longer stored, but the raw input that generated the vector may contain personally identifiable information, trade secrets, or regulated data.
Why static machine identities are a liability for embeddings
Because a machine identity is issued once and lives indefinitely, an attacker who discovers the token can replay embedding requests forever. The token also bypasses any human‑in‑the‑loop review, so sensitive snippets can be vectorized and later reconstructed through similarity queries. In many organizations the token is hard‑coded into build scripts, meaning that even after a developer departs the credential remains active.
Beyond theft, static credentials make it impossible to know which piece of code generated a particular vector. Auditors cannot answer questions like “who sent this document to the embedding service on Tuesday?” without a dedicated logging layer, and the organization loses the ability to enforce data‑loss‑prevention policies on the fly.
Machine identity requirements for a secure data path
The first line of defense is to treat the embedding service as a protected resource and place a gate that every request must cross. That gate should:
- Validate the machine identity against an identity provider at the moment of request.
- Issue short‑lived, just‑in‑time credentials so that a token cannot be reused after the operation completes.
- Record the full request and response metadata for later replay or audit.
- Inspect the payload and mask any fields that match regulated patterns before they reach the embedding engine.
- Require an explicit approval step for high‑risk inputs, such as documents containing health information or financial data.
Only a component that sits in the data path can guarantee that every call is subject to those controls. Identity verification alone, performed by an IdP, does not enforce masking or approval because the request flows directly to the embedding backend.
