This guide covers production deployment of Microsoft Presidio using the official Helm chart, with detailed rationale for each configuration decision based on Presidio’s internal resource model.Documentation Index
Fetch the complete documentation index at: https://mintlify.hoop.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Hoop integrates with Presidio through a configuration interface that allows users to select which entity types will be used to perform redaction analysis. Based on this configuration, the Agent component parses the protocol (Postgres, Mongo, terminal, etc.) in real time and constructs a structured payload to analyze the protocol’s contents. Any findings are then anonymized and the content is redacted back into the original protocol format. Redaction statistics are also collected and sent to the gateway, where they are stored in the database for further analysis.Architecture
Presidio is composed of three deployable components in this chart:- Analyzer — NLP-heavy service; holds a spaCy model in memory and performs inference per request
- Anonymizer — Lightweight string transformation; no ML model, negligible resource cost
- Envoy Proxy — Reverse proxy using least-connections load balancing to distribute traffic efficiently across Analyzer pods
Helm Chart Reference
To deploy using a fullvalues.yaml file
values.yaml (default)
values.yaml (default)
presidio-analyzer- The analyzer service that detects PII data in text.presidio-anonymizer- The anonymizer service that masks PII data in textpresidio-envoy-proxy- The envoy proxy that load balance connections with Presidio
Release Information
For more information about new releases, consult the Presidio Helm Chart repository.Generating Manifests
If you prefer using manifests over Helm, we recommend this approach. It allows you to track any modifications to the chart whenever a new version appears. You can apply a diff to your versioned files to identify what has been altered.Presidio Analyzer
For the default installation, the Analyzer component loads theen_core_web_lg spaCy model (~750MB) once at startup. Every request runs a full NLP pipeline:
- tokenizer → tagger → dependency parser → named entity recognizer → recognizer chain.
- Memory is mostly static after startup (dominated by the model)
- CPU is consumed per request, and scales linearly with token count
- More CPU cores do not speed up a single request — they allow more requests to run simultaneously
- CPU allocation controls throughput, not latency
NER Entity Cost Tiers
Not all entity types have the same CPU cost:| Tier | Entity Types | Cost |
|---|---|---|
| Regex / rule-based | EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, US_SSN, IP_ADDRESS, URL, DATE_TIME, IBAN_CODE, CRYPTO, country-specific IDs | Low (a few ms each) |
| NER-backed | PERSON, LOCATION, ORGANIZATION, NRP | High (requires full spaCy NER pipeline) |
entities when configuring the data masking resource on Hoop. This is one of the most effective ways to reduce per-request CPU time.
Gunicorn Configuration
Enable Preload
preload_app, each Gunicorn worker independently loads the spaCy model at startup:
preload_app = True, the master process loads the model once, then forks workers that inherit memory via Linux copy-on-write (CoW). Because the model weights are read-only during inference, these pages are never copied — they remain shared across all workers:
preload_app:
| Workers | Without preload | With preload | Concurrent requests |
|---|---|---|---|
| 1 | ~1.2Gi | ~1.2Gi | 1 |
| 2 | ~2.4Gi | ~1.4Gi | 2 |
| 4 | ~4.8Gi | ~1.6Gi | 4 |
| 8 | ~9.6Gi | ~2.0Gi | 8 |
Workers and CPU Requests
requests.cpu in whole cores):
requests.cpu: 1024m (~1 core), so workers = 2 provides a small amount of headroom. If you increase CPU requests to 2000m, set workers = 2; for 4000m, set workers = 4.
If workers exceed guaranteed cores, they compete for CPU time under load. The kernel’s CFS scheduler throttles workers that exceed their quota window (100ms intervals), introducing latency spikes mid-inference.
Threads Configuration
gthread workers are thread-based and handle I/O-bound concurrency within a worker using multiple threads. Combined with workers = 2, this allows up to 8 concurrent connections with overlap during I/O phases (request parsing, response serialization). CPU-bound inference still blocks the thread, so effective CPU-saturating concurrency remains bounded by workers.
It gives access to consuming endpoints that are not meant for inference, allowing them to respond without blocking.
Kubernetes Resources
-
CPU requests (
1024m) determine guaranteed scheduling and should match your worker count. The scheduler places the pod assuming ~1 core is needed. -
CPU limits (
2500m) allow burst to ~2.5 cores when the node has spare capacity. This benefits Presidio during traffic spikes before the HPA scales out new pods. However, burst is not guaranteed — on a fully loaded node, the pod receives exactly1024m. Always size workers for the request, not the limit.The default configuration does not guarantee optimal resource allocation. While sufficient for evaluating the solution in most setups, production workloads with stricter requirements should always have CPU resources explicitly reserved based on the Gunicorn Workers configuration. -
Memory requests (
1024Mi) must accommodate the preloaded spaCy model (~750MB) plus worker overhead. This is the minimum viable allocation withpreload_app = Trueand 2 workers. -
Memory limits (
2048Mi) provide headroom for longer documents, traffic spikes, and Python GC overhead. OOM kills are destructive (mid-inference requests are dropped), so the limit should be meaningfully above the steady-state baseline.
Autoscaling
- Fault isolation: Losing a 2-worker pod drops 2 concurrent slots. Losing a 16-worker pod drops 16.
- Rolling deploy safety: Each pod restart incurs a ~15–30s model reload window. Smaller pods reduce the blast radius per restart.
- Scheduling flexibility: Smaller pods fit on more nodes, reducing pending risk during cluster autoscaler events.
cpuAverageUtilization: 70 leaves 30% headroom before scaling, accounting for the fact that CPU usage spikes sharply during NER inference. Scaling at 90%+ would trigger only after latency has already degraded.
scaleUpStabilizationWindowSeconds: 30 allows fast scale-up response to traffic bursts. Presidio CPU spikes are sudden and sustained.
scaleDownStabilizationWindowSeconds: 120 prevents thrashing — each new pod incurs a 15–30s startup cost, so premature scale-down followed by immediate scale-up wastes time and causes dropped requests.
Image Configuration
By default, the latest version is used. If you want to use a specific image or pin the versions, refer to the configuration below:Presidio Anonymizer
Kubernetes Resources
The Anonymizer receives text and a list of pre-detected entity positions, then applies string substitutions (redact, replace, encrypt). It holds no NLP model, performs no inference, and its CPU and memory usage are negligible.minReplicas: 2 to maxReplicas: 4.
Gunicorn Configuration
preload_app = True has minimal impact here since there is no heavy model to share, but it does not hurt and keeps configuration consistent.
Image Configuration
By default, the latest version is used. If you want to use a specific image or pin the versions, refer to the configuration below:Presidio Envoy Proxy
Standard round-robin distributes requests in rotation with no knowledge of backend occupancy. For Presidio, this is problematic for two reasons:- Request processing time varies significantly with text length. A 100-token document completes in ~2ms; a 5,000-token document in ~100ms. A pod that receives two consecutive long-document requests is occupied for 200ms while round-robin keeps routing new requests to it.
- Workers are fully synchronous. A 2-worker pod with 2 active requests has zero available slots. Any additional request must queue behind the running ones.
Least-connections strategy was validated on benchmarks tests to handle saturated Analyzer instances more efficiently than round-robin.
Envoy Resources
Autoscaling
scaleUpStabilizationWindowSeconds: 60prevents reactive scaling on short bursts; a single Envoy instance handles significant concurrency before becoming a bottleneck.
Image Configuration
By default, the latest version is used. If you want to use a specific image or pin the versions, refer to the configuration below:Presidio Inference Models
Spacy
en_core_web_lg
The default installation comes with the base model en_core_web_lg, which is a spaCy large English model.Flair
We have a custom build of Presidio that leverages the use of Flair, it provides better accuracy in detecting PII data. To use this custom build, you could use our custom build of the Presidio Analyzer.Troubleshooting
HPA Field Conflict on Helm Upgrade
When upgrading a Helm release that toggles autoscaling on and then modifies HPA fields like minReplicas, the upgrade fails with a server-side apply conflict error:How to Fix
Option 1: Disable Server-Side Apply Pass--server-side=false to fall back to the classic client-side apply, which does not track field ownership:
--force-conflicts to allow Helm to take ownership of the conflicting fields: