The query burned through the logs: masked data where it shouldn’t be. The integration tests had passed. The job had shipped. But in Databricks, real names still slipped through.
Integration testing with Databricks data masking is not optional. It is the control layer that catches leaks before they hit production. Databricks supports robust data masking policies across tables, views, and compute clusters, but these must be tested end-to-end. The gaps appear when transformations and joins bypass masking rules.
Start with clear masking policies in Unity Catalog or Hive metastore. Define column-level masks for sensitive fields. Use built-in functions for deterministic masking or random obfuscation. Integration tests must validate both the policy itself and its execution during workflows.
Build tests that run against staging datasets in Databricks. Include scenarios with joins, aggregations, and filters. Verify masked fields remain masked after each step. Use PySpark or SQL-based assertions in notebooks or pipelines. Test against known datasets with controlled sensitive values to confirm accuracy.