The query returned nothing. The pipeline had failed. Data without proper masking had slipped into the lake, exposed and raw.
Databricks offers powerful tools for building and managing large-scale data pipelines, but control over sensitive data is not optional. Manpages for Databricks data masking are not official in the classic UNIX sense, yet documenting your masking methods with the precision of system manuals is the fastest path to consistency and compliance.
Data Masking in Databricks
Data masking replaces sensitive fields—PII, financial records, health data—with obfuscated values while keeping schema and format intact. In Databricks, masking can be done at query time using SQL functions, during ETL transforms with PySpark, or at the ingestion layer via Delta Live Tables. Common strategies include:
- Static masking: Permanently overwrite sensitive fields with masked values during ingestion.
- Dynamic masking: Apply masking logic at runtime based on user roles or query context.
- Format-preserving masking: Maintain data type and length for downstream compatibility.
Implementing SQL-Based Masking
Example for masking an email column:
SELECT
CONCAT('user', CAST(RAND() * 1000 AS INT), '@example.com') AS masked_email,
other_column
FROM raw_table;
This logic can be baked into views or temporary tables, ensuring consumers never see raw values.