Data privacy has become a core focus for engineering teams, especially when handling sensitive information like email addresses in application logs. Managing these data points responsibly is critical to keeping systems compliant with privacy regulations and maintaining user trust. Google Cloud’s BigQuery offers powerful tools for data masking, enabling you to restrict access to sensitive data.
This article provides a concise guide on masking email addresses in logs using BigQuery, ensuring sensitive information is protected while remaining accessible for analysis.
Why Mask Email Addresses in Logs?
Email addresses stored in logs might contain sensitive personal information. Having them visible in their entirety can lead to unintended leaks, violations of data protection policies, or unnecessary exposure to unauthorized parties.
Data masking makes email data usable for analytics while safeguarding its sensitive nature. For example, if an email address appears as ******@example.com in your logs instead of john.doe@example.com, you can still analyze patterns like domain usage over time without exposing the full user identity.
Implementing email masking allows teams to align technical systems with regulatory requirements (e.g., GDPR, CCPA) while maintaining critical analytical workflows.
Steps to Mask Email Addresses in BigQuery
Here’s how you can mask email addresses in logs using BigQuery’s SQL features.
1. Extract the Domain from the Email
Before masking, break the email address into its username and domain parts. BigQuery’s SPLIT() or REGEXP_EXTRACT() functions allow you to isolate these components.
Example Query:
SELECT
REGEXP_EXTRACT(email, r'^(.*?)@') AS username,
REGEXP_EXTRACT(email, r'@(.*)$') AS domain
FROM
`your_dataset.your_table`;
This query separates the username from the domain, like:
john.doe@example.com → username: john.doe, domain: example.com
2. Anonymize the Username
Use placeholder characters (******) to replace the username while keeping the domain intact. This ensures only the generalized info remains visible.
Update the query to look like this:
SELECT
'******' AS username,
REGEXP_EXTRACT(email, r'@(.*)$') AS domain,
CONCAT('******', '@', REGEXP_EXTRACT(email, r'@(.*)$')) AS masked_email
FROM
`your_dataset.your_table`;
Result:
From john.doe@example.com, it masks the username to produce:
******@example.com.
For production use cases, you can either create a new dataset or table to store the masked data, or update the existing records after applying masking. For instance:
Insert masked data into a new table:
CREATE TABLE `your_dataset.masked_logs` AS
SELECT
CONCAT('******', '@', REGEXP_EXTRACT(email, r'@(.*)$')) AS masked_email
FROM
`your_dataset.your_table`;
This prevents unmasked information from being accessible in day-to-day queries.
Alternatively, apply masking on-the-fly during queries by embedding it directly in your SELECT statements.
4. Restrict Unmasked Data Access
Masking isn’t beneficial if the original unmasked data is widely accessible. Use BigQuery’s column-level security to restrict direct access to sensitive columns containing raw emails:
GRANT SELECT ON TABLE `your_dataset.your_table.colum_name` TO ROLE `restricted_user_role`;
Combining masking with proper permissions ensures sensitive information is not exposed inadvertently.
BigQuery masking operations are efficient with small datasets, but for large-scale logs, best practices are essential to maintain query performance:
- Partition and Cluster Tables: Optimize tables by clustering them on columns frequently queried, like
date. - Pre-Process into Dedicated Tables: Using pre-masked tables reduces overhead during analysis versus on-the-fly masking.
- Limit Regular Expressions: Use simpler patterns when possible to reduce the cost of computations.
See It Live with Hoop.dev
Implementing masking policies shouldn’t slow engineers down. With Hoop.dev, you can simulate BigQuery data workflows, like email masking, in minutes. Build secure systems faster without sacrificing reliability or efficiency. Explore the solution and see how easily it integrates into your data pipelines.