BigQuery Data Masking: Masking Email Addresses in Logs

Data privacy has become a core focus for engineering teams, especially when handling sensitive information like email addresses in application logs. Managing these data points responsibly is critical to keeping systems compliant with privacy regulations and maintaining user trust. Google Cloud’s BigQuery offers powerful tools for data masking, enabling you to restrict access to sensitive data.

This article provides a concise guide on masking email addresses in logs using BigQuery, ensuring sensitive information is protected while remaining accessible for analysis.

Why Mask Email Addresses in Logs?

Email addresses stored in logs might contain sensitive personal information. Having them visible in their entirety can lead to unintended leaks, violations of data protection policies, or unnecessary exposure to unauthorized parties.

Data masking makes email data usable for analytics while safeguarding its sensitive nature. For example, if an email address appears as ******@example.com in your logs instead of john.doe@example.com, you can still analyze patterns like domain usage over time without exposing the full user identity.

Implementing email masking allows teams to align technical systems with regulatory requirements (e.g., GDPR, CCPA) while maintaining critical analytical workflows.

Steps to Mask Email Addresses in BigQuery

Here’s how you can mask email addresses in logs using BigQuery’s SQL features.

1. Extract the Domain from the Email

Before masking, break the email address into its username and domain parts. BigQuery’s SPLIT() or REGEXP_EXTRACT() functions allow you to isolate these components.

Example Query:

SELECT 
 REGEXP_EXTRACT(email, r'^(.*?)@') AS username, 
 REGEXP_EXTRACT(email, r'@(.*)$') AS domain 
FROM 
 `your_dataset.your_table`;

This query separates the username from the domain, like:

Continue reading? Get the full guide.

Data Masking (Dynamic / In-Transit) + PII in Logs Prevention: Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.

john.doe@example.com → username: john.doe, domain: example.com

2. Anonymize the Username

Use placeholder characters (******) to replace the username while keeping the domain intact. This ensures only the generalized info remains visible.

Update the query to look like this:

SELECT 
 '******' AS username, 
 REGEXP_EXTRACT(email, r'@(.*)$') AS domain, 
 CONCAT('******', '@', REGEXP_EXTRACT(email, r'@(.*)$')) AS masked_email 
FROM 
 `your_dataset.your_table`;

Result:

From john.doe@example.com, it masks the username to produce:
******@example.com.

3. Apply Updates or Transformations to the Dataset

For production use cases, you can either create a new dataset or table to store the masked data, or update the existing records after applying masking. For instance:

Insert masked data into a new table:

CREATE TABLE `your_dataset.masked_logs` AS 
SELECT 
 CONCAT('******', '@', REGEXP_EXTRACT(email, r'@(.*)$')) AS masked_email 
FROM 
 `your_dataset.your_table`;

This prevents unmasked information from being accessible in day-to-day queries.

Alternatively, apply masking on-the-fly during queries by embedding it directly in your SELECT statements.

4. Restrict Unmasked Data Access

Masking isn’t beneficial if the original unmasked data is widely accessible. Use BigQuery’s column-level security to restrict direct access to sensitive columns containing raw emails:

GRANT SELECT ON TABLE `your_dataset.your_table.colum_name` TO ROLE `restricted_user_role`;

Combining masking with proper permissions ensures sensitive information is not exposed inadvertently.

Performance Tips When Masking Data

BigQuery masking operations are efficient with small datasets, but for large-scale logs, best practices are essential to maintain query performance:

Partition and Cluster Tables: Optimize tables by clustering them on columns frequently queried, like date.
Pre-Process into Dedicated Tables: Using pre-masked tables reduces overhead during analysis versus on-the-fly masking.
Limit Regular Expressions: Use simpler patterns when possible to reduce the cost of computations.

See It Live with Hoop.dev

Implementing masking policies shouldn’t slow engineers down. With Hoop.dev, you can simulate BigQuery data workflows, like email masking, in minutes. Build secure systems faster without sacrificing reliability or efficiency. Explore the solution and see how easily it integrates into your data pipelines.