All posts

BigQuery Data Masking with Rsync: How to Secure and Synchronize Your Data

BigQuery is a trusted solution for storing and analyzing massive datasets. However, handling sensitive information, such as personally identifiable information (PII) or financial records, often presents additional challenges. Data masking combined with efficient synchronization, like Rsync, ensures data security during storage and processing while keeping workflows streamlined. In this post, we’ll explore how to implement data masking in BigQuery and use Rsync for seamless and secure data trans

Free White Paper

Data Masking (Static) + VNC Secure Access: The Complete Guide

Architecture patterns, implementation strategies, and security best practices. Delivered to your inbox.

Free. No spam. Unsubscribe anytime.

BigQuery is a trusted solution for storing and analyzing massive datasets. However, handling sensitive information, such as personally identifiable information (PII) or financial records, often presents additional challenges. Data masking combined with efficient synchronization, like Rsync, ensures data security during storage and processing while keeping workflows streamlined.

In this post, we’ll explore how to implement data masking in BigQuery and use Rsync for seamless and secure data transfer. By the end, you'll have actionable methods to safeguard your data while maintaining its utility for analytics.


What Is Data Masking in BigQuery?

Data masking refers to the process of obfuscating sensitive information in your datasets while maintaining its structure and usability. When working with BigQuery, masking ensures compliance with privacy regulations like GDPR or HIPAA without compromising analysis or functionality.

By masking sensitive data, you reduce the risk of exposure while still enabling analysts or systems to work with the data. BigQuery offers built-in SQL functionality to implement masking with ease, allowing users to transform sensitive columns without altering the underlying schema.


Why Combine Data Masking with Rsync?

Rsync is a powerful utility for synchronizing files, ideal for moving datasets between different environments or backup systems. Combining BigQuery’s masking capabilities with Rsync results in:

  1. Enhanced Security: Before syncing the data, masking ensures only anonymized information is transferred to external systems.
  2. Controlled Access: Full data access is restricted; external teams or systems only see masked/obfuscated data where needed.
  3. Efficient Synchronization: Rsync allows incremental updates, avoiding the need to transfer entire datasets repeatedly.

By integrating these two tools, organizations can adhere to strict security demands while keeping their systems efficient.


Implementing Data Masking in BigQuery

Implementing data masking in BigQuery can be done natively using simple SQL queries. Here's a step-by-step approach:

1. Identify Sensitive Columns

Identify which columns in your dataset contain sensitive information, such as email addresses, credit card numbers, or social security numbers.

Continue reading? Get the full guide.

Data Masking (Static) + VNC Secure Access: Architecture Patterns & Best Practices

Free. No spam. Unsubscribe anytime.
SELECT email, credit_card_number, user_id, transaction_date
FROM `project.dataset.transactions`
WHERE LENGTH(credit_card_number) > 0;

2. Use Built-In Functions to Mask Data

BigQuery provides functions like SUBSTR and LPAD to replace or truncate parts of sensitive data. For example:

- Masking email addresses:

SELECT 
 CONCAT(SUBSTR(email, 1, 5), '*****@domain.com') AS masked_email
FROM `project.dataset.transactions`;

- Masking credit card numbers while leaving the last 4 digits visible for verification:

SELECT 
 CONCAT('****-****-****-', SUBSTR(credit_card_number, -4)) AS masked_credit_card
FROM `project.dataset.transactions`;

- Hashing unique identifiers:

SELECT 
 FARM_FINGERPRINT(user_id) AS hashed_user_id
FROM `project.dataset.transactions`;

3. Save the Masked Data

Save the masked dataset into a new BigQuery table. For example:

CREATE TABLE `project.dataset.masked_transactions` AS
SELECT 
 CONCAT(SUBSTR(email, 1, 5), '*****@domain.com') AS masked_email,
 CONCAT('****-****-****-', SUBSTR(credit_card_number, -4)) AS masked_credit_card,
 FARM_FINGERPRINT(user_id) AS hashed_user_id,
 transaction_date
FROM `project.dataset.transactions`;

This ensures the original data remains untouched while giving you a secure, anonymized version for further operations.


Synchronizing Data Securely with Rsync

After masking data in BigQuery, you may need to transfer it to an on-premise server, cloud storage bucket, or another system. This is where Rsync shines.

Steps to Use Rsync for Data Transfer:

  1. Export the data:
    Export the masked BigQuery table to Google Cloud Storage using the EXPORT command.
bq extract --destination_format CSV \
'project.dataset.masked_transactions' \
gs://your-bucket-name/extracted_data.csv
  1. Use Rsync to transfer the file:
    After exporting the file, use Rsync to transfer the data securely.
rsync -avz --progress \
gs://your-bucket-name/extracted_data.csv \
/local/system/path/
  1. Automate the process:
    To ensure ongoing synchronization, combine Rsync with tools like cron jobs or CI pipelines. Automate this process to regularly fetch new or updated masked datasets.

Example automation script:

#!/bin/bash
bq extract --destination_format CSV \
'project.dataset.masked_transactions' \
gs://your-bucket-name/extracted_data.csv

rsync -avz --progress \
gs://your-bucket-name/extracted_data.csv \
/local/system/path/

Save the script and schedule it with cron or a CI server for regular execution.


Key Takeaways

  1. BigQuery’s built-in masking capabilities help secure sensitive data, ensuring compliance while maintaining usability for analytics.
  2. Rsync efficiently synchronizes masked data across environments without transferring unnecessary files.
  3. Together, BigQuery data masking and Rsync protect your data while enhancing performance and access control.

If you're looking to streamline your data workflows while adhering to privacy standards, tools like Hoop.dev make it easy to automate BigQuery operations and securely share data within minutes. See how you can put it into action today.

Get started

See hoop.dev in action

One gateway for every database, container, and AI agent. Deploy in minutes.

Get a demoMore posts