You have a data pipeline that hums nicely until a batch job drags your timelines into the next timezone. The culprit: long-running Spark jobs waiting on static infrastructure. This is where Cloud Run and Dataproc finally start playing on the same field.
Cloud Run is Google Cloud’s container runtime that scales to zero. It is built for event-driven workloads and web endpoints that should only live as long as they are needed. Dataproc, in contrast, runs distributed data processing with Spark, Hadoop, or Flink. It shines when chewing through terabytes of logs or feature extraction runs for ML models. The magic happens when you let Cloud Run orchestrate Dataproc, triggering and managing jobs dynamically without babysitting clusters.
The integration is straightforward in principle. Cloud Run handles HTTP triggers or Pub/Sub messages, authenticates using a service account through IAM, and pushes a request to the Dataproc API. That API spins up ephemeral clusters or submits jobs to existing ones. Cloud Run then monitors job status and reports results back via Cloud Logging or BigQuery. You end up with a reactive workflow where compute exists only while your data needs it.
Access control is the part that usually trips teams up. Each Dataproc job submission must run under an identity with precise permissions—no more, no less. Map your Cloud Run service account directly to Dataproc’s IAM roles for job submission and bucket access. Audit every call with Cloud Audit Logs or an external SIEM. For extra security, use Workload Identity Federation so nothing relies on static keys. Secret rotation becomes automatic, compliance folks stay happier.
Featured snippet: Cloud Run and Dataproc work best together when Cloud Run triggers Dataproc jobs on demand, passing data and identity securely through IAM so you process big data without leaving idle clusters running.