Keeping your Cloud Dataflow pipelines safe with customer-managed encryption keys

7 Lessons Every CISO Can Learn From the ANU Cyber Attack
October 23, 2019
Find and fix misconfigurations in your Google Cloud resources
October 23, 2019

5. In the drop-down menu with customer-managed keys, select the key you want to use in your pipeline and click on Run Job.

What is protected by the Cloud KMS key you specify?
As a general principle, Cloud Dataflow uses your Cloud KMS key to encrypt the data that you supply. The only exception to this rule are the data keys used in data-key-based operations such as windowing, grouping, and joining. The encryption of data keys is currently out of scope for CMEK encryption, so if these keys contain PII data, you should hash or otherwise transform the keys before they enter the Cloud Dataflow pipeline. In any case, the values of the key-value pairs are in scope for CMEK encryption.

Once you specify the Cloud KMS key, Cloud Dataflow will use it to protect the following pipeline state storage locations:

  • Persistent Disks attached to Cloud Dataflow workers and used for Persistent Disk-based shuffle and streaming state storage, and

  • Cloud Dataflow Shuffle state for batch pipelines.

What about the Cloud Storage buckets where Cloud Dataflow stores temporary BigQuery export/import data (specified by the --tempLocation parameter) or binary files containing pipeline code (--stagingLocation parameter)? While specifying the dataflowKmsKey parameter is not necessary to protect these locations, if you define a default Cloud KMS key for these buckets in the Cloud Storage UI, Cloud Dataflow will respect these settings.

Currently, the Cloud Dataflow streaming engine state cannot be protected by a Cloud KMS key and is encrypted by a Google-managed key. If you want all of your pipeline state to be protected by Cloud KMS keys, do not use the Cloud Dataflow streaming engine optional feature.

Tips and tricks for using Cloud KMS keys together with Cloud Dataflow
These details can be helpful to keep in mind as you’re using Cloud KMS keys with Cloud Dataflow.

Auditing: If you want to audit key usage by Cloud Dataflow, you can review the Cloud Audit Logs for log items related to key operations, such as encrypt and decrypt. These log items are tagged with the Cloud Dataflow Job ID, which allows you to track every time a specific Cloud KMS key is used for a Cloud Dataflow job.

Pricing: Each time the Cloud Dataflow service account uses your Cloud KMS key, that operation is billed at the rate of Cloud KMS key operations. Pricing information on Cloud KMS key operations is available at Cloud KMS pricing page.

Verifying CMEK key usage: To verify which KMS key was used to protect the state of your pipeline, look at the Job Details page under the “Encryption key” section, or use the describe command in gcloud:

Leave a Reply

Your email address will not be published. Required fields are marked *