Introducing advanced security options for Cloud Dataproc, now generally available

How to implement document tagging with AutoML
June 28, 2019
MacOS Malware Outbreaks 2019 | The First 6 Months
July 1, 2019
  1. Each GCP user is associated with a cloud identity. This authentication mechanism gives users the ability to SSH into a cluster, run jobs via the API and to create cloud resources (i.e., a Cloud Dataproc cluster).
  2. If you want to use a Kerberized “Hadoop” application, you have to obtain a Kerberos principal. Microsoft Active Directory is used as a cross-realm trust to users and groups that map into Cloud Dataproc Kerberos principals.
    Note: This setup requires Active Directory to be source of truth for user identities. Cloud Identity is only a synchronized copy.
  3. When the “Hadoop” application needs to obtain data from Cloud Storage, a Cloud Storage Connector is invoked. The Cloud Storage Connector allows “Hadoop” to access Cloud Storage data at the block level as if it were a native part of Hadoop. This connector relies on a service account to authenticate against Cloud Storage.

Standing on the shoulders of GCP security
Kerberos and Hadoop secure mode provides you parity with legacy Hadoop security platforms, making it easy to port your existing procedures and policies. However, you may find that even though you maintain existing security practices, the overall security posture of your Hadoop and Spark environments greatly improves with the migration to GCP.

This is because Cloud Dataproc and GCP take advantage of the same secure-by-design infrastructure, built-in protection, and global network that Google uses to protect your information, identities, applications, and devices. In addition, GCP and Cloud Dataproc offer additional security features that help protect your data. Some of the most commonly used GCP-specific security features used with Cloud Dataproc include:

  • Default at-rest encryption, where GCP encrypts customer data stored at rest by default, with no additional action required from you. We offer a continuum of encryption key management options, including a CMEK feature that lets you create, use, and revoke the key encryption key (KEK).

  • Stackdriver Monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. Stackdriver collects and ingests metrics, events, and metadata from Cloud Dataproc clusters to bring you insights via dashboards and charts.

  • VPC Service Controls allow you to define a security perimeter around Cloud Dataproc and the data stored in Cloud Storage buckets. Datasets can be constrained within a VPC to help mitigate data exfiltration risks. With VPC Service Controls, you can keep sensitive data private and still take advantage of the fully managed storage and data processing capabilities of GCP.

These features and many others are certified by third-party auditors. Cloud Dataproc certifications include the most widely recognized, internationally accepted independent security standards, including ISO for security controls, cloud security and privacy, as well as SOC 1, 2, and 3. These certifications help us meet the demands of industry standards such as HIPAA and PCI. We continue to expand our list of certifications globally to assist our customers with their compliance obligations.

End-to-end authorization with GCP Token Broker
As a typical cloud best practice, we recommend that the GCP service accounts associated with the virtual machines (or cloud infrastructure) access datasets on behalf of a user. Many Cloud Dataproc customers choose to provision small autoscaling clusters for each Cloud Dataproc user. This way, there is a clear audit log to see who was on which cluster when it accessed a Cloud Storage dataset.

However, we also hear that many enterprise customers would prefer to use multi-tenant clusters and have strict compliance requirements that dictate that access to GCP resources (Cloud Storage, BigQuery, Cloud Bigtable, etc.) must be attributable to the individual user who initiated the request. In addition, to meet compliance requirements, this should be done in a way that ensures no long-lived credentials are stored on client machines or worker nodes.

To meet these customer goals, Google Cloud created an open source GCP Token Broker. The GCP Token Broker enables end-to-end Kerberos security and Cloud IAM integration for Hadoop workloads on GCP. You can use this open source software to bridge the gap between Kerberos and Cloud IAM to allow users to log in with Kerberos and access GCP resources.

The following diagram illustrates the overall architecture for direct authentication.

Leave a Reply

Your email address will not be published. Required fields are marked *