Cloud Dataproc Spark Jobs on GKE: How to get started

Microsoft Azure Training Day: Migrating Applications to the Cloud
September 10, 2019
Red Hat Ansible Automation Accelerates Past Major Adoption Milestone, Now Manages More Than Four Million Customer Systems Worldwide
September 11, 2019
Microsoft Azure Training Day: Migrating Applications to the Cloud
September 10, 2019
Red Hat Ansible Automation Accelerates Past Major Adoption Milestone, Now Manages More Than Four Million Customer Systems Worldwide
September 11, 2019

Cloud Dataproc is Google Cloud’s fully managed Apache Hadoop and Spark service. The mission of Cloud Dataproc has always been to make it simple and intuitive for developers and data scientists to apply their existing tools, algorithms, and programming languages to cloud-scale datasets. Its flexibility means you can continue to use the skills and techniques you are already using to explore data of any size. We hear from enterprises and SaaS companies around the world that they’re using Cloud Dataproc for data processing and analytics.

Cloud Dataproc now offers alpha access to Spark jobs on Google Kubernetes Engine (GKE). (Find out more about the program here.) This means you can take advantage of the latest approaches in machine learning and big data analysis (Apache Spark and Google Cloud) together with the state-of-the-art cloud management capabilities that developers and data scientists have come to rely upon with Kubernetes and GKE. Using these tools together can bring you flexibility, auto-healing jobs, and a unified infrastructure, so you can focus on workloads, not maintaining infrastructure. Email us for more information and to join the alpha program.

Let’s take a look at Cloud Dataproc in its current form and what the new GKE alpha offers.

Cloud Dataproc now: Cloud-native Apache Spark

Cloud Dataproc has democratized big data and analytics processing for thousands of customers, offering the ability to spin up a fully loaded and configured Apache Spark cluster in minutes. With Cloud Dataproc, features such as Component Gateway enable secure access to notebooks with zero setup or installation, letting you immediately start exploring data of any size. These notebooks, combined with Cloud Dataproc Autoscaling, make it possible to run ML training or process data of various sizes without ever having to leave your notebook or worry about how the job will get done. The underlying Cloud Dataproc cluster simply adjusts compute resources as needed, within predefined limits.

Once your ML model or data engineering job is ready for production, or for use in an automated or recurring way, you can use the Cloud Dataproc Jobs API to submit a job to an existing Cloud Dataproc cluster with a jobs.submit call over HTTP, using the gcloud command-line tool, or in the Google Cloud Platform Console itself. Submitting your Spark code with the Jobs APIs ensures the jobs are logged and monitored, in addition to having them managed across the cluster. It also makes it easy to separate the permissions of who has access to submit jobs on a cluster and who has permissions to reach the cluster itself, without needing a gateway node or an application like Livy.

Cloud Dataproc next: Extending the Jobs API with GKE

The Cloud Dataproc Jobs API has been a perfect match for companies who prefer to wrap their job automation and extract, transform, and load processing (ETL) jobs in custom tooling such as Spotify’s Spydra or Cloud Dataproc’s Workflow Templates.

However, developers and data scientists who have embraced containerization and the cloud management capabilities of Kubernetes have started to demand more from their big data processing services. In order to automate your Spark job today, you would either need to continue running the cluster that created the job (expensive and does not take advantage of the pay-as-you-need capability of the cloud), or you need to carefully track how to re-create that same cluster environment in the cloud, which can become a complicated mixture of configurations, initialization scripts, conda environments, and library/package management scripts. This process can be additionally cumbersome in multi-tenant environments, where various software packages, configurations, and OS updates may conflict.

With Cloud Dataproc on Kubernetes, you can eliminate the need for multiple types of clusters that have various sets of software, and the complexity that’s involved. By extending the Cloud Dataproc Jobs API to GKE, you can package all the various dependencies of your job into a single Docker container. This Docker container allows you to integrate Spark jobs directly into the rest of your software development pipelines.

Additionally, by extending the Cloud Dataproc Jobs API for GKE, administrators have a unified management system where they can tap into their Kubernetes knowledge. You can avoid having a silo of Spark applications that need to be managed in standalone virtual machines or in Apache Hadoop YARN.

Kubernetes: Yet another resource negotiator?

Apache Hadoop YARN (introduced in 2012) is a resource negotiator commonly found in Spark platforms across on-prem and cloud. YARN provides the core capabilities of scheduling computing resources in Cloud Dataproc clusters that are based on Compute Engine. By extending the Jobs API in Cloud Dataproc with GKE, you can choose to replace your YARN management with Kubernetes. There are some key advantages to using Kubernetes over YARN:

1. Flexibility.

Greater flexibility of production jobs can be achieved by having a consistent configuration of software libraries embedded with the Spark code. Containerizing Spark jobs isolates dependencies and resources at the job level instead of the cluster level. This flexibility will give you more predictable workload cycles and make it easier to target your troubleshooting when something does go wrong.

2. Auto-healing.

Kubernetes provides declarative configuration for your Spark jobs. This means that you can declare at the start of the job the resources required to process the job. If for some reason Kubernetes resources (i.e., executors) become unhealthy, Kubernetes will automatically restore them and your job will continue to run with the resources you declared at the onset.

3. Unified infrastructure.

At Google, we have used a system called Borg to unify all of our processing, whether it’s a data analytics workload, web site, or anything else. Borg’s architecture served as the basis for Kubernetes, which you can use to remove the need for a big data (YARN) silo.

By migrating Spark jobs to a single cluster manager, you can focus on modern cloud management in Kubernetes. At Google, having a single cluster manager system has led to more efficient use of resources and provided a unified logging and management framework. This same capability is now available to your organization.

Kubernetes is not just “yet another resource negotiator” for big data processing. It’s an entirely new way of approaching big data that can greatly improve the reliability and management of your data and analytics workloads.

Spark jobs on GKE in action

Let’s walk through what is involved with submitting an Apache Spark job to Cloud Dataproc on GKE during the alpha phase.

Step 0: Register your GKE cluster with Cloud Dataproc

Before you can execute Cloud Dataproc jobs on GKE, you must first register your GKE cluster with Cloud Dataproc. During alpha, the registration will be completed with a helm installation. Once the GKE cluster is registered, you will be able to see your GKE cluster unified with the rest of your Cloud Dataproc clusters by running the command:

Leave a Reply

Your email address will not be published. Required fields are marked *