Spark on Google Cloud: Serverless Spark jobs made seamless for all data users

Google Cloud’s Government and Education Summit celebrates transformation and resilience
October 18, 2021
Supercharge your Google Cloud workloads with up-to-date best practices from Architecture Framework
October 18, 2021
Google Cloud’s Government and Education Summit celebrates transformation and resilience
October 18, 2021
Supercharge your Google Cloud workloads with up-to-date best practices from Architecture Framework
October 18, 2021

Apache Spark has become a popular platform as it can serve all of data engineering, data exploration, and machine learning use cases. However, Spark still requires the on-premises way of managing clusters and tuning infrastructure for each job. Also, end to end use cases require Spark to be used along with technologies like TensorFlow, and programming languages like SQL and Python. Today, these operate in silos, with Spark on unstructured data lakes, SQL on data warehouses, and TensorFlow in completely separate machine learning platforms. This increases costs, reduces agility, and makes governance extremely hard; prohibiting enterprises from making insights available to the right users at the right time.

Announcing Spark on Google Cloud, now serverless and integrated

We are excited to announce Spark on Google Cloud, bringing industry’s first autoscaling serverless Spark, seamlessly integrated with the best of Google Cloud and open source tools, so you can effortlessly power ETL, data science, and data analytics use cases at scale. Google Cloud has been running large scale business critical Spark workloads for enterprise customers for 6+ years, using open source Spark in Dataproc. Today, we are furthering our commitment by enabling customers to:

  1. Eliminate time spent managing Spark clusters: With serverless Spark, users submit their Spark jobs, and let them do auto-provision, and autoscale to finish.

  2. Enable data users of all levels: Connect, analyze, and execute Spark jobs from the interface of users’ choice including BigQuery, Vertex AI or Dataplex, in 2 clicks, without any custom integrations.

  3. Retain flexibility of consumption: No one size fits all. Use Spark as serverless, deploy on Google Kubernetes Engine (GKE), or on compute clusters based on the requirements.

With Spark on Google Cloud, we are providing a way for customers to use Spark in a cloud native manner (serverless), and seamlessly with tools used by data engineers, data analysts, and data scientists for their use cases. These tools will help customers on their way to realize the data platform redesign they have embarked on.

“Deutsche Bank is using Spark for a variety of different use cases. Migrating to GCP and adopting Serverless Spark for Dataproc allows us to optimize our resource utilization and reduce manual effort so our engineering teams can focus on delivering data products for our business instead of managing infrastructure. At the same time we can retain the existing code base and knowhow of our engineers, thus boosting adoption and making the migration a seamless experience.”–Balaji Maragalla, Director Big Data Platform, Deutsche Bank

“We see serverless Spark playing a central role in our data strategy. Serverless Spark will provide an efficient, seamless solution for teams that aren’t familiar with big data technology or don’t need to bother with idiosyncrasies of Spark to solve their own processing needs. We’re excited about the serverless aspect of the offering, as well as the seamless integration with BigQuery, Vertex AI, Dataplex and other data services.” —Saral Jain, Director of Engineering, Infrastructure and Data, Snap Inc.

Dataproc Serverless for Spark

Per IDC, developers spend 40% time writing code, and 60% of the time tuning infrastructure and managing clusters. Furthermore, not all Spark developers are infrastructure experts, resulting in higher costs and productivity impact. With serverless Spark, developers can spend all their time on the code and logic. They do not need to manage clusters or tune infrastructure. They submit Spark jobs from their interface of choice, and processing is auto-scaled to match the needs of the job. Furthermore, while Spark users today pay for the time the infrastructure is running, with serverless Spark they only pay for the job duration.

Spark through BigQuery

BigQuery, the leading data warehouse, now provides a unified interface for data analysts to write SQL or PySpark. The code is executed using serverless Spark seamlessly, without the need for infrastructure provisioning. BigQuery has been the pioneer for serverless data warehousing, and now supports serverless Spark for Spark-based analytics.

Leave a Reply

Your email address will not be published. Required fields are marked *