Microsoft Azure Training Day: Developers Guide to AI
March 10, 2020Microsoft Azure Training Day: Fundamentals
March 10, 2020This follows up on several releases designed to make it easier to run open source data processing on Kubernetes. This means you can reduce stack dependencies and have your jobs run across multi-cloud and hybrid environments. We started by releasing an open source Kubernetes operator for Apache Spark and followed up by integrating the Spark operator with the Dataproc Jobs API. This gives you a single place to securely manage containerized Spark workloads across various types of deployments, all with the support and SLAs that Dataproc provides.
Make open source easier with Google Cloud
Open source has always been a core pillar of Google Cloud’s data and analytics strategy. Starting with the MapReduce paper in 2004, to more recent open source releases of Tensorflow for ML, Apache Beam for data processing, and even Kubernetes itself, we’ve built communities around open source technology and across company boundaries. To accompany these popular open source technologies, Google Cloud offers managed versions of the most popular open source software applications. Dataproc is one, and Anthos is an open hybrid and multi-cloud application platform that enables you to modernize your existing applications, build new ones, and run them anywhere in a secure manner. Anthos is also built on open source technologies pioneered by Google, including Kubernetes, Istio, and Knative.
Why should you run Apache Flink on Kubernetes?
Recently, our Dataproc team has been exploring how customers use open source data processing technologies like Apache Flink, and we’ve heard several pain points related to library and version dependencies that break systems, and balancing that with having to isolate environments and have resources that sit idle. These are challenges that Kubernetes and Anthos are well-positioned to address.
Kubernetes can improve the reliability of your infrastructure. This is very important for Apache Flink, since many Apache Flink jobs are streaming applications that need to stay up 24/7 and be resistant to failure. By combining the features of Kubernetes with Apache Flink, operators have much more control over their architecture and can keep streaming jobs up and running while still performing updates, patches, and upgrades of their system. By using containerization, you can even have different Flink jobs with conflicting versions and different dependencies all sharing the same Kubernetes cluster.
The Apache Flink Runner for Apache Beam also makes Beam pipelines portable to nearly any public or private cloud environment. We hear that developers and data engineers love Google Cloud’s Dataflow for streaming pipelines because it offers a way to run Apache Beam data processing pipelines in the cloud with fully automated provisioning and management of resources. However, many companies have either technical or compliance constraints on what data can be taken to the cloud. Using the Kubernetes operator for Apache Flink makes it easy to deploy Flink jobs, including ones authored with the Beam SDK that target the Flink runner. This enables Flink users to run Beam pipelines in the cloud using a service like GKE, while still making it easy to run jobs on-prem in Anthos.
Apache Beam fills an important gap for Flink users who prefer a mature Python API. For example, if you are a machine learning engineer using TFX on-prem for your end-to-end machine learning lifecycle, you can author your pipeline using the Beam and TFX libraries, then run them on the Flink runner.
You can get started with the Flink Operator in Kubernetes by deploying it from the Google Cloud Marketplace today. For those interested in contributing to the project, find us on GitHub. Learn more in this video about the Flink on Kubernetes operator and take a look at the operations it provides.