Large scale data processing workloads can be challenging to operationalize and orchestrate. Google Cloud announced the release of a Firestore in Native Mode connector for Apache Beam that makes data processing easier than ever for Firestore users. Apache Beam is a popular open source project that supports large scale data processing with a unified batch and streaming processing model. It’s portable, works with many different backend runners, and allows for flexible deployment. The Firestore Beam I/O Connector joins BigQuery, Bigtable, and Datastore as Google databases with Apache Beam connectors and is automatically included with the Google Cloud Platform IO module of the Apache Beam Java SDK.
The Firestore connector can be used with a variety of Apache Beam backends, including Google Cloud Dataflow. Dataflow, an Apache Beam backend runner, provides a structure for developers to solve “embarrassingly parallel” problems. Mutating every record of your database is an example of such a problem. Using Beam pipelines removes much of the work of orchestrating the parallelization and allows developers to instead focus on the transforms on the data.
To better understand the use case for a Beam + Firestore Pipeline, let’s look at an example that illustrates the value of using Google Cloud Dataflow to do bulk operations on a Firestore database. Imagine you have a Firestore database and have a collection group you want to do a high number of operations on; for instance, deleting all documents within a collection group. Doing this on one worker could take a while. What if instead we could use the power of Beam to do it in parallel?