Goodbye Hadoop. Building a streaming data processing pipeline on Google Cloud

Everything You Need to Know To Get Started In Cybersecurity
July 8, 2019
two men putting puzzle pieces together
IBM Closes Landmark Acquisition of Red Hat for $34 Billion; Defines Open, Hybrid Cloud Future
July 9, 2019
Everything You Need to Know To Get Started In Cybersecurity
July 8, 2019
IBM Closes Landmark Acquisition of Red Hat for $34 Billion; Defines Open, Hybrid Cloud Future
July 9, 2019

Getting to this point took quite a bit of experimentation, though. As an early adopter of Google Cloud, we were one of the first businesses to test Cloud Bigtable, and actually built Stash, our in-house key/value datastore, on top of it. In parallel, we had also been building a similar version with HBase, but the ease of use and deployment, manageability, and flexibility we experienced with Bigtable convinced us of the power of using serverless architecture, setting us on our journey to build our new infrastructure on managed services.

As with many other Google Cloud services, we were fortunate to get early access to Cloud Dataflow, and started scoping out the idea of building a new streaming pipeline with it. It took our developers only a week to build a functional end-to-end prototype for the new pipeline. This ease of prototyping and validation cemented our decision to use it for a new streaming pipeline, since it allowed us to rapidly iterate ideas.

We briefly experimented with building a hybrid platform, using GCP for the main data ingestion pipeline and using another popular cloud provider for data warehousing. However, we quickly settled on using BigQuery as our exclusive data warehouse. It scales effortlessly and on-demand, lets you run powerful analytical queries in familiar SQL, and supports streaming writes, making it a perfect fit for our use case. At the time of writing this article, our continuously growing data storage in BigQuery is at a 5PB mark and we run queries processing over 10PB of data every month.

Our data storage architecture requires that the events be routed to different BigQuery datasets based on client identifiers baked into the events. This, however, was not supported by the early versions of the Dataflow SDK, so we wrote some custom code for the BigQuery streaming write transforms to connect our data pipeline to BigQuery. Routing data to multiple BigQuery tables has since been added as a feature in the new Beam SDKs.

Within a couple of months, we had written a production-ready data pipeline and ingestion infrastructure. Along the way, we marveled at the ease of development, management, and maintainability that this serverless architecture offered, and observed some remarkable engineering and business-level optimizations. For example, we reduced our engineering operational cost by approximately half. We no longer needed to pay for idle machines, nor manually provision new ones as traffic increases, as everything is configured to autoscale based on incoming traffic. We now pay for what we use. We have dealt with massive traffic spikes (10-25X) during major retail and sporting events like Black Friday, Cyber Monday, Boxing Day etc, for three years in a row without any hiccups.

Google’s fully managed services allow us to save on the massive engineering efforts required to scale and maintain our infrastructure, which is a huge win for our infrastructure team. Reduced management efforts mean our SREs have more time to build useful automation and deployment tools.

It also means that our product teams can get proof-of concepts out the door faster, enabling them to validate ideas quickly, reject the ones that don’t work, and rapidly iterate over the successful ones.

This serverless architecture has helped us build a federated data model fed by a central Cloud Pub/Sub firehose that serves all our teams internally, thus eliminating data silos. BigQuery serves as a single source of truth for all our teams and the data infrastructure that we built on Google Cloud powers our app and drives all our client-facing products.

Our partnership with Google is fundamental to our success–underpinning our own technology stack, it ensures that every customer interaction on web, mobile, or in-app is smarter, sharper, and more personal. Serverless architecture has helped us build a powerful ingestion infrastructure that forms the basis of our personalization platform. In upcoming posts, we’ll look into some tools we developed to work with the managed services, for example an open source tool to launch dataflows and our in-house Pub/Sub event router. We’ll also look at how we monitor our platform. Finally, we’ll deep dive into some of the personalization solutions that we built that leverage serverless architecture, like our recommendation engine. In the interim, feel free to reach out to us with comments and questions.

Leave a Reply

Your email address will not be published. Required fields are marked *