The first step is to ingest data into the pipeline. Here, the inputs take the form of aggregated portfolio and trade data. One key design goal was the ability to handle both batch and stream inputs. In the batch case, CSV files are uploaded to Google Cloud Storage, and the file upload triggers a message onto a Cloud Pub/Sub topic. For the streaming case, information is published directly onto a Cloud Pub/Sub topic. Cloud Pub/Sub is a fully managed service that provides scalable, reliable, at-least-once delivery of messages for event-driven architectures. Cloud Pub/Sub enables loose coupling of application components and supports both push and pull message delivery.
Those Cloud Pub/Sub messages feed a Cloud Dataflow pipeline for trade data preprocessing. Cloud Dataflow is a fully managed, auto-scaling service for transforming and enriching data in both stream and batch modes, based on open-source Apache Beam. The portfolio inputs are cleansed and split into individual trade elements, at which point the required risk calculations are determined. The individual trade elements are published to downstream Cloud Pub/Sub topics to be consumed by the risk calculation engine.
Intermediate results from the preprocessing steps are persisted to Cloud Datastore, a fully managed, serverless NoSQL document database. This pattern of checkpointing intermediate results to Cloud Datastore is repeated throughout the architecture. We chose Cloud Datastore for its flexibility, as it brings the scalability and availability of a NoSQL database alongside capabilities such as ACID transactions, indexes and SQL-like queries.
At the heart of the architecture sits the risk calculation engine, deployed on GKE. GKE is a managed, production-ready environment for deploying containerized applications. We knew we wanted to evaluate GKE, and Kubernetes more broadly, as a platform for risk computation for the following reasons:
The risk engine is a set of Kubernetes services designed to handle data enrichment, perform the required calculations, and output results. Pods are independently auto-scaled via Stackdriver metrics on Cloud Pub/Sub queue depths, and the cluster itself is scaled based on the overall CPU load. As in the preprocessing step, intermediate results are persisted to Cloud Datastore and pods publish messages to Cloud Pub/Sub to move data through the pipeline. The pods can run inside a private cluster that is isolated from the internet but can still interact with other GCP services via private Google access.
Final calculation results output by the risk engine are published to a Cloud Pub/Sub topic, which feeds a Cloud Dataflow pipeline. Cloud Dataflow enriches the results with the portfolio and market data used for the calculations, creating full-featured snapshots. These snapshots are persisted to BigQuery, GCP’s serverless, highly scalable enterprise data warehouse. BigQuery allows analysis of the risk exposures at scale, using SQL and industry-standard tooling, driving customer use cases like regulatory reporting.
Lessons learned building a proof-of-concept data platform
We learned some valuable lessons while building out this platform:
What’s next for our risk solution
We built a modernized, cloud-native risk computation platform that offers several advantages over traditional grid-based architectures. The architecture is largely serverless, using managed services such as Cloud Dataflow, Cloud Pub/Sub and Cloud Datastore. The solution is open-source at its core, using Kubernetes and Apache Beam via GKE and Cloud Dataflow, respectively. BigQuery provides an easy way to store and analyze financial data at scale. The architecture has the ability to handle both batch and stream inputs, and scales up and down to match load.
Using GCP, we addressed some of the key challenges associated with traditional risk approaches, namely inflexibility, high management overhead and reliance on expensive third-party tools. As our VP of financial services, Suranjan Som, put it: “The GCP risk analytics solution provides a scalable, open and cost-efficient platform to meet increasing risk and regulatory requirements.” We’re now planning further work to test the solution at production scale.