Real-time Change Data Capture for data replication into BigQuery

Migrate from Oracle to PostgreSQL with minimal downtime with Datastream
May 26, 2021
When Apple Admits macOS Malware Is A Problem – It’s Time To Take Notice
May 26, 2021
Migrate from Oracle to PostgreSQL with minimal downtime with Datastream
May 26, 2021
When Apple Admits macOS Malware Is A Problem – It’s Time To Take Notice
May 26, 2021

Businesses hoping to make timely, data-driven decisions know that the value of their data may degrade over time and can be perishable. This has created a growing demand to analyze and build insights from data the moment it becomes available, in real-time. Many will find that the operational and transactional data fuelling their business is often stored in relational databases, which work well for processing transactions, but aren’t designed or optimized for running real-time analytics at scale.

Traditional approaches to solving this challenge include replicating data from one source to another in scheduled bulk loads of entire, frequently large, datasets. This is often costly, strenuous on production systems, and can become a bottleneck to making timely and accurate decisions.

So, how can you run real-time analytics against operational and transactional data?

You can achieve this with a technique for data integration known as Change Data Capture (CDC). CDC identifies and captures changes in source databases (updates, inserts and deletes). This allows you to process only the data that has changed, at the moment it changes. CDC delivers a low-latency, near real-time, and cost-effective solution for data acquisition, replication, storage and analysis.

CDC can replicate transactional data into data warehouses, unlocking the potential to analyze the freshest data for operational reporting, streaming analytics, cache invalidation, event-driven architectures, and more. However, implementing CDC solutions can be complex, require expensive licenses, and be heavily reliant on niche technical expertise. In this blog, we’ll explore how you can take advantage of a completely cloud-native, end-to-end solution to this problem.

BigQuery is Google Cloud’s data warehouse which offers a serverless and cost-effective way to store large amounts of data, it is uniquely optimized for large-scale analytics. While BigQuery is a great solution for operational analytics, one of the biggest challenges is bringing in data in a reliable, timely, and easy-to-use manner. There have been scattered solutions in this area, but they have largely placed the burden of integration on customers.

The launch of Datastream, our new, serverless CDC and replication service, solves many of these challenges. Datastream synchronizes data across heterogeneous databases, applications, and storage systems with minimal latency. It supports data replication for a variety of use cases, including real-time analytics. Datastream integrates with our Data and Analytics services allowing you to create simple, end-to-end, cloud-native solutions that replicate your changed data into BigQuery:

  • Cloud Data Fusion is Google Cloud’s integration service for building ETL and ELT data pipelines. Data Fusion already supports the replication of data from SQL Server and MySQL to BigQuery through an easy-to-use, wizard-driven experience. Data Fusion now integrates with Datastream to support Oracle as a data source, without the need for expensive licenses or agents.

  • Dataflow is our fully managed service for unified stream and batch data processing. Dataflow’s integration with Datastream includes the launch of three new templates that replicate data to BigQuery, Cloud Spanner and Cloud SQL for PostgreSQL. You can also extend and customize the Dataflow templates that ingest and process changed data from Datastream sources. This is key if you need to do transformations or enrichments with data from another source before storing it in Google Cloud.

Leave a Reply

Your email address will not be published. Required fields are marked *