Moving a publishing workflow to BigQuery for new data insights

DACH businesses embrace Google Cloud for digital transformation
September 20, 2019
The Good, the Bad and the Ugly in Cybersecurity – Week 38
September 20, 2019

We used a simple Python script to read the issues from the API and then insert the entries into BigQuery using the streaming API. We chose the streaming API, rather than Cloud Pub/Sub or Cloud Dataflow, because we wanted to repopulate the BigQuery content with the latest data several times a day. The Google API Python client library was an obvious choice, because it provides an idiomatic way to interact with the Google APIs, including the BigQuery streaming API.

Since this data would only be used for reporting purposes, we opted to keep only the most recent version of the data as extracted. There were two reasons for this decision:

  1. Master data: There would never be any question about which data was the master version of the data.
  2. Historical data: We had no use cases that required capturing any historical data that wasn’t already captured in the data extract.

Following common extract, transform, load (ETL) best practices, we used a staging table and a separate production table so that we could load data into the staging table without impacting users of the data. The design we created based on ETL best practices called for first deleting all the records from the staging table, loading the staging table, and then replacing the production table with the contents.

When using the streaming API, the BigQuery streaming buffer remains active for about 30 to 60 minutes or more after use, which means that you can’t delete or change data during that time. Since we used the streaming API, we scheduled the load every three hours to balance getting data into BigQuery quickly and being able to subsequently delete the data from the staging table during the load process.

Once our data was in BigQuery, we could write SQL queries directly against the data or use any of the wide range of integrated tools available to analyze the data. We chose Data Studio for visualization because it’s well-integrated with BigQuery, offers customizable dashboard capabilities, provides the ability to collaborate, and of course, is free.

Because BigQuery datasets can be shared with users, this opened up the usability of the data for whomever was granted access and also had appropriate authorization. This also meant that we could combine this data in BigQuery with other datasets. For example, we track the online engagement metrics for our reference guides and load them into BigQuery. With both datasets in BigQuery, it made it easy to factor in the online engagement numbers to build dashboards.

One of the biggest reasons that we wanted to create reporting against our publishing process is to track the publishing process over time. Data Studio made it easy to build a dashboard with charts, similar to the two charts below. Building the dashboard in Data Studio allowed us to easily analyze our publication metrics over time and then share the specific dashboards with teams outside ours.

Leave a Reply

Your email address will not be published. Required fields are marked *