So, you find yourself executing the same Python script each day. Maybe you’re executing a query in BigQuery and dumping the results in BigTable each morning to run an analysis. Or perhaps you need to update the data in a Pivot Table in Google Sheets to create a really pretty histogram to display your billing data. Regardless, no one likes doing the same thing every day if technology can do it for them. Behold the magic of Cloud Scheduler, Cloud Functions, and PubSub!
Cloud Scheduler is a managed Google Cloud Platform (GCP) product that lets you specify a frequency in order to schedule a recurring job. In a nutshell, it is a lightweight managed task scheduler. This task can be an ad hoc batch job, big data processing job, infrastructure automation tooling–you name it. The nice part is that Cloud Scheduler handles all the heavy lifting for you: It retries in the event of failure and even lets you run something at 4 AM, so that you don’t need to wake up in the middle of the night to run a workload at otherwise off-peak timing.
When setting up the job, you determine what exactly you will “trigger” at runtime. This can be a PubSub topic, HTTP endpoint, or an App Engine application. In this example, we will publish a message to a PubSub topic.
Our PubSub topic exists purely to connect the two ends of our pipeline: It is an intermediary mechanism for connecting the Cloud Scheduler job and the Cloud Function, which holds the actual Python script that we will run. Essentially, the PubSub topic acts like a telephone line, providing the connection that allows the Cloud Scheduler job to talk, and the Cloud Function to listen. This is because the Cloud Scheduler job publishes a message to the topic. The Cloud Function subscribes to this topic. This means that it is alerted whenever a new message is published. When it is alerted, it then executes the Python script.
For this example, I’ll show you a simple Python script that I want to run daily at 8 AM ET and 8 PM ET. The script is basic: it executes a SQL query in BigQuery to find popular Github repositories. We will specifically be looking for which owners created repositories with the most amount of forks and in which year they were created. We will use data from the public dataset
bigquery-public-data:sample, which holds data about repositories created between 2007 and 2012. Our SQL query looks like this: