Introducing the BigQuery Terraform module

Data Analytics

BigQuery static hero

Tony DiGangi

Infrastructure Cloud Consultant

August 20, 2019

Try GCP

Get $300 free credit to spend over 12 months.

Free Trial

It’s no secret software developers love to automate their work away, and cloud development is no different. Since the release of the Cloud Foundation Toolkit (CFT), we’ve offered automation templates with Deployment Manager and Terraform to help engineers get set up with Google Cloud Platform (GCP) quickly. But as useful as the Terraform offering was, it was missing a critical module for a critical piece of GCP: BigQuery.

Fortunately, those days are over. With the BigQuery module for Terraform, you can now automate the instantiation and deployment of your BigQuery datasets and tables. This means you have an open-source option to start using BigQuery for data analytics.

In building the module, we applied the flexibility and extensibility of Terraform throughout and adhered to the following principles:

Referenceable templates
Modular, loosely coupled design for reusability
Provisioning and association for both datasets and tables
Support for full unit testing (via Kitchen-Terraform)
Access control (coming soon)

By including the BigQuery Terraform module in your larger CFT scripts, it’s possible for you to go effectively from zero to ML in minutes, with significantly reduced barriers to implementation.

Let’s walk through how to set this up.

Building blocks: GCP and Terraform prerequisites

To use the BigQuery Terraform module, you’ll need–you guessed it–to have BigQuery and Terraform ready to go.

Note: The steps outlined below are applicable for Unix- and Linux-based devices, and have not been optimized for CI/CD systems or production use.

1. Download the Terraform binary that matches your system type and Terraform installation process.

2. Install Google Cloud SDK on your local machine.

3. Start by creating a GCP project in your organization’s folder and project. Try something via Terraform like the following:

  resource "google_folder" "department-1" {
  display_name = "department-1"
  parent       = "organizations/1234567"
}

resource "google_project" "my-project-in-a-folder" {
  name 	      = "my-project"
  project_id = "your-project-id"
  folder_id  = "${google_folder.department-1.name}"
}

4. Let’s set up some environment variables to use. Ensure you updated the values to accurately reflect your environment.

  SA_ID=bq-deployer
PROJECT_ID=your-project-id
SA_EMAIL=$SA_ID@$PROJECT_ID.iam.gserviceaccount.com

5. Go ahead and enable the BigQuery API (or use the helpers directory in the module instead)

  gcloud services enable bigquery-json.googleapis.com  
    --project $PROJECT_ID

6. Establish an identity with the IAM permissions required

  gcloud iam service-accounts create $SA_ID 
   --display-name $SA_ID 
   --project $PROJECT_ID

gcloud projects add-iam-policy-binding $PROJECT_ID 
 --member serviceAccount:$SA_EMAIL 
 --role roles/bigquery.dataOwner 
 --user-output-enabled false

gcloud iam service-accounts keys create $HOME/.ssh/bq-key.json 
 --iam-account $SA_EMAIL  
 --project $PROJECT_ID

7. Browse through the examples directory to get a full list of examples that are possible within the module.

What’s in the box: Get to know the Terraform module

The BigQuery module is packaged in a self-contained GitHub repository for you to easily download (or reference) and deploy. Included in the repo is a central module that supports both Terraform v0.12.X and v0.11.X, allowing users (both human and GCP service accounts) to dynamically deploy datasets with any number of tables attached to the dataset. (By the way, the BigQuery module has you covered in case you’re planning to partition your tables using a TIMESTAMP OR DATE column to optimize for faster retrieval and lower query costs.)

To enforce naming standardization, the BigQuery module creates a single dataset that is referenced in the multiple tables that are created, which streamlines the creation of multiple instances and generates individual Terraform state files per BigQuery dataset. This is especially useful for customers with hundreds of tables in dozens of datasets, who don’t want to get stuck with manual creation. That said, the module is fundamentally an opinionated method for setting up your datasets and table schemas; you’ll still need to handle your data ingestion or upload via any of the methods outlined here, as that’s not currently not supported by Terraform.

In addition, the repo is packaged with a rich set of test scripts that use Kitchen-Terraform plugins, robust examples on how to use the module in your deployments, major version upgrade guides, and helper files to get users started quickly.

Putting them together: Deploying the module

Now that you have BigQuery and Terraform set up, it’s time to plug them together.

1. Start by cloning the repository:

  git clone https://github.com/terraform-google-modules/terraform-google-bigquery.git

2. If you didn’t enable the BigQuery API earlier and create the service account with permissions, run the setup-sa.sh quickstart script in the helpers directory of the repo. This will set up the service account and permissions, and enable the BigQuery API.

  cd ./terraform-google-bigquery/helpers/
. setup-sa.sh <organization name> $PROJECT_ID

3. Define your BigQuery table schema, or try out an example schema here.

4. Create a deployment (module) directory.

  mkdir /bq-basic-deployment

5. Create the deployment files: main.tf, variables.tf, outputs.tf, and optionally a terraform.tfvars (in case you want to override default vars in the variables.tf file):

  touch main.tf variables.tf outputs terraform.tfvars

6. Populate the files as detailed below.

Main.tf

  provider "google" {
  version     = "~> 2.5.0"
  # Update credentials to the correct location, alternatively set   GOOGLE_APPLICATION_CREDENTIALS=/path/to/.ssh/bq-key.json in your shell session and   remove the credentials attribute.
  credentials = file("/path/to/.ssh/bq-key.json") 
}

module "bigquery" {
  source            = "terraform-google-modules/bigquery/google" # Path to the module
  version           = "~> 2.0.0" # Specify the version of the module you require
  dataset_id        = "foo"
  dataset_name      = "foo"
  description       = "some description" # updated the description accordingly
  expiration        = var.expiration
  project_id        = var.project_id
  location          = "US" # Update location if needed
  tables            = var.tables
  time_partitioning = var.time_partitioning
  dataset_labels    = var.dataset_labels
}

Outputs.tf

  output "dataset_id" {
  value       = module.bigquery.dataset_id
  description = "Unique id for the dataset being provisioned"
}

output "dataset_name" {
  value       = module.bigquery.dataset_name
  description = "Friendly name for the dataset being provisioned"
}

output "dataset_project" {
  value       = module.bigquery.dataset_project
  description = "Project where the dataset and table are created"
}

output "table_id" {
  value       = module.bigquery.table_id
  description = "Unique id for the table being provisioned"
}

output "table_name" {
  value       = module.bigquery.table_name
  description = "Friendly name for the table being provisioned"
}

output "dataset_labels" {
  value       = module.bigquery.dataset_labels
  description = "Key value pairs in a map for dataset labels"
}

output "table_labels" {
  value       = module.bigquery.table_labels
  description = "Key value pairs in a map for table labels"
}

Terraform.tfvars

  #Update with the project you are deploying module to
project_id = "example-project" 

#Time that 
time_partitioning = "DAY" 

#The labels for dataset being deployed
dataset_labels = {
  env   = "dev"
  billable   = "true"
  owner = "janesmith"
}

#List of the tables that you are 
tables = [
  {
    table_id = "foo",
    schema = "sample_bq_schema.json",
    labels = {
      env = "dev"
      billable = "true"
      owner = "joedoe"
    },
  },
]

Variables.tf

  variable "expiration" {
  description = "TTL of tables using the dataset in MS"
  default     = null
}

variable "project_id" {
  description = "Project wheree the dataset and table are created"
}

variable "time_partitioning" {
  description = "Configures time-based partitioning for this table"
}

variable "dataset_labels" {
  description = "A mapping of labels to assign to the table"
  type        = map(string)
}

variable "tables" {
  description = "A list of maps that includes both table_id and schema in each element, the table(s) will be created on the single dataset"
  default     = []
  type        = list(object({
    table_id  = string,
    schema    = string,
    labels    = map(string),
}))
}

7. Navigate to the correct directory.

  cd /bq-basic-deployment

8. Initialize the directory and plan.

  terraform init
terraform plan -var-file terraform.tfvars

9. Apply the changes.

  terraform apply -var-file terraform.tfvars

What’s next?

That’s it! You’ve used the BigQuery Terraform module to deploy your dataset and tables, and you’re now ready to load in your data for querying. We think this fills a critical gap in our Cloud Foundations Toolkit so you can easily stand up BigQuery with an open-source, extensible solution. Set it and forget it, or update it anytime you need to change your schema or modify your table structure. Once you’ve given it a shot, if you have any questions, give us feedback by opening an issue. Watch or star the module to stay on top of future releases and enjoy all your newfound free time (we hear BQML is pretty fun).

Posted in: