Skip the maintenance, speed up queries with BigQuery’s clustering

BFloat16: The secret to high performance on Cloud TPUs
August 23, 2019
The Good, the Bad and the Ugly in Cybersecurity – Week 34
August 23, 2019
BFloat16: The secret to high performance on Cloud TPUs
August 23, 2019
The Good, the Bad and the Ugly in Cybersecurity – Week 34
August 23, 2019

BigQuery is Google Cloud’s serverless data warehouse, automating much of the toil and complexity associated with setting up and managing an enterprise-grade data warehouse. If there’s a maintenance task to be done, BigQuery’s philosophy is to take care of it for our users autonomously.

To get the most out of BigQuery, one of our key best practices is table partitioning and clustering. In this blog post, we’ll explain the internal workings of clustered tables in BigQuery. We’ll also cover automatic re-clustering, which fixes a classic data warehousing concern in a completely transparent way. It does this as a background process unique to BigQuery that continuously optimizes clustered tables for best performance.

You’ll find partitioning and clustering vastly improve the cost and performance profiles of your workloads. In addition, automatic re-clustering seamlessly provides a practical benefit: making your clustered tables resilient to real-world demands such as continuous inserts. One less thing to worry about!

Within a partitioned table, individual partitions behave as independent tables–one per partition. As such, the behavior of clustering for each partition of a partitioned table automatically extends to the clustering of non-partitioned tables.

Clustering is supported on primitive non-repeated top-level columns, such as INT64, BOOL, NUMERIC, STRING, DATE, GEOGRAPHY, and TIMESTAMP.

In general, there are two typical usage patterns for clustering within a data warehouse:

  • Clustering on columns that have a very high number of distinct values, like userId or transactionId.

  • Clustering on multiple columns that are frequently used together. When clustering by multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data. You can filter by any prefix of the clustering columns and get the benefits of clustering, like regionId, shopId and productId together; or regionId and shopId; or just regionId.

Data in a BigQuery table is stored in Capacitor format blocks. This means that table clustering defines a “weak” sort order on these blocks. In other words, BigQuery attempts to distribute the data such that the blocks store non-overlapping ranges of values for the clustering keys. BigQuery automatically determines the boundaries of these new blocks within each partition as it is written to a table.

You can see below the layout of data in a table partitioned on the eventDate date column and clustered on the eventId clustering column:

Leave a Reply

Your email address will not be published. Required fields are marked *