Introducing Cloud Build private pools: Secure CI/CD for private networks
July 29, 2021Building with Looker made easier with the Extension Framework
July 29, 2021First Steps: API request management
BigQuery supports an asynchronous API for executing queries: callers can insert a query job request, and then poll it until complete – as we discussed a few weeks ago. BigQuery supports a REST-based protocol for this, which accepts queries encoded via JSON.
To proceed, there’s some level of API processing that must occur. Some of things that must be done are authenticating and authorizing the request, plus building and tracking associated metadata such as the SQL statement, cloud project, and/or query parameters.
Decoding the query text: Lexing and parsing
Lexing and parsing is a common task for programming languages, and SQL is no different. Lexing refers to the process of scanning an array of bytes (the raw SQL statement) and converting that into a series of tokens. Parsing is the process of consuming those tokens to build up a syntactical representation of the query that can be validated and understood by BigQuery’s software architecture.
If you’re super interested in this, we recommend checking out the ZetaSQL project, which includes the open source reference implementation of the SQL engine used by BigQuery and other GCP projects.
Referencing resources: Catalog resolution
SQL commonly contains references to entities retained by the BigQuery system – such as tables, views, stored procedures and functions. For BigQuery to process these references, it must resolve them into something more comprehensible. This stage helps the query processing system answer questions like:
-
Is this a valid identifier? What does it reference?
-
Is this entity a managed table, or a logical view?
-
What’s the SQL definition for this logical view?
-
What columns and data types are present in this table?
-
How do I read the data present in the table? Is there a set of URIs I should consume?
Resolutions are often interleaved through the parsing and planning phases of query execution.
Building a blueprint: Query planning
As a more fully-formed picture of the request is exposed via parsing and resolution, a query plan begins to emerge. Many techniques exist to refactor and improve a query plan to make it faster and more efficient. Algebraization, for example, converts the parse tree into a form that makes it possible to refactor and simplify subqueries. Other techniques can be used to optimize things further, moving tasks like pruning data closer to data reads (reducing the overall work of the system).
Another element is adapting it to run as a set of distributed execution tasks. Like we mentioned in the beginning of this post, BigQuery leverages large pools of query computation nodes, or workers. So, it must coordinate how different stages of the query plan share data through reading and writing from storage, and how to stage temporary data within the shuffle system.
Doing the work: Query execution
Query execution is simply the process of working through the query stages in the execution graph, towards completion. A query stage may have a single unit of work, or it may be represented by many thousands of units of work, like when a query stage reads all the data in a large table composed of many separate columnar input files.
Query management: scheduling and dynamic planning
Besides the workers that perform the work of the query plan itself, additional workers monitor and direct the overall progress of work throughout the system. Scheduling is concerned with how aggressively work is queued, executed and completed.
However, an interesting property of the BigQuery query engine is that it has dynamic planning capabilities. A query plan often contains ambiguities, and as a query progresses it may need further adjustment to ensure success. Repartitioning data as it flows through the system is one example of a plan adaptation that may be added, as it helps ensure that data is properly balanced and sized for subsequent stages to consume.
Finishing up: finalizing results
As a query completes, it often yields output artifacts in the form of results, or changes to tables within the system. Finalizing results includes the work to commit these changes back to the storage layer. BigQuery also needs to communicate back to you, the user, that the system is done processing the query. The metadata around the query is updated to note the work is done, or the error stream is attached to indicate where things went wrong.
Armed with our new understanding of the life of a query, we can dive more deeply into query plans. First, let’s look at a simple plan. Here, we are running a query against a public BigQuery dataset to count the total number of citi bike trips that began at stations with “Broadway” in the name.
SELECT COUNT(*)
FROM `bigquery-public-data.new_york.citibike_trips`
WHERE start_station_name LIKE "%Broadway%"
Now let’s consider what is happening behind the scenes when BigQuery processes this query.
-
First, a set of workers access the distributed storage to read the table, filter the data, and generate partial counts. Next, these workers send their counts to the shuffle.
-
The second stage reads from those shuffle records as its input, and sums them together. It then writes the output file into a single file, which becomes accessible as the result of the query.