Why Verizon Media picked BigQuery for scale, performance and cost

The Good, the Bad and the Ugly in Cybersecurity – Week 7

February 12, 2021

Orchestrating the Pic-a-Daily serverless app with Workflows

February 12, 2021

Who uses the MAW data and what do they use it for?

Yahoo executives, analysts, data scientists, and engineers all work with this data warehouse. Business users create and distribute Looker dashboards, analysts write SQL queries, scientists perform predictive analytics and the data engineers manage the ETL pipelines. The fundamental questions to be answered and communicated generally include: How are Yahoo’s users engaging with the various products? Which products are working best for users? And how could we improve the products for better user experience?

The Media Analytics Warehouse and analytics tools built on top of it are used across different organizations in the company. Our editorial staff keeps an eye on article and video performance in real time, our business partnership team uses it to track live video shows from our partners, our product managers and statisticians use it for A/B testing and experimentation analytics to evaluate and improve product features, and our architects and site reliability engineers use it to track long-term trends on user latency metrics across native apps, web, and video. Use cases supported by this platform span across almost all business areas in the company. In particular, we use analytics to discover rends in access patterns and in which partners are providing the most popular content, helping us assess our next investments. Since end-user experience is always critical to a media platform’s success, we continually track our latency, engagement, and churn metrics across all of our sites. Lastly, we assess which cohorts of users want which content by doing extensive analyses on clickstream user segmentation.

If this all sounds similar to questions that you ask of your data, read on. We’ll now get into the architecture of products and technologies that are allowing us to serve our users and deliver these analytics at scale.

Identifying the problem with our old infrastructure

Rolling the clock back a few years, we encountered a big problem: We had too much data to process to meet our users’ expectations for reliability and timeliness. Our systems were fragmented and the interactions were complex. This led to difficulty in maintaining reliability and it made it hard to track down issues during outages. That leads to frustrated users, increasingly frequent escalations, and the occasional irate leader.

Managing massive-scale Hadoop clusters has always been Yahoo’s forte. So that was not an issue for us. Our massive-scale data pipelines process petabytes of data every day and they worked just fine. This expertise and scale, however, were insufficient for our colleagues’ interactive analytics needs.

Deciding solution requirements for analytics needs

We sorted out the requirements of all our constituent users for a successful cloud solution. Each of these various usage patterns resulted in a disciplined tradeoff study and led to four critical performance requirements:

Performance Requirements

Loading data requirement: Load all previous day’s data by next day at 9am. At forecasted volumes, this requires a capacity of more than 200TB/day.
Interactive query performance: 1 to 30 seconds for common queries
Daily use dashboards: Refresh in less than 30 seconds
Multi-week data: Access and query in less than one minute.

The most critical criteria was that we would make these decisions based on user experience in a live environment, and not based on an isolated benchmark run by our engineers.

In addition to the performance requirements, we had several system requirements that spanned the multiple stages that a modern data warehouse must accommodate: simplest architecture, scale, performance, reliability, interactive visualization, and cost.

System Requirements

Simplicity and architectural integrations
ANSI SQL compliant
No-op/serverless–ability to add storage and compute without getting into cycles of determining the right server type, procuring, installing, launching, etc.
Independent scaling of storage and compute

Reliability

Reliability and availability: 99.9% monthly uptime

Scale

Storage capacity: hundreds of PB
Query capacity: exabyte per month
Concurrency: 100+ queries with graceful degradation and interactive response
Streaming ingestion to support 100s of TB/day

Visualization and interactivity

Mature integration with BI tools
Materialized views and query rewrite

Cost-efficient at scale

Proof of concept: strategy, tactics, results

Strategically, we needed to prove to ourselves that our solution could meet the requirements described above at production scale. That meant that we needed to use production data and even production workflows in our testing. To focus our efforts on our most critical use cases and user groups, we focused on supporting dashboarding use cases with the proof-of-concept (POC) infrastructure. This allowed us to have multiple data warehouse (DW) backends, the old and the new, and we could dial up traffic between them as needed. Effectively, this became our method of doing a staged rollout of the POC architecture to production, as we could scale up traffic on the CDW and then do a cut over from legacy to the new system in real time, without needing to inform the users.

Tactics: Selecting the contenders and scaling the data

Our initial approach to analytics on an external cloud was to move a three petabyte subset of data. The dataset we selected to move to the cloud also represented one complete business process, because we wanted to transparently switch a subset of our users to the new platform and we did not want to struggle with and manage multiple systems.

After an initial round of exclusions based on the system requirements, we narrowed the field to two cloud data warehouses. We conducted our performance testing in this POC on BigQuery and “Alternate Cloud.” To scale the POC, we started by moving one fact table from MAW (note: we used a different dataset to test ingest performance, see below). Following that, we moved all the MAW summary data into both clouds. Then we would move three months of MAW data into the most successful cloud data warehouse, enabling all daily usage dashboards to be run on the new system. That scope of data allowed us to calculate all of the success criteria at the required scale of both data and users.

Performance testing results

Round 1: Ingest performance.

The requirement is that the cloud load all the daily data in time to meet the data load service-level agreement (SLA) of “by 9 am the next day”–where day was local day for a specific time zone. Both the clouds were able to meet this requirement.

Bulk ingest performance: Tie

Round 2: Query performance

To get an apples-to-apples comparison, we followed best practices for BigQuery and AC to measure optimal performance for each platform. The charts below show the query response time for a test set of thousands of queries on each platform. This corpus of queries represents several different workloads on the MAW. BigQuery outperforms AC particularly strongly in very short and very complex queries. Half (47%) of the queries tested in BigQuery finished in less than 10 sec compared to only 20% on AC. Even more starkly, only 5% of the thousands of queries tested took more than 3 minutes to run on BigQuery whereas almost half (43%) of the queries tested on AC took 3 minutes or more to complete.

Why Verizon Media picked BigQuery for scale, performance and cost

The Good, the Bad and the Ugly in Cybersecurity – Week 7

Orchestrating the Pic-a-Daily serverless app with Workflows

The Good, the Bad and the Ugly in Cybersecurity – Week 7

Orchestrating the Pic-a-Daily serverless app with Workflows

Who uses the MAW data and what do they use it for?

Identifying the problem with our old infrastructure

Deciding solution requirements for analytics needs

Proof of concept: strategy, tactics, results

Tactics: Selecting the contenders and scaling the data

Performance testing results

Bulk ingest performance: Tie

Related posts

What’s new with Google Cloud

What’s new with Google Cloud

What’s new with Google Cloud

Leave a Reply Cancel reply