The Good, the Bad and the Ugly in Cybersecurity – Week 7
February 12, 2021Orchestrating the Pic-a-Daily serverless app with Workflows
February 12, 2021Who uses the MAW data and what do they use it for?
Yahoo executives, analysts, data scientists, and engineers all work with this data warehouse. Business users create and distribute Looker dashboards, analysts write SQL queries, scientists perform predictive analytics and the data engineers manage the ETL pipelines. The fundamental questions to be answered and communicated generally include: How are Yahoo’s users engaging with the various products? Which products are working best for users? And how could we improve the products for better user experience?
The Media Analytics Warehouse and analytics tools built on top of it are used across different organizations in the company. Our editorial staff keeps an eye on article and video performance in real time, our business partnership team uses it to track live video shows from our partners, our product managers and statisticians use it for A/B testing and experimentation analytics to evaluate and improve product features, and our architects and site reliability engineers use it to track long-term trends on user latency metrics across native apps, web, and video. Use cases supported by this platform span across almost all business areas in the company. In particular, we use analytics to discover rends in access patterns and in which partners are providing the most popular content, helping us assess our next investments. Since end-user experience is always critical to a media platform’s success, we continually track our latency, engagement, and churn metrics across all of our sites. Lastly, we assess which cohorts of users want which content by doing extensive analyses on clickstream user segmentation.
If this all sounds similar to questions that you ask of your data, read on. We’ll now get into the architecture of products and technologies that are allowing us to serve our users and deliver these analytics at scale.
Identifying the problem with our old infrastructure
Rolling the clock back a few years, we encountered a big problem: We had too much data to process to meet our users’ expectations for reliability and timeliness. Our systems were fragmented and the interactions were complex. This led to difficulty in maintaining reliability and it made it hard to track down issues during outages. That leads to frustrated users, increasingly frequent escalations, and the occasional irate leader.
Managing massive-scale Hadoop clusters has always been Yahoo’s forte. So that was not an issue for us. Our massive-scale data pipelines process petabytes of data every day and they worked just fine. This expertise and scale, however, were insufficient for our colleagues’ interactive analytics needs.
Deciding solution requirements for analytics needs
We sorted out the requirements of all our constituent users for a successful cloud solution. Each of these various usage patterns resulted in a disciplined tradeoff study and led to four critical performance requirements:
Performance Requirements
-
Loading data requirement: Load all previous day’s data by next day at 9am. At forecasted volumes, this requires a capacity of more than 200TB/day.
-
Interactive query performance: 1 to 30 seconds for common queries
-
Daily use dashboards: Refresh in less than 30 seconds
-
Multi-week data: Access and query in less than one minute.
The most critical criteria was that we would make these decisions based on user experience in a live environment, and not based on an isolated benchmark run by our engineers.
In addition to the performance requirements, we had several system requirements that spanned the multiple stages that a modern data warehouse must accommodate: simplest architecture, scale, performance, reliability, interactive visualization, and cost.
System Requirements
-
Simplicity and architectural integrations
-
ANSI SQL compliant
-
No-op/serverless–ability to add storage and compute without getting into cycles of determining the right server type, procuring, installing, launching, etc.
-
Independent scaling of storage and compute
Reliability
-
Reliability and availability: 99.9% monthly uptime
Scale
-
Storage capacity: hundreds of PB
-
Query capacity: exabyte per month
-
Concurrency: 100+ queries with graceful degradation and interactive response
-
Streaming ingestion to support 100s of TB/day
Visualization and interactivity
-
Mature integration with BI tools
-
Materialized views and query rewrite
Cost-efficient at scale
Proof of concept: strategy, tactics, results
Strategically, we needed to prove to ourselves that our solution could meet the requirements described above at production scale. That meant that we needed to use production data and even production workflows in our testing. To focus our efforts on our most critical use cases and user groups, we focused on supporting dashboarding use cases with the proof-of-concept (POC) infrastructure. This allowed us to have multiple data warehouse (DW) backends, the old and the new, and we could dial up traffic between them as needed. Effectively, this became our method of doing a staged rollout of the POC architecture to production, as we could scale up traffic on the CDW and then do a cut over from legacy to the new system in real time, without needing to inform the users.
Tactics: Selecting the contenders and scaling the data
Our initial approach to analytics on an external cloud was to move a three petabyte subset of data. The dataset we selected to move to the cloud also represented one complete business process, because we wanted to transparently switch a subset of our users to the new platform and we did not want to struggle with and manage multiple systems.
After an initial round of exclusions based on the system requirements, we narrowed the field to two cloud data warehouses. We conducted our performance testing in this POC on BigQuery and “Alternate Cloud.” To scale the POC, we started by moving one fact table from MAW (note: we used a different dataset to test ingest performance, see below). Following that, we moved all the MAW summary data into both clouds. Then we would move three months of MAW data into the most successful cloud data warehouse, enabling all daily usage dashboards to be run on the new system. That scope of data allowed us to calculate all of the success criteria at the required scale of both data and users.
Performance testing results
Round 1: Ingest performance.
The requirement is that the cloud load all the daily data in time to meet the data load service-level agreement (SLA) of “by 9 am the next day”–where day was local day for a specific time zone. Both the clouds were able to meet this requirement.
Bulk ingest performance: Tie
Round 2: Query performance
To get an apples-to-apples comparison, we followed best practices for BigQuery and AC to measure optimal performance for each platform. The charts below show the query response time for a test set of thousands of queries on each platform. This corpus of queries represents several different workloads on the MAW. BigQuery outperforms AC particularly strongly in very short and very complex queries. Half (47%) of the queries tested in BigQuery finished in less than 10 sec compared to only 20% on AC. Even more starkly, only 5% of the thousands of queries tested took more than 3 minutes to run on BigQuery whereas almost half (43%) of the queries tested on AC took 3 minutes or more to complete.