Customizing Findings & Recommendations in VMware Skyline
March 4, 2020Add Users & Manage Groups in Skyline
March 4, 2020Editor’s note: We’re hearing today from Discord, maker of a popular voice, video, and text chat app for gaming. They have to bring a great experience to millions of customers concurrently, and keep up with demand. Here’s how they moved from Redshift to Google Cloud’s BigQuery to support their growth.
At Discord, our chat app supports more than 50 million monthly users. We had been using Amazon Redshift as our data warehouse solution for several years, but due to both technical and business reasons, we migrated completely to BigQuery. Since migrating, we’ve been able to serve users faster, incorporate AI and ML capabilities and ensure compliance.
The challenges that led us to migrate
Our team here at Discord began to consider alternative solutions once we realized we were encountering technical and cost limitations on Redshift. We knew that if we wanted our data warehouse to scale with our business, we had to find a new solution. On the technical side, we realized we were going to hit the maximum cluster size (128 compute nodes) for DC2 type nodes in six months, given our growing usage patterns. The cost for using Redshift was also becoming a challenge. We had been paying hundreds of thousands of dollars a month, not including storage and the cost of network ingress/egress between Google Cloud and AWS. (We’d been using Google Cloud for our chat application already.)
We looked at some Google Cloud-native solutions and identified that BigQuery would be a natural solution for us, given its large scale (with known customers that were larger than Discord), proximity to where our data resides, and the fact that Google Cloud already had pipelines in place for loading data. Another major reason for our choice of BigQuery was that it is completely serverless, so it wouldn’t require any upfront hardware provisioning and management. We were also able to take advantage of a brand-new feature called BigQuery Reservations to gain significant savings with fixed slot usage.
Migration tradeoffs and challenges
We had some preparation to do ahead of and during the migration. One initial challenge was that while both Redshift and BigQuery are designed to handle analytical workloads, they are very different.
As an example, in Redshift we had a denormalized set of tables where each of our application events ended up in its own table, and most of our analytics queries need to be joined together. Running an analytics query on user retention involved analyzing data across different events and tables. So running this kind of JOIN-heavy workload resulted in performance differences out of the box. We relied on order by and row number of large swaths of data previously, but that method is supported by BigQuery with limitations. Redshift and BigQuery do partitioning differently, so joining on something like user ID isn’t as fast, because the data layout is different. So we used timestamp partitioning and clustering on JOIN fields, which increased performance in BigQuery. Other aspects of BigQuery brought significant advantages right away, making the migration worthwhile. Those include ease of management (one provider vs. multiple, no maintenance windows, no VACUUM/ANALYZE); scalability; and price for performance.
There were some other considerations we took into account when undertaking this migration. We had to convert more than a hundred thousand lines of SQL into BigQuery syntax, so we used the ZetaSQL library and PostgreSQL parser to implement a conversion tool. To do this, we forked an open source parser and made modifications to the grammar so it could parse all of our existing Redshift SQL. Building this was a non-trivial part of the migration. The tool can walk an abstract syntax tree (also known as a parse tree) from templated Redshift and output the equivalent templated for BigQuery. In addition, we re-architected the way we built our pre-aggregated views of data to support BigQuery. Moving to a fixed slot model using BigQuery Reservations allowed for workload isolation, consistent performance, and predictable costs. The last migration step was getting used to the new paradigm post-migration and educating stakeholders on the new operating model.