Schools turn to Google Cloud to help re-open campuses
April 13, 2021What’s new with Google Cloud
April 13, 2021Where does our data come from and how do we use it?
Providing threat protection requires a giant feedback loop. As we detect and block cyber attacks in the field, those systems send us telemetry and samples: the type of threats, where they came from, and the damage they tried to cause. We sift through the telemetry to decide what’s bad, what’s good, what needs to be blocked, which websites are safe or unsafe, and convert those opinions into new protection which is then pushed back out to our customers. And the cycle repeats.
In the early days, this was all done by people–experts mailing floppy disks around. But these days the number of threats and the amount of data are so overwhelming that we must use machine learning (ML) and automation to handle the vast majority of the analysis. This allows our people to focus on handling the newest and most dangerous threats. These new technologies are then introduced into the field to continue the cycle.
Shortcomings of the legacy data platform
Our legacy data platform had evolved from an on-prem solution, and was built as a single, massive, relatively inflexible multi-tenant system. It worked well when there was a big infrastructure team that maintained it but failed to take advantage of many capabilities built into Google Cloud. The design also introduced a number of obvious limitations, and even encouraged bad habits from our application teams. Accountability was challenging, changes and upgrades were painful, and performance ultimately suffered. We’d built the ecosystem on top of a specific vendor’s Apache Hadoop stack. We were always limited by their point of view, and had to coordinate all of our upgrade cycles across our user base.
Our data platform needed a transformation. We wanted to move away from a centralized platform to a cloud-based data lake that was decentralized, easy to operate, and cost-effective. We also wanted to implement a number of architectural transformations like Infrastructure as Code (IaC) and containerization.
“Divide and Conquer” with ephemeral clusters
When we built our data platform on Google Cloud, we went from a big, centrally managed, multi-tenant Hadoop cluster to running most of our applications on smaller, ephemeral Dataproc clusters. We realized that most of our applications follow the same execution pattern. They wake up periodically, operate on the most recent telemetry for a certain time window, and they generate analytical results that are either consumed by other applications or pushed directly to our security engines in the field. The new design obviated the need to centrally plan the collective capacity of a common cluster by guessing individual application requirements. It also meant that the application developers were free to choose their compute, storage, and software stack within the platform as they seemed fit, clearly a win-win for both sides.
After the migration, we also switched to using Google Cloud and open-source solutions in our stack. The decentralized cloud-based architecture of our data lake provides users with access to shared data in Cloud Storage, metadata services via a shared Hive Metastore, job orchestration services via Cloud Composer, and authorization via IAM and Apache Ranger. We have a few use cases where we employ Cloud SQL and Bigtable. We had a few critical systems based on HBase which we were able to easily migrate to Bigtable. Performance has improved, and it’s obviously much easier to maintain. For containerized workloads we use Google Kubernetes Engine (GKE), and to store our secrets we use Secret Manager. Some of our team members also use Cloud Scheduler and Cloud Functions.
Teaming up for speed
STAR has a large footprint on Google Cloud with a diverse set of applications and over 200 data analytics team members. We needed a partner with in-depth understanding of the technology stack and our security requirements. Google Cloud’s support accelerated what would otherwise have been a slow migration. Right from the start of our migration project, their professional services organization (PSO) team worked like an extension of our core team, participating in our daily stand-ups and providing the necessary support. The Google Cloud PSO team also helped us quickly and securely set up IaC (infrastructure as code). Some of our bleeding edge requirements even made their way to Google Cloud’s own roadmap, so it was a true partnership.
Previously, it took almost an entire year to coordinate just a single major-version upgrade of our data lake. With this Google Cloud transformation we can do much more in the same time, it only took about a year to not only move and re-architect the data lake and its applications, but also to migrate and optimize dozens of other similarly complex and mission-critical backend systems. It was a massive effort, but overall, it went smoothly, and the Google Cloud team was there to work with us on any specific obstacles.
A cloud data lake for smoother sailing
Moving the data lake from this monolithic implementation to Google Cloud allowed the team to deliver a platform focused entirely on enabling app teams to do their jobs. This gives our engineers more flexibility in how they develop their systems while providing cost accountability, allowing app-specific performance optimization, and completely eliminating resource contention between teams.
Having distributed control allows teams to do more, make their own decisions, and has proven to be much more cost-effective. Because users run their own persistent or ephemeral clusters, their compute resources are decoupled from the core data platform compute resources and users can scale on their own. The same applies to user-specific storage needs.
We now also have portability across cloud providers to avoid vendor lock-in, and we like the flexibility and availability of Google Cloud-specific operators in Composer, which allow us to submit and run jobs across Dataproc or on an external GKE cluster.
We’re at a great place after our migration. Processes are stable and our data lake customers are happy. Application owners can self-manage their systems. All issues around scale have been removed. On top of these benefits, we’re now taking a post-migration pass at our processes to optimize some of our costs.
With our new data lake built on Google Cloud, we’re excited about the opportunities that have opened up for us. Now we don’t need to spend a lot of time on managing our data and can devote more of our resources to innovation.
Learn more about Broadcom. Or check out our recent blog exploring how to migrate Apache Hadoop to Dataproc.