While there are more components in our new architecture, they are all much less complex. Communication is done through a queue where each step of the pipeline reports its success status. Each sub-step takes less than 10 seconds and can easily and quickly resume from the previous state and with no data loss.
How do Preemptible VMs fit in this picture?
Using preemptible resources might seem like an odd choice for a mission-critical service, but because of our microservices design, we were able to use Preemptible VMs and GPUs without losing data or having to write elaborate retry code. Using Cloud Pub/Sub (see 2. above) allows us to store the state of the job in the queue itself. If a service is notified that a node has been preempted, it finishes the current task (which, by design, is always shorter than the 30-second notification time), and simply stops pulling new tasks. Individual services don’t have to do anything else to manage potential interruptions. When the node is available again, services begin pulling tasks from the queue again, starting where they left off.
This new design means that preemptible nodes can be added, taken away, or exchanged for regular nodes without causing any noticeable interruption.
GKE’s Cluster Autoscaler also works very well with preemptible instances. By combining the auto scaling features (which automatically replaces nodes that have been reclaimed) with node labels, we were able to achieve an architecture with >99.9% availability that runs primarily on preemptible nodes.
Finally…
We did all this over the course of a month–one week for design, and three weeks for the implementation. Was it worth all this effort? Yes!
With these changes, we increased our throughput from 100,000 to 7 million tracks per week–and at the same cost as before! This is a 7000% increase (!) in efficiency, and was a crucial step in making our business profitable.
Our goal as a company is to be able to transform the way the music industry handles data and volume and make it efficient. With nearly 15 million songs being added to the global pool each year, access and accessibility are the new trend. Thanks to our new microservices architecture and the speed and reliability of Google Cloud, we are on our way to make this a reality.
Learn more about GKE on the Google Cloud Platform website.