Leader election is a commonly applied pattern for implementing distributed systems. For example, replicated relational databases such as MySQL, or distributed key-value stores such as Apache Zookeeper, choose a leader (sometimes referred to as master) among the replicas. All write operations go through the leader, so only a single node is writing to the system at any time. This is done to ensure no writes are lost and the database is not corrupted.
It can be challenging to choose a leader among the nodes of a distributed system due to the nature of networked systems and time synchronization. In this article, we’ll discuss why you need leader election (or more generally, “distributed locks”), explain why they are difficult to implement, and provide an example implementation that uses a strongly consistent storage system, in this case Google Cloud Storage.
Imagine a multithreaded program where each thread is interacting with a shared variable or data structure. To prevent data loss or corrupting the data structure, multiple threads should block and wait on each other while modifying the state. We ensure this with mutexes in a single-process application. Distributed locks are no different in this regard than mutexes in single-process systems.
A distributed system working on shared data still needs a locking mechanism to safely take turns while modifying shared data. However, we no longer have the notion of mutexes while working in a distributed environment. This is where distributed locks and leader elections come into the picture.
Typically leader election is used to ensure exclusive access by a single node to shared data, or to ensure a single node coordinates the work in a system.
For replicated database systems such as MySQL, Apache Zookeeper, or Cassandra, we need to make sure only one “leader” exists at any given time. All writes go through this leader to ensure writes happen in one place. Meanwhile, the reads can be served from the follower nodes.
Here’s another example. You have three nodes for an application that consumes messages from a message queue; however, only one of these nodes is to process messages at any time. By choosing a leader, you can appoint a node to fulfill that responsibility. If the leader becomes unavailable, other nodes can take over and continue the work. In this case, a leader election is needed to coordinate the work.
Many distributed systems take advantage of leader election or distributed lock patterns. However, choosing a leader is a nontrivial problem.
Distributed systems are like threads of a single-process program, except they are on different machines and they talk to each other over the network (which can be unreliable). As a result, they cannot rely on mutexes or similar locking mechanisms that use atomic CPU instructions and shared memory to implement the lock.
The distributed locking problem requires the participants to agree on who is holding the lock. We also expect a leader to be elected while some nodes in the system are unavailable. This may sound simple, but implementing such a system correctly can be quite difficult, in part due to the many edge cases. This is where distributed consensus algorithms come into the picture.
To implement distributed locking, you need a strongly consistent system to decide which node holds the lock. Because this must be an atomic operation, it requires consensus protocols such as Paxos, Raft, or the two-phase commit protocol. However, implementing these algorithms correctly is quite difficult, as the implementations must be extensively tested and formally proved. Furthermore, the theoretical properties of these algorithms often fail to withstand real-world conditions, which has led to more advanced research on the topic.
At Google, we achieve distributed locking using a service called Chubby. Across our stack, Chubby helps many teams at Google make use of distributed consensus without having to worry about implementing a locking service from scratch (and doing so correctly).
Instead of implementing your own consensus protocol, you can easily take advantage of a strongly consistent storage system that provides the same guarantees through a single key or record. By delegating the responsibility for atomicity to an external storage system, we no longer need the participating nodes to form a quorum and vote on a new leader.
For example, a distributed database record (or file) can be used to name the current leader, and when the leader has renewed its leadership lock. If there’s no leader in the record, or the leader has not renewed its lock, other nodes can run for election by attempting to write their name to the record. First one to come will win, because this record or file allows atomic writes.
Such atomic writes on files or database records are typically implemented using optimistic concurrency control, which lets you atomically update the record by providing its version number (if the record has changed since then, the write will be rejected). Similarly, the writes become immediately available to any readers. Using these two primitives (atomic updates and consistent reads), we can implement a leader election on top of any storage system.
In fact, many Google Cloud storage products, such as Cloud Storage and Cloud Spanner, can be used to implement such a distributed lock. Similarly, open source storage systems like Zookeeper (Paxos), etcd (Raft), Consul (Raft), or even properly configured RDBMS systems like MySQL or PostgreSQL can provide the needed primitives.
We can implement leader election using a single object (file) on Cloud Storage that contains the leader data, and require each node to read that file, or run for election based on the file. In this setup, the leader must renew its leadership by updating this file with its heartbeat.
My colleague Seth Vargo published such a leader election implementation – written in Go and using Cloud Storage – as a package within the HashiCorp Vault project. (Vault also has a leader election on top of other storage backends).
To implement leader election among distributed nodes of our application in Go, we can write a program that makes use of this package in just 50 lines of code: