Here is a recap of what we covered in the last blog:
If there was no need to worry about a majority quorum, we would have the flexibility to deploy any number of nodes we require. We can designate any subset of those nodes to be eligible leaders, and we can make durability decisions without being influenced by the above two decisions. This is exactly what many users have done with Vitess. The following use cases are loosely derived from real production workloads:
I have not seen anyone ask for a durability requirement of more than two nodes. But this may be due to difficulties dealing with corner cases that MySQL introduces due to its semi-sync behavior. On the other hand, these settings have served the users well so far. So, why become more conservative?
These configurations are all uncomfortable for a majority based consensus system. More importantly, these flexibilities will encourage users to experiment with even more creative combinations and allow them to achieve better trade-offs.
The configurations in the previous section seem to be all over the place. How do we design a system that satisfies all of them, and how do we future-proof ourselves against newer requirements?
There is a way to reason about why this flexibility is possible. This is because the two cooperating algorithms (Request and Election) share a common view of the durability requirements, but can otherwise operate independently.
For example, let us consider the five node system. If a user does not expect more than one node to fail at any given time, then they would specify their durability requirement as two nodes.
The leader can use this constraint to make requests durable: as soon as the data has reached one other node, it has become durable. We can return success to the client.
On the election side, if there is a failure, we know that no more than one node could have failed. This means that four nodes will be reachable. At least one of those will have the data for all successful requests. This will allow the election process to propagate that data to other nodes and continue accepting new requests after a new leader is elected.
In other words, a single durability constraint dictates both sides of the behavior; if we can find a formal way to describe the requirements, then a request has to fulfil those requirements. On the other hand, an election needs to reach enough nodes to intersect with the same requirements.
For example, if durability is achieved with 2/5 nodes, then the election algorithm needs to reach 4/5 nodes to intersect with the durability criteria. In the case of a majority quorum, both of these are 3/5. But our generalization will work for any arbitrary property.
In the above five node case, if two nodes fail, the failure tolerance has been exceeded. We can only reach three nodes. If we don’t know about the state of the other two nodes, we will have to assume the worst case scenario that a durable request could have been accepted by the two unreachable nodes. This will cause the election process to stall.
If this were to happen, the system has to allow for a compromise: abandon the two nodes and move forward. Otherwise, the loss of availability may become more expensive than the potential loss of that data.
A two-node durability does not always mean that the system will stall or lose data. A very specific sequence of failures have to happen:
This type of failure can happen if the leader and the recipient node are network partitioned from the rest of the cluster. We can mitigate this failure by requiring the ackers to live across network boundaries.
The likelihood of a replica node in one cell failing after an acknowledgment, and a master node failing in the other cell after returning success, is much lower. This failure mode is rare enough that many users treat this level of risk as acceptable.
The most common operation performed by a consensus system is the completion of requests. In contrast, a leader election generally happens in two cases: taking nodes down for maintenance, or upon failure.
Even in a dynamic cloud environment like Kubernetes, it would be surprising to see more than one election per day for a cluster, whereas such a system could be serving hundreds of requests per second. That amounts to many orders of magnitude in difference between a request being fulfilled and a leader election.
This means that we must do whatever it takes to fine tune the part that executes requests, whereas leader elections can be more elaborate and slower. This is the reason why we have a bias towards reducing the durability settings to the bare minimum. Expanding this number can adversely affect performance, especially the tail latency.
At YouTube, although the quorum size was big, a single ack from a replica was sufficient for a request to be deemed completed. On the other hand, the leader election process had to chase down all possible nodes that could have acknowledged the last transaction. We did consciously trade off on the number of ackers to avoid going on a total wild goose chase.
In the next blog, we will take a short detour. Shlomi Noach will talk about how some of these approaches work with MySQL and semi-sync replication. Following this, we will continue pushing forward on the implementation details of these algorithms.