Orchestrator is an open source MySQL replication topology management and high availability solution. Vitess has recently integrated orchestrator as a native component of its infrastructure to achieve reliable failover, availability, and topology resolution of its clusters. This post first illustrates the core logic of orchestrator’s failure detection, and proceeds to share how the new integration adds new failure detection and recovery scenarios, making orchestrator’s operation goal-oriented.
Vitess and orchestrator both use MySQL’s asynchronous (async) or semi-synchronous replication. For the purposes of this post, the discussion is limited to async replication. In an async setup, we have one primary server and multiple replicas. The primary is the single writable server and the replicas are all read-only, mainly being used for read scale-out, backups, etc. While MySQL offers a multi-writable primaries setup, it is commonly discouraged, and Vitess does not support it (in fact, a multi-writer setup is considered a failure scenario as described later on).
The most critical and important failure scenario in an async topology is a primary’s outage. Either the primary server has crashed, or is network isolated: the result is that there are no writes on the cluster, and the replicas are left hanging with no server to replicate from.
How does one diagnose that the primary server is healthy? A common practice is to see that port :3306 is open. More reliably, we can send a trivial query, such as `SELECT 1 FROM DUAL`. Or even more reliable is to query for actual information: a status variable, or actual data. All these techniques share a similar problem. What if the primary server doesn’t respond?
A naive conclusion is that the primary is down, kicking off a failover sequence. However, this may well be a false positive since there could be a network glitch. It is not uncommon to miss a communication packet once in a while, so database clients are commonly configured to retry a couple times upon error. The common way to reduce such false positives is to run multiple checks, successively: if the primary fails a health check, try again in, say, 5 seconds, and again, and again, up to n times. If the nth test still fails, we determine the server is indeed down.
This approach yet introduces a few problems:
Consider the last bullet point. Some monitoring solutions run health checks from multiple endpoints, and require a quorum, an agreement of the majority of check endpoints that there is indeed a problem. This kind of setup must be used with care; the placement of the endpoints in different availability zones is critical to achieve sensible quorum results. Once that’s done, though, the triangulation is powerful and useful.
Orchestrator uses a different take on triangulation. It recognizes that there are more players in the field: the replicas. The replicas connect to the primary over MySQL protocol, and request the changelog so as to follow up on the primary’s footsteps. To evaluate a primary failure, orchestrator asks:
If, for example, orchestrator is unable to reach the primary, but can reach the replicas, and the replicas are all happy and confident that they can read from the primary, then orchestrator concludes there’s no failure scenario. Possibly some of the replicas themselves are unreachable: maybe a network partitioning or some power failure took both primary and a few of the replicas. orchestrator can still reach a conclusion by the state of all available replicas. It’s noteworthy that orchestrator itself runs in a highly available setup, cross availability zones, where orchestrator requires quorum leadership so as to be able to run failovers in the first place, mitigating network isolation incidents. But this discussion is outside the scope of this post. Orchestrator doesn’t do check intervals and a number of tests. It needs a single observation to act. Behind the scenes, orchestrator relies on the replicas themselves to run retries in intervals; that’s how MySQL replication works anyhow, and orchestrator utilizes that.
This holistic approach, where orchestrator triangulates its own checks with the servers’ checks, results in a highly reliable detection method. Iterating our example, if orchestrator thinks the primary is down, and all the replicas say the primary is down, then a failover is justified: the replication cluster is effectively not receiving any writes, the data becomes stale, and that much is observable all the way to the users and client apps. The holistic approach further allows orchestrator to treat other scenarios: an intermediate replica (e.g. 2nd level replica in a chained replication tree) failure is detected in exactly the same way. It further offers granularity into the failure severity. orchestrator is able to tell that the primary is seen down, while replicas still disagree. Or that replicas think the primary is down while orchestrator can still see it.
If orchestrator can’t see the primary, but can see the replicas, and they still think the primary is up, should this be the end of the story?
Not quite. We may well have an actual primary outage, it’s just that the replicas haven’t realized it yet. If we wait long enough, they will eventually report the failure; but orchestrator wishes to reduce total outage time by resolving the situation as early as possible.
Orchestrator offers a few emergency detection operations, which are meant to speed up failure detection. Examples:
An important observation is that orchestrator knows what your replication clusters actually look like, but doesn’t have the meta information about how they should look like. It doesn’t know if some standalone server should belong to this or that cluster; if the current primary server is indeed what’s advertised to your application; if you really intended to set up a multi-primary cluster. It is generic in that it allows a variety of topology layouts, as requested and used by the greater community.
For the past few years, orchestrator was an external entity to Vitess. The two would collaborate over a few API calls. orchestrator did not have any Vitess awareness, and much of the integration was done through pre- and post- recovery hooks, shell scripts and API calls. This led to known situations where Vitess and orchestrator would compete over a failover, or make some operations unknown to each other, causing confusion. Clusters would end up in split state, or in co-primary state. The loss of a single event could cause cluster corruption.
We have recently integrated orchestrator into Vitess as an integral part of the vitess infrastructure. This is a specialized fork of orchestrator, that is Vitess-aware. In fact, the integrated orchestrator is able to run Vitess native functions, such as locking shards or fetching tablet information.
The integration makes orchestrator both cluster aware and goal driven.
MySQL itself has no concept of a replication cluster (not to be confused with InnoDB cluster or MySQL Cluster): servers just happen to replicate from each other, and MySQL has no opinion on whether they should replicas from each other, or what’s the overall health and status of the replication tree. orchestrator can share observations and opinions on the replication tree, based on what it can see. Vitess, however, has a firm opinion on what it expects. In Vitess, each MySQL server has its own vttablet, an agent of sorts. The tablet knows the identity of the MySQL server: which schema it contains; part of what shard it is; what role it assumes (primary, replicas, OLAP, ...) etc. The integrated orchestrator now gets all of the MySQL metadata directly from the Vitess topology server. It knows beyond doubt that two servers belong to the same cluster, not because they happen to be connected in a replication chain, but because the metadata provided by Vitess says so. orchestrator can now look at a standalone, detached server, and tell that it is, in fact, supposed to be part of some cluster.
This cluster awareness is a fundamental change in orchestrator’s approach, and allows us to make orchestrator goal-driven. orchestrator’s goal is to ensure a cluster is always in a state compatible with what Vitess expects it to be. This is accomplished by introducing new failure detection modes not possible before, and new recovery methods too opinionated otherwise. Examples:
Thus, Vitess has an opinion of what the cluster should look like, and orchestrator is the operator that makes it so. It is furthermore interesting to note that orchestrator’s operations will either fail or converge to the desired state.
But, what if a primary unexpectedly fails? What server should be promoted?
On an unexpected failure, it is orchestrator’s job to pick and promote the most suitable server, and to advertise its identity to Vitess. The new interaction ensures this is a converging process and that orchestrator and vitess do not conflict with each other over who should be the primary. Orchestrator promotes a server based on multiple limiting factors: is the server configured such that it can be a primary, e.g. has binary logs enabled? Does its version match the other replicas? What are the general recommendation for the specific host (metadata acquired from Vitess). But there are also general, non server-specific rules, that dictates what promotions are possible. Do we strictly have to only failover within the same data center? The same region/availability zone? Or, do we strictly have to only failover outside the data center? Do we only ever failover onto a server configured as semi-sync replica? And how do we reconfigure the cluster after promotion?
Previously, some of these questions were answered by configuration variables, and some by the user’s infrastructure. However, the new integration allows the user to choose a failover and recovery policy, that is described in code. Orchestrator and Vitess already support three pre-configured modes, but will also allow the user to define any arbitrary (within a set of rules) policy they may choose.
More on that in a future post.