|
HighAvailabilityTool
Requirements for mk-cluster, a new tool for managing MySQL replication clusters
See issue 850. IntroductionThis document contains requirements and specification for a new tool to manage clusters of MySQL replication servers. This tool's goal is to achieve high availability and load balancing, with as much consistency as possible. In Josh Berkus's terms, the tool is for online users. The tool will be able to perform automated failovers during a cluster or node failure, and to make it easy and safe to correctly perform manual switchovers. The tool is designed to improve upon MMM and Flipper, overcome their limitations, and meet the needs of Percona's clients. This tool has nothing to do with NDB cluster ("MySQL Cluster"). It is designed to take advantage of stock MySQL installations using normal MySQL replication. SynopsisAssume that you have a master-master pair of servers on 192.168.1.{10,11}, and .10 is the writable server right now. You want to set up a new cluster. All you should have to do is run the following two commands to create a high-availability cluster of Master Master replication between the two nodes. The first IP address tells the monitor how to connect to the first server. The 10 range IP addresses are the writer IP and the two reader IP addresses. mk-cluster h=192.168.1.10,D=maatkit --initialize 10.10.10.1 10.10.10.2 10.10.10.3 mk-cluster h=192.168.1.10,D=maatkit This cluster has a weakness. Only one monitor is watching it for failures and other problems. On any other machine, you run the following command to create a standby monitor in case the first one fails. mk-cluster h=192.168.1.10,D=maatkit This is how simple it should be to set up a high-availability cluster. This is the goal of the tool described in this document. Notice that there is absolutely no configuration file in this example. Basic RequirementsThis tool will support various replication topologies and provide reader, writer, and offline roles by default; users may define their own roles. Major goals include:
The tool's functionality can be broken down into:
These are distinct functions that must be specified and tested in isolation to meet the goal of strong correctness guarantees. (#2 and #6 are quite similar). TerminologyThe following is a basic glossary of the terms used in this document.
Consistency, Availability, and PartitioningIt is important to understand that this tool is designed for high availability, not complete consistency. Users who cannot afford to lose any data should look at a tool that guarantees data will not be lost, such as DRBD with Heartbeat. This tool will be built upon asynchronous replication, which does not offer strong consistency guarantees. We believe there is a great need for a highly available, reasonably consistent cluster built upon asynchronous replication. For the majority of users, replication is not noly "good enough," it is actually the right solution (i.e. it is better than DRBD). Even though replication can lag, and you can see inconsistency from reading a slave that is delayed, there are many use cases for this that are more than good enough for real-world needs. What is needed is simply a decent way to manage such clusters. There is a lot of literature and research around distributed clusters, especially synchronous clusters. In recent years, much of this has focused on three properties of such clusters: consistency, availability, and network partitioning (see this research paper on Brewer's conjecture and the CAP principle for more). The traditional approach values correctness and consistency, at the expense of scalability and/or availability. This is not the solution that we seek to support. The type of cluster that this tool will support could be considered a BASE cluster, as opposed to an ACID cluster. These concepts are best explained in an ACM queue article. In short, you can choose from consistency, high-availability, and partition resistance, but you can only choose two of those properties. This tool supports high-availability and partition resistance, with as much consistency as possible. An example of the type of trade-off that this tool makes is fencing, or STONITH (shoot the other node in the head) functionality. A system such as heartbeat with DRBD would typically take whatever steps are necessary to protect the consistency of the data. This includes forcibly killing a node that cannot be controlled normally. For many of our users, this is absolutely unacceptable. Instead, we will choose to isolate the misbehaving node and allow an administrator to bring it online again later. This gives an administrator a chance to correct the situation, which may be simple to resolve with human intervention. For example, suppose there is a misbehaving replication master. The server is unreachable through the network. The remaining servers will disconnect from the server, instead of killing it. Killing such a server could have an extremely high cost. A server with a large buffer pool could take several days to recover, warm up, and become fully functional again. Why a New Tool?Why not fix MMM? MMM's design seems to be limited in ways that cannot be overcome. It uses agents, and has a single point of failure that cannot be solved. MMM code is demonstrably buggy in critical parts, difficult to understand, and complicated by administrative functions, the use of agents, and reams of "glue" code. MMM also includes many features that this tool will not include, such as tools to help automate backups, synchronize servers with each other, and perform other tasks. These are unnecessary for our purposes, and they should be built as separate tools. This tool will focus on doing one thing well. Why not extend Flipper? Flipper's code is nice and its design is good, but it does not have enough features. And instead of extending Flipper, we think it makes sense to take advantage of all the code that is already written for other Maatkit tools. Unlike Flipper's code, Maatkit's code has a real test suite, which is required for this type of tool. Why not use heartbeat or pacemaker? These are well understood, well tested tools that are trusted by many people and widely deployed. The problem is, they are also quite complicated. Neither Heartbeat nor Pacemaker is capable of building the type of functionality that Percona's clients require. And a purpose-built solution is better than a highly complex generic system. For example, Drupal can be used to build a blog, but Wordpress is a much better blog. In the same way I believe we can build a much better cluster management system that is easier to use and administer than one built on top of pacemaker. Finally, the Heartbeat/Pacemaker model is wrong for replication high availability as envisioned here. The decisions should not be made by disconnected agents communicating amongst themselves and reaching some kind of majority opinion; the database should be the sole source of truth. A transactional database is different from the kinds of systems such as DRBD that Pacemaker is built to work with. Why Store State and Configuration in the Database?Storing the state in the database is actually one of the more important characteristics of the system. Storing the state in the database means that there is no chance for mismatch between the cluster itself and some extra system. Three properties must be true about the system for it to work well:
When properly implemented, these properties mean that the stored state is exactly as authoritative as the true state of the cluster, and a failure in either is made unimportant by the success of the other. There is no weak link inside or outside of the cluster; both the cluster itself, with all of replication's problems, and the stored state can be wrong in the same ways and under the same set of circumstances. The cluster is self-documenting, as it should be. If there is a failure to record the state, then there must have been a failure in the cluster, so the failure to record the state doesn't matter. If there is a failure to change the cluster to match the stored state, then the tool will try again, so the failure to change the cluster doesn't matter. If the cluster isn't available, the configuration doesn't matter. If the configuration isn't available, then the cluster must not be, either, so that also doesn't matter. You can look at it as a proof by contradiction, too. Imagine we store configuration/state separately from the cluster. There are now two things (cluster + state) we must make highly available and keep in sync in a distributed fashion. For the cluster to be any good, we need both of them to be online. Now pretend that one of the two is unavailable -- what good is the other? And conversely, if one of the two is down, what do you lose if the remaining one goes down? The cluster and its state/configuration succeed or fail as a group, not separately. In addition, if you've ever deployed a system across a number of nodes and had to fiddle with keeping the configuration files in sync across the nodes, you'll appreciate the single source of truth that this design creates. I have seen several problems caused by MMM configurations that were not the same (in subtle ways) on all the nodes. The danger of storing state in the database is that, unless we design things carefully, the tool itself could cause a problem for the cluster. To avoid this, we need to avoid auto_increment primary keys or statements that could cause duplicate key errors in situations like a takeover. We'll need to vet all the SQL statements to see if there is a possibility of any of them breaking replication from writing on a slave. RequirementsThe following is a laundry list of requirements, in no particular order.
Assumptions
Adjusting Replication Topology
Fencing and STONITHFencing is isolating a resource from the other resources to protect them. It is very important to protect data integrity. Here is a typical scenario in master-master replication without fencing, which illustrates what can happen: the active master fails, and the writable role is moved to its co-master. However, some application connections are still open to the first server, and writes continue to happen there. These propagate through replication and cause duplicate key errors on the newly promoted second master, and its replication fails. Or, the first master crashes; the writable role is moved to the second master; the first master comes back online, and finishes sending its binary log events to the second master, causing the same trouble. In clusters that insist on protecting data to above all, such as Red Hat's clustered file system or DRBD, fencing is very strict. If a resource cannot be isolated (e.g. it is not responding or not reachable, which can happen for a number of reasons), the typical tactic is called STONITH, or "shoot the other node in the head." This is usually done with a network power switch to forcibly power off the fenced-off node. In MySQL replication, fencing needs to be able to break two types of connections between machines:
In a BASE cluster such as the type mk-cluster supports, fencing is good, but STONITH is not necessarily a good thing. Here's why. The replication connection is the one that can corrupt data the worst, and it's easy to fence off. But the consequences of an IP address refusing to move to a new node are much less dramatic. In the worst case, writes will occur to a node that is suddenly not part of the cluster anymore. This is not nice, but it won't mess up the rest of the cluster in the general case. It just means an administrator has to decide whether that data is worth saving, and if so, what to do about it. Moreover, killing a node can have a very high cost, as mentioned early on in this document. mk-cluster will support fencing as far as possible without actually STONITH-ing a node (and even that will be an option if someone wants it). It will also try all available means to determine whether a node is truly unreachable; it will try to connect to all addresses associated with a node before declaring it unreachable. If a failed node comes back online, the resume_policy configuration variable specifies the resulting behavior. Load BalancingIn the beginning, load balancing will be very simple: nodes with roles of type balanced will have addresses distributed amongst them such that no node has more than one more address than any other. Database StructuresTable names are hardcoded into the tool and are not configurable. The only configurable variable is the database name in which the tables are stored. Tables must use a transactional engine. The naming convention is that the primary key of each table, if it has a single-column primary key, is named id. Foreign keys are named after the table they link to; thus, the following type of query makes sense: select * from child join parent on child.parent = parent.id; The @@server_id and similar @@ variables in sample queries is a placeholder. The value needs to be selected from the server and then inserted as a literal, to avoid different behaviors across MySQL versions. The @@ variables are not replication-safe in some versions of MySQL. The Config TableThis table is a simple name-value structure that defines configuration options for the tool. The table structure is as follows: create table config ( name varchar(64) not null primary key, value text not null ); The possible names and their values are:
The Monitor TableThis table records the presence of monitors, and their order of precedence in takeover attempts. The structure: create table monitor ( id varchar(32) not null primary key, promotion_order int unsigned not null, -- order in which monitors will attempt takeover unique index(promotion_order) ); When a monitor starts up and joins the cluster, it inserts into the table as follows: insert into monitor (id, promotion_order) select <monitor ID>, coalesce(max(promotion_order), 0) + 1 from monitor inner join heartbeat on heartbeat.id = 1 where heartbeat.node = <@@server_id>; TODO: I think this table needs a ts column for a heartbeat, and each monitor needs to pulse it to indicate it is still alive. The controller should delete rows that haven't had a heartbeat in some configurable delay time. Otherwise users can't clean out the table, and the monitors might wait a long time during takeovers. The Heartbeat TableThe heartbeat table is similar in concept to that used by mk-heartbeat. It is safe to read from this table with mk-heartbeat, but not safe to write to it. Only mk-cluster should write to it. This is because the updates must be done very carefully. The heartbeat table has a special purpose in the cluster. It contains a single row, so it is a global mutex. It indicates the ID of the monitor instance that is controlling the cluster, and which node is considered to be writable (there is only one writable node in the cluster). It is used for both automated and manual takeovers of the cluster. create table heartbeat ( id int unsigned not null primary key, node int unsigned not null, -- the @@server_id where it was inserted monitor varchar(32) not null, -- the mk-cluster monitor that inserted it thread int unsigned not null, -- the connection_id() of the monitor ts datetime not null ); There is only one row in the table, with the magical id constant of 1. All monitors try to write to the heartbeat, whether they are actually the controller or not. Writes to the heartbeat table are specially done so only the controller will actually change the data. Here is the query: replace into heartbeat(id, node, monitor, thread, ts) select 1, <@@server_id>, <mk-cluster ID>, connection_id(), now() from heartbeat where id = 1 and monitor = <mk-cluster ID> and node = <@@server_id>; This query won't change anything unless two conditions hold:
The heartbeat table can be read from any node to determine its replication delay, to a tolerance of one second. And it can be read on the writable or origin node to determine which mk-cluster monitor is the controller. It also communicates to every monitor whether the cluster's controller has failed. If the heartbeat is not updated recently on the writable node, the output of SHOW PROCESSLIST will reveal whether the controller is still connected to the writable node. More on this later. The monitor ID is a new concept in mk-cluster. This is something we need to develop for the tool. It needs to be a globally unique identifier, which should probably be some combination of timestamp, IP address, and some random information. We need to investigate existing methods for creating a globally unique identifier. All queries that modify data must select or join from the heartbeat table with the current node and monitor ID in the WHERE clause, to ensure that only the cluster controller updates data and only on the writable node. There is only one exception, and that is a cluster takeover. More on this later. By default, when you initialize a cluster, mk-cluster inserts the following row: insert into heartbeat (id, monitor, node, thread, ts) values (1, <monitor ID>, <@@server_id>, connection_id(), now()); The following query will take control of the cluster and reset the writable node, in case the controller has been inactive for more than 30 seconds. update heartbeat set monitor = <my ID>, node = <@@server_id>, thread = connection_id(), ts = now() where ts < now() - interval 30 second and id = 1; The Node TableThe node table contains one row for each node in the cluster. The table structure is as follows: create table node (
id int unsigned not null primary key, -- the @@server_id
ip_address int unsigned not null, -- the permanent IP address of the node
master int unsigned null, -- the replication master
state enum(
'online',
'offline',
'misconfigured',
'fenced',
'unreachable')
);Node StatesThe node states have the following meanings:
If you're missing REPLICATION_DELAY or REPLICATION_FAILED as in MMM, these are specified per-role as an allowable replication delay. We also really don't care if replication is actually running. If the sysadmin stops replication for a bit and the delay doesn't exceed the configured limit, nothing is actually wrong. We care whether the data is up to date enough to use. The Role Tablemk-cluster requires there to be a writer role that has only one address. By default it creates writer and reader roles. The reader roles are optional. All other roles are up to the user. The relationship between nodes and roles is stored in the node_role table. There are several types of roles:
The structure of the table is as follows: create table role (
id varchar(64) not null primary key,
type enum('writer', 'balanced') not null default 'balanced',
delay_threshold int,
allowed_demotions int,
comment varchar(255),
);The delay_threshold column contains the maximum permissible replication delay for balanced roles. If a node's replication delay exceeds this value, the role's addresses will be moved away from the node. The allowed_demotions column is explained in the Node_Role table. If it is NULL, then an infinite number of demotions are permitted. By default, when you initialize a cluster, mk-cluster will add the following entries: insert into role(id, type, delay_threshold, comment) values
('writer', 'writer', null, 'Automatically added writer role'),
('reader', 'balanced', 60, 'Automatically added reader role'),The Node_Role TableThis table contains relationships between nodes and roles. The presence of a row in this table indicates that a node is eligible for a specified role. The structure of the table is as follows: create table node_role ( node int unsigned not null, role varchar(64) not null, assigned int not null default 0, -- whether the node should have the role promotion_order int null, -- order of preference for failover demotion_count int unsigned not null, -- number of times has been failed away from primary key(node, role), unique index(role, promotion_order) ); The note and role columns should be self-explanatory. The assigned column contains a zero if the role has been assigned to the note. Remember, the presence of the row indicates that the node is eligible for the role, but the node may not actually have that role. If the node is supposed to have the role, the assigned column will be one. It is up to the controller to make sure that this actually happens. The demotion_count column counts how many times the controller has automatically failed away from a node (taken the role away from it). If this count exceeds the allowed_demotions configurable value for the role, then the node is not eligible for the role anymore. The promotion_order column is a unique order in which the controller will attempt to assign roles to nodes. The order is stored in the database when a node is added, and not modified afterwards. This is very important for takeovers, when monitors must agree on the new writable node. Two monitors must not be trying to do a takeover and promotion to writable role on two different nodes unless there is a severe problem. Aside from being useful in takeovers, this also lets the database administrator specify a preference for which nodes should be assigned which roles. For example, perhaps you want to create a data warehouse role, and you prefer the associated address to be assigned to a particular node if it is available, not treating all nodes within the role equally. When mk-cluster initializes a master-slave cluster with one writer and two reader IP addresses, it inserts the following example rows into the table: insert into node_role(node, role, assigned, promotion_order) values (1, "writer", 1, 0), (1, "reader", 1, null), (2, "reader", 1, null); The Address TableThis table contains only addresses that are movable -- the fixed addresses are stored in the node table. Every address belongs to a role. If the role is balanced, the addresses are balanced across nodes, within the role, as defined by Load Balancing. The table structure is as follows: create table address ( id varchar(64) not null primary key, role varchar(64) not null, node int unsigned null -- The node to which the address is assigned ); The Health_Check TableThis table contains information about each type of check the system will perform. This allows each check to be configurable. The structure is as follows: create table health_check ( id varchar(64) not null primary key, comment varchar(64) ); The Health_Check_Config TableThis table contains configuration parameters for each check: create table health_check_config ( health_check varchar(64) not null, name varchar(64) not null, value text, primary key(health_check, name) ); For example, the default configuration for the mysqld health check will be SELECT NOW(). The Role_Health_Check TableThis table specifies which checks are to be performed for each role that belongs to a node: create table role_health_check ( role varchar(64) not null, health_check varchar(64) not null, primary key(role,health_check) ); TODO: does it make sense to specify timeouts and fail-counts per-role? The Cluster_Log TableThe cluster log table stores a log of all actions taken by the controller. This is a brief, terse log that only records changes the controller makes and the reasons for making them. The structure of this table is as follows: create table cluster_log ( -- exact columns TBD ); The table probably shouldn't have a primary key. The log statements are inserted in the usual way to make sure they only get inserted into the writable server. insert into cluster_log(....) select ( .... ) from heartbeat where ...; Research Done On MMMThe following sections are about the MMM code I've read to try to understand it and make sure nothing important is missed. State ChangesMMM's state changes look too complex to me. They are in daemon.pm and are as follows, simplified. One of the problems is that these are all if statements, not if/else, so a state might change to one thing and then another as they are traversed.
In addition to the above, when the failover method is "wait" and both hosts online, it gets switched to "auto". If both hosts aren't online, the behavior depends on wait_for_other_master which is unclear to me now. When MMM changes a slave's stateWe need to move a slave to a different parent when its parent is truly dead. MMM's behavior is not known yet; when a parent fails or replication delays, the daemon notifies the agent; what does it do? TODO. Use CasesInitialize a cluster
Monitor the cluster
Main Success Scenario:
Extensions:
Add a node to a cluster
Main Success Scenario:
Extensions:
TODO: Should it try to start replication on a slave that comes online? (I think it should -- setting a dupe server_id can affect existing servers -- a new node should NOT have replication started yet IMO) Take a node off-line for maintenance
Main Success Scenario:
Move a role to another nodeAdd an IP address to the clusterRemove a node from the clusterMove a slave to a new master
Bring a node back onlineRemove an IP address from the clusterReconfigure a clusterAdd an IP address to a nodeRemove an IP address from a nodeTest whether an IP address is presentTest for my SQL replication delayTest whether replication is runningTest whether a node is aliveMake a node read-only
Make a node writableTest whether a node is writableStart replicationStop replicationTerminate connections to a nodeCheck configuration on a node for read-only and skip slave startSend a replication heartbeatDetect a failed controllerLook at the heartbeat on the writable node. If it's not updated there, look at the heartbeat table on all nodes, to see if the writable node has been fenced off from the cluster. If you find a new writable node, look there, repeat. If you don't find a new writable node, assume the original one is still writable. Look at it and try to see if SHOW PROCESSLIST shows the controller's thread. If not, assume the controller is dead; take over the cluster. Assume control of the clusterRelinquish control of the clusterFence off a failed nodeRead the desired state of the clusterObserve the current state of the clusterCompare the observed state to the desired stateFind differences between the desired state and the observed stateDecide how to reconcile differences in stateRead updated cluster configurationEvery loop we read configuration and sanity-check it. If it's invalid we correct it.
Test that a slave is connected to a MasterTest that the master has a correct slaveSynchronize the stored state with the observed stateWait for sleep to catch upLet a slave finish executing relay logsBalance roles among nodesDetermine whether a node is eligible for ruleChange a node's roleTest whether the database is availablePerform routine sanity checks on an instanceIn the main loop, every node in the cluster is checked for the following:
Perform initial sanity checks on an instanceWhen a node is added to the cluster, the tool should perform checks on it. The tool should warn and refuse to join the node to the cluster if any of the following is not true:
If any of the checks fails, the node's status should be set to "misconfigured". | ||||||||||||||||||||||||||||||||||||||||