Wednesday 22 July 2015

SQL Server Failover Cluster Concept

Basic understanding of the concept of SQL Server Failover Clustering

What is a windows Cluster ?
A cluster is a group of independent computer systems, referred to as nodes, working together as a unified computing resource. When we say unified computing , it does not mean that a single application can access the resources (cpu/memory) of multiple computer at any point of time. A cluster provides a single name for clients to use it services.
There are two type of cluster:
Network Load Balancing Cluster : A Network Load Balancing Cluster (NLB) distribute the load (request from the client)  to the different nodes which are part of the cluster based on predefined rules.The  client application has to communicate to a single cluster IP address (virtual server)  and client does not know which node in the cluster exactly serving its request.Network Load Balancing cluster helps to enhance the the availability and scalability of application.
 
Fail over Cluster:  Failover cluster is a collection of servers that by working together increase the availability of applications and services that run on the cluster. It can be  described as a technology that automatically allows one physical server to take over the tasks and responsibilities of another physical server that has failed.A failover cluster provides high availability for cluster aware application like SQL server ,exchange server etc.

The major difference between NLB cluster and failover cluster is, failover cluster will not help to improve the scalability of the application. It can only support high availability.

A Microsoft SQL Server Cluster is nothing more than a collection of two or more physical servers with identical access to shared storage that provides the disk resources required to store the database files. 
These servers are referred to as "nodes".  Each of the nodes talk to one another via a private network, sending a heartbeat signal between them.  Should one node not communicate its heartbeat to the other node in the cluster the secondary node will take ownership of any dependent services being run by the node that lost communication.  This process is referred to as "failover". 
A failover can occur both automatically (a server's heartbeat stops communicating) or manually.  A manual failover is beneficial in the event that patching or some other form of maintenance is required at the physical server level.

There are many misunderstanding among people about the advantage of having SQL server on a failover cluster environment. Below table will give a clear picture about the capability of SQL server on cluster environment.




To understand the failover cluster it is important to familiar with the terminologies. Let us have a look on the terminologies used in the windows clustering.

Active/Active or Active/Passive Clustering
Clusters are often referred to as either Active/Active or Active/Passive.  Just as you would expect by the name, in an Active/Active cluster there will be two or more nodes, each one owning an instance of Microsoft SQL Server.  If one node fails, the instance it owns would fail over to the other node, running along side (and contending for resources with) the other instance.  An Active/Passive architecture  requires that no matter how many nodes make up the cluster, at least one node is not the owner of an instance of SQL Server.  It is "passive" and only exists to accept a failover of a node hosting a SQL instance in the event of a failover. 

Microsoft licensing policies require only license the active nodes running Microsoft SQL Server.  The passive node need not be licensed.

Server Node: Physical node with operating system that support windows clustering. Each server node should have minimum two network card for public and private network.Minimum one local hard disk is required for OS and other application binaries.
Private Storage: Local disks are referred as private storage. There will be minimum of one private disk for OS and SQL binaries.Server node can have a additional local disk for bigger page file.

Shared Disk Array: Each server needs to be attached to the shared external storage.In non-clustered SQL server instance, databases (system/user/resource) are store on locally attached disk storage but in clustered SQL server instances databases are store data on a shared disk array.That mean all the server nodes in the cluster setup are physically connected to the disk array.This shared storage configuration allow application to failover between server in the cluster.


Quorum: Quorum is a count of voting members— a quorum is a way of taking attendance of cluster members who are present. The cluster uses a quorum to determine who should be online.

Quorum Disk: Quorum is the cluster's configuration file.This file (quorum.log) resides in the the quorum disk (one disk from shared disk array).Quorum is the main interpreter between all nodes. It stores latest cluster configuration and resource data. This helps the other nodes to take ownership when one node goes down.
In real life, it is like emergency contact number and other medical information in your wallet which help others to take action in case of emergency.

Public Network and Private Network: Each server needs at least two network cards. Typically, one is the public network and the other is a private network between the nodes. The private network is used for internal communication of cluster.This is referred as heart beat. Public network is used to connect the external world or intranet.
Heart Beat: Heart beat is health check mechanism in cluster.  A single UDP packet sent between nodes in the cluster via the private network to confirm that  nodes are still online. By default cluster service will wait five seconds(one heart beat sent in every second)  before considering a cluster node to be unreachable.
Looks Alive check:Looks alive check is a basic resource health check to verify that the service(SQL service in our context) is running properly.To perform this , cluster service queries the windows service control manager to check the status of the service.By default looks alive check will happen in every five seconds.
Is Alive check: An exhaustive check to verify that a resource is running properly. If this check fails, the resource is moved offline and the failover process is triggered. During the Is alive check  the cluster service connects to the SQL server instance and execute select @@SERVERNAME.It will check only the SQL server instance availability and does not check the availability of user databases.
Failover: Clusters service monitor the health of the nodes in the cluster setup and the resources in the cluster. In the event of a server/resource failure, the cluster service  re-starts the failed server's workload on one of the remaining servers based on possible owners settings. The process of detecting failures and restarting the application on another server in the cluster is known as failover .
Virtual Server: Virtual server consist of a network name and IP address to which the client connect.This helps the client to connect the service  which hosted in the cluster environment with out knowing the details of the server node on which the service is running.In simple word , virtual server helps the client application to connect the SQL server  instance with out changing the connection string in case of failover of SQL instance.

Possible Owners: Possible owners are the server nodes on which cluster group (in our context SQL instance) can failover

Preferred Owner: Preferred owner is the server node which best suited to running an application or group.


Cluster Setup: Below schema diagram will give a clear picture about the windows cluster setup.


This picture depicts  two  node cluster setup.At this stage Node A and Node B will have only OS and windows cluster service.One of the disk from cluster disk will be designated as the quorum disk. This setup basically done by windows administrators and storage specialists.On top of this,  DBA's install  and configure the SQL instances. We are not going talk about the SQL server installation on cluster environment on this post.
To make it more clear, let us describe our cluster environment as given below.

Example of Active\Active Cluster:



Let us look into the pictorial representation current setup.
Fig 1






















From the Fig-1, we can see that INST1 and INST2 are installed on both physical node , but at any point of time one instance will be online only in one node.At this stage INST1 is running on physical node Pnode-A and INST2 is running on node Pnode-B.The cluster disk D and E will be visible in the node Pnode-A and F and G will be visible in the node Pnode-B.The request from the App1 will be routed through the virtual server A and landed in Pnode-A where the INST1 is running.In the same way the request from the App2 will be routed through the virtual Server B and landed in Pnode-B where the INST2 is running.

Let us see what will happen if there is a hardware/network failure in Pnode-A. Fig 2 depicts the failover of  scenario of INST1 to Pnode-B.


Fig 2





















Now the INST1 went offline in Pnode-A and came online Pnode-B.The connection from the App1 will be routed through the same virtual Server A and landed in Pnode-B.All four cluster disk will be visible in Pnode-B and only C (Local disk) will be accessible from Pnode-A. Note that even after the failover of INST1,App1 will be connecting to the virtual server A.It helps to make the failover transparent to the application.

No comments:

Post a Comment