High Availability (or HA) � a common buzz-phrase in the database server world these days. As more businesses depend on computer systems they also expect these systems to be available at all times. Minimizing downtime can translate to minimizing financial losses such as lost productivity, sales, and customer satisfaction.
High availability means that a system is setup so that if failure occurs, the system fails over into a backup system. Business can then go on as usual. Once the problem is resolved, the system can be �failed back� to the primary system.You may be sat there faced with a decision on implementing HA, you may even be forced into implementing something because of a recent �hiccup� on one of your servers, highlighting the alarming fact that you have no HA technology in place at all!
High Availability � everyone is doing it, or at least should be doing it.What is clustering? What does it mean to maintain a cluster? What kind of activity at what frequency are expected to be done? What�s this �Virtual Server� thing everyone keeps talking about? What�s "Active/Passive..." terminology?All questions that will need answering and hopefully this article will start to answer them for you.Certain articles and writings on the subject of SQL Clustering urge you to hire specialised consultants that have high level of knowledge in clustering. They say it�s not something that you can easily pick up or should do on your own without any experience, regardless of how many books and articles you have read. There are a few things we need to make clear here, firstly clustering does nothing to protect your data, it simply ensures better hardware availability of the processing resources.
This is Windows clustering technology. SQL Server is not a clustering technology; it is just �cluster aware� (like MS Exchange). Windows PerfMon and all the SQL Server tools are also �cluster aware�.
So, the cluster service (MSCS) is Windows OS component, if you are primarily a DBA you�ll need to ensure you have some Windows Admin skills available too. And lets be honest here, most SQL Server DBAs these days should have some OS knowledge (should!)
You use Cluster Administrator to administer the cluster. This is available in Administration Tools on the servers � again another reason for a little Windows admin knowledge.
Clustering is not a load balancer in SQL Server. Clustering can give us some version of load balancing if you splice up your data, but you�ll have to be careful with available resources to take over from any instance that fails so that you don't have only part of your spliced data available to your customers!Clustering is also not some performance-enhancing drug for the athlete that is SQL Server. It�s simple task is to provide a safeguard for failure � a plan B if plan A fails.
I suppose, Clustering can give you some form of load balancing if you license all of the nodes and split up the instances amongst the nodes. It isn't true load balancing though and it�s expensive doing it this way!!Clustering is for failures on a server and to have smaller downtime windows.
Clustering allows you to fail to another node in case of failure on the current one. It allows you to have very short outages when you carry out procedures like Windows patching, as you can patch one node at a time and move the instance to the other node. It makes installing software on your servers easier as the downtime on the SQL Server is very minor, ~30 seconds (time it takes to fail to another node).
There is no longer any need to take the production application down and deny service to your users to upgrade hardware, software, the database, or the application itself. All that is necessary is to switch users from the node to be upgraded to another node, perform the upgrade, test it, and then return the users to the upgraded node. In this way, upgrades can be rolled through all nodes in the application network without ever denying service to a user. This has proved invaluable at my workplace where uptime on mission critical applications is the priority.
Clustering is often seen as the complete solution to High Availability. Unfortunately, clustering adds complexity to the system which can then impact stability. The way we remove that uncertainty is to use high quality hardware that is tested and approved for clustering. Predictably "cheap" and "Highly Available" just don't go together, so be prepared to rob budgets!Clustering's primary benefit is an immediate hot-standby server to protect us from hardware failure. Remember Clustering will do nothing to stop or recover from a "DROP DATABASE" command. Backup and Recovery strategies are still a must.
OK, so why Clustering? Why not Mirroring? or both? (I.e. cluster in live and mirror to another DR site).
You could use both clustering and database mirroring. Clustering can be used for server failures/maintenance/etc at a primary site and database mirroring can be used for disaster recovery at a secondary site.
Mirroring is a whole different topic in its own right; this article will simply describe what Clustering is and how your organisation could benefit from it. You can then make your own minds up as to which way to go.
Applications do not care about clustering; only SQL Server and Windows care about it.
There�s a myth out there in SQL Server world, that installing and configuring clustering for SQL Server is a daunting and arduous task. True this is what I automatically assumed before actually doing it for myself. My more senior colleagues tell me things have drastically changed (for the better I hasten to add) with every new version of SQL Server.
Clustering allows you to have a set of servers that all share a set of drives. If one server fails, its defined resources like SQL Server move to one of the other servers. As mentioned above the entire process usually takes approx 30 secs in a properly configured cluster. The phrase �properly configured� is the part that worries people!
One caveat - you need to have at least Windows 2000 Advanced Server installed to cluster.
Also you cannot cluster your base server equipment by default. You must also purchase a shared drive array that both servers can see. The most common device to use is a shared SCSI array device but like in my workplace we use a storage area network (SAN) device.
If you do use a SAN, you must also purchase cards to connect to the SAN. Make sure when you purchase connectivity equipment like network cards and SAN fiber optic cards, that you buy two of everything for each server so there�s not a single point of failure.
In my SAN environment, we purchased multiple fiber optic cards to connect to the SAN and specified certain hard drives to go out of each card.
That way, you not only give yourself added performance, but you also create another level of fault tolerance.
Some Clustering terminology
Each server that participates in clustering is referred to as a node. A tool called Cluster Administrator manages the Windows clustered resources including SQL Server. Inside Cluster Administrator, you can specify which server is the preferred owner of a resource (like SQL Server), and you can define who are the possible owners.
This means that when you have the possibility of having 4-node clusters, you can specify that one service failover to a particular node. You can also set which server is dependent on another server. By setting a dependent service, you can make sure that SQL Server does not start until the drives are ready.
MSCS allows the creation of virtual servers. A virtual server is not associated with a specific computer and can be failed over to any node in the cluster.
They act similar to physical servers by allowing access to network resources, publishing a unique server name to network clients, and associating a network name with an IP address.
A virtual server is a group that contains: A Network Name resourcean IP Address resource, usually a physical disk and applications to be accessed by clients of the virtual server
Failover is the process of moving resources in the cluster to another node on the same cluster � It is the principle activity of MSCS. This feature is what helps to reduce application downtime and allow for system recovery at minimal loss of time and revenue.
Failover may occur automatically on the event of some type of failure or can be manually initiated by an administrator.
Automatic failover typically occurs as the result of some type of hardware, operating system, or application failure. If any resource in the group encounters failure, the entire group fails over to another node. Manual failover is initiated by a system administrator for almost any system or application upgrade, maintenance, or other support needs.
A group in MSCS may be configured for automatic failback. If configured for automatic failback, MSCS will move groups back to a preferred node. Failback may be set for immediate failback or timed failback for a given time period.
Failback procedure occurs following the same rules for failover when actually moving resources.
Active/Active & Active/Passive
There are two types of failover clustering in Windows: Active/Passive and Active/Active.
An active/passive configuration is the most basic HA configuration.
In this type of configuration, one node is active running your resources, and the other node is passive, simply waiting for the active side to fail.
The advantages of this configuration are that failing over causes no service degradation, provided that the passive machine is similar to the active machine.In addition, services are only restarted when machines fail, which typically results in half as many services interruptions as an active/active configuration.
Active/Passive minimises system investment as the Passive nodes are rendered �out-of-use� until called upon. I guess some organisations would deem this as wasted resources? Why spend all that money on a server that just sits dormant?
With Active/Active, the nodes are generally completely symmetric. Any transaction can be routed within the network to any node (server) which can read or update data in the database.
Hence Active/Active has the most flexibility and maximizes system investment as requests can be load-balanced across all available processing capacity.
Should a node fail, users at the other nodes are unaffected. Furthermore, the users at the failed node can be quickly switched to surviving nodes, thus restoring their services (as previously mentioned generally in ~30secs)
An active/active network contains at least two copies of the application database. All database copies are kept in synchronism so that any copy can be used for a transaction.
Should a database copy fail, all transactions are routed to a surviving copy.
Note however, you need to ensure that both nodes have sufficient capacity (processor, memory, secondary storage) to run the entire cluster workload at a satisfactory level of performance.
The disadvantage of Active/Active configuration is that quality of service may be degraded when one machine runs both services.
Additionally, resources are moved not only when machines fail, but when they return to service, resulting in twice as many service transitions as active/passive configurations.
For performance, it is said that Active/Passive is the better solution
(quote from sql-server-performance.com)
. It�s just a case of deciding whether it�s worth the extra ��s.
Personally, I prefer an Active/Passive configuration as it is easier to set up and administer, and overall it will provide better performance.
Assuming you have the budget, this is what I recommend.
The advantage of this configuration is that both nodes are providing useful service, and is generally viewed as a more efficient use of resources than an active/passive configuration.
Also note, in an Active/Active environment, you must license both nodes still since both nodes are accessible.
I guess what Active/ Active allows you to do is, let it fail, but fix it fast! It provides some form of disaster recovery.
�Heartbeat� is as it describes the aliveness of a clustered environment. This is what all the nodes in a cluster use to �talk� to each other.
This signal is used by each node to determine if the other node is still available. In case any node is not available then the other node assumes failover and takes over.
You would normally have a shared SCSI disk that both servers can have access to. You would have a public network that clients would connect to your server with, while a private network would be used to check the health of the network through a heartbeat.
This "heartbeat connection" can share the public network connection as well. Applications would connect to SQL Server using a virtual server name, which can float to any server that owns the server�s resources.
There is also a cluster name, which can also float based on who owns the main server resources.
So now I�ve described some of the clustering jargon, what will your network need before you can deploy a server cluster?
As well as needing a good rapport with the Window/OS guys, you�ll need to be on speaking terms with your network people too. This is because in order to configure clustering, you�ll need at least 4 more IP addresses!
(One for each physical node which the heartbeat works on, at least one for the �virtual� server name (2 if you opt Active/Active) and the Cluster itself needs its own IP address. You�ll see later in the �Configuring Clustering� section how these IP addresses are configured etc.
As with most things Microsoft, the hardware configuration for a cluster must be selected from the Cluster Hardware Compatibility List (HCL). This includes network interface controllers along with any other components
All of the adapters used to attach nodes to the cluster network must be of the same speed, media type, Duplex mode etc.
And for the Windows gurus out there, the nodes of a cluster must belong to a single domain. The domain configuration must have at least 2 domain controllers and if you�re using DNS, then you�ll need at least two DNS servers too.
There are other networking requirements based on the geographical dispersion of your cluster. For more in-depth info on these visit
What about SQL 2008?
I haven�t had too much opportunity to play with all the new features that SQL 2008 brings but one thing is for sure clustering
remain a High Availability failover technology. I cannot foresee the actual workings of the SQL changing.
Rumour has it that Microsoft is adding iSCSI support for clustering, as well as now supporting up to 64 nodes in a cluster. I guess 8 nodes weren�t confusing enough!
So to conclude, why choose clustering?
Downtime will be minimised and patching/new software installs are made easier. If the primary node does fail the failover is automatic � can occur without your DBA touching a server!
Failing back is also quick & easy, and can be done whenever the primary server if fixed and put back on-line � allowing you flexibility. Clusters are expandable too � can act as a performance booster by allowing the introduction of more nodes etc
I guess the main advantage is that your boss has added peace of mind and your users don�t grumble as much when you need to perform server updates etc. A happy boss and happy users� that�s all you wish for isn�t it?
Its not all rosy though, as previously mentioned high availability comes with a price (Log shipping could be a cheaper solution?). More hardware usually means more maintenance, more maintenance = more work for your DBAs!
In Part Two of this article �SQL Server Clustering � Configuration�, I will aim to walk through the configuration and implementation of a SQL cluster from a beginner�s viewpoint, but hopefully after reading this you at least now know the basics.
If you need any more information or assistance on clustering your SQL Svr environment please
simply contact us
Related Documents & Further Reading on Clustering technology
TechCenter - Windows 2003 Clustering Services
Server Clusters - Network Configuration Best Practices for Windows 2000 and Windows Server 2003
Server Clusters - Backup and Recovery Best Practices for Windows Server 2003