Introduction to Veritas Cluster Server (VCS)
I somehow managed to escape having to deal with Veritas products over the years. Now I have been tasked to deal with some of it, so I'm taking this opportunity to familiarise myself with it. The product I'm looking at more specifically is Veritas Cluster Server, which is a different product from their popular file system and backup products. Veritas is now also a Symantec product, but I'm looking specifically at the shape of the product before the Symantec acquisition, because it's the form I have to support. Hopefully the details I cover here will help others who also end up having to support a VCS installation.
Description of VCS
Veritas Cluster Server is software used to facilitate and manage application clusters. Examples of these would be to manage an Oracle database cluster, a web application stack, or handle the clustering of Veritas file system products, or various combinations of all of these in a global operation.
I have been told it's a good product, but also quite expensive. It's more focused on managing a cluster for high availability than for performance. In other words its main purpose is to detect failures and perform failover operations, reliably.
How it works
It is run as a service (or a set of services) on top of the operating sytem on each server. It has its own heartbeat and synchronization system communicating over the network at layer 2 level, wanting several redundant network links to do this. It provides service over a virtual IP on the system node which is currently active. In other words clients only connect to the virtual IP, and VCS makes sure that something is available to provide service over that IP address and does all the failover magic in the background to achieve that.
It is also configured to know the dependencies of resources on each other, and will shut down and start them up in the most optimal order. For example to start up an application, it will make sure that the file system resource is brought up first, then the database, then the application. It will also make sure the network is up before making sure the IP address configuration is brought up. It also does this optimally so it will make sure that certain services can be started in parallel where possible.
Service Groups
Systems can also be grouped, in Service Groups. These are used to group systems that form a particular service. For example a bunch of web servers and database servers are in one group, and when there is a fault on one of the database servers the entire lot is failed over to another set of web and database servers. Failover can also be done for maintenance purposes. Service Groups can be in active-standby configurations (called 'failover'), active-active (called 'parallel') or a mix of these (called 'hybrid')
Resource types and agents
To shut services down and bring them up, VCS interacts with the system through commands supplied to it. You have to define these in the configuration, with defined stop, start and monitoring procedures. These are coordinated by agents. Bundled Agents, Enterprise Agents and Custom Agents exist for the various resource types. A resource type could be, for example, a database, web server or file system service, e.g. Oracle DB or an NFS server. On initial startup, VCS will determine which agents are needed to manage the services, and only those agents will be started. Each agent can manage multiple services of the same resource type on a system. For example the Oracle agent can manage multiple Oracle databases on one server.
Agents have entry points, which are usually points perl scripts are triggered to perform certain functions. It doesn't have to be perl, extensions can be developed in C++ or bolted on using other scripting languages. There are various entry points: online, to bring a service up, offline to shut a service down, monitor to check the status of a resource and other entry points such as clean, action and info.
Daemons
There are three main daemons, and one module, run on each system, that makes up the VCS service.
High-Availability Daemon (HAD)
This is the main daemon controlling the whole show. It's typically referred to as the VCS engine. It maintains the cluster according to the configuration files, maintains state information, and performs all the monitoring and failover needed. It runs as a replicated state machine, so on each node it contains a synchronized view of what's going on in the whole cluster. The replicated state machine is maintained through the LLT and GAB daemons.
Low Latency Transport Daemon (LLT)
This daemon is a low latency, high performance replacement for the IP stack, for cluster maintenance. This is done over a private network and requires two independent networks between all the cluster nodes for redundancy, and to be able to tell the difference between a system failure and a network failure. It has two major functions:
This does two things:
This makes sure that only one cluster survives a split of the private network. It determines who remains in the cluster and makes sure that systems that aren't members of the cluster any more can't write to storage.
Other Processes
Veritas Cluster Server comes with a couple of other commands and processes:
Cluster Topologies
VCS supports a lot of different cluster topologies, this is where you can start to see the value and strength of the product. It supports from the most basic topologies up to fairly complex, and useful configurations.
The most basic form is the asymmetric, or active-passive setups. This is where there is one live server that runs an application, and there is another server that can be started up and failed over to when needed. Then it can also support symmetric, or active-active setups. Here you can have one server with one application, and another server with another application, and when one of the servers goes down, the application on that server gets launched to run on the other server along with the application already on there.
The possibilities get better from here. For example you can have multiple servers sharing a few spares, banking on the fact that not all of them will fail at the same time so you can get away with only a few spares. Another is that you can have a bunch of servers running multiple applications each, and if one of them fails it can shuffle them around on the remaining servers that has available capacity to run the application on. It can also handle failover between data centers, for example for disaster recovery. Neat.
Configuration
I think it's best that I cover configuration in another article. So, I've done that in the Overview of Veritas Cluster Server Configuration article.
Description of VCS
Veritas Cluster Server is software used to facilitate and manage application clusters. Examples of these would be to manage an Oracle database cluster, a web application stack, or handle the clustering of Veritas file system products, or various combinations of all of these in a global operation.
I have been told it's a good product, but also quite expensive. It's more focused on managing a cluster for high availability than for performance. In other words its main purpose is to detect failures and perform failover operations, reliably.
How it works
It is run as a service (or a set of services) on top of the operating sytem on each server. It has its own heartbeat and synchronization system communicating over the network at layer 2 level, wanting several redundant network links to do this. It provides service over a virtual IP on the system node which is currently active. In other words clients only connect to the virtual IP, and VCS makes sure that something is available to provide service over that IP address and does all the failover magic in the background to achieve that.
It is also configured to know the dependencies of resources on each other, and will shut down and start them up in the most optimal order. For example to start up an application, it will make sure that the file system resource is brought up first, then the database, then the application. It will also make sure the network is up before making sure the IP address configuration is brought up. It also does this optimally so it will make sure that certain services can be started in parallel where possible.
Service Groups
Systems can also be grouped, in Service Groups. These are used to group systems that form a particular service. For example a bunch of web servers and database servers are in one group, and when there is a fault on one of the database servers the entire lot is failed over to another set of web and database servers. Failover can also be done for maintenance purposes. Service Groups can be in active-standby configurations (called 'failover'), active-active (called 'parallel') or a mix of these (called 'hybrid')
Resource types and agents
To shut services down and bring them up, VCS interacts with the system through commands supplied to it. You have to define these in the configuration, with defined stop, start and monitoring procedures. These are coordinated by agents. Bundled Agents, Enterprise Agents and Custom Agents exist for the various resource types. A resource type could be, for example, a database, web server or file system service, e.g. Oracle DB or an NFS server. On initial startup, VCS will determine which agents are needed to manage the services, and only those agents will be started. Each agent can manage multiple services of the same resource type on a system. For example the Oracle agent can manage multiple Oracle databases on one server.
Agents have entry points, which are usually points perl scripts are triggered to perform certain functions. It doesn't have to be perl, extensions can be developed in C++ or bolted on using other scripting languages. There are various entry points: online, to bring a service up, offline to shut a service down, monitor to check the status of a resource and other entry points such as clean, action and info.
Daemons
There are three main daemons, and one module, run on each system, that makes up the VCS service.
High-Availability Daemon (HAD)
This is the main daemon controlling the whole show. It's typically referred to as the VCS engine. It maintains the cluster according to the configuration files, maintains state information, and performs all the monitoring and failover needed. It runs as a replicated state machine, so on each node it contains a synchronized view of what's going on in the whole cluster. The replicated state machine is maintained through the LLT and GAB daemons.
Low Latency Transport Daemon (LLT)
This daemon is a low latency, high performance replacement for the IP stack, for cluster maintenance. This is done over a private network and requires two independent networks between all the cluster nodes for redundancy, and to be able to tell the difference between a system failure and a network failure. It has two major functions:
- Traffic distribution - it spreads internode traffic between all the private links, for speed and reliability.
- Heartbeat - This is used by the GAB daemon to determine the state of cluster membership.
This does two things:
- Maintains cluster membership, setting nodes as up or down based on heartbeat status.
- Handles cluster communications, doing guaranteed delivery of point to point and broadcast messages to all the nodes.
This makes sure that only one cluster survives a split of the private network. It determines who remains in the cluster and makes sure that systems that aren't members of the cluster any more can't write to storage.
Other Processes
Veritas Cluster Server comes with a couple of other commands and processes:
- Command Line Interface - to manage and administer VCS.
- Cluster Manager - This comes in two forms. One is a Java based graphical user interface, the other is a web interface.
- hacf - This is a utility that can verify the configuration file or make HAD load a configuration file while running.
- hashadow - This watches the health of HAD, and restarts when needed.
Cluster Topologies
VCS supports a lot of different cluster topologies, this is where you can start to see the value and strength of the product. It supports from the most basic topologies up to fairly complex, and useful configurations.
The most basic form is the asymmetric, or active-passive setups. This is where there is one live server that runs an application, and there is another server that can be started up and failed over to when needed. Then it can also support symmetric, or active-active setups. Here you can have one server with one application, and another server with another application, and when one of the servers goes down, the application on that server gets launched to run on the other server along with the application already on there.
The possibilities get better from here. For example you can have multiple servers sharing a few spares, banking on the fact that not all of them will fail at the same time so you can get away with only a few spares. Another is that you can have a bunch of servers running multiple applications each, and if one of them fails it can shuffle them around on the remaining servers that has available capacity to run the application on. It can also handle failover between data centers, for example for disaster recovery. Neat.
Configuration
I think it's best that I cover configuration in another article. So, I've done that in the Overview of Veritas Cluster Server Configuration article.
Comments