SOFTWARE & DEVELOPMENT TOOLS
High Availability Software
A High-Availability Ecosystem from the Service Availability Forum
A complementary specification is being added to the existing AIS and HPI specifications, which acts as an umbrella to tie them together in the fields of configuration, management and event notification in order to meet high-availability requirements.
BILL SWORTWOOD, MOTOROLA, STEVE DAKE, MONTAVISTA SOFTWARE, ADAM SCHMIDT, FORCE COMPUTERS AND TIMO JOKIAHO, NOKIA–FOR THE SERVICE AVAILABILITY FORUM
The Service Availability Forum (SA Forum) mission in the communications and computing marketplace is to foster an ecosystem that enables the use of commercial off-the-shelf building blocks in the creation of high-availability network infrastructure products, systems and services. To achieve this, the SA Forum develops and publishes high-availability and management software interface specifications, and promotes and facilitates their adoption by the industry.
The SA Forum specifications include three levels. The Application Interface Specification (AIS) defines the interface between the applications and the high-availability (HA) middleware making each independent of the other, thus providing for a very robust HA stack management.The Hardware Platform Interface (HPI) defines the interface between the hardware and the HA middleware and makes each independent of the other. The Systems Management Specification (SMS) defines the interfaces to access the monitor and control aspects of the AIS and the HPI interfaces, as well as a comprehensive notification interface for HA systems. In addition to the AIS and HPI, the forthcoming Systems Management Specification will provide management and high availability. Figure 1 shows how the three specifications will work together.
Application Interface Specification
Applications today are often developed as one application that runs on a single node. In a typical system, this exposes four possible areas that could fail: the application, the middleware, the operating system and the hardware.
The SA Forum AIS improves service availability by taking advantage of redundant hardware and software components in a distributed system. The AIS defines a C language API for six services that together provide a distributed mechanism for supporting cluster membership, application failover, checkpointing, event distribution, messaging and distributed locking. The approach of the AIS to increase service availability is to mask the four possible failures by redundantly distributing the application and middleware across the same or multiple nodes.
Consider an application that runs on two nodes. If any of the four possible failures occurs on one of the nodes, the remaining node survives to provide service to the user of the application. Providing more than one standby node improves service availability dramatically. For applications to failover, they must make use of the Availability Management Framework (AMF). The AMF specifies a distributed system model and policy to describe the actions taken when a failure is detected or reported.
The
system model of the AIS is provided by a container called a service group. A
service group is a collection of service units. A service unit is a collection
of one or more software components, or processes combined to deliver a service.
Service units are assigned a component service instance (CSI), which is either
active or standby. In Figure 2, the active service unit X contains components
A and B. The standby service unit Y contains components C and D. If the active
service unit X failed, service unit Y would become active.
A number of alternative policies are provided to specify which service units are activated during failover. The AIS explains these policies in great detail. One of these policies, for example is the N+M policy, which specifies that N service units will be active, and M will be standby. If an active CSI fails, a standby CSI will be assigned the active state. There are several mechanisms that will direct a standby CSI to enter the active state. Failure to respond to a health check will activate the standby unit. In addition, the crash of a component or an error report via the API, as well as the unregistration of a component via the API will bring the standby unit online.
With the AMF, stateful application failover is made possible with the addition of checkpointing. Checkpointing is the action of saving the active component’s state into a checkpoint section. When a standby service unit is activated, it can read the standby state and start at the last known good state of the active application. The Checkpoint API provides functions that create, delete, read or write a checkpoint. It also allows the automatic cleanup of checkpoints and the iteration of checkpoint sections.