SOFTWARE & DEVELOPMENT TOOLS
Managin High Availability
Building a Highly Available Base Station Controller Using COTS Components
High-availability management middleware is key to helping developers build systems with COTS components that are carrier grade and are deployed where uninterrupted service availability is a fundamental requirement.
ASIF NASEEM, GOAHEAD SOFTWARE
Building a carrier-grade network element capable of providing continued service availability in the presence of failures is a complex undertaking. Historically, telecom equipment manufacturers (TEMs) have designed and built such systems from the ground up using the specialized, in-house expertise they have developed and nurtured over decades. Many TEMs—especially the tier 1’s—have invested a significant amount of time and resources in developing software services, often referred to as high-availability middleware, essential to building network elements that provide five nines or better service availability.
Recently, however, such middleware is becoming the focus of various independent software vendors (ISVs), who are providing software products that package a collection of key services that can be acquired off-the-shelf and employed in system design and implementation. Helping this proliferation are the key standards efforts such as the application programming interface (API) specifications published by the Service Availability Forum. The Forum has provided two sets of API specifications: a hardware abstraction layer termed Hardware Platform Interface (HPI), and an application abstraction layer called the Application Interface Specification (AIS). Together these specifications, when implemented, allow for portability of service availability middleware as well as applications that comply with them. A specific example of how to build a highly available wireless network element using commercial-off-the-shelf (COTS) components can be given using the case of a Base Station Controller (BSC).
Anatomy of a Highly Available System
Broadly speaking, there are six categories of essential services that are required in building such a system. The centerpiece of any high-availability middleware is its availability management services. Systems management services enable the creation of both external and internal management functionality. Application services are targeted primarily at developers to simplify the development of applications for highly available systems. Platform management services interface with a particular hardware platform’s management capabilities, to ensure proper discovery of resources and subsequent population of a system model.
Foundational services provide a variety of functionality that system developers can utilize to build highly available systems. The kernel is a small, reliable, cross-platform foundation for all services, and helps abstract platform-specific capabilities into generic, platform-independent capabilities. The services that comply with the AIS specification are represented by the red blocks in Figure 1.

Assuming all of these services are conveniently available in a package to the system designer, let us dive into how one would build a base station controller that processes voice calls from a number of base stations that it aggregates. Designing such a system typically involves establishing system requirements, determining the deployment configuration and mapping middleware capabilities to the desired solution
In establishing system requirements, a base station controller must provide three key functional elements. Operations, administration and maintenance (OA&M) satisfy the requirements of the functional areas described by the acronym. Effectively, this element is the system manager responsible for monitoring the state of the system, and providing an interface to the outside world, e.g., to an element management system. Secondly, call control provides the voice processing services within the system. This element communicates with the OA&M element as well as with the third element, media control, which manages the switching configuration for the media on which the voice calls are transmitted and received.
For uninterrupted service availability, each of these functional elements needs to be highly available. One common and logical way to ensure this is to provide redundant pairs of each of these functional elements in the system, and to ensure that these elements operate in independent fault zones, with each element designed to run on its own node. We will assume that the system will be designed using some standards-based bladed system such as ATCA, Blade Center, etc., so that each of the functional element instances will run on a blade within the shelf.

Figure 2 depicts a deployment configuration for the base station controller. The hardware components or resources in this deployment configuration include the following:
• The Shelf – a platform that hosts the hardware components required for the system. Each shelf includes redundant Ethernet base interfaces to eliminate the network as a single point of failure.
• OA&M Node – a blade that hosts the OA&M functional element. The OA&M services are expected to operate with a 2N redundancy model.
• Call Control Node – a blade that hosts the call control functional element. There are two call control nodes in the system running the same set of resources, and the two call processing services operate with a 2N redundancy model.
• Media Control Node – a blade that hosts the media control functional element. There are two media control nodes in the system running the same resources, and the two media control services operate with a 2N redundancy model.
• ShM – redundant shelf managers are responsible for monitoring the hardware through the internal management interface for the shelf, such as IPMI, including detecting the removal or insertion of new hardware components.
• Switch – redundant switch blades each providing the switching logic for one of the Ethernet base interfaces.
• PSU/Fan – these power supply units and fans are simply shown to represent non-intelligent FRUs that are needed for the shelf to be operational.
Having established the systems requirements, and identified the deployment system configuration with appropriate redundancy policy, it is necessary to map the essential middleware services to meet the base station availability requirements. In our base station controller, the middleware services would be designed to be available on all of the blades hosting the key functional elements of the system. Each element would include deployment of only those services that are required for the domain-specific role of the node. The key functional elements that would benefit from high availability and systems management capabilities are, as described earlier, the nodes providing the OA&M, call control and media control functions.
OA&M Nodes
For these nodes, the middleware has been configured to be manager-capable (Figure 3), such that the nodes can be designated either the manager or standby manager node by the management middleware. Because these OA&M nodes will be the manager and standby manager nodes, the information model is present on the manager and is replicated to the standby manager node. This mapping of the node roles is consistent with the domain-specific role these nodes have in the system as well.

Because the OA&M services are operating in a 2N redundancy model, logically the OA&M service on each node would be represented as a service unit in the Availability Management Framework (AMF) system model, and those service units would be members of a 2N OA&M service group. The application processes that provide the OA&M functionality would also be represented as AMF components within the OA&M service unit and would affect the high-availability states of the containing OA&M service unit for the node.
The purpose of each of the services enabled on the OA&M nodes is as follows:
• Cluster Management Service (CMS) – required on all cluster nodes, and in this case, the node is configured to be manager-capable
• Distributed Messaging Service (DMS) – required because it is used by the management middleware services as the communications infrastructure for both intra-node and inter-node communication
• AMF Manager – manages the AMF system model and implements the required AMF policies; also performs replication of AMF system model state updates to the standby manager node
• AMF Client – required to perform the node local AMF operations for managing the lifecycle of AMF components on the node
• IMM Manager – required because this is a manager-capable node; manages the information model within the Information Model Management (IMM) service, including replication of all information model updates to the standby manager node
• Notification (NTF) Service – on a manager-capable node, fulfills the same role as the other management services (IMM, AMF, etc.) and provides reliable notification services to the cluster
• Log (LOG) Service – on a manager-capable node, the Log Service fulfills the same role as the other management services (IMM, AMF, etc.) and provides reliable logging services to the cluster
• Platform Resource Management Service (PRMS) – represents the state of the system’s hardware resources in the AMF system model and propagates HPI events by way of the NTF service; also facilitates the implementation of custom hot swap and alarm management for the system and relies on a HPI client library to access the HPI APIs
• SNMP Agent – supports the MIBs required for the system by processing incoming requests related to the supported MIBs, as well as generating traps and notifications described in the MIBs
It is likely that other services would also be useful on an OA&M node, such as the Management Database Service.
Call Control Nodes
These nodes have been configured to be client-only (Figure 4), such that the nodes can never be designated either the manager or standby manager node by the management middleware. This mapping of the node roles is consistent with the domain-specific role these nodes have in the system as well, because these nodes are expected to perform the business logic of the system and not necessarily manage the other system components.

Because the call control nodes are using a 2N redundancy model, logically, the call control service on each node would be represented as a service unit in the AMF system model and those service units would be members of a 2N call control service group. The application processes on each call control node that provide the call control functionality would also be represented as AMF components within the call control service unit for the node in the AMF system model, which affects the high-availability states of the containing call control service unit for the node.
The purpose of each of the services enabled on these nodes is briefly described below.
• Cluster Management Service (CMS) – required on all cluster nodes, but, in this case the nodes are configured to be client-only
• Distributed Messaging Service (DMS) – required because it is used internally by the management middleware services for intra-service communication
• Event (EVT) Service – potentially used to send and receive management requests/responses from/to the OA&M nodes; can also be used to checkpoint the state of the call control applications between the nodes
• Checkpoint (CKPT) Service – potentially used to checkpoint the state of the in-progress calls between the active and standby call processing applications on the nodes
• AMF Client – required to perform the node local AMF operations as requested by the AMF manager for managing the lifecycle of AMF components on the node
Media Control Nodes
For the media control nodes, the management middleware has been configured to be client-only—the same as the call control nodes (Figure 5). This mapping of the node roles is consistent with the domain-specific role these nodes have in the system.

Because the media control nodes are using a 2N redundancy model, the media control service on each node would be represented as a service unit in the AMF system model, and those service units would be members of a 2N media control service group. The application processes on each media control node that provide the media control functionality would also be represented as AMF components within the media control service unit for the node in the AMF system model, which affects the high-availability states of the containing media control service unit for the node.
Notice that the set of services utilized on this node are very similar to the services utilized on the call control nodes. The primary difference between the configuration of the media control and call control nodes is the set of AIS application services configured to be available on the media control nodes.
The message (MSG) service is potentially used to send and receive management requests/responses from/to the OA&M nodes, where incoming management requests would be retained in a designated message queue even if the media control application(s) are currently unavailable to receive the request. The cluster membership (CLM) service is used by the media control application processes to determine when cluster nodes have entered or left the cluster.
When a node leaves the cluster, the media control applications could use the “node left” notification as a trigger to clean up any media switching configuration related to the call processing applications that are no longer available due to the node exiting the cluster.
So there we have it! A highly available Base Station Controller that is designed and developed using COTS components.
GoAhead Software
Bellevue, WA.
(425) 453-1900.
[www.goahead.com].


Adlink
Elma