Open Standards Reduce Management Costs in New Server Designs

Open Standards Reduce Management Costs in New Server Designs

In the past, server hardware vendors designed high-performance and reliable
systems, but usually at the expense of using proprietary components that offered
little or no interoperability. Server manageability, whether pedestal, rack
or a blade design, has therefore relied on proprietary tools and components
for monitoring system hardware health, performing preventive maintenance, troubleshooting
faults and recovering from failures. Without standardization of management,
or the requisite security infrastructure, IT managers struggle to manage multi-vendor
networks securely and in real time.

Scripts offer some reprieve by automating mundane tasks. Yet vendors use multiple
commands and interfaces for the same task—further complicating the process.
Finally, by relying totally on OS-based management agents, states outside the
OS or where the OS is not responding cannot be managed effectively. By adding
open, standardized platform management components into server designs, vendors
help customers lower operational costs by accessing features at the command-line
level in a consistent manner—complementing existing management approaches.

Why Open Standards?

Flexible management design standardizes the platform interfaces, not the implementation.
This supports future platform

innovations without sacrificing functionality. The overall philosophy for systems
management should therefore be based on open standards building blocks. This offers
the following characteristics:

• Scalable, from low-end to high-end products

• Supports pedestal, rack mount, multi-chassis and modular/blade configurations

• Supports feature integration without requiring hardware modifications

• Supports a common platform management configuration interface

• Isolates management software from platform hardware differences

• Supports cross platform Systems Management software

• Supports managed remote services

IPMI to Standardize Hardware Management

The Intelligent Platform Management Interface (IPMI) is a management technology
initiated by Dell, HP, Intel and NEC in 1998 that standardizes the systems hardware
management interfaces and processes that can help address many of these challenges.
The latest revision, IPMI v2.0, announced in February 2004, adds new capabilities
over the previous version. Most notable are security and standardized console
access. More than 170 vendors worldwide have adopted IPMI as part of their products
(Figure 1).

IPMI defines a common and secure interface for monitoring system hardware and
sensors (temperature, voltage, fan, etc.), control system components (power supplies,
blades, etc.), logging important system events (chassis intrusion, system reset,
etc.) and allowing administrators to remotely manage and recover failed systems.

Specialized IPMI firmware is at the heart of any IPMI-enabled server. This code
runs on a dedicated chip/controller and implements the IPMI specification. This
chip is sometimes referred to as a service processor or baseboard management controller
(BMC). It typically exists on the server motherboard or blade. When combined with
IPMI firmware, a stand-alone, “embedded management subsystem” is created.
This subsystem works regardless of the type or state of the CPU, or Operating
System (OS), allowing for monitoring and recovery even when the rest of the system
is unavailable. This approach is commonly referred to as providing “out-of-band”
access. Administrators can therefore access IPMI information even when the server’s
OS is not loaded, is unstable or is unresponsive.

IPMI functions are performed by sending commands, defined in the IPMI specification,
to IPMI firmware. Typically, the commands are sent by a system manager, usually
a management console, over the LAN using the Remote Management Control Protocol
(RMCP) over UDP.

IPMI firmware not only responds to remote commands but also receives and logs
internal server events (system reboot, chassis open, etc.) in the System Event
Log (SEL). It also maintains Sensor Data Records (SDRs) that provide information
about the hardware by communicating (using IPMI) with sensors (temperature, fan
speed, voltage, etc.). A separate area also exists for the hardware inventory
(Field Replaceable Units), which stores information about the system’s hardware
components—useful for service and support as well as for asset/inventory
systems.

A typical chassis may have IPMI implemented at two different levels—A “full”
BMC on the main management module and on selected blades that can communicate
locally and remotely, complete with local OS management access to the BMC, and
a “partial” BMC (sometimes referred to as a satellite or peripheral
management controller) that offers local (with no remote access) basic management
services for blades (no OS access) and/or for peripherals like fans, power supplies,
front panels, LEDs, etc. (Figure 2).

Standardizing the Command Line Interface

In 2003, a roundtable of financial customers identified the need for a standard
command line interface (CLI) across servers. The Distributed Management Task Force
(DMTF) took on this challenge to deliver an open standard by creating a new working
group—the Server Management Working Group (SMWG). The overall goal was to
implement a model that is consistent across multiple vendor products.

This was divided into various specifications. However, the focus for the initial
delivery was the command line interface—the System Management Architecture
for Server Hardware (SMASH) Command Line Protocol (CLP). SMASH v1.0 was released
in June 2005 as a preliminary specification for public review and comment. It
is expected to be supported in various products by the end of this year.

In emergency and/or ad/hoc situations, system administrators often need to interactively
manage various systems. Servers that have support for the SMASH CLP allow administrators
to directly use generic Telnet and/or SSH software clients to open interactive
sessions. Once logged in (using the server’s native security features) they
can use SMASH CLP “SHOW” commands to list the system resources that
can be managed.

CLP has a specific architecture. Figure 3 shows the major components. The
administrator at the CLP client uses Telnet or SSH2 to send CLP commands to
the Manageability Access Point (MAP). The MAP can exist in a dedicated or shared
processor or as a process or service elsewhere in the system or network. This
flexibility accommodates many different management designs that we will discuss
later.

The MAP provides authentication and authorization, processing commands and responses,
resolving addresses, discovery and managing the sessions between the client and
the managed systems and elements. A MAP typically needs to support multiple concurrent
sessions for managed elements.

Typically, the hardware vendor describes these managed elements, although third-party
vendors can also do this. CLP conforms to the object model described by the Common
Information Model (CIM), v2.9. The object model definitions that describe the
details of these managed element details are stored as profiles. Profiles can
also contain sub-profiles. Associations between objects are also described. As
an example, a storage subsystem has associations with boot options and power state
control – described in sub-profiles.

The CLP syntax breaks down as follows: <verb> <options>
<target> <properties>

For example: SHOW – l all /system1/nic1 speed porttype

Where

<verb> is the command verb i.e., SHOW—which
retrieves system information

<options> affect the action, behavior or output of
the verb i.e., “-l all”—lists all levels

<target> is the implicitly- or explicitly-identified
managed element the command is directed to i.e., /system1/nic1—which
is the address of the target server nic

<properties> are attributes of the target relative
to the command execution i.e., “speed porttype”—which
would show the speed (10/100, etc.) and type of nic port (Ethernet, etc.) for
nic1.

Bringing IPMI and SMASH Together

Using IPMI and SMASH in conjunction can allow a comprehensive system management
strategy. The following are a few scenarios that cover a large cluster, a distributed
branch office environment and a mixed rack of 1U servers and blades. The flexibility
that enables these scenarios is shown in Figure 4.

In clusters of servers, compute cycles equal money. The more the merrier, until
servers go down. So, how do you diagnose and/or power cycle these servers when
the OS has hung? With so many servers, it’s difficult to spot issues. Using
IPMI-enabled servers can help spot problems ahead of time. By configuring IPMI
thresholds within the servers, potential heat and power issues can be recorded
and alerted to management consoles ahead of meltdowns, allowing for time to fix
the problem. This predictive alerting can offer cluster managers a chance to keep
compute cycles—and dollars—at maximum. Power cycling can be achieved
by executing the “Power Cycle” SMASH script against the IPMI firmware
in the servers using a Telnet session.

Now, suppose a branch office doesn’t have the expertise or personnel to
keep an eye on their systems 24/7. Security in the branch is also an issue. So
how do you fill the gap to “be there” whenever necessary? By placing
an appliance out at the branch, you can aggregate alerts and secure access to
a single point. Receiving an IPMI “security alert” back at the NOC
indicates that someone just popped open the server chassis. If that appliance
supports SMASH, then the same Branch’s SMASH scripts can be run from the
network operations center staff, irrespective of the model or vendor of those
branch servers. Just open a Telnet session to the SMASH appliance, and run the
“health check” SMASH script to perform a health check by pulling IPMI
information. Or run the “Inventory Scan” SMASH script to pull detailed
information that IPMI has created in the box that can help identify whether changes
have been made out in the field.

Today, blades are invading the 1U rack. But don’t they require new management
techniques? How do you support the various consolidation projects that are forcing
“scale up” exercises in the same rack? IPMI and SMASH were designed
with flexibility in mind. They don’t care whether they are on the blade,
the motherboard, a plug-in card or in the blade’s chassis manager. To the
administrator (and his/her scripts) it all looks the same.

So what happens when the rack experiences an “event?” Realizing that
access to the OS consoles can be valuable to see where the problems are, a manager
can take advantage of the IPMI and SMASH capabilities from day one. A “stream
console over LAN” SMASH script will open multiple OS console sessions and
record what the OS consoles were doing right before the failure. It is also possible
to set up thresholds to check on overall system hardware health. By using both
the IPMI and SMASH functionality, problem resolution can immediately be handed
to the software team to resolve a specific OS issue that required a patch.

There are indeed big benefits server designers can offer by including IPMI and
SMASH in their designs. They can pass these benefits on to IT because a consistent
command line that changes little over time reduces training requirements as well
as mistakes. The need to buy fewer management tools further lowers purchase and
training costs. In additions, fewer scripts will be needed to perform management
tasks across multiple server vendors. Policies developed with management standards
should make it easier to predict hardware failures and to schedule downtime during
non-peak hours. That makes it more likely that service personnel will arrive at
the site with the right parts by diagnosing the issue before dispatching service
personnel, all of which means a reduction in mean time to repair (MTTR), more
uptime cycles and more dollars.

Avocent
San Jose, CA.
(408) 436-6333.
[www.avocent.com].