24 Apr 2016
System management controllers evolve into the control plane
Ron Wilson, of Altera / Intel looks at how the processing requirements are changing as system management controllers take on more tasks.
System management controllers would be straightforward to implement if it weren’t for decades of feature creep.
What began as simple managers of a single task - think a thermistor controlling a cooling fan in a rack - have become sophisticated embedded systems handling a portfolio of tasks including physical monitoring and control, remote configuration management, workload management, virtualization, reliability, and security.
Each of these tasks has become increasingly complex. For example, as CPU boards added multiple SoCs and DRAM DIMMs developed thermal issues, temperature measurement began to demand multiple sensors and a microcontroller (MCU). Once you have added that MCU, you might as well use its pulse-width modulation outputs to control the fan drivers. Feature creep is underway.
There’s been a similar progression in voltage monitoring, which initially meant keeping the CPU reset until VCC was within spec, and asserting a ‘power fail’ if it went out of spec again. But then SoCs began to need multiple supply rails with different tolerances, and sometimes strict power sequencing, too. IC makers developed mixed-signal power-management controllers to handle these tasks, although some system designers loaded them on to the existing MCU.
A further development, dynamic voltage-frequency scaling, meant the controller might have to change the supply voltage and clock frequency of a domain within the SoC in real time, freezing the clock until the new supply level was stable. Again, the task could be handled by a dedicated chip or the system-management MCU. Some systems have become so delicate that sensors must capture voltage waveforms or spectra, not just periodic level measurements, and pass them to a controller over something like an I2C bus.
A battery of issues
Battery management, particularly for mobile devices, introduces further system-management challenges. Modern batteries provide decent energy density and cycle life in exchange for behavioural issues including opacity about their true charge level, load and temperature sensitivity, and the possibility of catastrophic failure if mistreated.
Managing such cells can involve highly accurate voltage and current monitoring, the use of complicated state-estimator algorithms such as Kalman filters, load-balancing algorithms, and current-switching within the battery stack during charging and operation. Again, all this can be handled by a dedicated battery-management controller or within the increasingly burdened system-management controller.
Large systems also need to capture and report other physical measurements, such as fan speeds, cabinet intrusions, and hot-plug events. There are other events, such as error flags on DIMMs and SoCs, need to be incorporated in a system management strategy, too.
In small, autonomous systems, all this monitoring and control can be handled locally. But more complex systems need to be able to log routine data, report exceptions, and accept commands from a remote supervisor. For this reason, many board-management controllers have a communications protocol stack and a remote connection. This can be a simple serial port or, more often today, a sideband connection on the board’s system interface, be this PCI Express (PCIe) or Ethernet. This sideband port must keep working even if the board’s CPU is disabled.
Much of the work on networked board management has been done either by standards organizations, such as the PCI Industrial Computer Manufacturers Group (PICMG), creators of the Advanced Telecommunications Computing Architecture (ATCA) specification, or by data-center server developers such as Dell and HP. They see a network of board-management processors as fundamental to the operation of large switching and computing systems, and are using the network connections to build a control plane over the computing or switching hardware (Figure 1).
Figure 1. Connectivity and redundancy enable what had been a set of isolated MCUs to become a high-reliability network capable of powerful system management functions (Source: Altera)
This network brings opportunities beyond sensor logging. For example, you can send all the logged data into a big-data analysis system to predict failures. You can use the network to enable remote firmware updates, by giving the board-management controller write-access to the board’s flash memory. [PICMG provides a standard interface for this - at least for the MCU firmware on the board - through the hardware platform management interface.] And you can access the CPU through the board-management port, to virtually attach a CDROM drive or keyboard/video/mouse console, or to deliver CPU status updates through a serial output. All the information is packetized and conveyed over the Ethernet sideband. This enables an external device to monitor and control operating-system and even application activity.
A control plane
In network switching equipment, functions are often segregated between two sets of hardware. Functions that must work at wire speed - such as packet buffering, routing, and prioritization - are done in dedicated, configurable hardware in the data plane. Supervisory functions- such as building routing tables and managing queues - are done in software on CPUs in the control plane.
In the server world, applications run on server CPUs in the data plane. Supervisory functions - such as maintenance routines and configuration management - run on other server CPUs in a virtualized control plane. The connection between the two is the network of board-management processors that also manage the cabinet, cooling, and power.
The more capable board-management processors become, the more tempting it is to give them power over fast-paced local configuration and application-allocation decisions. But the more power the devices have, the greater the risk that they will become targets of attacks.
Even in much smaller systems, security is a major issue. For instance, authentication and encryption are necessary - even in a single-board system - to ensure that board-manager firmware updates are safe.
For designers in military and transportation systems, this scenario may sound familiar. We’ve described a physically separate network of processors that can monitor both physical quantities and the execution of application code. This network can help allocate resources, and manage the failure of critical tasks. In communications or computing, we could be talking about system-management hardware. In military or transportation design, we would be talking about a functional-safety subsystem.
Such mission-critical systems sometimes use separate, high-reliability hardware to check the state of the system, to monitor the external environment for risks that the system might do harm - say, exceeding the safe speed on a railway segment - and to intervene.
In larger systems these functional-safety tasks are often grouped with application tasks on the main CPUs, with lots of redundancy to ensure that even when hardware fails the functional-safety tasks still happen on time. But there are advantages to running the functional-safety tasks in a simpler, isolated environment, where it may be possible to prove formal assertions about the execution of the code.
The future of system management
Today’s most sophisticated system-management processors are becoming a sophisticated control plane for large estates of communications and computing servers. But how did we get here?
A little embedded control loop became a multi-input data logger. It gained a network interface, and remote update and console capability. It began to monitor software execution, and acquired an operating system. Perhaps it began to work with the system hypervisor to manage virtual machines. And so we arrive where we are today.
What happens next to system management? As it takes on more capabilities and becomes more widely and deeply networked, it will become increasingly attractive to attackers. To mitigate this, we’re likely to see the increasing use of high-reliability hardware, redundancy strategies, and crypto processing to protect what will become the critical control plane for large amounts of communications and computing resources. Perhaps, in time, the system-management subsystem will also start borrowing from other disciplines, such as automotive design, and begin to assume some responsibility for the functional safety of large parts of our critical infrastructure.
It is all a long way from a thermistor, a fan and a rack.
Page 1 of 1
About the author
Ron Wilson is editor-in-chief of the System Design Journal publication from Altera which is now part of Intel. He has close to 40 years of experience in the electronics industry , and prior to joining Altera has held a variety of editorial positions with EE Times, serving as both editorial director and publisher of ISD Magazine and has written and edited for EDN Magazine, Computer Design, and Embedded Systems Design. Wilson holds a B.S. in Applied Science from Portland State University.
Intel (NASDAQ: INTC) expands the boundaries of technology to make the most amazing experiences possible.
Most popular articles in Processing & embedded
Share this page
Want more like this? Register for our newsletter