Safety-critical RTOS demands Verified Correctness

Fractal Realms series. Backdrop of fractal elements, grids and symbols on the subject of education, science and technology

Safety-critical RTOS demands Verified Correctness

The real-time operating systems used in safety-critical applications must offer more than fast, deterministic real-time response. They must offer time and space partitioning facilities that ensure that critical tasks run to completion and prevent errant and malicious code from corrupting critical functions. They must also undergo rigorous design and development processes that verify correctness.

Greg Rose, Vice president of marketing, DDC-I

From avionics and automotive control to medical instrumentation, there is a special class of applications where fast and available are not enough. In these life or death applications, the software must also exhibit safety criticality. This heightened level of reliability ensures that the software is verified correct to minimize failure, and that if it does fail, that the design level and system safety assessment minimize the threat to life and limb.

Undergirding this special breed of software is the safety-critical real-time operating system, which ensures that applications have guaranteed access to computer resources, and that if they fail, the damage is contained and the most critical functions continue to run. These RTOSes must be not only fast, secure, and reliable, but exhibit verified correctness through a rigid design and development process. A safety-critical RTOS must adhere to the most stringent development, analysis, verification, and test methodologies, and undergo the most extensive certification processes by industry and governmental certification authorities.

Verifying and certifying safety criticality

Nowhere is the demand for safety criticality greater than in avionics, where DO-178C sets the guidance for verification and certification. DO-178C specifies five levels of design assurance (Level A to Level E), with Level A being the most stringent. Software that can result in a catastrophic failure and loss of life must meet Level A, where the probability of failure cannot exceed more than one in 109 per operating hour. At the other end of the design assurance level (DAL) spectrum is Level E, which applies to software that can’t impact safe operation of the aircraft, such as passively collecting maintenance data over time for later analysis.

DAL A verification ensures that software performs its intended function to an appropriate degree of confidence, detecting and reporting software errors that have been introduced during the software development process. It confirms that all executable code traces to system, software architecture, and source code requirements, and that 100% of the code paths have been tested. This becomes particularly important as the RTOS and application software evolve, making it easier to identify and eliminate dead code that can creep into the design as new features, upgrades and fixes are added.

Verification consists of three activities: review, testing and analysis. Review is a qualitative assessment of compliance with requirements, architecture and verifiability to ensure accuracy, correctness, consistency and completeness.

Testing demonstrates that the software meets requirements, and to a given degree of confidence, that errors that could lead to unacceptable failure conditions (i.e., incorrect software, incorrect requirements, incorrect test case) have been removed. Hardware/software integration tests verify correct operation in the target environment. Software integration tests verify software interfaces and interdependencies such as initialization, control and data coupling.

Analysis is a quantitative assessment of accuracy, correctness, consistency and completeness that utilizes test analysis, coverage analysis, and traceability analysis. The purpose of structural coverage analysis, which encompasses statement coverage, decision coverage, and modified condition decision coverage (MC/DC), is to ensure adequacy of the test set — that sufficient testing has been done for the desired assurance level.

Traceability

Traceability analysis makes it easier for designers to verify that software developed using techniques like UML and mathematical modeling, object-oriented programming, and formal methods achieves the desired level of safety criticality. It ensures that every requirement is implemented and tested, and that every line of code has a reason to be (all code traces to at least one requirement).

Traceability must be top-down and bottom-up, from models and requirements down to each line of code, and back from the code to the requirements and model, including all interceding work products and test cases. Traceability also requires that the executable code be intact relative to the source code. Many compilers, for example, add branch points in the executable code that are not present in the original source code. These branch points must be identified and tested. Conversely, some optimizations can remove constructs, data in particular, especially static data.

Time and space partitioning

Unlike conventional RTOSes, which utilize purely pre-emptive, interrupt driven scheduling, safety-critical RTOSes like Deos may utilize a hybrid approach that combines pre-emptive scheduling with time partitioning. In this model, safety-critical tasks are budgeted a fixed period time that guarantees sufficient time to execute. If they exceed their budget, an exception will be raises that can be handled appropriately by the system. Remaining tasks run within their time budgets to completion, or until they are pre-empted by a higher priority task.

Most safety critical RTOSes utilize a partitioned memory architecture that allows programmers to run the kernel and all critical application tasks each in their own separate memory partitions, where they cannot be corrupted by code running in other partitions. This not only enhances safety by isolating and containing failures, it also enhances security by preventing lower design assurance or malicious user code from accessing the memory allocated to critical tasks.

RTOSes such as Deos afford an extra level of protection that prevents developers of lower design assurance code or less experienced developers from allocating and accessing critical portions of memory. Typically, it is the least skilled developer who poses the greatest risk to system integrity. The Deos platform integrator prevents novice developers who would likely be working on tasks with the lowest safety criticality requirements from allocating memory reserved to critical tasks or critical resources by defining their access privileges in the system registry.

Performance considerations

While the ability to respond quickly to external events is key to any real-time system, it is the ability to minimize worst-case execution time for critical tasks while guaranteeing those tasks sufficient time to execute that characterizes safety-critical systems. For developers, managing shared resources is essential to minimizing worst case execution times (WCET) for critical tasks.

For example, on-chip cache allows processors to run at on-chip memory bus speeds and increases overall compute power. However, task switching and competition for cache resources can degrade cache performance and dramatically increase WCET. Benchmarks show that WCET can be three times higher than average-case execution time (ACET) on single-core processors, and an order of magnitude (or more) higher on multi-core processors due to cache effects.

fig 1_Cache Partitioning Bounding WCETs

Figure 1 —  By reducing cache interference, cache partitioning reduces both worst-case execution time (WCET) and the delta between WCET and average case execution time (ACET). This not only speeds cache access for critical tasks, but increases budgeting efficiency and CPU utilization.

 

To help programmers isolate safety-critical tasks from detrimental cache effects, DDC-I’s Deos safety-critical operating system utilizes a technique called cache partitioning (Figure 1). By setting aside dedicated sections of the cache for critical tasks (or groups of tasks), developers can reduce interference and provide timely, deterministic access to cache. This reduces WCET, thereby decreasing the amount of time that must be budgeted for critical tasks, maximizing the “guaranteed” execution time available to safety-critical tasks, and increasing CPU utilization.

Slack scheduling further increases performance by enabling programmers to harvest the unused time budgeted to time-critical tasks (Figure 2). While cache partitioning reduces the delta between WCET and ACET, time-critical tasks will on average still use less time than they are budgeted. Slack scheduling enables that unused time to be recouped on the fly and made available to other threads.

 

fig 2_Slack Scheduling2

Figure 2 — Time-critical tasks typically use less time on average than they are budgeted worst case. Slack scheduling enables that unused time to be harvested in real time and made available to other threads, thereby boosting CPU utilization.

 

FACE brings interoperability to safety critical software

No discussion of safety critical software development would be complete without a word on cost and time to market, where vendor lock and proprietary interfaces have exacerbated the already steep costs and long delays associated safety-critical verification and certification. Leading the charge for cost containment in the software realm is the Future Airborne Capability Environment (FACE), a collaboration of government and industry charged with enhancing interoperability and portability across DoD avionics applications and platforms.

By establishing standards for software interfaces, interoperability and certification, FACE will reduce vendor lock, opening what have historically been sole-sourced software solutions from one vendor to interoperable solutions from multiple suppliers. This increased competition not only lowers per-program cost, but also makes it easier for program managers to take advantage of best-in-class technology and services. The new standards will also enhance portability and reuse, further reducing cost by making it easier to utilize software components across multiple platforms and programs.

As part of the FACE initiative, the consortium has developed a base profile for RTOSes that combines ARINC-653 and POSIX scheduling. The ARINC-653 interface provides the rigid fixed-in-time scheduling required for tasks with high safety criticality, while POSIX enables developers to quickly access third party code for less critical functions such as maintenance functions with lower or no safety criticality requirements.

DDC-I and On-Line Applications Research (OAR), the original developer of the RTEMS real-time operating system, have announced an integrated solution for the FACE Safety Base Profile that incorporates ARINC-653 and POSIX functionality running on DDC-I’s Deos. The integration features RTEMS hosted in a Deos time partition, giving safety-critical developers a DO-178C certifiable RTOS solution that delivers hard real-time response, time- and space-partitioning, and both POSIX and ARINC-653 interfaces.

Reverse engineered or safety critical by design

Most so-called safety-critical RTOSes are actually generic RTOSes reverse engineered to comply with DO-178C. Rather than starting with safety-critical requirements and producing RTOS code that is optimized for those requirements, this band-aid approach starts with RTOS code and generates requirements that are optimized for the code. RTOSes like Deos, by contrast, was developed from the ground up for safety-critical applications using RTCA DO-178B, Level A processes. This not only enhances safety critical performance and functionality, but provides a streamlined path to DO-178 Level A certification.

Performance, functionality, cost and time to market will always top the list for developers seeking an optimal safety-critical COTS RTOS solution, but what really separates safety-critical RTOSes from their generic counterparts is verified correctness. Where safety is paramount, developers must be confident that their RTOS of choice has been scrutinized at the highest level, run the gauntlet of analysis, test, and verification, and can meet the most demanding certification requirements.

url: www.ddci.com

About the author

Greg Rose is the vice president of product management and marketing at DDC-I.  He has over 30 years of experience in marketing, product management, business development and engineering in embedded software, hardware and intellectual property licensing.  Greg is a graduate of the Iowa State University, where he earned a bachelor of science in electrical engineering.