Reliability is essential to the functionality of an electric power grid. This principle guarantees that a constant qualitative and quantitative supply of electric power is flowing from a provider to businesses, homes and more. It’s what enables electric power to drive life forward in modern society. As a result, there’s reason to be concerned about events that threaten the reliability of the power grid. Those events include misoperations. As explained by the Western Electricity Coordinating Council (WECC), a misoperation of a protection system throws a Bulk Electric System (BES) into a less reliable state. Such an event can then produce transmission outages and/or affect the grid’s broader reliability. Of course, the scale of disruption and its subsequent impact on the grid varies from one event to another. As it explained in its “Electric Reliability Organization Event Analysis Process Version 4.0” report, for instance, NERC revealed that a “Category 1” event could result in the misoperation of a BES or an unintended loss of generation of up to 1,999 megawatts. On the other end of the spectrum, a “Category 5” event could cause an unintended loss of generation of 10,000 megawatts or more. That begs the question: what’s causing misoperations in U.S. electric organizations? And what can these entities do to address these?
Understanding the Sources of Misoperations
The North American Electric Reliability Corporation (NERC) helped to illuminate the cause of misoperations in its annual report for 2019. This publication referenced the 2019 State of Reliability, a study which found that protection system misoperations had continued a downward trend initiated five years prior. The downward slide had persisted despite the fact that misoperations had been slightly higher in 2018 than 2017 at 8.0% compared to 7.4%, respectively. NERC took its investigation a step further and examined the cause of misoperations. Through this effort, the organization learned that incorrect settings/logic/design errors had been the greatest source of misoperations at 30% of events. This source thereby dwarfed relay failures/malfunctions and communication failures at 19% and 13% of events, respectively. AC systems and as-left personnel errors tied at 10% of misoperation incidents, followed by unknown/unexplainable causes at 8%. DC systems came in last at 4% - two percentage points behind other/explainable sources. So, what does NERC mean by “incorrect settings/logic/design errors?” They’re misconfigurations and software bugs. Organizations can take steps to minimize the occurrence of these types of errors, of course. But even having a redundant system may not help in mitigating the failure if proper testing has not been carried out first. Redundant systems don’t disqualify cyber assets for inclusion in standard coverage, per NERC’s CIP-002 requirement.
Figure 5.5: Misoperations by Cause Code (4Q 2013 through 3Q 2018), NERC’s 2019 State of Reliability page 54 Figure 5.5: Misoperations by Cause Code (4Q 2013 through 3Q 2018), NERC’s 2019 State of Reliability page 54 NERC conveyed this point in a lesson learned report released in April 2020. According to NERC’s alert, a facility suffered the loss of energy management system (EMS) functionality in its Automatic Generation Control (AGC) system. This disruption resulted from an untested software update deployed in both control centers in parallel. The plant was in the process of performing a routine weekly update of its AGC system. This modification included a code change that updated the alarm text character length, which ended up generating alarm text with a greater length than its max character limit. As a result, the error produced a run time abort of a critical task as well as a loss of critical functionality. The issue is that the organization applied the update without prior testing at the primary control center and then right after on the redundant control center. Such a failure to test caused the untested bug to trip the system in both control centers, causing a loss of control until the issue was identified and a script applied. This latter mitigation effort led the system to ignore the alarm until a permanent fix could be applied.
How Organizations Can Minimize the Occurrence of Misoperations
Fortunately, electric organizations aren’t powerless to reduce the risk of loss of operations and control. There are many things that they can do to ensure their reliability. These include the following guidelines:
- Make sure that they have conducted appropriate tests of an updated piece of code. Simultaneously, ensure that they have deployed a new configuration in a test environment prior to live deployment. This type of exercise will demonstrate the types of issues that could arise in their production environment. Organizations can then respond appropriately to ensure that live deployment causes as little disruption as possible.
- Do not deploy an update or new configuration on redundant systems close together. Additionally, leave enough time before you deploy those changes on your primary and secondary systems. Doing so will help to ensure that you can recover access to your systems as quickly as possible if an update/new configuration produces a misoperation event.
- Have a change monitoring tool in place that can assist in forensic analysis to identify the root cause as soon as possible should anything go wrong after deployment on any of your systems.
Tripwire has file integrity monitoring (FIM) capabilities for both IT and OT assets that can further help organizations to minimize the risk of misoperations. Additionally, the Tripwire Enterprise application can help monitor changes to critical IT-based systems. Finally, Tripwire’s Industrial Visibility appliance can help with organizations safeguard their OT assets.