Taming the Wild Alarm System, Part 4

September 9, 2022

Bill Hollifield

Just How Bad is Your Alarm System?

And how can you tell? Alarm analysis has become commonplace. You have to analyze something to improve it. There are many analyses that provide great information for improving your alarm system. The most important one, though? It is all about alarm RATE. How many alarms are being generated, in how much time, and can the operator receiving them deal with that number?

Looking at the alarm rate is independent of the type of process that is being controlled. It doesn’t matter if you are making gasoline or aspirin or megawatts. All processes involve an operator monitoring and controlling factors like flow, temperature, pressure, composition, etc. All alarm rate analyses are normalized by looking at the alarms presented to a single human – the operator responsible for dealing with them.

This operator is “a single human at a time.” For example, looking at alarms per day for a continuously staffed process will involve either two or three humans in sequence, depending on if there are two shifts or three in the 24-hour period. We call this staffing combination an “operating position.” It is common that a process may have more than one operating position, with different operators controlling different parts of the process and coordinating when necessary. It is a best practice, though, that each operator is sent only the alarms relevant to their span of control responsibility.

All alarms are a source of human-machine interaction. The human must detect the alarm, understand it, examine the process to understand why it is occurring, determine the right response, take that action, and continue to monitor the process to see that the chosen action is successful. This sequence takes time and thought.

Obviously, an operator can accomplish those steps if the alarm rate is one alarm per hour. Just as obvious, an operator cannot deal with one alarm per second (but we commonly see alarm rates far higher than one per second!). Remember, an alarm is about an abnormal condition that requires operator action to avoid a consequence. If an operator misses or cannot “get to” an alarm in time, then the related consequence will occur.

The best overall measure, one that tells the story of your current alarm performance to operators, engineers and managers, is the number of alarms per day for a single operating position. Here is a typical graph of that. At least a month’s worth of data creates a good chart. All data shown are from the analysis of customer alarm systems:

This graph is quite typical for an unimproved system. In fact, it is about a factor of five LESS than most such systems. Rates of 10,000 to 20,000 alarms per day are common, unfortunately.

The target lines at the bottom for 150 and 300 alarms per day are based on long-established guidelines for “acceptable” and “maximum manageable” daily rates of alarms for a single operator managing a typical process. On an hourly basis, this would be between about six and 12. Some people feel this is a pretty low rate to shoot for but think about that. Would you be happy if your control system is operating so poorly that every five minutes the operator has to be interrupted, analyze a situation, and take an action that avoids some sort of significant consequence? Would you be happy for them to be continually doing that rather than performing much more useful tasks like monitoring and adjusting the process to wring out those last few points of efficiency and profitability?

Unimproved alarm systems are full of nuisance alarms and other junk, which cause these high rates. The nuisance alarms are solvable (see previous blogs with links at the end of this one) and alarm rationalization gets rid of the junk (the subject of a later blog). A good method to justify an alarm improvement project is to take the above graph and discuss it in this way: how many alarms were likely missed by the operator last week because of these high rates? What are the odds that the operator saw and responded to all the “actually important alarms” and only ignored the less important ones? Hoping and wishing for that is not a good strategy for success or safety.

Here’s an example of how some analyses can reveal several bad things at once. Some background: a popular brand of Distributed Control System (DCS) has the ability for the operator to suppress a configured alarm. This can be done in such a way that any occurrences of the alarm are still saved in the log (which is the source for analysis data) but those occurrences do NOT generate an annunciated alarm to the operator. The method for using this suppression is often uncontrolled, and there is not very good tracking or visibility as to which alarms are suppressed, or for how long. So, look at this graph:

The annunciated alarms seen by operators (blue line) are mostly down in the desired ranges of less than 300. If you just charted or looked at that alarm rate, you would think there was no problem at all. The real rate, however, was much higher. A separate analysis showed that 147 tags (points) with almost 500 alarms configured on them had been suppressed but were still generating invisible occurrences in large numbers. The operators, over time, had gotten rid of many nuisance alarms, but in an uncontrolled fashion that also included suppressing some important alarms. This is NOT the way to solve an alarm problem! The situation also revealed poor operating discipline and poor management of change of the control system. The answer is that rationalization must be applied to this system. All needed alarms must be unsuppressed, and engineering controls must be applied to the practice of alarm suppression.

Alarm rate averages can be misleading, and do not nearly tell the whole story. Alarm floods are periods of high alarm rates – more than 10 alarms in 10 minutes. During a severe flood, the alarm system becomes worthless; a nuisance distraction that can impede the operator’s ability to handle an upset. Alarm floods have preceded many major accidents. Here is a simple analysis of alarms per 10 minutes.

The green band at the bottom is “10.” You should actually have only a few peaks above that. But this data is from a system that was in alarm flood 96% of the time (which is more common than you think). Imagine you are the operator trying to solve a major process upset. However, the alarm system is going off every few seconds, sometimes in bursts of dozens of alarms. You would want to turn off the whole thing. It is not an effective tool to help you get the process back on track.

The Alarm Management Handbook covers these analyses and many more, such as:

Most frequent alarms
Stale alarms (that have been in effect continuously for days or weeks)
Alarm priority distribution (compared to best practices)
Alarm flood analysis (duration and quantity)
Breakdown by type of alarm (process value alarms, instrument malfunction alarms, etc.)
Correlated alarms (that always occur close together)
Alarm configuration changes (that should have been made and documented in the Management of Change (MOC ) procedures. Lots of surprises found here!)

The ISA 18.2 and IEC26282 Alarm Management Standards contain the following table of recommended metrics. It is mandatory to monitor your alarm system performance, but you can determine your own key performance indicators. The table is preceded by this caveat: the target metrics described below are approximate and depend upon many factors (e.g., process type, operator skill, Human Machine Interface (HMI), degree of automation, operating environment, types and significance of the alarms produced). Maximum acceptable numbers could be significantly lower or perhaps slightly higher depending upon these factors. Alarm rate alone is not an indicator of acceptability.

Let’s consider a couple of lesser known but useful analyses. Control systems log a lot more than just alarms. They record a variety of operator actions. If you have Hexagon’s alarm management software, there are many other useful things you can analyze and report automatically. These can give you insight into the challenges faced by your operators.

Controller Mode Analysis: process controllers have multiple modes with the most common being “in AUTO” or “in MANUAL”. You spent the money to install a controller because you wanted it to run in AUTO. A mode analysis can show you what percentage of time each controller is in its different modes. You will likely find dozens that are run in MANUAL most of the time! Why? Because they either run better in MANUAL than in AUTO (often true), or the operators THINK that they do! Either way, you have just found a rich source for inexpensive improvements. Make those controllers work!

This table represents controller mode changes for a single operating position for one week.

Point	Normal Mode	Change Count	% of Time in Normal Mode	% of Time in MANUAL Mode
LIC 27	AUTO	111	96.5	3.5
FIC34	CAS	105	23.6	20.9
LIC 117	AUTO	101	74.8	25.2
PIC654	AUTO	78	99.8	0.2
FIC78	AUTO	74	0	1.5
LIC200	AUTO	70	99.3	0.7
FIC77	CAS	60	18.3	19.2
LIC01	AUTO	54	15.2	0.9
FIC288	CAS	54	20.7	3.5
TIC384	AUTO	49	59.7	31.7
LIC2088	AUTO	48	65.6	31.4
TIC309	AUTO	45	89.7	3.6
FIC897	CAS	45	7.2	44.8
FIX12	CAS	40	0	7.3
LIC611A	AUTO	38	95.8	0.3
FOC55	CAS	38	10.8	47
FIC22	CAS	37	0	0.7
FIC980	CAS	37	0	0.2
FIC400	CAS	36	32.7	21.9
FIC1000	CAS	34	26.6	31.1
If “Normal” is presented in cap/lower case, then MANUAL should be too, ie, Manual.

Several interesting things are shown here. First, there are three controllers that experienced more than 100 mode changes in the week’s data. Is it possible they were designed to be operated in such a way? Unlikely. In fact, imagine asking your best control engineer to design a controller where the correct thing for the operator to do is to change its mode about 100 times a week. You would get a funny look in return. But analysis will usually find many controllers like this.

Operators do not change controller mode without a reason. They do not do it just for fun or to occupy their time; they perceive (rightly or wrongly) a need to do so. Therefore, these controllers are not operating as designed. They need to be investigated and fixed. Otherwise, you have just wasted the investment of installing them in the first place.

Operator Action Analysis: DCSs capture all the operator’s interactions with the control system. Some of those directly affect the process and others do not. The ones that always do are:

Adjusting a controller setpoint
Changing a controller mode
Directly controlling the OUTPUT of a controller when placed in MANUAL
Manually initiating an ON-OFF or similar discrete action (using a digital output point)

These represent direct manipulation of the process by the operator. Now, we all know that you are making the most money when the process is running smoothly. Stable and optimally performing processes will have low operator change rates. How many such changes are your operators making every hour? Have you looked? When operator change rates become high, it could be that not enough thought is being given to each change. Here is an example of such a chart, showing 114 days. This operating position is probably a bit overloaded.

Once you have eliminated nuisance alarm behaviors (like chattering), and after rationalization has taken the meaningless junk out, the alarm system then consists only of indications of abnormal situations that require an operator response. At that point, high rates of valid alarms show the control system is incapable of keeping the process within boundaries that do not require operator intervention to avoid consequences. At that point, the solution to high alarm rates is improving the control of the process, not messing with the alarm system.

Alarm analysis is important. It can quickly direct you to the improvements you need the most. And with modern software, all the resulting reports can be easily automated.

Please contact us for more information or if you have questions

Review other Taming the Wild Alarm System topics in this blog series: