Intrigued by human error in data centres I came across David Smith’s book on Reliability Maintainability and Risk where he describes TESEO (empirical technique to estimate operator failure) by G.C.Bellow and V.Colombari.
The principle is that the probability of failure is the product of each of the factors within the following five groups.
1) Activity Difficulty
2) Time Stress
3) Operator Experience
4) Task Related Anxiety
5) Ergonomic Design
In this table I have estimated an example of the probability of failure for a UPS upgrade and a mains power restoration having untrained and trained operators.
Whilst this tool is quite simplistic, it does provide some interesting conclusions for data centres:
1) Regular site tests reduce failures 10 times
2) Expert operatives (compared to average) reduce failures by half
3) Training reduces failures by a factor of 3
4) Regular site tests reduce anxiety failures by 1.5
5) Good visual display (ergonomics) can reduce failures by a factor of 1.5
More is required – How can this be achieved? Monitored? How can people help each other? What will this be like in the future?