This is the title of an excellent book by Duffey and Saull, that analyses the space, nuclear, aviation, chemical and other industries and reports that 80% of all failures are down to human error. This correlates well with the Uptime Institute’s reports of approximately 70% of data centre failures attributed to human error.
With time we ALL become complacent and therefore it is better to plan for an inevitable rare failure
Duffey and Saull construct a human failure rate bath tub curve and explain the Universal Learning Curve as an exponential curve that varies in terms of organisational experience and operators’ depth of experience. Failure rates decline with experience, reducing from an initial higher failure rate to plateau at a minimum failure rate. They explain that we learn from experience and therefore learning from failures allows us to reduce our failure rates.
Nevertheless, with time we ALL become complacent and therefore it is better to plan for an inevitable rare failure (100,000 to 200,000 hours accumulated experience), than to assume / believe that the failure will never happen.
The learning rate varies; some industries / organisations learn and others do not, depending on their “learning environment”. I wonder where the data centre industry stands in this respect?. However, failure rate tends to vary much more correlated with depth of operator experience, i.e. novice operators will have more failures than the more experienced operators. I would describe depth of experience as a combination of relevant working experience, education (cognitive knowledge), skills (psychomotor) and affective learning (teamwork, awareness, knowledge sharing, communication etc).
Whilst we certainly need to improve education and skills with regard to risk and energy, effective learning is also an important area to develop in the data centre industry.
How is effective learning achieved?
Why is training not learning?
Why effective learning is so difficult in the data centre industry and what to do about it