Operational Intelligence was founded on the understanding that significant risk and energy reduction within the data center environment could only be achieved through an active engagement with operations teams across all disciplines. Risk and energy reduction may be the responsibility of an individual, but it can only be delivered if there is commitment from all stakeholders.
By identifying risks and increasing stakeholder awareness of those risks it is possible to better manage and minimise risk impacts.
The biggest single barrier to risk reduction is knowledge sharing and lack of risk awareness. Many sites have documentation on single points of failure (SPOF), failure mode and effect analysis (FMEA) and failure mode, effect and criticality analysis (FMECA), but in many cases these studies are not shared with the staff who have a hands-on interface with the systems. Therefore, the full value of these documents is not realised.
Figure 1 indicates the universal learning curve, applied to organisations and individuals. The accumulated experience of the company and the depth of experience of the individual interact and both are important in reducing risk and addressing energy wastage. Knowledge-sharing becomes more important as the uniqueness and complexity of systems increase, particularly where systems are beyond the existing knowledge base of the individual operator.
Figure 1: Universal Learning Curve – Organisations and Individuals
Kolb learning cycle
Kolb (the educational theorist) says that learning is best achieved when we move through all four quadrants of the Kolb Learning Cycle: reflection, theory, practice and experience, as shown in Figure 2.
Figure 2: Kolb Learning Cycle and the Construction Industry
When we consider how technical information is shared and transferred in the construction industry and compare this with the Kolb Learning Cycle, we see that different roles inhabit different areas and are usually separated by contractual boundaries. The transfer of information across these boundaries is rarely, if ever, perfect. At the point of handover from the installation/commissioning team to the operations team, much of the imbedded project knowledge is lost and the operations team is left to look after a live and critical facility with only the benefit of a few hours of training and a set of record documents to support them.
It has been well established that human interface is the single biggest risk in the data centre environment and relying on an individual to do the right thing at the right time without any investment in site-specific training is likely to result in more failures and increased downtime. As an industry we should be considering how to improve the processes for handover of information at completion of a new project.
Data centres are delivered via a construction industry more used to working with offices, schools, hospitals, etc, where the entire project delivery process has remained much the same for more than 30 years.
This project delivery method is different in the process, aeronautical, nautical and space industries, where extensive performance trials and hours of training are required before the asset is accepted by the client. This is then followed by hours of scenario training throughout the life of the asset.
Integrated systems testing (IST) is now common on data centre projects but it is still very much the domain of the project delivery team, with generally only limited involvement of the operations team. The testing is used to satisfy a contractual requirement as opposed to imparting knowledge from the construction phase to the operations phase.
In many cases, particularly with legacy data centers, the operations team has little or no access to the designer or the installation contractor, which results in a shortfall in the transfer of knowledge to the people who need to operate the facility, optimize the performance of the systems and keep it live.
The consequence of this is that the operators don’t feel sufficiently informed to start making changes that may improve the energy performance of the facility, for fear of introducing risk. Another consequence of this is a lack of engagement by the operators, which introduces risk due to lack of awareness. This lack of awareness may not be an issue during times of stable operation, but at times of reduced resilience due to maintenance or failure events, operational errors may arise.
The issue is more evident where designs become complex with multiple operational scenarios and complex automatic control systems. This is further compounded where the monitoring systems are not comprehensive enough to provide real-time feedback.
The Uptime Institute has reviewed failure data for data centers and claims that 70% of failures are down to human error. Duffey and Saul, in their book Managing Risk: The Human Element, report that the figure for similar technology-dependent industries could be 80%. Both suggest that human error is responsible for most failures.
There has been sufficient research undertaken by others for us to be confident in the assertion that any system with a human interface will eventually fail. It is not, therefore, a case of eliminating failure but rather reducing the risk of failure by learning about the system and sharing that knowledge among those who are actively involved in its operation. Developing specific, facility-based knowledge is a fundamental requirement in reducing failure.
Traditionally in the facilities sector we place people in silos based on their discipline, experience and management position. Where a blame culture is adopted, these silos become fortresses and information is retained within them, so as not to give the ‘enemy’ an edge in the battle to avoid blame. This may seem like an old-fashioned and outdated attitude but it is very much evident in our industry.
If we are to reduce risk we must increase knowledge and awareness at all levels and areas of the business, and accept that failure is inevitable. So too are ‘near misses’. It is therefore important to maximise the opportunity to learn from these near misses.
When an incident occurs, it’s important we learn from it by adopting open and frank dialog with all stakeholders and use that experience to prevent similar incidents occurring in the future. The function of management must be to create an environment where staff feel they have a voice and are recognised for their role in delivering a high-performing environment.
The skill sets of the staff with the most relevant knowledge may not, for example, include the ability to write a 2,000-word technical incident report; however, there should be a forum to facilitate that transfer of knowledge to someone who can. This can only happen in an open environment, free of a blame culture.
Before we consider system complexity it is necessary to consider that for a resilient system with no single points of failure, a failure event must be, by definition, the result of two or more simultaneous events. These can be component failures or incorrect human intervention.
A 2N system could be considered the minimum requirement to achieve a SPOF-free installation. For simplicity, we will assume our 2N system comprises A and B electrical and mechanical systems. Fault tree analysis (FTA) will highlight combinations of events that result in failure; however, it is very difficult to model human error in an FTA. The data used to model human error will always be subjective and the variables are infinite.
If in our 2N system, the systems are diverse throughout and physically separated, then any action on one system should have no impact on the other. However, it is not uncommon for ‘improvements’ to be introduced that take the simple 2N system and add in disaster recovery links, common storage vessels, etc, providing an interconnection between the A and B systems. Furthermore, the controls are enhanced so the A and B systems can’t be interlinked easily. On large-scale projects, this becomes an automatic control system (SCADA, BMS) as opposed to simple mechanical interlocks. The basic principles of 2N have been compromised and the complexity of the system has risen exponentially. So too have the skills required by the operations team.
A desktop review of the design would still show that a 2N design had been achieved; however, the resulting complexity and challenges of operability undermine the fundamental requirement of a high-availability design.
Often, the particular sequence of events that leads to a failure is unforeseen, and until it has occurred there was no knowledge that it would do so. In other words, these event sequences are unknown until they become known. It would not, therefore, form part of an FTA.
Ludwig Von Boltzmann developed an equation for entropy that has been applied to statistics and, in particular, to missing information, as shown in Figure 3.
Figure 3: Boltzmann equation (S=Entropy, k=Constant, log=log base 2, W=Consumption)
In this example we have eight boxes and one coin. Boltzmann allows us to
determine the number of intelligent questions we must ask to locate the coin. In this case, applying the formula gives us an answer of three and we can see that this is correct:
1. Is it in the top or bottom row?
2. Is it in the left or right boxes?
3. Is it in the right or left box?
If we substitute system components for the boxes and unknown failure events for the coins, we can consider how system availability is compromised by complexity.
In any system there are a number of component parts, and within these systems there will be unknown combinations of events that will lead to failure of the primary function. It should be noted that we are not dealing with risk analysis as in an FTA, but with reducing the number of unknowns.
Figure 4 shows a system of 100 variables, of which five unknown events occurring at the same time would result in a loss of business function. If the number of unknown events is reduced from five to four to three, there is a significant reduction in the ways in which the system can fail. So, increasing our detailed knowledge of systems and discovering unknown events will reduce the combinations in which the system can fail, therefore reducing risk. Note that the y axis is a log scale.
Figure 4: Component count, unknowns and combinations of failure events
Figure 5 provides an illustrative comparison between three resilient systems; 2N, 2N with interconnections between the A and B strings, and 2N with multiple interconnections between the A and B strings. The interconnections have the effect of increasing the complexity (component count) of the system and increasing the number of unknown failure events. Note that the y axis is a log scale.
Figure 5: An illustrative comparison between three resilient systems
It has been well established that human interface is the single biggest risk in the data centre environment and relying on an individual to do the right thing at the right time without any investment in site-specific training is likely to result in more failures and increased downtime. As an industry we should be considering how to improve the processes for handover of information at completion of a new project. The BSRIA ‘soft landings’ process is a good starting point for developing a better handover process than that currently adopted in the data center industry.
Continuous site-specific training of staff will increase knowledge and identify unknown failure combinations. Both reduce the number of unknown failure combinations and resulting downtime. Complex systems increase the need for this training, but it would be far better if designs were simple, with local control and global monitoring. The old adage Keep It Simple Stupid (KISS) remains an appropriate philosophy for the data center industry. So, if 70% of data centre failures are the result of human error, making systems less complex seems a logical solution.
Article originally published in Data Centre Dynamics Focus Issue 20, July/August 2013