Human and management errors are the root cause of most failures and energy wastage in data centres. Learning curves for organisations and operators have been developed for several industries, such as nuclear power, space travel, chemical, aeronautical, medical. Operator depth of experience can be improved through effective training, thereby decreasing failure rates, optimising energy performance and reducing staff turnover. The main issues are management complacency, inter-team communication, air management ownership and metrics, risk awareness, energy awareness, over reliance on procedures and on humans to do the “right thing”. By addressing the human element, significant TCO reductions have been achieved.
Operator and management errors are the root cause of most data centre failures and the operational team also has the power to deliver significant improvements in facility energy performance. Recognising the important role which people have in the industry, the case is made to invest in team development to enhance organisational and individual learning and experience and optimise data centre total cost of ownership.
Data centre operators are reluctant to consider making improvements to their energy efficiency due to fears about what this could do to their reliability.
Analysis of industrial failures has found that most failures are due to human error and that failure rate reduces with experience. As learning occurs, the failure rate decreases to a minimum level but eventually, after time, complacency leads to another failure; very experienced operators still make mistakes. This behaviour is represented by a bathtub-shaped curve, a similar pattern to that observed for some component failure rates.
Probability of failure vs. time
Probability of failure
Statistics for the data centre industry also suggest that human error is the root cause of most outages. Data centres are typically designed with resilient topology to improve availability, however it is impossible to entirely design out human operators. In some cases, redundant infrastructure design can make the systems and interactions highly complicated, making them more difficult to operate and potentially increasing the likelihood of mis-operation, rather than improving reliability.
By engaging the operational team, the facility reliability and energy performance can both be improved, saving the business money
Operational team performance improves with experience and training, however 100% uptime remains an aspiration and failure can never be completely eliminated; knowing this we should plan for how to recover from the inevitable failure event, for example through scenario training, so we are ready when it does occur. When there is a failure, what is the management response? Fire those responsible, or review what happened and use it as an opportunity to improve procedures? Ironically, by firing those responsible, the people they replace them with will be starting at the top of the experience curve in terms of site knowledge, which brings with it a higher risk of failure! This reactionary response may not have the desired effect.
Response to failure
Removing hot spots reduces likelihood of hardware failure and allows operating temperatures to increased, resulting in energy savings.
Response to failure
In recent years, the data centre industry has become increasing focussed on energy efficiency facing policy, corporate social responsibility and financial pressures. Similarly, energy performance can be improved as experience and understanding grows. Often, data centre operators are reluctant to consider making improvements to their energy efficiency due to fears about what this could do to their reliability. This is understandable given the mission critical nature of most facilities; having confidence that facility risks are well managed takes priority. However there are plenty of improvements which can be implemented to safely reduce energy consumption, some of which also improve reliability. For example in the case of data hall air management, removing hot spots reduces likelihood of hardware failure and allows operating temperatures to increased, resulting in energy savings.
Experience is relevant to both organisations and individuals. At an organisational level this can be described by the number of data centre operating years and reflected by how effectively the processes in place deal with maintenance, change management etc. Within a team, the depth of experience of individuals is important and may be described by the number of years of relevant experience, their knowledge and attitude. Training has an important role to help enhance team knowledge but an individual’s attitude towards learning is also important. This is reflected in their openness towards learning new things, sharing with others and their ability to work as part of a team. In some cases you may find a human Single Point Of Failure (SPOF) where all the site knowledge rests with one person – the availability of that person presents a risk to the operational continuity of the facility.
Impact of organisational and individual depth of experience on failure rate and energy wastage
The learning process can be modelled as a continuous cycle with a number of elements reinforcing each other: Reflection, Theory, Practice and Experience.
The Learning Cycle
We reflect on a problem, analyse it theoretically, put it into practice, experience the outcome, then reflect on how well this addressed the issue.
It is possible to apply this model to the data centre industry, where generally the different part of the cycles apply to different roles: the business reflects on their needs, the design consultant applies engineering theory to create a facility to fit their needs, the building contractor has to make this happen in practice and the operational team experience the results. There are various points of handover where information is transferred: the client creates a design brief for the designer, who in turn writes a specification for the contractor; they hand over the finished site to the operational team, who reports back to the client.
The Learning Cycle and Project Roles
A common problem is where these handovers do not go smoothly and there are contractual or organisational barriers between the different stakeholders, so knowledge is not effectively transferred. When trying to tackle risk or energy performance within an organisation the barriers need to be removed; it is important to look at how to involve the right people and get them to interact in a collaborative way, allowing the organisation to benefit from their knowledge.
Data centre total cost of ownership comprises capital cost, operating cost (labour and energy bills) and reliability cost (failures have a financial business impact). The investment focus is often with the infrastructure rather than the human element, even though the operational team are central to the successful operation of the facility. By engaging the operational team, the facility reliability and energy performance can both be improved, saving the business money. Where the operational team have ownership of improvements this also has a positive impact of staff motivation. As it matures, the industry is starting to recognise the importance of the human capital which supports it. It is a fast-moving, high-tech and often high pressure environment; investing in developing team performance allows the full facility potential to be realised.
Operational Intelligence work with data centre operational teams to help them reduce their total cost of ownership.
 Duffey & Saull 2008 “Managing Risk: The human element”
 The Uptime Institute
Article originally published in CIBSE April Journal 2013: