Data Centre Operations Blog

The Gartner Data centre power and cooling technologies hype cycle

For those thinking of investing in new technologies for data centres, this is a must read document. It describes most technologies and concludes that mainly those related to fundamental issues of cooling and UPS as likely to survive the hype cycle.

ilus3
(Reproduced by courtesy of Gartner)
As the applications of technologies are changing fast, this work should be updated when possible. It would be useful to explain the outcome differences between ‘on site generation’, ‘low carbon technology’, ‘renewable energy’ and ‘energy efficiency’, and how free cooling affects them. This would be useful to explain the following:

  • Only if cogeneration uses biofuel will it be renewable and its relative efficiency will depend on how well does each national grid perform. I would also challenge the statement that it “dramaticallyincreases the energy and carbon efficiency of data centres”.
  • CFD and air management are only energy enablers, they do not in their own right save energy.

One observation is that the rotary heat exchanger (Kyoto wheel) is a semi-indirect air system as there is outdoor air (although less) entering the data hall.

Another observation is that “in row cooling” is only one method of physical air containment and that there are many other methods that use air-side and water-side free cooling.

For a legacy data centre with a PUE of around 2, the UPS losses will typically be significantly less (typically 3 to 10 times less) than the energy for refrigeration What makes the difference between a data centre with a PUE of around 2 and one of around 1.25 is mainly down to the intensive use of free-cooling.

The true potential of free-cooling (air side, water-side) is not fully appreciated by the market yet. I will bet that it will move from the trough of disillusionment to the slope of enlightenment once the combination of following is understood:

  • The true potential of adiabatic cooling
  • Opportunity to use of ASHRAE’s recommended and allowed environmental ranges
  • Adequate air containment systems in the data hall

Not only will this enable the PUE to reduce dramatically, but important capital costs are also possible. This will happen due to reducing and eliminating chilled water systems and their associated electrical infrastructure.

This is possible in many cities including Johannesburg, London, Madrid, Mexico DF, Moscow, Munich,
Riyadh and San Francisco. If ASHRAE’s allowed range is used, then this applies to most cities in the world including Barcelona, Beijing, Buenos Aires, Chicago, Houston, Istanbul, Miami, Mumbai, New York, Rome, Sao Paulo, Singapore and Tokyo.

Once we have got the PUEs down to below 1.3-1.4, it is likely that the UPS losses will be more predominant than before.

Gartner predict the surviving technologies to be related to cooling and UPS. I would suggest that the running order of technologies should be:

  • Air management – physical containment will enable variable air flow supply and fan energy
    savings and also enable
  • Increase set points of air and water to improve refrigeration efficiency and
  • Maximise free cooling opportunities, followed by
  • UPS improvements

Ref: Phelps J., Hype Cycle for Data Center Power and Cooling Technologies, 2010, Gartner Research ID
Number: G00205727

The Age of Contained Air Managed Cooling

At DCD London November 2010 Neil Rassmussen has challenged the traditional raised floor cooling by referring to it as “the end of raised floor cooling”.

Whilst I agree with the end of an era of “open air management” cooling systems (which mostly use raised floor cooling) I think we will still continue to see raised floor cooling designs and for valid reasons.

I have been involved or am aware of recent designs that addressed dynamic high density loads, achieved PUEs below 1.3 using free cooling and used raised floor cooling, BUT used closed air management i.e. physical containment between hot and cold air streams (excluding cold aisle containment). Raised floor cooling could have been avoided but there were reasons not to do so including: floor plenum was used for UPS/PDUs and cable trays, less restrictions on building height than plan area, and use of plenum for thermo-syphon air path for fan-less operation.
The theme for the new generation of data centre cooling is contained air management (not necessarily using a raised floor). This is because if you contain the air streams and adequately control the cooling systems you obtain the following:

  • No air recirculation, so the CRAH supply air is the same as the air inlet to IT equipment, which can be as high as 27C to comply with ASHRAE’s recommended range. Therefore there is more availability of free cooling and the refrigeration systems will be much more efficient, less utilised and perhaps even removed completely
  • The CRAHs circulate only the amount of air actually required by the IT equipment, saving significantly on fan energy

With this it is quite difficult not to obtain a PUE of 1.3 in most of the world and definitely below 1.5 in all the world (if indirect air side adiabatic free cooling is used).
Ideally, rack exhaust (from rack top) containment is used although hot aisle and cold aisle containment are also effective to separate hot and cold air streams. The reason for this is that all the data centre space where operatives work is cold with the first solution, whereas for the last solution, most of the data hall is hot (except the cold aisles).
Maybe “The end of traditional raised floor cooling” is a better choice of words, or should we give it a positive spin with “A new era of contained air managed cooling”?

Cold air as a renewable source of cooling

Why not? Providing it is at a colder temperature than what we need to supply to IT equipment, it is cheaper than other renewable technologies (tidal, wind, solar, geothermal), it readily available in all the world, and it only has the embodied energy of the air handling units. Even if the temperature of the air is warm, unless it is saturated with water vapour, it just requires adiabatic cooling (using water) to bring down the temperature of the air so it can be used for free cooling. With this technology, full and partial free cooling for data centres can be achieved in every city of the world. Therefore, cold air free cooling is even better than standard renewable technologies. All we need now, is for legislation to recognise the advantages of using air free cooling in addition to the renewable technologies currently promoted.

“Human Unawareness” of Energy Saving Potential

While “human error” is responsible for most mission critical facilities failures, “human unawareness” is responsible for easily avoidable energy wastage in data centres. For most data centres 10-30% energy savings can be achieved with low investments. In a typical 1000m2 raised floor data centre, savings of hundreds of thousands of £, USD, Euros per year can be achieved with Return on Investments (ROI) under a year. Air management is normally the fundamental first step to achieving this.

“Managing Risk: The Human Element”

This is the title of an excellent book by Duffey and Saull, that analyses the space, nuclear, aviation, chemical and other industries and reports that 80% of all failures are down to human error. This correlates well with the Uptime Institute’s reports of approximately 70% of data centre failures attributed to human error. Duffey and Saull construct a human failure rate bath tub curve and explain the Universal Learning Curve as an exponential curve that varies in terms of organisational experience and operators’ depth of experience. Failure rates decline with experience, reducing from an initial higher failure rate to plateau at a minimum failure rate. They explain that we learn from experience and therefore learning from failures allows us to reduce our failure rates. Nevertheless, with time we ALL become complacent and therefore it is better to plan for an inevitable rare failure (100,000 to 200,000 hours accumulated experience), than to assume / believe that the failure will never happen. The learning rate varies; some industries / organisations learn and others do not, depending on their “learning environment”. I wonder where the data centre industry stands in this respect?. However, failure rate tends to vary much more with depth of operator experience, i.e. novice operators will have more failures than the more experienced operators. I would describe depth of experience as a combination of relevant working experience, education (cognitive knowledge), skills (psychomotor) and affective learning (teamwork, awareness, knowledge sharing, communication etc). Whilst we certainly need to improve education and skills with regard to risk and energy, affective learning is also an important area to develop in the data centre industry.

Air Performance

“I have 1200kW of CRAC units installed in a data centre designed to 1000kW of IT load which is currently loaded to 80%, so I should be OK”. Wrong, or at least you are not looking the right information. What is really missing is how much of that air cooled by CRAC units actually makes it to the IT equipment. The following diagram shows that approximately only 50% of the CRAC air actually makes it to the IT equipment (air flow performance). This typical value is due to bypass air (BP) that comes out of floor grilles in the wrong location (i.e. hot aisle) or cable cut-outs (not closed) at the back of the racks. So if we consider this variable, the amount of cooling actually delivered to IT equipment is the product of the installed CRAC capacity, multiplied by the air flow performance factor, i.e. 1200kW * 0.5 = 600kW. The air is only delivering 600kW of cooling capacity to the IT equipment (that required 800kW), which is clearly not enough. Therefore, the IT equipment fans draw more air by recirculating (R) what is missing from the back of the racks.

 

ilus2

Thermal runaway

Backup generators to start-up and power to be restored to the cooling systems before they restart. To overcome the high air temperatures entering IT equipment during this event, many new designs put CRAH fans and secondary chilled water pumps on UPS (so they are immune to a mains interruption) and may also include large chilled water storage tanks to provide cooling media inertia. The consensus amongst IT equipment manufacturers is a recommended server inlet temperature of between 18-27°C during normal operation with up to 32.2°C or higher allowable (ASHRAE, Thermal Guidelines 2011).
If the CRAH fans are not on mechanical UPS, depending on the load density in the data hall, the thermal runaway can be considerable, e.g. 10K rise within a few minutes (or less). With CRAH fans on UPS this reduces dramatically e.g. to less than 5K rise in 10 minutes, however the reasons for this are not obvious. Through modelling and testing I have found that the following influence this, in addition to the air thermal inertia of the data hall:

  • Thermal inertia of fabric in data centre (high thermal capacity, low conductivity)
  • Thermal inertia of metal (racks, IT equipment, cable trays, raised floor, etc.) in data centre (high thermal capacity and conductivity)
  • Thermal inertia of cooling coils in CRAH units (medium thermal capacity, very high heat convection)
  • More of the data hall air is used (as thermal inertia) as all the air is mixed up and is moving at
    higher velocities (enhances heat convection to metal and fabric). There is less air stratification.

Data halls with good air management and contained systems reduce the exposure of their IT equipment to higher temperatures, as the hot air is separated from the inlet air – although air temperatures in the room may rise, it takes longer for this hot air to reach the server inlets. Mechanical UPS may well be necessary as CRAH fans and IT equipment fans are now in series.For existing data centres I wouldm recommend a short test as part of the regular shutdown testing whereby cooling is failed for a couple of minutes and the results monitored to extrapolate the likely thermal runaway conditions.

Compromising Commissioning

There is always pressure at the end of a project to reduce commissioning time. However, the interdependency dimensions of any project are financial, time and quality. Each one of these dimensions affects the other dimensions. If you shorten the programme time, then it will either cost more or the quality (of tests / commissioning) will be compromised. If you try to reduce the cost of it (cutting corners), again the quality will be compromised and you will certainly have delays. And finally if you compromised the quality (why would you?) it would cost less and shorten your programme. But bearing in mind that commissioning is proving that what has been designed, will actually work in practice, in whose interest is it to compromise commissioning?

 

The true cost of energy is not being charged

Our carbon debt is accruing interest.

The International Panel on Climate Change (IPCC) reports that the atmosphere can take another 1000 Giga tons of CO2 before we reach our target of 2C of global warming. If our current rate if emissions remains static, this threshold will be reached in 30 years and no further CO2 emissions can be released.

We are releasing emissions at an unsustainable rate; there is some debate over whether the impact of a 2 degree temperature rise is too great; depending on the exact eco-system tipping point, the anticipated consequences include extreme weather conditions, severe droughts, floods, water shortage, food shortage, social unrest, mass emigration, riots and financial meltdown; we will be at war with ourselves fighting for survival.

US president Barak Obama recently said that this is first generation to notice the effects of global warming and the last generation to be able to do something about it. I agree, the consequences will be personal and extreme.

I believe we got to this situation due to an open loop control issue, we use way too much energy and we are not charged for the environmental consequences of it. If energy costs were say 10 times greater, we would think twice about: flying, driving, heating / cooling set points, insulation, leaving lights / computers on, etc., etc. Then we might have some money for research and development of renewable energy, and reduction programmes of energy use and embodied environmental impact of products and systems.

Containment systems: a disappointing outcome!

I have visited too many sites where I am assured that air management is well under control because an air containment system has been installed, only to find disappointing results in many cases. The objective of containment is to segregate the hot and cold air streams, to minimise recirculation (exhaust IT equipment hot air re-entering intake) and bypass (cooled CRAH air not making it to IT equipment and returning directly to the cooling units). There are two fundamental issues: firstly, even with a perfect containment system, the amount of air supplied needs to be controlled to satisfy all IT equipment requirements, and secondly, containment is only part of the segregation solution.

There is a very simple way to check if there is enough cold air being supplied. In the case of cold aisle containment, open the cold aisle door slightly and verify the air flow direction with a sheet of paper. If air is coming out of the cold aisle, there is a slight oversupply of air to the cold aisle (which is fine), however, in many cases hot air is entering the cold aisle, which means that there is insufficient air supplied, making recirculation inside the cold aisle inevitable. This can be due to incorrect type and number of floor tiles or simply insufficient air volume from the CRAHs. There are some very simple and rapid methods / metrics to diagnose air management based on temperatures, etc. such as Af (Availability of flow) which is the ratio of CRAH air volume to IT equipment air volume. Normally a slight oversupply of air (with a small amount of bypass) is better than undersupply (which causes recirculation). A large oversupply of air is an energy opportunity, whereas a large undersupply of air will inevitably lead to considerable hot spots.

The next concern is the quality of segregation between hot and cold air streams. The metric defined as Air Segregation Efficiency = 100% (ideal) when there is zero bypass and zero recirculation. The important concept here is that we are trying to create a physical barrier between the cold and hot air streams, for which we use a containment system. Most containment systems (excluding butchers curtains) are very hermetic. The issue is regarding the other segregation areas which are not part to the containment system, such as the raised floor where you can have unsealed cable cut-outs and floor grilles in the hot aisles or the front of the rack where there is a lack of blanking panels between IT equipment and gaps at the sides.

Whilst I am all for containment, the reasons why it is being installed need to be fully understood and monitored in order for its objectives to be met.

 

ilus