Global Data Center Engineering

DCIM (Gestión de Infraestructura del Centro de Datos)

GDCE-DCIM-ReportThe field of Data Center Infrastructure Management (DCIM) is by most counts, still a new industry, as the “Data Center” itself has evolved in the past two decades.  While commonly known that 70% of data center outages are caused by human error, existing data center standards focus primarily on the 30% of the equation (mechanical failure, and natural disaster mitigation).

Outage is defined as any event that impacts a production service, resulting in loss of function for any length of time.  Major Outage is defined as an event that impacts either an entire data hall within a DC or the entire DC.  A data hall is the area where racks are placed which contain IT hardware, such as servers, switches, routers, SAN storage, etc.

The Problem

Data Center Infrastructure Management is a rapidly expanding service in the data center industry.  Most DCIM providers position their software offerings on elements like: Speed to market, rapid deployment, cost, performance management and reduction of energy usage.  These are all important aspects, but the greatest benefit to a data center in cost avoidance is in not having an outage.

The scope of DCIM is also controversial.  This research has aligned with a definition provided in a report from Cappuccio, D.J. (2010) for Gartner.  He identifies DCIM as a “juxtaposition of two other markets – System Performance Management (SPM) and Building Management Systems (BMS)” (where Data Center is the core market, and SPM and BMS are applied to it.) He goes on to define further “DCIM (tools) integrate facets of system management with building management and energy management, with a focus on IT assets and the physical infrastructure needed to support them.”  It is this fundamental definition of DCIM that is assumed throughout this dissertation.

The initial basis for testing DCIM’s impact on outage was born out of several discussions I had with multiple DCIM solution providers.  During those discussion I discovered that none of them had a metric that would indicate if the presence of their solution could reduce outages in the data center.

The Promises
DCIM as a solution is expected to deliver several benefits, and depending on the product may include some or all of:

  1. Reduce time to deploy IT equipment in the data center
  2. Manage asset inventory
  3. Automate asset tracking
  4. Provide “dash board” view to KPI metrics of the data center
    1. PUE (Power Utilization Effectiveness)
    2. Electrical consumption (Reduce power utilization)
    3. Real-time monitoring of capacity (in terms of ability to expand the data center load)
    4. Identify unused equipment or spare capacity
    5. Manage floor loading
  5. Manage work flow (repository for workflow and work orders)
  6. Provide alerts to manage equipment and maintenance schedules

The Missing Pieces

While these are all solid benefits it was surprising to find that there was no focus on two critical elements:

  1. Reduction of data center outage
  2. Human Error management

When I discussed proof of reduction of data center outage with several of the large DCIM manufacturers (Schneider Electric, Nlyte, Emerson, Eaton, Raritan, NoLimits) all had expressed a lack of industry data to support suggesting that the presence of DCIM at any level could reduce outage.  This article (and the research it was based on) is the first proof of its existence.  There will be more to come.

As for Human Error Management, more on this later.

The Pyramid

Data Center Infrastructure Management shows consistent and steady growth as an industry.  In only a few short years it has grown from a $300m USD industry to $1.8b USD industry.  However, it currently appears to be missing a critical opportunity: alignment with practices that will improve Human Error Management.  When considering the average cost per minute of data center outage is $7,900 USD a two hour outage would result in a $1m USD loss.

The diagram below depicts the relationship of DCIM and how it integrates.

GDCE-DCIM-Pyramid

The DCIM Pyramid: Relationships of Management Elements – Payton (2015)

 WHAT? I Was Adopted!

While the data center has been around for many years (earliest reported data center in this study was 1981), DCIM by contrast has not.  The first presence of DCIM noted from respondents was reported in 2006.  Prior to 2010 only 12% of data centers had DCIM.

GDCE-DCIM_Adoption_Rate

DCIM Adoption Rate – Payton (2015a)

Adoption has progressed, as can be seen, post 2010 29% had adopted DCIM.  It is also interesting to note that a small number of data centers adopted DCIM at the point they were put into operation with 24% implementing DCIM during initial operation.  DCIM total (either prior or post data center commencing operation) was 41%.  There were no reported instance where DCIM was removed after implementation.

The Proof in the Pudding

The original aim of our study was to determine if there was a measurably identifiable significance for the presence of DCIM on part or whole facility outage.  We were largely certain that we would.  We also found an even greater argument for DCIM’s presence in the role of outage.  That being, the time to recover.  Our study showed that both DCIM’s presence reduced the likelihood of an outage, and more interestingly, drastically reduced the time to recover from an outage when DCIM was present.

GDCE-DCIM_Impact_Chart_Border

Data Center Outage Experience – Payton (2015b)

The fundamental question posed by this research study was to ascertain if the presence of DCIM in the data center would result in a reduction of outages.  The table above is proof that this supposition is overwhelmingly true.  Of the data centers that reported not having a DCIM system implemented, 80% also reported having experienced a major data center outage.  By comparison, those sites that did have DCIM, roughly 29% reported having an outage.  This may seem imbalanced, but the 80% counts data centers that may have implemented DCIM after their outage (hence why the two combined do not equal 100%).

The leading columns in this chart reflect the percentage with DCIM and without DCIM.  The two columns at right in the above indicate the percentage of outage experienced with and without DCIM.

Relief for the Outage Hangover

Even the shortest of outage can have serious impact, both to revenue and customer experience.  No one wants an outage, but when they do happen, recovering from one is like a hangover… the faster the recovery the better.  (If you’ve never had a hangover, just image the pain of your most recent outage.  They are very similar).

GDCE-DCIM_Impact_Length_Outage

DCIM’s Impact on Length of Outage – Payton (2015c)

With the understanding now that costs of outage are directly tied to their length of existence, it is also important to understand if DCIM has an impact on reducing the length of an outage when they occur.  If outage length is reduced even when an outage does occur, this further adds to the benefit of presence of a DCIM solution.  And, as can be evidenced in the table above, the presence of DCIM reduced the average length of an outage from 342 minutes without DCIM compared to 36 minutes with DCIM.

This factor is extremely important in the justification of DCIM implementation.  Utilizing the $7,900 outage figure (Ponemon 2013), that represents a cost of impact reduction with DCIM of $2.4 million dollars.  The average cost of outage with DCIM is near $250,000, while the average cost of outage without DCIM rings in at $2.7 million.

So what Else Should We Know?

Well, there are many factors at work.  However, one overwhelming influence as well is the presence of a maintenance program.  We found it very interesting that whether maintenance was rigidly followed, or at least sometimes followed, it fared far better than when no plans were in place.  In fact, 100% of DC respondents to our study reported they had experienced an outage.  (All DCs had been in operation for at least 2 years).

But clearly a maintenance plan on its own did not reduce outage as effectively has having a DCIM solution as well.  The two combine to make a significant impact on the availability of a data center.

Still, these focus primarily on the physical aspects of the DC.  Monitoring systems, providing alters, and giving “feedback” to the operators.  It focuses primarily on the 30% of outage, and does little to address the 70% – Human Error Management.  And that’s where DCIM makers have an opportunity.  They should be focusing on building in more error management strategies, including “Feedfoward” systems – those that predict system issues before they result in an “out of tolerance” condition.  When we start to focus on these intervention strategies, then we can impact greater reduction of outage in the data center.

 

References:

Cappuccio, D.J. (2010) DCIM: Going Beyond IT, pp. 2 Gartner ID: G00174769
Payton S (2015) – DATA CENTER INFRASTRUCTURE MANAGEMENT – HOW DCIM IMPACTS DC INFRASTRUCTURE OUTAGE REDUCTION Figure 2, p 14
Payton S (2015a) – DATA CENTER INFRASTRUCTURE MANAGEMENT – HOW DCIM IMPACTS DC INFRASTRUCTURE OUTAGE REDUCTION Table 8, p 42
Payton S (2015b) – DATA CENTER INFRASTRUCTURE MANAGEMENT – HOW DCIM IMPACTS DC INFRASTRUCTURE OUTAGE REDUCTION Table 13, p 48
Payton S (2015c) – DATA CENTER INFRASTRUCTURE MANAGEMENT – HOW DCIM IMPACTS DC INFRASTRUCTURE OUTAGE REDUCTION Table 19, p 53
Ponemon Institute (2013) 2013 Cost of Data Center Outage, Ponemon Institute Research Report, pp. 11

 

Global Data Center Engineer Back to Top Man