Downtime, Outages and Failures - Understanding Their True Costs
- 11 Apr 2019
- Written by: Gad Cohen
This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes carried out in the enterprise cloud environment. Evolven helps leading enterprises cut the number of incidents, slash troubleshoot time, and eliminate unauthorized changes. Learn more
When it comes to mission-critical applications or data-center performance quality, enterprises are willing to make huge investments. Unfortunately, these investments don’t always fully deliver.
Confronting system downtime
Despite the efforts invested in infrastructure robustness, many IT organizations continue to deal with database, hardware, and software downtime incidents that last from just a few minutes to several days, completely incapacitating the business and causing tremendous losses.
The world of IT failure can sometimes seem awkward.
Despite the variety of advanced solutions and the mounting data collected by major enterprise software vendors and IT departments (from ERP to CRM and more), outages are still a valid and a terrifying threat to the industry.
On the other hand, IT failures have somehow become an inherently accepted, even expected, part of the enterprise life.
This is counter intuitive…
IT downtime revisited
While IT professionals find themselves confronting downtimes from time to time, and then they are fully focused on trying to get on top of them, the business organization as a whole suffers from the ‘financial pain’ by effects, which tend to be very significant.
In the past, we took an in-depth look at the multiple ways in which IT downtime can impact enterprises’ bottom line (you can read more about it here - Cost and Scope of Unplanned Outages). We looked at different aspects, from direct loss of revenues through reputation damage to indirect effects such as decrease in productivity.
Now, I wish to revisit the issue and examine how organizations should address and assess threats to their IT operations, including systems, applications and data, by analysing solid (and established) benchmarks that represent the potential costs behind downtime and outages.
Measuring big brand failures
When should the industry start measuring the financial impact of big brand outages, such as the one that recently hit Facebook, theone that hit hundreds of thousands of Lloyds Bank customers, or the Jetstar outage that resulted in hundreds of flights delays?
In other words, at what point is an outage ‘significant enough’ so that a cost analysis becomes valuable to the industry in order to learn from it and predict the impact of future outage incidents?
Well, apparently at some point the outage creates an impact that can’t be ignored, PR wise. That’s the point of no return, which is followed by financial impact estimations.
Downtime costs vary significantly between industries. The affected business size is obviously a critical factor, but it is not the only major one. The role of the IT systems in the business is also key.
Setting a numerical value behind an IT outage means predefining its implications across multiple business and organizational aspects, so that the whole industry can learn and optimize accordingly.
A failure of a critical application can lead to two distinct types of losses:
- Loss of the application service – the impact of downtime varies according to the application and the business;
- Loss of data – the potential loss of data due to a system outage can have significant legal and financial implications.
Now, I am sure that you would agree that today's data centers should never go down; applications must stay available 24/7, and internal (let alone external) end-users worldwide must be able to rely on data centers’ availability (for critical data and application availability) at all times.
Well, reality bites. In the back office (meaning inside the data center) this is not the case. No organization enjoys 100% uptime. Should you aspire to reach 100%? Sure. But you should also develop a deep understanding of downtime implications and ways to minimize it.
The worst outage nightmare ever? Probably the one that happened to you…
Some past outage incidents turned into PR catastrophes, like the mythological Virgin Blue debacle from 2010, or the recent one that affected Facebook.
Why? The mass impact probably had something to do with it.
As a reminder, the Virgin Blue outage prevented passengers from boarding flights for 11 days (!!) resulting in negative press, damaged reputation, and millions of dollars lost.
To be more accurate: Virgin Blue's reservations management company, Navitaire, ended up compensating Virgin Blue for more than $20 million (Navitaire booking glitch earns Virgin $20M in Compo).
There are many other incidents that still manage to capture the attention of the media. Here’s just one recent article by USA Today about the Wells Fargo outage that prevented customers from accessing their accounts for many hours.
I can safely say that anyone in the IT industry would agree that outages or downtimes are VERY bad for business. They are unwanted, very harmful financially, and must be fought against using all available resources.
Misconfigurations are key
The IT Process Institute's Visible Ops Handbook reported in the past that "80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers" (Visible Ops).
The Enterprise Management Association reported that 60% of availability and performance errors are the result of misconfigurations.
What’s the cost?
Downtime can cost companies $5,600 per minute and up to $300,000 per hour in web application downtime (according to a 2014 Gartner's analysis).
The average hourly cost of enterprise server downtime, worldwide, 2017-2018:
Application maintenance costs are increasing at an annual rate of 20%. But that can’t solve all of your problems. A past industry survey revealed that at least one-quarter of polled downtime was caused by configuration errors. (How much will you spend on application downtime this year?).
How common are downtimes or outages?
Ok, downtime can be a financial nightmare. That part is clear. But If you wish to properly estimate the risk potential of outages to your business, the immediate question should be “how likely is it to happen?”
Source: Data Center Knowledge
Ok, so outages are way too common to be ignored by thinking “I am not likely to experience a major outage”. Now comes the question of how to calculate their specific risk to your business.
Production and application downtimes costs made clear
Unplanned outages are up to IT to resolve. Nevertheless, and as I already mentioned, at the end of the day these outages impact the entire organization.
An important part of a thorough outage risk evaluation process is estimating how much money you will lose per hour (or minute, or any other time increment of your choice) in the incident of downtime.
For enterprises that depend solely on data centers' ability to deliver IT and networking services to customers – such as telecommunications service providers or e-commerce companies – downtime can be particularly costly, with the highest cost of a single event topping $1 million (more than $11,000 per minute) according to estimations by experts.
In a USA Today survey of 200 data center managers, over 80% reported that their downtime costs exceeded $50,000 per hour. Over 25% reported downtime costs of over $500,000 per hour (!!).
According to another survey, while companies can't achieve zero downtime, one in every 10 companies said that their availability must be greater than 99.999%.
Source: Searchcio Techtarget
To get a firm understanding of the implications of production and release downtime, let's take a look at how the consequences of downtime are manifested.
Downtime cost - per year or per incident?
A 2017 study revealed that out of 400 IT decision makers, 46% experienced more than four hours of IT-related downtime over 12 months; 23% said that they incurred costs ranging from $12,000 up to more than $1 million per hour.
Over 35% admitted that they are unsure of the cost of an outage to their business.
If you ask Delta airlines, which had to cancel 280 flights due to an outage in 2017, the losses of a single outage incident can reach over $150 million.
A couple of years ago, Dun & Bradstreet reported that 59% of Fortune 500 companies experience a minimum of 1.6 downtime hours per week.
If you take the average Fortune 500 company (or a company that employ at least 10,000 employees) and assume that it pays an IT team members an average of $56 per hour, then (assuming the entire IT is busy solving the downtime) just the labor part of downtime for an organization of this size would reach $896,000 per week, translating to more than $46 million per year (Assessing The Financial Impact Of Downtime).
Of course that the reality is more complicated, as you need to take into consideration many parameters like the time of the event (mid-week or weekend? Day or night time?) and more. Still, understanding the costs of outages will significantly help estimate your risk potential and the ROI of tools that can help minimizing the effect of downtime incidents.
Has the industry managed to learn from the past and to minimize the collateral damage during an outage?
How have things changed from the past?
So, we already know that downtimes and outage incidents still happen today, and the industry has yet to successfully abolish. But how has their cost changed over time? Are these incidents less harmful today?
In 2010, a research by Coleman Parkes found that IT downtime incidents collectively cost businesses more than 127 million man-hours per year - an average of 545 man-hours per company - in employee productivity.
In 2009, it was reported that the average downtime costs vary considerably across industries, from approximately $90,000 per hour in the media sector to about $6.48 million per hour for large online brokerages (How to quantify downtime).
According to a survey of IT managers conducted during those years, companies are becoming more aware of the direct financial costs of computer downtime. The survey revealed that one in every five businesses loses $12,000 an hour through systems downtime (How to quantify downtime).
As mentioned above, a later analysis performed in 2014 by Gartner, reported an average cost of $5,600 per minute and over $300k per hour.
Even as early as 2004, a conservative estimate from Gartner pegged the hourly cost of downtime for computer networks at $42,000. Accordingly, a company that suffers from a worse-than-average downtime of 175 hours per year can lose more than $7 million annually. However, the cost of each outage affects each company differently, so it's important to know how to calculate the precise financial impact (How to quantify downtime).
It makes sense to believe that the cost of outage only gets higher with time (since we all lean more on data systems today). You can therefore understand why past data can be multiplied by a significant number in order to reflect today’s reality…
Every minute counts
Over ten years ago, the average cost of a data center downtime across industries was valued at approximately $5,600 per minute (Unplanned IT Outages Cost More than $5,000 per Minute), a figure which, according to Gartner, remained the same until 2014. The aforementioned past study by the Ponemon Institute calculated the minimum, median, mean and maximum cost per minute of unplanned outages, based on input from 41 data centers. The greatest cost of an unplanned outage was found to exceed $11,000 per minute.
On average, the cost of an unplanned outage is likely to exceed $5,000 per minute.
It only gets more significant
A 2013 study saw an uplift of over 41% from the past averages described above, and an average of more than $7900 cost per one minute.
An ITIC survey from 2015 clearly showed that the hourly cost (compared to data from 2008) has increased by between 25% to 30%.
Downtime impact per year
A past analysis Gartner has calculated that downtime incidents can reach 87 hours per year, on average. Obviously that's the sum of many outages - anywhere from a few minutes to several hours (Average large corporation experiences 87 hours of network downtime a year).
How things have changed?
A later research from 2011 revealed that although the industry has managed to successfully fight the downtime epidemic and decrease their occurences, we are still seeing significant downtime hours and huge revenue losses (Source: led to over 3 million (apparently Whatsapp users) that migrated to Telegram)
The impact on reputation and loyalty
How much is your business reputation worth? This may be extremely difficult to assess, as well as the long-term effect of a damaged reputation and its impact on revenue and profitability.
In this case, downtime costs include lost customers (both short and long term), and other tangible elements that reflect the costs of reputation impairment like stock downturns, marketing hours (crisis and brand recovery management) and media budget required to reboot and polish up an organization's profile.
What parameters should impact your calculation?
When trying to estimate the cost of downtimes, there are the obvious direct costs (such as loss of business during downtime). However, many indirect costs such as employee overhead or reputation issues discussed above, should be calculated in as well.
Workforce overhead is derived from the cost of burning ‘war-room’ tasks that focus on getting the IT systems back up and running, the cost of being delayed with all other planned tasks, the cost of employee overtime expenses (if applicable), and more. Then there’s the value of data loss, emergency maintenance fees (particularly if the outage occurs during off hours), and additional repair costs that may continue long after service has been restored.
Needless to say, you must calculate these costs when you estimate the implication of downtime, as they are usually very significant; but even a rough guesstimate can prove to be extremely beneficial for understanding the risks and deciding on the required level of technology you should lean on, in order to fight it.
There’s also the impact of lost sales. To have an accurate assessment of the total lost sales, the impact percentage must be increased to reflect the real lifetime value of customers who permanently defect to a competitor. For instance, the Facebook (and Whatsapp) outage that I mentioned earlier Cost-Unconscious: Denying the True Cost of Network Downtime. What is the revenue loss derived by the fact that these users will present less billable ad-impressions?
Stock dropped by 25%
Although it's hard to put a number on so many parameters, they are still substantial and significant. For instance, when Amazon.com went offline for several hours during its early days, its stock dropped by 25% in a single day (Cost-Unconscious: Denying the True Cost of Network Downtime)!
In this Amazon cloud outage example, the company continued to scramble to get its cloud services back online. As a result, many customers questioned the reliability of its cloud and Amazon’s communication surrounding the outage. Other customers thought they should be compensated for the downtime as part of their SLA.
I know you are curious: As for the SLA, despite the almost-four-day outage, Amazon's EC2 SLA was not breached (Seven lessons to learn from Amazon's outage).
The cost of downtime: Calculating it yourself
How much are you bound to lose from an unexpected downtime of your servers or business applications?
According to multiple sources, the simplest way to calculate potential revenue losses during an outage is by using this equation:
|LOST REVENUE||=||(GR/TH) x I x H|
|GR||=||gross yearly revenue|
|TH||=||total yearly business hours|
|H||=||number of hours of outage|
How to minimize outage and downtime risk?
Downtime and outages are catastrophic, but they don’t have to be that impactful. By utilizing solutions that focus on getting to the root of the problem, outages can be prevented before they even occur.
Evolven Change Analytics developed a unique AIOps solution that focuses on changes - the true root cause of performance incidents. Evolven helps enterprise IT and Cloud Ops teams prevent and troubleshoot incidents before the trouble starts.
Contact us to see how we help leading enterprises slash the number of incidents and MTTR.
What is the true cost of system downtime? ›
Quick downtime calculator
To get a quick estimate of your company's probable downtime costs, use the following formula, based on the size of your business and the number of minutes your most recent incident lasted: Downtime cost = minutes of downtime x cost-per-minute. For small business, use $427 as cost-per-minute.
Downtime cost is defined as any profit that a company loses when its equipment or network stops functioning. The cost of downtime implies not only direct financial loss but can have an impact on your company in at least the other 4 ways.What is the difference between downtime and outage? ›
Downtime occurs when a system can't complete its primary function. It can be broken up into two types: IT outages and brownouts. IT brownouts occur when a system is slowed or partially available. This might mean customers can access your site, but pages load slowly or dynamic features like "add to cart" don't function.What is downtime failure? ›
In industrial environments, downtime may refer to failures in production equipment. This type of downtime is often measured as downtime per work shift or downtime per a 12- or 24-hour period. Downtime duration is the period of time when a system fails to perform its primary function.What is true downtime cost analysis? ›
TDC is a methodology of analyzing all cost factors associated with downtime, and using this information for cost justification and day to day management decisions. Most likely, this data is already being collected in your facility, and need only be consolidated and organized according to the TDC guidelines.What are the three types of downtime? ›
Common categories of downtime include excessive tool changeover, excessive job changeover, lack of operator, and unplanned machine maintenance.What are the main causes of downtime? ›
This can be due to several reasons including hardware or software failure, human error, malicious attacks or natural disasters. Since unplanned downtime is unexpected and occurs without a warning, preventing it can be a challenge.What are the two types of downtime? ›
Downtime falls into two categories: planned and unplanned. Planned downtime is notable because it offers advanced warning and gives users a chance to prepare. Planned downtime is usually done for upgrades or maintenance to the network infrastructure.How do you explain downtime? ›
a time during a regular working period when an employee is not actively productive. an interval during which a machine is not productive, as during repair, malfunction, maintenance.What is the industry standard for downtime? ›
World Class Standards For Downtime
Aim for unscheduled downtime to be 10% or less.
What is outage in incident management? ›
Hi, Basically, Outage means getting a downtime so to understand this you can consider that in a particular incident for its solution we want a downtime so you can create an outage for that particular incident and mention the start and end time so to mention the interruptions.How do you calculate maintenance downtime? ›
1. Divide your total revenue by the planned operating time to get your daily revenue. 2. Assess by how much your daily revenue goes down if the chosen piece of equipment stops working for 1 hour.What are the different types of downtime? ›
- Not-Utilizing Talent.
- Motion Waste.
- Excess Processing.
Importance of Reducing Unplanned Downtime
Waiting on parts or the necessary personnel to fix an issue takes time and could mean the machine is going to stay down for longer. Longer downtime is less time making product, directly effecting the bottom line.
- Lost productivity. If staff cannot perform their job function due to applications being offline, then there is an obvious hit on productivity. ...
- Reputational damage. ...
- Lost sales. ...
- Missed business opportunities. ...
- Hidden costs.
Calculating Downtime Cost
The duration of the downtime and the cost incurred per minute you're offline are the two variables that most affect the financial impact of an outage.
How Much Does Downtime Cost a Company? The average cost of downtime is significant. Each minute costs an average of $9,000, according to the Ponemon Institute, bringing the downtime cost per hour to over $500,000.What is true cost methodology? ›
True Cost Accounting (TCA) is a new way of identifying the real costs of a specific product or service. TCA calculates not only the direct costs like raw materials and labour, but also the effects on the natural and social environment in which a company operates.Is downtime a KPI? ›
Revenue is directly impacted by downtime because the less equipment is running, the fewer products are made and sold. Therefore, one of your maintenance KPIs is downtime. All sorts of quantifiable actions can influence downtime, such as the mean time to repair (MTTR) or planned maintenance percentage.What is Level 3 downtime? ›
Downtime Level 3 - Operations are defined as localized, scheduled or unscheduled problem involving the loss of multiple functions, applications, or systems, not anticipated to exceed 24 hours of unavailability. For a level 3 the problem can be resolved using all available resources.
How do you maximize downtime? ›
In order to get the most out of your downtime, you need to be strategic about how you spend it. One way to maximize your downtime is by resetting your goals and milestones. This involves taking a step back and re-evaluating where you are and where you want to be.What is a downtime plan? ›
Planned downtime is scheduled time when production equipment is limited or shut down to allow for planned maintenance, repairs, upgrades or testing.What is the difference between downtime and breakdown time? ›
Breakdown time is downtime that results from the equipment breaking down. You'd start counting from the time the asset fails to the time you manage to get it up and running again. Equipment downtime, on the other hand, is any amount of time in which a piece of equipment is offline.What is a major outage? ›
More Definitions of Major Outage
Major Outage means any Power Outage that lasts for at least ten (10) consecutive minutes and/or any Temperature Irregularity, in each case causing inoperability of Customer's Equipment.
The most well-known downtime metric is Mean Time to Repair (MTTR). The MTTR metric reflects the average time it takes to troubleshoot and repair a failed piece of equipment.What is 5 nines availability downtime? ›
Availability is normally expressed in 9's. For example, “5 nines uptime” means that a system is fully operational 99.999% of the time — an average of less than 6 minutes downtime per year. The chart shows what impact various availability levels have on your server downtime.What is acceptable downtime? ›
Maximum allowable downtime denotes the maximum time a business can tolerate the absence or unavailability of a particular business function. Different business functions are likely to have different answers to the allowable downtime equation.How can service outages be prevented? ›
- How Do You Avoid Service Downtime? The best way to avoid service downtime is to: ...
- Use Enterprise-Level Network Infrastructure. ...
- Always Have a Backup Plan. ...
- Keep Things Simple. ...
- Monitor Frequently. ...
- Test and Retest. ...
- Deploy Network Redundancy. ...
- Regularly Update Systems.
Notify the help desk of an outage so they can respond to user concerns appropriately. Check the power supply to make sure your network still has a healthy dose of power. Contact your service providers to ensure the network outage isn't originating on their end. Log in to your equipment and examine the error messages.What is the difference between downtime and maintenance? ›
In manufacturing, “downtime” occurs when an unplanned event halts production for a period of time. This event can be a malfunction, repair, or changeover of tools or equipment. Maintenance downtime in particular is when a machine is not operating or being productive due to required maintenance work.
How is downtime KPI calculated? ›
To measure the KPI, you can track server downtime either as a comprehensive figure (including both planned and unplanned outages), or measure each individually. In the former case, simply add up all the times your servers were offline for the desired measurement period (daily, weekly, monthly, or yearly).How do you calculate downtime cost per hour? ›
The cost per hour of downtime is calculated by adding labor costs per hour to the revenue lost per hour.What is the ITIL definition of downtime? ›
Term. Definition. Downtime Total period that a service or component is not operational, within agreed service times.What is downtime in accounting? ›
Downtime is the period during which equipment is not operational. This situation is caused by such factors as maintenance, setup for a job, broken equipment, or missing inputs, such as raw materials or qualified operators.How does downtime increase productivity? ›
You can't have the high without the low. The better you are at resting, the better you will be at working.” Downtime is essential for increasing attention, boosting mood, unlocking creativity, and solving problems. It's also necessary for improving learning and memory and restoring mental health at work.What is the advantage of downtime? ›
Downtime gives us time and space to enjoy our personal lives and get personal tasks done. It grants us time with family, friends, and our hobbies. On a brain level, it allows us to reach homeostasis and is a necessary break from the aroused state, Dr. Hanson says.What is unpredictable downtime? ›
Unplanned downtime occurs when there is an unexpected shutdown or failure of equipment or process. Unplanned downtime not only causes costly delays in maintenance, production schedules and order deliveries, but it also increases the chance of personnel injury, environmental incidents and emergency repairs.What is most tolerable downtime? ›
Definition(s): The amount of time mission/business process can be disrupted without causing significant harm to the organization's mission.How much does downtime cost the auto industry? ›
For example, in the auto industry, downtime can cost up to $50,000 per minute. That's $3 million per hour. 400 The true downtime cost includes a variety of wasted business support costs and lost business opportunity costs because resources were needed to resolve a downtime incident that probably didn't need to happen.Is database downtime costly? ›
Database outages can have a significant impact on top line revenue. In fact, according to a survey conducted by ITIC, 98% of organizations say a single hour of downtime costs over $100,000, while 81% report that it costs over $300,000. And that's just for a single hour!
What is the average cost of downtime in a data center? ›
According to Gartner, downtime costs $5,600 per minute on average. This results in average costs between $140,000 and $540,00 per hour depending on the organization. Some factors that contribute to the costs associated with downtime include: Lost sales.Is the chip shortage over for the auto industry? ›
The Auto Chip Shortage Remains, But It May Be Improving
This figure highlights the continued production difficulties that manufacturers face. However, if Fiorani's estimate holds true, it would mark a significant improvement for the industry.
In 2021, hamstrung by the global microchip shortage, the automotive industry lost more than $200 billion. Eleven million fewer vehicles were produced; manufacturing plants idled.What are the financial impacts of downtime? ›
The cost of downtime = downtime duration x per-minute cost.
You can use around $400 as a cost-per-minute figure for small enterprises. In the case of large and medium businesses, use $10,000. Many people only associate downtime costs with lost revenue.
That downtime comes at a cost, and it isn't cheap. For example, the average automotive manufacturer loses $22,000 per minute when the production line stops. That quickly adds up. Overall, unplanned downtime costs industrial manufacturers as much as $50 billion a year.What is the most expensive part of a data center? ›
Infrastructure. Without a doubt, the biggest cost in making a data center “rack ready” is the infrastructure.What is the biggest cost in data center? ›
Facility construction cost
The most significant variable that impacts data center cost is the money spent on constructing the physical data center facility. Facility costs account for around 45 percent of total data center costs, on average.
- Labor Cost: ([Number of Engineers] X [Annual Salary of Engineer]) X 30%
- Compliance Risk: [4% of Your Revenue in 2019]
- Opportunity Cost: [Revenue you could have generated if you moved faster, releasing X new products, and acquired Y new customers]
- = $ Annual Cost of Data Downtime.