Reliability Metrics that Should Be Part of Every Disaster Recovery Strategy

Disaster recovery strategies are at the top of IT leaders’ minds. What happens if a natural disaster strikes or your network becomes the target of hackers? These issues are all too prevalent amidst the other challenges of managing day-to-day operations in mid-size IT departments, often with tight budgets and limited human resources. However, it’s important to consider reliability metrics and KPIs when designing your disaster plan. Here’s a quick overview of key performance indicators to keep in mind when developing your plan.

Reliability: A Key Factor in the Employee and Customer Experience

Today’s audiences have come to expect reliability as part of their digital experience. When a key system goes offline or people are unable to access critical data, it can quickly get expensive. Employees are unproductive, and customers may become unhappy enough to go to the competition. As a result, real-time issues impact the quality of your IT delivery.

Reliability also plays a role in disaster planning. Ensuring that your organization is constantly looking at relevant metrics — and using them as the basis for continuous improvement — can keep key systems functioning through even the most difficult times. It also ensures that a lack of maintenance or other issues won’t compound your troubles when disasters do affect your workplace.

Uptime: The Holy Grail of Reliability

Uptime and downtime are two of the most important reliability metrics. As Technopedia notes, “Uptime is a metric that represents the percentage of time that hardware, an IT system or device is successfully operational. It refers to when a system is working, versus downtime, which refers to when a system is not working.” In other words, what percentage of the time are your key systems or data solutions accessible? Reliability KPIs can look at several components of the organization’s infrastructure:

  • Equipment reliability, including servers and end-point devices like computers and mobile equipment
  • Network reliability, meaning the percentage of time that your organization is connected to the Internet
  • Software reliability, for both self-hosted and Software-as-a-Service solutions
  • Comparing planned uptime — which accounts for planned downtime for maintenance and other activities — versus actual uptime, to see what role work estimates may be playing in your overall reliability

Performance KPIs

Uptime and downtime provide an important part of the overall picture. However, it doesn’t speak to whether assets are performing at expected levels. For example, your network may be online, but is it consistently delivering the bandwidth promised in your SLAs?

Understanding these metrics has two benefits. The first is ensuring that you’re performing at optimum levels for the most time possible. The second is providing a baseline for determining acceptable performance during times of stress. For example, if disaster conditions are occurring, what levels of performance are acceptable, and how do they compare to your usual levels of delivery? Outlining these KPIs provides both a target to aim for and a way to track performance trends over time.

Schedule Maintenance Compliance

Another KPI to consider when evaluating your organization’s systems reliability is related to scheduled maintenance. Scheduled maintenance can refer to installing updates and patches, performing preventative maintenance on hardware, or dealing with data issues in software situations. Complying with scheduled maintenance can help prolong the life of your hardware and keep your software functioning smoothly. However, many organizations struggle with how to prioritize scheduled maintenance. Looking at compliance levels is one way to ensure that you’re keeping a pulse on your overall system health.

Ensuring that your equipment is in the best shape possible and that all updates are current is an asset during a crisis. One company where I worked experienced a weather-related outage. When the IT team worked to bring key systems online, there was a failure due to a critical path that hadn’t previously been implemented. While the team worked through it, the issue added several hours to getting key systems back online and could have been entirely avoided with regularly scheduled maintenance.

Security and Compliance KPIs

Security is a critical part of any organization’s IT strategy. In heavily regulated industries such as health care and finance, systems may be held to compliance indicators. Violating those indicators puts client and company data at risk and can lead to expensive consequences. What security indicators do you need to put in place? Are there compliance-related KPIs that you need to take into account? Some KPIs to consider include:

  • Percentages of applications and systems tested for vulnerabilities
  • Defect remediation window, or how long it takes you to address identified vulnerabilities
  • Rate of defect occurrence, or how successfully you protect against certain threats from happening twice

Recovery Time Objective (RTO)

Your recovery time objective refers to how quickly you have to get online after a disaster occurs. For example, let’s say that a storm hits and it affects your physical location. How much downtime can your business sustain while key systems and data are brought back online? Certain organizations may be able to tolerate more downtime than others, but raised expectations among users are contributing to an environment with shorter RTOs.

It’s important to define RTOs in relation to overall system performance and individual organizations. For example, certain systems may be essential to your operations and have a very short RTO. Others may be important but not urgent, and fall lower on the priority scale with a lower RTO.

A Final Note on KPIs

KPIs are helpful because they can give you insights at any point on how you’re performing, and they can keep your team focused on where to invest their time. However, in a situation where you’re evaluating reliability, maintenance and disaster preparedness, it’s important that you remember to prioritize the end-user experience and the overall system health over meeting numbers. What’s measured gets managed, and you don’t want to lose the benefits of focusing on reliability.

As IndustryWeek notes, “In [some] organizations and with many others, numbers were valued over value-added, outcome-oriented work activities. Then to complicate matters even more, a comprehensive picture of reliability and maintenance effectiveness did not exist due to a narrow focus on a select few KPIs. Both management and employees had become victims of ensuring the numbers looked good versus delivering value-added sustaining maintenance.” KPIs are helpful tools in steering the ship, but real-time needs and common sense will sometimes take priority.

Disaster preparedness is a key component of keeping your information safe and getting back online in the event of a hack or natural disaster. Reliability and maintenance may get overlooked in the larger discussion, as disaster recovery’s less interesting counterpart. However, including these KPIs will not only improve overall performance, but it will ensure that your systems are in excellent shape should a disaster occur. That’s one less worry for your team while they work to bring your services, applications and data back online.