Mitigate the Risk of Cloud Downtime and Data Loss



  • Senior leadership is asking difficult questions about the organization’s dependency on third-party cloud services and the risk that poses.
  • IT leaders have limited control over third-party incidents and that includes cloud services. Yet they are on the hot seat when cloud services go down.
  • While vendors have swooped in to provide resilience options for the more-common SaaS solutions, it is not the case for all cloud services.

Our Advice

Critical Insight

  • No control over the software does not mean no recovery options. Solutions range from designing an IT workaround using alternate technologies to pre-defined third-party service continuity options (e.g. see options for O365) to business workarounds.
  • Even where there is limited control, you can at least define an incident response plan to streamline notification, assessment, and implementation of workarounds. Leadership wants more options than simply waiting for the service to come back online.
  • At a minimum, IT’s responsibility is to identify and communicate risk to senior leadership. That starts with a vendor review to identify SLA issues and overall resilience gaps.

Impact and Result

  • Follow a structured process to assess cloud resilience risk.
  • Identify opportunities to mitigate risk – at the very least, ensure critical data is protected.
  • Summarize cloud services risk, mitigation options, and incident response for senior leadership.

Mitigate the Risk of Cloud Downtime and Data Loss Research & Tools

Besides the small introduction, subscribers and consulting clients within this management domain have access to:

1. Mitigate the Risk of Cloud Downtime and Data Loss – Step-by-step guide to assess risk, identify risk mitigation options, and create an incident response plan.

Even where there is limited control, you can define an incident response plan to streamline notification, assessment, and implementation of workarounds.

  • Mitigate the Risk of Cloud Downtime and Data Loss Storyboard

2. Cloud Services Incident Risk and Mitigation Review – Review your key cloud vendors’ SLAs, incident preparedness, and data protection strategy.

At a minimum, IT’s responsibility is to identify and communicate risk to senior leadership. That starts with a vendor review to identify SLA and overall resilience gaps.

  • Cloud Services Incident Risk and Mitigation Review Tool

3. SaaS Incident Response Workflows – Use these examples to guide your efforts to create cloud incident response workflows.

The examples illustrate different approaches to incident response depending on the criticality of the service and options available.

  • SaaS Incident Response Workflows (Visio)
  • SaaS Incident Response Workflows (PDF)

4. Cloud Services Resilience Summary – Use this template to capture your results.

Summarize cloud services risk, mitigation options, and incident response for senior leadership.

  • Cloud Services Resilience Summary
[infographic]

Further reading

Mitigate the Risk of Cloud Downtime and Data Loss

Resilience and disaster recovery in an increasingly Cloudy and SaaSy world.

Analyst Perspective

If you think cloud means you don’t need a response plan, then get your resume ready.

Frank Trovato

Most organizations are now recognizing that they can’t ignore the risk of a cloud outage or data loss, and the challenge is “what can I do about it?” since there is limited control.

If you still think “it’s in the cloud, so I don’t need to worry about it,” then get your resume ready. When O365 goes down, your executives are calling IT, not Microsoft, for an answer of what’s being done and what can they do in the meantime to get the business up and running again.

The key is to recognize what you can control and what actions you can take to evaluate and mitigate risk. At a minimum, you can ensure senior leadership is aware of the risk and define a plan for how you will respond to an incident, even if that is limited to monitoring and communicating status.

Often you can do more, including defining IT workarounds, backing up your SaaS data for additional protection, and using business process workarounds to bridge the gap, as illustrated in the case studies in this blueprint.

Frank Trovato
Research Director, Infrastructure & Operations

Info-Tech Research Group

Use this blueprint to expand your DRP and BCP to account for cloud services

As more applications are migrated to cloud-based services, disaster recovery (DR) and business continuity plans (BCP) must include an understanding of cloud risks and actions to mitigate those risks. This includes evaluating vendor and service reliability and resilience, security measures, data protection capabilities, and technology and business workarounds if there is a cloud outage or incident.

Use the risk assessments and cloud service incident response plans developed through this blueprint to supplement your DRP and BCP as well as further inform your crisis management plans (e.g. account for cloud risks in your crisis communication planning).

Overall Business Continuity Plan

IT Disaster Recovery Plan

A plan to restore IT application and infrastructure services following a disruption.

Info-Tech’s Disaster Recovery Planning blueprint provides a methodology for creating the IT DRP. Leverage this blueprint to validate and provide inputs for your IT DRP.

BCP for Each Business Unit

A set of plans to resume business processes for each business unit.

Info-Tech’s Develop a Business Continuity Plan blueprint provides a methodology for creating business unit BCPs as part of an overall BCP for the organization.

Crisis Management Plan

A plan to manage a wide range of crises, from health and safety incidents to business disruptions to reputational damage.

Info-Tech’s Implement Crisis Management Best Practices blueprint provides a framework for planning a response to any crisis, from health and safety incidents to reputational damage.

Executive Summary

Your Challenge

Common Obstacles

Info-Tech’s Approach

  • Senior leadership is asking difficult questions about the organization’s dependency on third-party cloud services and the risk that poses.
  • Migrating to cloud services transfers much of the responsibility for day-to-day platform maintenance but not accountability for resilience.
  • IT leaders are often responsible for not just the organization’s IT DRP but also BCP and other elements of overall resilience. Cloud risk adds another element IT leaders need to consider.
  • IT leaders have limited control over third-party incidents and that includes cloud services. With SaaS services in particular, recovery or continuity options may be limited.
  • While vendors have swooped in to provide resilience options for the more common SaaS solutions, that is not the case for all cloud services.
  • Part of the solution is defining business process workarounds and that depends on cooperation from business leaders.
  • At a minimum, IT’s responsibility is to identify and communicate risk to senior leadership. That starts with a vendor review to identify SLA and overall resilience gaps.
  • Adapt how you approach downtime and data loss risk, particularly for SaaS solutions where there is limited or no control over the system.
  • Even where there is limited control, you can define an incident response plan to streamline notification, assessment, and implementation of workarounds. Leadership wants more options than simply waiting for the service to come back online.

Info-Tech Insight

Asking vendors about their DRP, BCP, and overall resilience has become commonplace. Expect your vendors to provide answers so you can assess risk. Furthermore, your vendor may have additional offerings to increase resilience or recommendations for third parties who can further assist your goals of improving cloud service resilience.

Key deliverable

Cloud Services Resilience Summary

Provide leadership with a summary of cloud risk, downtime workarounds implemented, and additional data protection.

The image contains a screenshot of the Cloud Services Resilience Summary.

Additional tools and templates in this blueprint

Cloud Services Incident Risk and Mitigation Review Tool

Use this tool to gather vendor input, evaluate vendor SLAs and overall resilience, and track your own risk mitigation efforts.

The image contains a screenshot of the Cloud Services Incident Risk and Mitigation Review Tool.

SaaS Incident Response Workflows

Use the examples in this document as a model to develop your own incident response workflows for cloud outages or data loss.

The image contains a screenshot of the SaaS Incident Response Workflows.

This blueprint will step you through the following actions to evaluate and mitigate cloud services risk

  1. Assess your cloud risk
  • Review your cloud services to determine potential impact of downtime/data loss, vendor SLA gaps, and vendor’s current resilience.
  • Identify options to mitigate risk
    • Explore your cloud vendor’s resilience offerings, third-party solutions, DIY recovery options, and business workarounds.
  • Create an incident response plan
    • Document your cloud risk mitigation strategy and incident response plan, which might include a failover strategy, data protection, and/or business continuity.

    Cloud Risk Mitigation

    Identify options to mitigate risk

    Create an incident response plan

    Assess risk

    Phase 1: Assess your cloud risk

    Phase 1

    Phase 2

    Phase 3

    Assess your cloud risk

    Identify options to mitigate risk

    Create an incident response plan

    Cloud does not guarantee uptime

    Public cloud services (e.g. Azure, GCP, AWS) and popular SaaS solutions experience downtime every year.

    A few cloud outage examples:

    • Microsoft Azure AD outage, March 15, 2022:
      Many users could not log into O365, Dynamics, or the Azure Portal.
      Cause: software change.
    • Three AWS outages in December 2021: December 7 (Netflix and others impacted), December 15 (Duo, Zoom, Slack, others), December 20 (Slack, Epic Games, others). Cause: network issues, power outage.
    • Salesforce outage, May 12, 2022: Users could not access the Lightning platform. Cause: expired certificate.

    Cloud availability

    • Migrating to cloud services can improve availability, as they typically offer more resilience than most organizations can afford to implement themselves.
    • However, having multiple data centers, zones, and regions doesn’t prevent all outages, as we see every year with even the largest cloud vendors.

    DR challenges for IaaS, PaaS, and cloud-native

    While there are limits to what you control, often traditional “failover” DR strategy can apply.

    High-level challenges and resilience options:

    • IaaS: No control over the hardware, but you can failover to another region. This is fairly similar to traditional DR.
    • PaaS: No control over the software platform (e.g. SQL server as a service), but you can back up your data and explore vendor options to replicate your environment.
    • Cloud-native applications: As with PaaS, you can back up your data and explore vendor options to replicate your environment.

    Plan for resilience

    • Include DR requirements when designing cloud service implementation. For example, for IaaS solutions, identify what data would need to be replicated and what services may need to be “always on” (e.g. database services where high-availability is demanded).
    • Similarly, for PaaS and cloud-native solutions, consult your vendor regarding options to build in resilience options (e.g. ability to failover to another environment).

    DR challenges for SaaS solutions

    SaaS is the biggest challenge because you have no control over any part of the base application stack.

    High-level challenges and resilience options:

    • No control over the hardware (or the facility, maintenance processes, and so on).
    • No control over the base application (control is limited to configuration settings and add-on customizations or integrations).
    • Options to back up your data will depend on the service.

    Note: The rest of this blueprint is focused primarily on SaaS resilience due to the challenges listed here. For other cloud services, leverage traditional DR strategies and vendor management to mitigate risk (as summarized on the previous slides).

    Focus on what you can control

    • For SaaS solutions in particular, you must toss out traditional DR. If Salesforce has an outage, you won’t be involved in recovering the system.
    • Instead, DR for SaaS needs to focus on improving resilience where you do have control and implementing business workarounds to bridge the gap.

    Evaluate your cloud services to clarify your specific risks

    Time and money is limited, so focus first on cloud services that are most critical and evaluate the vendors’ SLA and existing resilience capabilities.

    The activities on the next two slides will evaluate risk through two approaches:

    Activity 1: Estimate potential impact of downtime and data loss to quantify the risk and determine which cloud services are most critical and need to be prioritized. This is done through a business impact analysis that assesses:

    • Impact on revenue or costs (if applicable).
    • Impact on reputation (e.g. customer impact).
    • Impact on regulatory compliance and health and safety (if applicable).

    Activity 2: Review the vendor to identify risks and gaps. Specifically, evaluate the following:

    • Incident Management SLAs (e.g. does the SLA include RTO/RPO commitments? Do they meet your requirements?)
    • Incident Response Preparedness (e.g. does the vendor have a DRP, BCP, and security incident response plan?)
    • Data Protection (e.g. does their backup strategy and data security meet your standards?)

    Activity 1: Quantify potential impact and prioritize cloud services using a business impact analysis (BIA)

    1-3 hours

    1. Download the latest version of our DRP BIA: DRP Business Impact Analysis Tool. The tool includes instructions.
    2. Include the cloud services you want to assess in the list of applications/systems (see the tool excerpt below), and follow the BIA methodology outlined in the Create a Right-Sized Disaster Recovery Plan blueprint.
    3. Use the results to quantify potential impact and prioritize your efforts on the most-critical cloud services.

    The image contains a screenshot of the DRP Business Impact Analysis Tool.

    Materials
    • DRP BIA Tool
    Participants
    • Core group of IT management and staff who can provide a well-rounded perspective on potential impact. They will create the first draft of the BIA.
    • Review the draft BIA with relevant business leaders to refine and validate the results.

    Activity 2: Review your key cloud vendors’ SLAs, incident preparedness, and data protection strategy

    1-3 hours

    Use the Cloud Services Incident Risk and Mitigation Review Tool as follows:

    1. Send the Vendor Questionnaire tab to your cloud vendors to gather input, and review your existing agreements.
    2. Copy the vendor responses into the tool (see the instructions in the tool) and evaluate. See the example excerpt below.
    3. Identify action items to clarify gaps or address risks. Some action items might not be defined yet and will need to wait until you have had a chance to further explore risk mitigation options.

    The image contains a screenshot of the Cloud Services Incident Risk and Mitigation Review Tool.

    Materials
    • Cloud Services Incident Risk and Mitigation Review Tool
    Participants
    • Core group of IT management and staff tasked with evaluating and improving cloud services’ resilience.

    Phase 2: Identify options to mitigate risk

    Phase 1

    Phase 2

    Phase 3

    Assess your cloud risk

    Identify options to mitigate risk

    Create an incident response plan

    Consult your vendor to identify options to improve resilience, as a starting point

    Your vendor might also be able to suggest third parties that offer additional support, backup, or service continuity options.

    • The Vendor Questionnaire tab in the Cloud Services Incident Risk and Mitigation Review Tool includes a section at the bottom where your vendor can name additional options to improve resilience (e.g. premium support packages, potentially their own DR services).
    • If your vendor has not completed that part of the questionnaire, meet with them to discuss this. Asking service vendors about resilience has become commonplace, so they should be prepared to answer questions about their own offerings and potentially can name trusted third-party vendors who can further assist you.
    • Leverage Info-Tech’s advisory services to evaluate options outlined by your vendor and potential third-party options (e.g. enterprise backup solutions that support backing up SaaS data).

    Some SaaS solutions have plenty of resilience options; others not so much

    • The pervasiveness of O365 has led vendors to close the service continuity gap, with options to send and receive email during an outage and back up your data.
    • With many SaaS solutions, there isn’t going to be a third-party service continuity option, but you might still be able to at least back up your data and implement business process workarounds to close the service gap.

    Example SaaS risk and mitigation: O365

    Risk

    • Several outages every year (e.g. MS Teams July 20, 2022).
    • SLA exceptions include “Scheduled Downtime,” which can occur with just five days’ notice.
    • The Recycling Bin is your data backup, depending on your setup.

    Options to mitigate risk (not an exhaustive list):

    • Third-party solutions for email service continuity.
    • Several backup vendors (e.g. Veeam, Rubrik) can protect most of your O365 suite.
    • Business continuity workarounds leveraging synced OneDrive, SharePoint, and Outlook (access to calendar invites).

    Example SaaS risk and mitigation: Salesforce

    Risk

    • Downtime has been infrequent, but Salesforce did have a major outage in May 2021 (DNS issue) and May 2022 (expired certificate).
    • At the time of this writing, the Main Services Agreement does not commit to a specific uptime value and specifies the usual exclusions.
    • Similarly, there are limited commitments regarding data protection.

    Options to mitigate risk (not an exhaustive list):

    • Salesforce provides a backup and restore service offering.
    • In addition, some third-party vendors support backing up Salesforce data for additional protection against data corruption or data loss.
    • Business continuity workarounds can further reduce the impact of downtime (e.g. record updates in MS Word and leverage Outlook for contact info until Salesforce is recovered).

    Establish a baseline standard for risk mitigation, regardless of cloud service

    At a minimum, set a goal to review vendor risk at least annually, define standard processes for monitoring outages, and review options to back up your SaaS data.

    Example baseline standard for cloud risk mitigation

    • Review vendor risk at least annually. This includes reviewing SLAs, vendor’s incident preparedness (e.g. do they have a current DRP, BCP, and Security IRP?), and the vendor’s data protection strategy.
    • Incident response plans must include, at a minimum, steps to monitor vendor outage and communicate status to relevant stakeholders. Where possible, business process workarounds are defined to bridge the service gap.
    • For critical data (based on your BIA and an evaluation of risk), maintain your own backups of SaaS data for additional protection.

    Embed risk mitigation standards into existing IT operations

    • Include specific SLA requirements, including incident management processes, in your RFP process and annual vendor review.
    • Define cloud incident response in your incident management procedures.
    • Include cloud data considerations in your backup strategy reviews.

    Phase 3: Create an incident response plan

    Phase 1

    Phase 2

    Phase 3

    Assess your cloud risk

    Identify options to mitigate risk

    Create an incident response plan

    Activity 1: Review the example incident response workflows and case studies as a starting point

    1-3 hours

    1. Review the SaaS Incident Response Workflows examples. The examples illustrate different approaches to incident response depending on the criticality of the service and options available.
    2. Review the case studies on the next few slides, which further illustrate the resilience and incident response solutions implemented.
    3. Note the key elements:
    • Detection
    • Assessment
    • Monitoring status / contacting the vendor
    • Communication with key stakeholders
    • Invoking workarounds, if applicable

    Example SaaS Incident Response Workflow Excerpt

    The image contains a screenshot of an example of the SaaS Incident Response Workflow Excerpt.
    Materials
    • SaaS Incident Response Workflows examples
    Participants
    • Core group of IT management and staff tasked with evaluating and improving cloud services’ resilience.
    • Relevant business process owners to provide input and define business workarounds, where applicable.

    Case Study 1: Recovery plan for critical fundraising event

    If either critical SaaS dependency fails, the following plan is executed:

    1. Donors are redirected to a predefined alternate donation page hosted by a different service. The alternate page connects to the backup payment processing service (with predefined integrations).
    2. Marketing communications support the redirect.
    3. While the backup solution doesn’t gather as much data, the payment details provide enough information to follow up with donors where necessary.

    Criticality justified a failover option

    The Annual Day of Giving generates over 50% of fundraising for the year. It’s critically dependent on two SaaS solutions that host the donation page and payment processing.

    To mitigate the risk, the organization implemented the ability to failover to an alternate “environment” – much like a traditional DR solution – supported by workarounds to manage data collection.

    Case Study 2: Protecting customer data

    Daily exports from a SaaS-hosted donations site reduce potential data loss:

    1. Daily exports to a CRM support donor profile updates and follow-ups (tax receipts, thank-you letters, etc.).
    2. The exports also mitigate the risk of data loss due to an incident with the SaaS-hosted donation site.
    3. This company is exploring more-frequent exports to further reduce the risk of data loss.

    Protecting your data gives you options

    For critical data, do you want to rely solely on the vendor’s default backup strategy?

    If your SaaS vendor is hit by ransomware or if their backup frequency doesn’t meet your needs, having your own data backup gives you options.

    It can also support business process workarounds that need to access that data while waiting for SaaS recovery.

    Case Study 3: Recovery plan for payroll

    To enable a more accurate payroll workaround, the following is done:

    1. After each payroll run, export the payroll data from the SaaS solution to a secure location.
    2. If there is a SaaS outage when payroll must be submitted, the exported data can be modified and converted to an ACH file.
    3. The ACH file is submitted to the bank, which has preapproved this workaround.

    BCP can bridge the gap

    When leadership looks to IT to mitigate cloud risk, include BCP in the discussion.

    Payroll is a good example where the best recovery option might be a business continuity workaround.

    IT often still has a role in business continuity workarounds, as in this case study: specifically, providing a solution to modify and convert the payroll data to an ACH file.

    Activity 2: Run tabletop planning exercises as a starting point to build your incident response plan

    1-3 hours

    1. Follow the tabletop planning instructions provided in the Create a Right-Sized Disaster Recovery Plan blueprint.
    2. Run the exercise for each cloud service. Keep the scenario generic at first (e.g. cloud service is down with no reported root cause) so you can focus on your response. Capture response steps and gaps.
    3. Add complexity in subsequent exercises (e.g. data loss plus downtime), and use that to expand and refine the workflow as needed.
    4. Use the resulting workflows as the core piece of your incident response plan.
    5. Supplement the workflow with relevant checklists or procedures. At this point you can choose to incorporate this into your DRP or BCP or maintain these documents as supplements to those plans.
      See the DRP Case Study and BCP Case Study for an example of DRP-BCP documentation.

    Example tabletop planning results excerpt with gaps identified

    The image contains an example tabletop planning results excerpt with gaps identified.

    Materials
    • SaaS Incident Response Workflows examples
    Participants
    • Core group of IT management and staff tasked with evaluating and improving cloud services’ resilience.
    • Review results with relevant business process owners to provide input and define business workarounds where applicable.

    Activity 3: Summarize cloud services resilience to inform senior leadership of current risks and mitigation efforts

    1-3 hours

    1. Use the Cloud Services Resilience Summary example as a template to capture the following:
    • The results of your vendor review (i.e. incident management SLAs, incident response preparedness, data protections strategy).
    • The current state of your downtime workarounds and additional data loss protection.
    • Your baseline standard for cloud services risk mitigation.
    • Summary of resilience, risks, workarounds, and data loss protection for each individual cloud service that you have reviewed.
  • Present the results to senior leadership to:
    • Highlight risks to inform business decisions to mitigate or accept those risks.
    • Summarize actions already taken to mitigate risks.
    • Communicate next steps (e.g. action items to address remaining risks).

    Cloud Services Resilience Summary – Table of Contents

    The image contains a screenshot of Cloud Services Resilience Summary – Table of Contents.
    Materials
    • Cloud Services Resilience Summary
    Participants
    • Core group of IT management and staff tasked with evaluating and improving cloud services’ resilience.
    • Review results with relevant business process owners to provide input and define business workarounds where applicable.

    Summary: For cloud services, after evaluating risk, IT must adapt how they approach risk mitigation

    1. Identify failover options where possible
    • A failover strategy is possible for many cloud services (e.g. IaaS replication to another region, or failing over SaaS to an alternate solution as in case study 1).
  • At least protect your data
    • Explore supplementary backup options to protect against ransomware, data corruption, or data loss and support business continuity workarounds (see case study 2).
  • Leverage BCP to close the gap
    • This doesn’t absolve IT of its role in mitigating cloud incident risk, but business process workarounds can bridge the gap where IT options are limited (see case study 3).

    Related Info-Tech Research

    IT DRP Maturity Assessment

    Get an objective assessment of your DRP program and recommendations for improvement.

    Create a Right-Sized Disaster Recovery Plan

    Close the gap between your DR capabilities and service continuity requirements.

    Develop a Business Continuity Plan

    Streamline the traditional approach to make BCP development manageable and repeatable.

    Implement Crisis Management Best Practices

    Don’t be another example of what not to do. Implement an effective crisis response plan to minimize the impact on business continuity, reputation, and profitability.

    Buying Options

    Mitigate the Risk of Cloud Downtime and Data Loss

    €309.50
    (Excl. 21% tax)

     

    IT Risk Management · IT Leadership & Strategy implementation · Operational Management · Service Delivery · Organizational Management · Process Improvements · ITIL, CORM, Agile · Cost Control · Business Process Analysis · Technology Development · Project Implementation · International Coordination · In & Outsourcing · Customer Care · Multilingual: Dutch, English, French, German, Japanese · Entrepreneur
    Tymans Group is a brand by Gert Taeymans BV
    Gert Taeymans bv
    Europe: Koning Albertstraat 136, 2070 Burcht, Belgium — VAT No: BE0685.974.694 — phone: +32 (0) 468.142.754
    USA: 4023 KENNETT PIKE, SUITE 751, GREENVILLE, DE 19807 — Phone: 1-917-473-8669

    Copyright 2017-2022 Gert Taeymans BV