Cloud Operational Resilience: Importance of multi-cloud and runbooks in catastrophe management (Guest blog from KPMG)

This blog was written by Mike Hadgraft, Associate Director - Technology Advisory, KPMG
As enterprises deepen their reliance on public cloud services, many are now considering what their ‘lifeboat’ options would be in the event of a fundamental cloud compromise. In Q2 2024, enterprise spending on cloud computing reached $79 billion, marking a 22% increase from the same quarter in the previous year.
The top three cloud providers—AWS, Microsoft Azure, and Google Cloud—collectively accounted for 66% of this total spending (Data Center Dynamics). But what happens when a breach is so severe that an organisation must rebuild its core operations in a clean cloud environment? How prepared are businesses to execute this recovery? Forward-thinking enterprises are now approaching resilience planning with this mentality—anticipating extreme failures to ultimately enhance overall robustness, even in less catastrophic scenarios.
What if your Cloud Identity Manager was hacked…
There are many scenarios often put to organisations by regulators and these can vary from experiencing a performance issue on a critical application all the way to an external actor gaining global admin access to the identity manager of the cloud tenant. Not only this but what if you were not able to access native back-ups and recovery services since in this scenario your entire cloud estate has been compromised (including all native backups no matter the availability zone or region).
Given many modern Cloud engineering patterns encourage 'identity first' architectures and 'zero-trust' this often means the identity manager is the bedrock of a Cloud estate. Not only does it manage all human user account access to the internal built applications and enrolled SaaS apps but it also manages DevOps identity processes providing secrets, keys and certificates for application to application processing.
Scenarios such as the hacking of this resource really does give food for thought on the level of dependency and the impact to business operations.
Consider Multi-Cloud
Where there is regulation there is of course opportunity for entrepreneurship, and organisations are already tapping into Cloud resilience in this area of catastrophe management.
Companies now offer solutions to migrate workloads to separate Cloud hyperscalers as snapshots and provide DevOps routines to recover these back to the native Cloud tenant all as code. This can include any hosted resources utilising IaaS but also PaaS resources across identity, networking, application and database. This means not just physical hosts such as VMs and databases being backed up but actually IaC and associated storage which can be re-deployed back to the native tenancy.
Suppliers offer SLAs and regional configuration of where to persist the back-ups to be recovered from within the defined requirements of the customer organisation. These services provide a truly ‘air gapped’ architecture to meet the extremes of the scenario that can now be put to organisations.
Define a Minimum Viable Business
After sourcing a truly air gapped back up and recovery service you now need to understand the core Cloud dependencies for your business to run. The real art in dealing with this challenge without leading to analysis paralysis by trying to consider everything the organisation currently hosts and runs in Cloud is to be Agile and think in 'minimum' terms for feasible business operation. This will determine your critical business processes you need to get up and running and therefore define your key platform services and workloads.
Defining what those key services are for your organisation and the process maps behind them will ensure you understand the touch points for business teams and by association their reliance on the Cloud estate. It’s also possible to draw out dependency trees between workloads and how they rely on surrounding architecture and underlying infrastructure to reveal single points of failure. Organisations can then calculate KPIs against these services such as the quantifiable loss of revenue for every hour a critical service is lost to add weight in the argument for putting recoverability behind it.
Everything as code vs runbooks
Quite often as IT estates evolve, and in particular organisations migrate to public cloud, documentation can vary considerably in how the workloads have been designed and implemented. Sometimes a datacentre migration project can provide standardised, effective documentation in this area but it can often be the case that research and interviews are required to get a full picture of how, in low level engineering terms, that application would actually be re-deployed with the necessary configuration, data and access to meet BAU needs.
A first step can be to use technical runbooks to document a sequential list of steps required to get a workload deployed across infrastructure, database, apps and data in traditional N-tier architectures. However, a consideration is the native-cloud services likely in place across the organisation utilising PaaS and SaaS services that will be infrastructure as code.
For these more modern services the runbook approach should be applied to the DevOps structure of the code base in tools such as Azure ARM templates or Terraform modules. The ordering of dependencies, interactions with other platform architecture can all be presented and exported clearly from the repository to effectively represent a runbook. This way the documentation is living, does not go stale and empowers the DevOps team to manage the solution as a service aligned to SLAs and resilience requirements aligned to business requirements.
Recovery environment in Cloud
So you're now aware of the critical services your business needs to run and you know at a low level how particular applications and services need to be rebuilt, but how do you demonstrate that you’d recover successfully in the event of disaster? And how do you actually know what you have documented is accurate and testable?
Here is where organisations can implement isolated 'recovery environments' which can act as a test bed for the recovery of services if a catastrophe is experienced. Understanding the core business services that need to be up and running means you can work backwards and identify the top level platform dependencies that need to be in place to enable these. Usually these are networking, firewalls, identity managers. Then there is a next tier of services such as end user compute, collaboration services and database servers. These create the foundations of a separate landing zone in effect where mock workloads can then be deployed to check they are accessible and meet operational requirements.
There are technical nuances like syncing of domain controllers, IP address clashing to be aware of and a completely isolated environment which is not connected to any other part of the estate is often the quickest way to implement the architecture as well as demonstrate to regulators that in the event of a hack of this scale, this new environment will still be accessible. These programmes of work require rigorous specification of steps to deploy services as well as testing of these through each stage of deployment to UAT. All the steps put together then need to be compared against service management requirements to ensure levels of RTO (recovery time objective) are met which are underpinned by the KPIs of business operation.
Evolving the estate and automation
Once an organisation has a recovery environment up and running, has proven its isolation and that it can indeed recover services providing access on an operational level, how do you then keep it up to date and keep it a true reflection of Production as an ongoing asset?
As the organisation matures in its use of infrastructure as code, the delta changes to the recovery environment can become automated through releases from DevOps pipelines with less dependence on manual documentation and maintenance. If all manual documentation eventually gets transitioned into code and made part of the BAU DevOps process this provides a real-time way for organisations to get constant re-assurance of resilience in the event of such catastrophe. The organisation should live the Agile mindset and continuously optimise the workload to meet evolving business needs by fine tuning the runbook activities and owning these metrics out to the business with a handshake of service requirements being met.
In a world where cloud dependency is both an opportunity and a risk, multi-cloud resilience strategies are no longer optional—they're a necessity. The greatest benefit of thinking in ‘lifeboat’ terms is not just being prepared for a catastrophic event, but also strengthening resilience against lesser disruptions. By defining MVB processes, leveraging air-gapped architectures, and automating recovery, organisations can build robust defences against even the most severe scenarios, ensuring operational continuity no matter the scale of disruption.
For more information please contact: