In today's ever-changing aviation world, where technology is the backbone of airport operations, having a plan to ensure business continuity is pivotal. Airports, much like any other business, heavily depend on various systems and services to ensure resilience, efficiency, and the overall passenger experience.
Given the interconnected nature of all the systems, service-interrupting events can happen at any time – it's not a matter of if, but when. As seen by instances where even the most reliable cloud service providers have faced failures, having a resilient, and thoroughly tested disaster recovery plan is imperative.
This article delves into a disaster recovery exercise centred around SkyCore AODB, one of the aviation industry's leading solutions. SkyCore AODB plays an important role in driving digitization and shaping the future of airport operations, and it is presented in depth in this blog post: The Role of SkyCore AODB in Driving Digitization and the Future of Airport Operations.
As the name implies, SkyCore AODB holds a central position in airport operations, managing a large volume of dynamically generated real-time data that integrates with various other airport management systems. Acting as a single source of truth for all information crucial to airport functions, a failure in this central system would have a cascading impact on interconnected systems, potentially creating a crisis within the airport. This is precisely why we opt to test our ability to restore this critical system – it has a significant influence over airport functionalities, emphasising the need for a swift recovery.
The Core Principles: Safeguarding Operations in the Face of Disruptions
At AirportLabs, we understand the importance of all potential challenges an airport could face. We place emphasis on building the most resilient, fault-tolerant, and highly available systems. Beyond that, we carefully analyse and craft the most comprehensive plans aimed at safeguarding operations, minimising disruptions, and ensuring the uninterrupted delivery of services.
Disaster Recovery and Data Backup: In times of crisis, the backup of data is essential for a swift recovery. Whether triggered by a natural disaster, hardware malfunctions, or a deliberate cyberattack, the absence of a robust disaster recovery plan exposes airports to the risk of losing critical data. This, in turn, can lead to disrupted decision-making, compromised future planning, financial losses, and severe damage to their reputation.
IT Infrastructure Resilience with Multi-Cloud Hosting Solution: Central to our approach is the adoption of a multi-cloud hosting solution. This becomes especially crucial in situations involving catastrophic failures or natural disasters that might impact a specific region or data center. In our article, "Why One Cloud is Not Enough for Your Airport: The Power of Multi-Cloud Solutions." we explore the benefits of this approach. This dynamic solution ensures that critical data and systems remain accessible and/or can be restored from a different location, providing a robust safety net for your operations.
Risk and Vulnerability Assessment for Proactive Measures: Understanding that continuous improvement is key, we conduct rigorous risk and vulnerability assessments. Through constant monitoring and evaluation, we identify potential weak points in our systems. This proactive approach allows us to safeguard against emerging threats, ensuring that we are always one step ahead in the game of resilience.
Crisis Navigation: A Disaster Recovery Drill in Action
The essence of a successful rapid recovery exercise lies in addressing both technical and non-technical aspects. Achieving swift recovery implies the ability to restore the system within minutes, ideally under one hour. On the technical side, this involves leveraging automated procedures to minimise errors and facilitate the rapid reconstruction of the same infrastructure configuration. Additionally, the team must have effective internal and external communication, ensuring seamless coordination during the recovery process.
At AirportLabs, SkyCore AODB, much like all our systems, has been designed with a robust, redundant, and fault-tolerant infrastructure. However, to simulate a real-world disaster and put our disaster recovery plan to the test, a dedicated infrastructure mirroring a production environment was created on a singular cloud provider spread across a single region.
In a carefully orchestrated exercise, we created a scenario aimed at replicating the failure of the entire region. The selected scenario involves the simulated destruction of the entire SkyCore AODB cluster, effectively emulating an event that renders the hardware infrastructure entirely non-functional.
Steps to Resilience: A Behind-the-Scenes Look
The orchestration of chaos begins as the disaster recovery team springs into action. Our goal is to restore SkyCore AODB to full functionality in under 60 minutes. This exercise not only tests the technical skills of our disaster recovery team but also the efficiency and coordination of our response team and all the stakeholders.
Activation of the Disaster Recovery Plan: As soon as the simulated disaster begins, the automated monitoring and alerting system kicks in, generating numerous alerts that notify our on-duty engineer. The response team activates the disaster recovery plan initiating the process of restoring SkyCore AODB to another cloud service provider.
Communication and Coordination: In addition to technical recovery efforts, it's a must to have efficient communication. The technical team makes use of internal communication channels to ensure effective information exchange and timely progress updates. A disaster recovery commander plays a key role in keeping all stakeholders informed, from the product team and customer support team to other relevant parties, including customers. Maintaining transparency and providing swift updates are essential for preserving trust and minimising the impact on airport operations.
Technical Restoration: At AirportLabs, our solutions are implemented through entirely automated processes:
- Automated Infrastructure Provisioning: Utilising infrastructure as code, we can rapidly restore all infrastructure components within minutes, maintaining the exact configuration across various cloud service providers. This includes the deployment of security measures, networking configurations, and load balancing, all achieved through automated scripts.
- Automated Deployment: We employ CI/CD tools to efficiently deploy all application services and databases. This automated approach ensures a streamlined and error-free deployment process.
- Backup and Restore Mechanisms: Backup and restore mechanisms are in place to ensure the integrity of data and allow for the quick recovery of data in case of any unexpected events.
Leveraging cutting-edge automated technologies, every team member has the capability to create an identical infrastructure from the ground up on any cloud service provider. Once the infrastructure is created, the automated deployment features empower us to deploy all of SkyCore's AODB services within minutes using CI/CD tools. The technical restoration is accomplished only by following a well-documented deployment guideline.
Ultimately, the most time-consuming step in this disaster recovery exercise is the restoration of data from offsite backups using automated scripts.
Testing, Validation, and Post-Exercise Evaluation: After deploying all infrastructure components and application services, thorough testing was conducted by the product and QA team in order to validate the main functionalities of the application. Lessons learned, insights gained, and improvements identified all contribute to refining our disaster recovery plan continually.
The user perspective: From the standpoint of the users, whether they were at the airport or working remotely, there were no alterations in the usage or procedures. We ensured that all users remained thoroughly informed about the recovery progress, and no action was necessitated on their part.
The true test of any disaster recovery plan lies in its practical application. AirportLabs not only believes in the theoretical strength of disaster recovery plans but proves their efficacy through real-world simulations. Our commitment to resilience was underscored in this disaster recovery exercise where we successfully restored the critical SkyCore AODB in merely 45 minutes including the entire data recovery procedures.
We believe that the standard contractual requirement for disaster recovery within 24 hours is outdated and neither aligned with the contemporary operational needs of airports nor with the capabilities of systems based on modern architectures.
These simulations evaluate our capacity to respond and recover from unexpected disruptions. To be ready for such events, we regularly perform disaster recovery drills on all our systems. This practice allows us to fine-tune our disaster recovery plan and increase the resilience of AirportLabs' systems.