Flight Troubles & Business Continuity Planning

4 min read

January 11, 2023 at 4:15 PM

This morning, the Federal Aviation Administration’s (FAA) system for alerting pilots and airports of real-time hazards, called NOTAM (Notice to Air Missions), went offline around 3:28AM EST. While flights have slowly resumed to normalcy, this recent event is a further example of a breakdown in incident response, business continuity, and disaster recovery planning in the airline industry. As of now, transportation Secretary Pete Buttigieg has not yet ruled out a cyberattack as the cause of the massive system outage.

This breakdown occurred just sixteen days after the recent Southwest Airlines breakdown in service. On December 26th, 2022, thousands of passengers around the United States learned that Southwest Airlines had canceled almost all flights scheduled for that day. Travelers were stranded far from home, and some were even told by police that they were trespassing while attempting to find alternate travel plans. More than two days later, Southwest CEO Bob Jordan apologized on the company's Twitter account. But what caused Southwest Airlines to cancel 2,293 flights in a few short days while the top nineteen competing airlines only canceled a combined total of 859? The simple answer is that software from the 1990s and inefficient business continuity plans (BCP) and disaster recovery plans (DRP) led to a significant disruption for Southwest Airlines.

Responsibility was originally attributed to poor weather conditions and bitter winter cold, though later reports surfaced revealing that the major cause of cancellations was due to Southwest Airlines relying on outdated scheduling software called SkySolver, which was developed the same year Facebook was launched and is quickly nearing its end of life. Due to the winter storm and the tremendous amount of scheduling changes that were required for passengers and crews, SkySolver could not match crew members with flights.

According to a spokesman for GE Aerospace, multiple systems are involved in crew scheduling. The company stated that its software is not an end-to-end solution. Rather, it's a so-called backend algorithm that airlines can supplement with other software. GE Aerospace stated that the algorithm gathers input from other systems to provide recommendations to resolve crew-related disruptions. Improving these systems is essential not only for crew management but also for enhancing overall airport services, ensuring a smoother travel experience for passengers.

Since 2015, Southwest Airlines pilots and crew have been begging leadership for internal technology updates. However, only minor updates to SkySolver have been implemented since its inception, referred to as "technical debt", or the gap between existing software and the required updates to maintain baseline operations. An IT consultant said the union had been urging the airline for years to update "I.T. and infrastructure from the 1990s".

Modern companies have been making a move to cloud-hosted solutions for several years now. Services such as software-as-a-service (SaaS), platform-as-a-service (PaaS), and infrastructure-as-a-service (IaaS) are offered by many cloud service providers such as AWS. Depending on the service model implemented, consumers of cloud services can avail their organizations of benefits such as scalability and flexibility, potential cost savings (e.g., pay for only the resources actually used), advanced security, and data loss prevention.

The idea behind cloud services is a shared responsibility model whereby security and compliance are shared between the cloud service provider and the customer. Here is a simplified example of the shared responsibility model:

Data

Regardless of the service model, it is important to understand that access management, monitoring, log analysis, and configuration control are customer responsibilities. An organization needs to define and implement strong password policies, consider multifactor authentication, and verify secure configurations are implemented.

Whether organizations implement a traditional IT, hybrid, or cloud-based model for managing systems and applications, security and compliance must be considered throughout all system and application lifecycles, and sound security practices must be implemented. At a minimum, organizations need to ensure the following security practices are implemented:

Asset identification – Understand and document the critical assets of the organization. This includes people, processes, and technologies.
Patching – Outdated systems and applications increase the chance that threats will exploit vulnerabilities.
Backups – Backups protect against human errors, hardware failure, virus attacks, power failure, and natural disasters. Backups can help save time and money if these failures occur.
Business Continuity, Incident Response, and Disaster Recovery Policies and Procedures – Business continuity keeps your organization running during the lifecycle of an incident. Incident response allows your organization to handle an incident from the start. Disaster recovery supports the recovery process back to normalcy.

While the details of today’s event are still emerging, both incidents highlight the catastrophic effects that can occur if incident response, disaster recovery, and business continuity plans are insufficient to keep an organization operating at full or near-full capacity in the event of an incident resulting in a stoppage. Testing and updating plans regularly is a crucial step. Leveraging a third-party consultant to assist in evaluating these plans is highly recommended (and at times required by regulators). Compass IT Compliance has worked to assist private and public sector entities in strengthening these plans since 2010. Contact us today to learn more and to discuss your unique challenges.