The 2023 Optus Outage: A Wake-Up Call
Back to Blog
Incident Analysis
Nov 08, 202318 min read

The 2023 Optus Outage: A Wake-Up Call

S
Shubham Singla

The 2023 Optus outage was not just a service disruption; it was a systemic shock that exposed the fragility of Australia's digital backbone. This comprehensive retrospective analyzes the 14-hour blackout that left 10 million people disconnected, dissection the BGP routing error, the crisis communications failure, and the long-term regulatory fallout.

Server Room Crisis

Executive Summary

On November 8, 2023, Optus, Australia's second-largest telecommunications provider, suffered a catastrophic network failure. For over 12 hours, mobile and broadband services were unavailable. The outage affected:

  • 10.2 million consumer and business customers.
  • 400,000 landline and broadband services.
  • Critical infrastructure including Melbourne's train network (Metro Trains) which relies on Optus for signaling communication.
  • Emergency services contactability for thousands of users.
  • Major banks and payment terminals, halting commerce for small businesses nationwide.

The root cause was identified as a routing information update from an international peering partner (Singtel) which propagated changes to the Border Gateway Protocol (BGP). These changes triggered a preset safety threshold on Optus's Cisco routers, causing them to self-isolate to protect the core network.

The Golden Update: Anatomy of a BGP Meltdown

To understand the failure, one must understand BGP. The Border Gateway Protocol is the postal service of the internet. It does not move data; it tells data where to go. It works by exchanging "routes" or "prefixes" between autonomous systems (AS).

At 4:05 AM AEDT, a routine software upgrade occurred at a Singtel internet exchange in North America. This exchange is a peering point from which Optus receives global routing data. During this upgrade, the peering router began advertising new routes to Optus.

The 900,000 Prefix Limit

Optus's core routers were configured with a safety mechanisms known as a "Max-Prefix Limit." This is a security feature designed to prevent memory exhaustion or "route leaks" (where traffic is accidentally routed through a network that cannot handle it).

The limit on the Optus routers was set to accept approximately 900,000 route prefixes. However, the update from Singtel contained routing information that exceeded this limit. It wasn't malicious traffic; it was valid routing data, but there was too much of it.

# Hypothetical Router Log Output

04:05:22.112 %BGP-3-MAXPFXEXCEED: No. of prefix received from 203.0.113.1 (Singtel-Peering) reaches 900001, max 900000

04:05:22.114 %BGP-5-ADJCHANGE: neighbor 203.0.113.1 Down BGP Notification sent

04:05:22.115 %BGP-3-NOTIFICATION: sent to neighbor 203.0.113.1 3/1 (update malformed) 0 bytes

04:05:25.000 %OSPF-5-ADJCHG: Process 1, Nbr 192.168.1.5 on GigabitEthernet0/1 from FULL to DOWN, Neighbor Down: Interface down or detached

When the limit was hit, the routers did exactly what they were programmed to do: they severed the connection to the peer. However, because this happened across multiple peering points simultaneously, and the routers attempted to re-establish and re-converge, the internal routing table became unstable. The core routers, overwhelmed by the recalculations, effectively disconnected themselves from the rest of the network to prevent hardware damage.

Global Routing Map

Timeline of Events

04:05 AM - The Trigger

Singtel exchange update propagates excessive BGP routes. Optus core routers begin to shed peers.

04:45 AM - The Cascade

Mobile towers across the East Coast begin to lose backhaul connectivity. Users wake up to "SOS Only" on their iPhones.

06:00 AM - Critical Mass

Melbourne Metro Trains reports communication failure. The incident is declared a Severity 1 Critical Incident within Optus NOC (Network Operations Center).

09:00 AM - The Blind Spot

Optus engineers struggle to access remote management interfaces because the management network itself rides on the failed infrastructure (In-Band Management failure).

12:30 PM - The Identification

Engineers, working with Singtel and Cisco TAC (Technical Assistance Center), identify the BGP prefix limit as the root cause. A plan is formed to manually reboot and reconfigure routers with a higher limit.

05:30 PM - Restoration

Services begin to progressively come back online as rolling restarts are completed.

Technical Deep Dive: The Attack Anatomy

Understanding the specific mechanics of the attack is crucial for engineers. Most advanced threats follow the Cyber Kill Chain model:

RECONNAISSANCE: The attacker gathers information on the target. This can be passive (OSINT) or active (port scanning).

WEAPONIZATION: Creating a deliverable payload (e.g., a malicious PDF or Office macro).

DELIVERY: Transmitting the weapon to the target (e.g., via Phishing or USB).

EXPLOITATION: Triggering the payload to exploit a vulnerability (e.g., CVE-2023-xyz).

INSTALLATION: establishing a backdoor or persistence mechanism (e.g., a scheduled task or registry key).

COMMAND & CONTROL (C2): The compromised system calls home to the attacker server for instructions.

ACTIONS ON OBJECTIVES: The attacker achieves their goal (encryption, extensive data exfiltration, destruction).

The Failsafe Paradox

The irony of the Optus outage is that it was caused by a safety feature. The "Max-Prefix" limit is standard industry practice. It exists to stop a misconfigured peer from flooding the router's memory (RAM) with millions of junk routes, which would crash the router entirely.

However, in modern networks, the global routing table is growing exponentially. The limit set by Optus (likely years prior) did not account for the sudden spike in valid routes during a major peering topology change. This highlights the concept of "Configuration drift"—where security settings that were correct three years ago become liabilities today if not reviewed.

The Communications Failure

While the technical failure was severe, the reputational damage was compounded by silence. For nearly seven hours, Optus provided no meaningful update to the public. The CEO, Kelly Bayer Rosmarin, eventually conducted a radio interview, but the lack of official SMS, email, or clear social media updates left customers in the dark.

In the absence of information, misinformation filled the void. Rumors of a massive cyberattack spread on Twitter (X). This demonstrates that Crisis Communication is a security function. If you cannot communicate, you cannot control the narrative, and panic ensues.

Regulatory and Compliance Context

In the aftermath of such incidents, organizations must navigate a complex web of regulatory obligations. Failure to comply can result in severe fines and reputational damage.

GDPR (General Data Protection Regulation)

For organizations operating in or serving citizens of the EU, GDPR mandates strict breach notification timelines (usually within 72 hours). Article 32 requires the implementation of appropriate technical and organizational measures to ensure a level of security appropriate to the risk.

NIST Cybersecurity Framework

The NIST framework provides a standard for critical infrastructure. It is organized around five core functions: Identify, Protect, Detect, Respond, and Recover. This incident highlights failures primarily in the 'Protect' and 'Detect' functions.

Local Legislation (Privacy Act 1988 - Australia)

Under the Notifiable Data Breaches (NDB) scheme, organizations must notify the OAIC and affected individuals if a data breach is likely to result in serious harm. This includes unauthorized access to personal information.

Financial and Economic Impact

The economic cost of the outage was staggering. Small businesses, cafes using Square readers, and tradies relying on mobile bookings lost a full day of trade. Optus later offered 200GB of free data as compensation—a move widely criticized as tone-deaf for business customers who lost thousands of dollars.

The share price of Singtel (Optus's parent company) dropped nearly 5% in the following days. It is estimated that the outage cost the Australian economy over $400 million AUD in lost productivity.

Standard Incident Response Procedures

A robust Incident Response Plan (IRP) is the best defense against chaos. The SANS Institute outlines a six-step process:

  1. Preparation: Training, tooling, and dry runs (tabletop exercises).
  2. Identification: Detecting the deviation from normal behavior and determining the scope.
  3. Containment: Short-term mitigation (isolating the system) and long-term containment (patching).
  4. Eradication: Removing the root cause (malware, compromised accounts).
  5. Recovery: Restoring systems to normal operation and monitoring for recurrence.
  6. Lessons Learned: Post-incident analysis to improve future response.

The Future: Domestic Roaming?

One of the most significant outcomes of the outage was the federal government's push for mandated Domestic Roaming during emergencies. This would mean that if the Optus network fails, an Optus phone would automatically switch to the Telstra or Vodafone network to make calls (similar to how 000 calls work now).

While technically feasible, it requires massive cooperation between competitors and significant upgrades to core network capacity to handle the sudden influx of millions of "guest" users.

Comprehensive Mitigation Strategies

To prevent recurrence, a defense-in-depth approach is required. This involves layering security controls so that if one fails, another catches the threat.

  • Network Segmentation: Isolate critical assets in separate VLANs with strict firewall rules (East-West traffic inspection).
  • Endpoint Detection and Response (EDR): Deploy agents that can detect behavioral anomalies, not just file signatures.
  • Identity and Access Management (IAM): Enforce Least Privilege and MFA everywhere. Review access logs regularly.
  • Regular Audits: Conduct penetration testing and vulnerability scanning (using tools like Nessus or Burp Suite) at least quarterly.

Final Thoughts

The Optus outage was a "Black Swan" event—rare, unpredictable, but high impact. It reminded us that the cloud is just someone else's computer, and the internet is held together by BGP, a protocol from the 1980s that relies heavily on trust. As we build more interconnected systems, the complexity increases, and so does the risk of cascading failure. The lesson for every engineer is simple: Review your limits. Test your failsafes. And always, always have an Out-of-Band management plan.