Resilience and Catastrophic Failure | Clarity Software Limited

A strange feature of catastrophic failure is how normal it can look immediately before it happens. The system is operating. The numbers are acceptable. The team has dealt with smaller incidents before. Nothing appears to demand urgent attention.

Everything is fine. Everything is fine. Everything is fine. Then the system crosses a boundary and the failure is no longer incremental. It is fast, visible, expensive, and difficult to contain.

The catastrophe is often not the first sign of trouble. It is the first sign that the hidden margin has been used up.

The Problem You Cannot See From Incidents Alone

If your only data points are fires, then a quiet period can look like safety. No fires this month. Fewer fires this year. The system appears to be improving.

But in a fire-prone landscape, the absence of flame is not the same thing as the absence of risk. Fuel load can grow slowly. Vegetation accumulates. Wet periods can create more growth, then dry periods can turn that growth into combustible material. The risk does not announce itself as a fire every day. It accumulates as latent capacity for a much larger event.

That is one way to understand the Los Angeles wildfires of January 2025. NOAA described the conditions as a combination of back-to-back wet winters that increased vegetation, a record-dry fall, and an extreme Santa Ana wind event. NASA also noted that warm, dry weather left vegetation primed to burn, after a period where herbaceous fuel loadings were above normal in California.

The important lesson for organisations is not that software systems are forests. It is that many systems accumulate risk during ordinary operation. When the only thing being measured is the visible incident, a long quiet period can hide the growth of conditions that make the next incident much worse.

The Shape of Catastrophic Failure

Catastrophic failure often has a recognisable rhythm:

Everything is fine The system meets its visible targets and prior incidents seem contained.

Everything is fine Workarounds, complexity, and deferred maintenance slowly become normal.

Everything is fine The system still works, but its margins are thinner than people realise.

Large failure A trigger arrives and the accumulated brittleness becomes visible all at once.

That last step gets most of the attention because it is dramatic. But the actual work of resilience sits earlier in the chain. It asks what is getting harder, where people are compensating, which controls are assumed to work, and what evidence would show that the system is becoming brittle before it breaks.

Knight Capital: A Software Failure With No Slow Burn On The Clock

Knight Capital is a useful example because the visible failure was almost unbelievably fast. On August 1, 2012, a deployment problem in Knight's trading systems generated erroneous orders in NYSE-listed securities. The SEC later charged Knight Capital with violating the Market Access Rule and described how technology controls failed to prevent the incident. The company lost roughly $440 million in less than an hour.

On the day, the disaster looked like a sudden software glitch. But sudden visible failure does not mean sudden systemic failure. The conditions that mattered included release management, dormant code, inconsistent deployment across servers, alert handling, and controls that did not stop the system before the damage was done.

Again, the pattern matters. Everything is fine, because the system normally trades. Everything is fine, because prior releases did not produce catastrophe. Everything is fine, because the hidden assumptions have not all been tested at the same time. Then a new release meets an old fragility, and the organisation discovers the real state of its resilience at market speed.

Resilience Engineering Starts Before The Incident

There is an entire field concerned with this problem: resilience engineering. Researchers and practitioners such as Dr. David Woods have spent years studying how complex systems succeed and fail under pressure. One of the core ideas is that resilience is not just recovery after failure. It is the ability to adapt, stretch, detect brittleness, and keep functioning as conditions change.

In operational systems, this means looking beyond uptime percentages and incident counts. Those are useful, but they are incomplete. A system can have good uptime while becoming harder to operate. It can have fewer incidents while engineers are absorbing more coordination cost. It can pass disaster recovery tests that do not resemble the actual stress it will face.

Resilience work asks different questions:

Where are teams relying on heroic manual intervention?
Which alerts are ignored because they fire too often or arrive without context?
Which recovery procedures have only been tested in clean, artificial conditions?
Where has architectural complexity outgrown the team's mental model?
Which weak signals appear before incidents, but are not yet being collected or reviewed?

These questions are uncomfortable in the right way. They move attention away from whether the system looked fine yesterday and toward whether it has the capacity to handle tomorrow's surprise.

Assessments and Audits

Resilience Assessment

Clarity's Resilience Assessment examines disaster recovery, monitoring, alert quality, operational documentation, and weak signal detection. The goal is to find the hidden fuel load in your systems before a large failure exposes it for you.

View the assessment

What To Look For Before The Fire

Weak signals rarely arrive labelled as weak signals. They look like small frustrations: a runbook that is slightly wrong, an alert that only one person understands, a recovery drill that depends on someone remembering an undocumented step, a dashboard that measures the easy thing instead of the dangerous thing.

In healthy systems, those signals have somewhere to go. They are noticed, discussed, linked to design decisions, and fed back into operations. In brittle systems, they are normalised. People learn to work around them until the workaround itself becomes part of the system.

The discipline is to treat resilience as an active practice, not a trait the organisation either has or lacks. Test recovery. Review near misses. Watch for operational saturation. Keep documentation alive. Ask what is accumulating while the visible metrics look calm.

Catastrophic failures are frightening because they seem to appear from nowhere. Resilience engineering teaches a more useful lesson: many of them come from somewhere. They come from slowly increasing fuel load, shrinking margins, missed weak signals, and controls that were assumed rather than proven.

The work is to see the system before the system is on fire.

The Problem You Cannot See From Incidents Alone

The Shape of Catastrophic Failure

Knight Capital: A Software Failure With No Slow Burn On The Clock

Resilience Engineering Starts Before The Incident

Resilience Assessment

What To Look For Before The Fire

Sources and Further Reading