Large-scale ops

Where bureaucracy meets big data, and nothing moves quickly—except the bills.

What can possibly go wrong?

The bureaucratic black hole

A simple model update requires 17 approvals, a risk assessment, and a sacrificial offering to the compliance team.

  • The competition launches a better product while we’re still in “governance review”.

  • Innovation slows to a crawl, as the safest option is to “just keep the old model running”.

  • A government report praises our “robust oversight” while ignoring our plummeting market share.

The legacy system labyrinth

The new ML system must integrate with a 20-year-old monolith that runs on COBOL and hope.

  • Glacial performance, as every prediction requires a pilgrimage through ancient APIs.

  • Engineers start having recurring nightmares about undocumented edge cases.

  • A museum curator calls to ask if we’ve considered donating our codebase as a “historical artifact”.

The domino disaster

A minor change in one system triggers a cascade of failures across the entire org.

  • A catastrophic outage that trends on Twitter before the team even notices.

  • The post-mortem document becomes a novella.

  • A regulator uses our incident as justification for sweeping new laws (which will, of course, make everything worse).

The cost spiral

The team spends £100k/month on cloud resources, half of which are running idle “just in case”.

  • Sudden price hikes as the company tries to recoup losses.

  • A grim company-wide email about “cost optimisation initiatives” (read: layoffs).

  • A tech blogger coins the term “ML-induced bankruptcy”.

Hallmarks of large-scale ops

  • The team has more architects than actual builders—every whiteboard is a maze of boxes and arrows.

  • “Multi-cloud” is both a strategy and a cry for help.

  • Meetings about meetings are a legitimate part of the workflow.

  • The phrase “we’re still evaluating the vendor options” has been uttered for 18 months straight.


Last update: 2025-05-19 20:21