Large-scale ops¶
Where bureaucracy meets big data, and nothing moves quickly—except the bills.
What can possibly go wrong?¶
The bureaucratic black hole¶
A simple model update requires 17 approvals, a risk assessment, and a sacrificial offering to the compliance team.
The competition launches a better product while we’re still in “governance review”.
Innovation slows to a crawl, as the safest option is to “just keep the old model running”.
A government report praises our “robust oversight” while ignoring our plummeting market share.
The legacy system labyrinth¶
The new ML system must integrate with a 20-year-old monolith that runs on COBOL and hope.
Glacial performance, as every prediction requires a pilgrimage through ancient APIs.
Engineers start having recurring nightmares about undocumented edge cases.
A museum curator calls to ask if we’ve considered donating our codebase as a “historical artifact”.
The domino disaster¶
A minor change in one system triggers a cascade of failures across the entire org.
A catastrophic outage that trends on Twitter before the team even notices.
The post-mortem document becomes a novella.
A regulator uses our incident as justification for sweeping new laws (which will, of course, make everything worse).
The cost spiral¶
The team spends £100k/month on cloud resources, half of which are running idle “just in case”.
Sudden price hikes as the company tries to recoup losses.
A grim company-wide email about “cost optimisation initiatives” (read: layoffs).
A tech blogger coins the term “ML-induced bankruptcy”.
Hallmarks of large-scale ops¶
The team has more architects than actual builders—every whiteboard is a maze of boxes and arrows.
“Multi-cloud” is both a strategy and a cry for help.
Meetings about meetings are a legitimate part of the workflow.
The phrase “we’re still evaluating the vendor options” has been uttered for 18 months straight.