The infrastructure nobody’s building¶

The Patrician has observed that Ankh-Morpork excels at building infrastructure that generates immediate returns or that addresses current crises, while consistently failing to build infrastructure that will be desperately needed in ten years but that provides no obvious benefit today. The city’s investment in decorative fountains vastly exceeds its investment in sewage treatment capacity, which is perfectly rational until the sewage situation becomes a crisis, at which point everyone expresses surprise that insufficient preparation was made despite the entirely predictable growth in population and corresponding sewage production.

Technology infrastructure follows similar patterns. Enormous investment flows into data centres, AI accelerators, and networking equipment that serve current demand while investment in unglamorous infrastructure that will be critical in coming years remains inadequate. This is not because anyone disagrees about what will be needed, but because the incentives favour solving today’s problems rather than preventing tomorrow’s crises, and because infrastructure that prevents disasters is invisible when successful while being obvious targets for budget cuts when financial pressure increases.

The gaps are not secrets. Industry analyses, technical committees, and concerned engineers regularly identify what infrastructure will be needed and what’s not being built. The reports gather dust while investment continues flowing toward whatever generates immediate returns or captures current enthusiasm. Eventually the missing infrastructure becomes urgently necessary, at which point it’s built frantically at enormous expense with everyone wondering why adequate preparation wasn’t made when there was time to build properly.

The Patrician notes that this pattern has repeated throughout history with tedious consistency, that predicting which infrastructure gaps will become crises requires no special insight beyond paying attention to obvious trends and elementary arithmetic, and that the fact that we know what’s coming makes our collective failure to prepare either more forgivable because everyone is doing it or less forgivable because we have no excuse for being surprised.

The power infrastructure that isn’t arriving¶

Data centre power consumption is growing faster than electrical grid capacity in regions with data centre concentrations. The arithmetic is straightforward and the timeline is short, but the infrastructure investments required to address the gap are not being made at necessary scales.

AI infrastructure requires substantially more power per rack than conventional computing. A GPU-dense rack can consume 40 to 80 kilowatts compared to 5 to 10 kilowatts for conventional servers. Data centres built for traditional workloads lack the electrical distribution capacity to support AI-density equipment. Upgrading requires substantial electrical infrastructure investment that takes years to implement and that must happen before the equipment arrives rather than after.

The utility capacity constraints are emerging as data centres request connections for tens to hundreds of megawatts in regions where utilities planned for gradual growth rather than sudden doubling of demand. Utilities must upgrade substations, distribution lines, and potentially generation capacity, which requires planning approvals, construction time, and capital investment measured in hundreds of millions of euros. The timelines for utility infrastructure typically exceed the timelines for data centre construction, which creates situations where data centres are built but cannot operate at full capacity because power supply is inadequate.

The renewable energy integration challenges are substantial when data centres claim to operate on renewable energy but draw from grids that are substantially fossil-fuel powered. The temporal mismatch between constant data centre demand and variable renewable generation requires either storage systems that don’t exist at necessary scales or acceptance that renewable energy claims are accounting fictions rather than physical realities. Building the storage infrastructure to enable genuine renewable operation requires investment that’s not currently happening.

The backup power systems that data centres depend on for reliability during grid outages are typically diesel generators designed for brief outages rather than extended operation. As grid reliability declines in some regions due to aging infrastructure and extreme weather, the assumptions about outage duration that backup systems were designed for may no longer hold. Upgrading to support longer outages or alternative backup power sources requires investment in infrastructure that’s only needed during emergencies and that provides no return during normal operation.

The Patrician observes that power infrastructure has lead times measured in years while enthusiasm for projects requiring that power has lead times measured in quarters, and that the gap between these timelines creates predictable infrastructure crises that everyone will claim were unpredictable despite being entirely predictable to anyone capable of arithmetic.

The cooling systems we’ll desperately need¶

Power consumption becomes waste heat that must be removed continuously or equipment fails. The cooling infrastructure required for AI workloads substantially exceeds what conventional data centres provide, but the investment in advanced cooling is inadequate for the equipment being deployed.

Liquid cooling infrastructure for high-density computing is necessary but expensive to retrofit into existing facilities designed for air cooling. The facilities require plumbing throughout, leak detection systems, heat exchangers, and expertise that air-cooled facilities don’t need. New construction can integrate liquid cooling from design, but retrofitting existing facilities is expensive and disruptive. The gap between what’s needed and what exists is growing as AI equipment density increases faster than cooling infrastructure is upgraded.

Water availability for cooling is constrained in many regions with data centre concentrations. Facilities requiring millions of litres daily for evaporative cooling face conflicts with other water users, particularly during droughts. The alternatives to water cooling like air cooling or closed-loop liquid cooling are less efficient or more expensive, which means they’re avoided until water availability forces the issue. Planning for water-constrained cooling infrastructure should be happening now but largely isn’t because water has been cheap and available historically.

The heat rejection infrastructure that removes waste heat from cooling systems to the environment is often undersized for current heat loads and will be inadequate for projected growth. Cooling towers, dry coolers, and heat exchangers sized for conventional computing cannot handle the heat loads from AI-dense facilities. Upgrading requires space, capital investment, and often planning approvals for large industrial equipment, which takes time that compresses when needs are urgent.

Waste heat utilisation where data centre heat is captured for district heating, industrial processes, or other uses could offset some environmental impact but requires infrastructure connecting data centres to heat consumers. This infrastructure is expensive, requires cooperation between parties with different interests, and provides returns over decades rather than quarters. The economic and organisational challenges mean waste heat utilisation remains rare despite making environmental sense.

The Patrician notes that cooling infrastructure is invisible when adequate and glaringly obvious when inadequate, which means investment naturally flows toward it only after failures occur rather than before, and that this creates periodic crises when equipment deployments race ahead of cooling capacity to support them.

The network capacity we’re assuming into existence¶

Networking infrastructure is being upgraded continuously but the upgrades may not keep pace with bandwidth demands from AI training, large model distribution, and increasing data movement between facilities and to users.

The interconnect speeds within data centres must increase to support distributed AI training across thousands of accelerators. Current high-speed networking provides impressive bandwidth but training at scale requires even higher speeds with lower latency. The switch fabrics, cabling infrastructure, and network interface cards required for next-generation speeds are expensive and require coordination across vendors who are developing the technology concurrently. Deployment will lag behind when AI training clusters need the bandwidth.

The wide area networking between data centres and to users faces pressure from increasing traffic volumes. Model weights for large AI models are tens to hundreds of gigabytes, which must be distributed to inference infrastructure globally. Training data must move to training clusters. User traffic to AI services is growing. The cumulative bandwidth requirements are substantial and increasing faster than wide area capacity is being added. Congestion and performance degradation are plausible outcomes when demand growth exceeds capacity investment.

The last-mile connectivity to users remains a bottleneck for bandwidth-intensive applications. While data centre networking improves and backbone capacity increases, the connections to residential and business users often lag. AI applications requiring substantial bandwidth to deliver results face constraints from user connectivity that’s adequate for current applications but marginal for future ones. The investment in last-mile infrastructure is politically fraught, expensive, and proceeding slowly relative to bandwidth growth needs.

The content delivery networks and edge infrastructure that reduce latency by placing content near users require substantial investment to expand. AI inference at the edge to reduce latency requires deploying computing infrastructure broadly rather than concentrating it in large data centres. This distribution is expensive and operationally complex, which means it happens gradually while applications that would benefit from it are constrained by latency from centralised infrastructure.

The Patrician observes that networking infrastructure is like road infrastructure in that everyone wants more capacity but nobody wants to pay for it until congestion becomes intolerable, and that by the time congestion is intolerable, the lead times for adding capacity mean suffering through years of inadequate infrastructure before improvements arrive.

The skills and training pipeline that doesn’t exist¶

Technology infrastructure includes human expertise, which has even longer lead times than physical infrastructure because training people requires years and because expertise is lost more quickly than it’s gained when skilled people leave the field.

The AI engineering skills gap is acknowledged by everyone while training programs remain inadequate to meet demand. Universities are expanding AI programs but cannot quickly scale from hundreds to thousands of annual graduates with deep expertise. Bootcamps and online courses produce credentials but variable actual competence. The gap between demand for AI expertise and supply of genuinely skilled practitioners will persist for years because educational infrastructure takes time to scale and because superficial training produces superficial skills that don’t meet actual needs.

The infrastructure operations expertise for managing large-scale AI systems requires understanding both AI/ML and infrastructure operations, which is rare combination. Training programs generally focus on one or the other but not both, which means the people who can effectively operate AI infrastructure are scarce. This gap matters more as AI systems scale and as operational complexity increases beyond what general operations teams can handle.

The security expertise for AI systems requires understanding both security principles and AI/ML specifics. The supply of people with both is minimal, which means AI security is often handled by security people who don’t understand ML or ML people who don’t understand security. Neither is ideal and both create vulnerabilities that will be exploited. Training programs that adequately cover both are rare.

The regulatory compliance expertise for navigating AI regulation requires understanding both the technology and the regulatory frameworks, which is another rare combination. As regulation increases, companies need people who can implement compliance without crippling the technology. The supply is inadequate and educational programs have barely begun addressing this need.

The Patrician observes that human expertise has the longest lead times of any infrastructure because growing people is slower than growing anything else, and that the current gaps in expertise will persist for a decade even if training programs improve immediately because you cannot create ten years of experience in less than ten years.

The boring infrastructure that enables everything else¶

The least glamorous infrastructure gaps are often the most consequential because they’re prerequisites for everything else but provide no direct value that would attract investment.

The electrical grid reliability in many regions is declining due to aging infrastructure, increased extreme weather, and growing demand. Data centres require extremely reliable power, typically targeting 99.99% uptime or better. Achieving this requires either grid reliability that many regions don’t provide or extensive on-site backup systems that are expensive and environmentally problematic. Investment in grid reliability is necessary but politically difficult and slow because it’s utility infrastructure requiring regulatory approval and public acceptance of costs.

The water infrastructure in regions with data centre growth may be inadequate for projected cooling water demand, particularly during droughts when water availability is constrained. Planning for adequate water supply requires coordination between utilities, regulators, and data centre operators across timeframes of decades. This coordination is happening sporadically rather than systematically, which creates risk of water shortages constraining data centre operations when demand exceeds supply.

The transportation infrastructure for delivering equipment, materials, and components to construction sites and operating facilities is often overlooked until it becomes bottleneck. Large data centre construction requires enormous amounts of material delivered on tight schedules. Operating facilities require regular deliveries of replacement parts and supplies. Road capacity, port capacity, and logistics infrastructure all matter when volumes are substantial. The investment in this mundane infrastructure happens only when bottlenecks become obvious.

The physical security infrastructure for protecting facilities from various threats requires walls, sensors, guards, and monitoring systems that are expensive and that provide no value unless threats materialise. The investment is insurance that seems wasteful until needed, which means it’s often deferred or inadequate. As facilities become more critical and potential threats increase, the security infrastructure needs to improve, but the investment competes with more immediate needs.

The telecommunications infrastructure connecting facilities to communications networks requires reliability that standard commercial connections don’t always provide. Redundant diverse routing, backup systems, and coordination with telecommunications providers are necessary but expensive. The investment is justified by reliability requirements but difficult to quantify returns for, which makes it vulnerable to budget cuts.

The Patrician notes that boring infrastructure is boring until its absence creates crisis, at which point it becomes extremely interesting while remaining expensive and time-consuming to address, and that planning ahead for boring infrastructure requires discipline that is rare in organisations optimised for immediate returns.

The Patrician’s assessment¶

Looking at the infrastructure gaps with appropriate concern for what’s not being built, The Patrician concludes that we’re systematically under-investing in unglamorous prerequisites while over-investing in glamorous applications, that this will create predictable crises when the prerequisites become bottlenecks, and that everyone will express surprise despite the inevitability being obvious to anyone paying attention.

The power infrastructure gap is entirely predictable through elementary arithmetic about growth rates and lead times. The cooling infrastructure gap follows directly from the power gap because waste heat must be removed. The networking gap results from bandwidth demand growing faster than capacity additions. The skills gap reflects educational pipelines that cannot scale instantly. The boring infrastructure gap exists because investment flows toward exciting projects rather than mundane necessities.

None of these gaps are mysteries. Reports identify them regularly. Technical people explain the problems clearly. The arithmetic is straightforward. The gaps persist because fixing them requires investment now to prevent problems later, and humans are reliably bad at this type of decision-making. The organisations capable of making necessary investments are optimised for quarterly returns rather than decade-long infrastructure planning.

The consequences will be infrastructure crises that constrain growth, increase costs, and create bottlenecks that everyone will wish had been addressed proactively. Power shortages will limit data centre expansion. Cooling inadequacy will constrain equipment density. Network congestion will degrade performance. Skills shortages will slow deployments. The boring infrastructure failures will create expensive disruptions.

The resolution will be frantic infrastructure building during crises at costs substantially higher than proactive investment would have required. This is the normal pattern where infrastructure gaps are addressed reactively rather than proactively because reactive spending is easier to justify than proactive spending even when proactive spending would be cheaper. The technology industry will follow this pattern because industries generally do.

The Patrician suggests that the sensible approach would be identifying obvious infrastructure gaps, calculating the lead times required to address them, and beginning investments immediately to ensure infrastructure is available when needed. This approach is sensible but rare because it requires long-term thinking that conflicts with short-term incentives that dominate most decision-making.

His prediction is that we’ll continue under-investing in necessary infrastructure until crises force investment, that the crises will be blamed on unforeseen circumstances despite being entirely foreseeable, and that the expensive reactive response will be accompanied by earnest promises to plan better next time that will be forgotten once the immediate crisis passes and attention returns to whatever is exciting rather than necessary.

The infrastructure nobody’s building will eventually be built because it will become impossible to avoid building it when absence creates sufficient pain. The question is whether it’s built proactively at reasonable cost or reactively at inflated cost after the pain has become intolerable. The historical pattern suggests the latter, and The Patrician sees no evidence that the technology industry will break from this pattern despite the obvious benefits of doing so.

In the meantime, investment will continue flowing toward whatever generates immediate returns or captures current enthusiasm while unglamorous infrastructure needs accumulate until they become crises. This is not optimal but it’s predictable, and predicting it is the best available consolation for those who will be explaining in a few years why infrastructure that everyone knew would be needed wasn’t built when there was adequate time to build it properly. The answer, as always, will be that building it would have required sacrificing immediate returns for long-term stability, and that this sacrifice is something humans discuss approvingly in the abstract while reliably declining to make in practice.