Memory bottlenecks¶
The semiconductor industry has a tendency to fixate on the glamorous components while ignoring unglamorous dependencies that turn out to be equally critical. Everyone discusses GPU capabilities, fabrication process nodes, and computational performance while memory subsystems receive attention only when they become constraint points, which they increasingly are. This is rather like obsessing over the quality of a carriage’s horses while ignoring that the wheels are about to fall off, which is fine until you’re stranded halfway to your destination wondering why nobody mentioned the wheel situation.
High Bandwidth Memory, or HBM, has graduated from obscure technical specification to supply chain bottleneck with remarkable speed. AI accelerators require enormous memory bandwidth to keep their computational units fed with data. HBM provides this bandwidth through clever 3D stacking and wide interfaces, which is excellent for performance and terrible for manufacturing because HBM is considerably more difficult and expensive to produce than conventional memory. Only three manufacturers produce HBM at scale, their capacity is limited, and demand has surged beyond what anyone predicted when the factories were being planned years ago.
The result is a situation where GPU manufacturers can produce computational engines faster than memory manufacturers can supply the memory to accompany them. Having a state-of-the-art GPU without adequate memory is like having a kitchen full of talented chefs but no ingredients, which is to say technically impressive but not actually useful for producing the intended outputs. The memory bottleneck is constraining AI hardware deployment in ways that are less visible than GPU shortages but equally consequential for anyone attempting to build AI infrastructure.
What makes HBM special and difficult¶
High Bandwidth Memory achieves its performance through vertical integration in the literal sense. Multiple memory dies are stacked on top of each other and connected through thousands of tiny vertical connections called through-silicon vias. This stack sits on a logic die that manages the memory interface. The entire assembly is then integrated with the processor, creating a memory system that can move data at rates measured in terabytes per second rather than the hundreds of gigabytes per second that conventional memory achieves.
The manufacturing process is extraordinarily complex. Each memory die must be thinned to around 50 micrometres, which is thin enough that handling them without breakage requires extreme care. Through-silicon vias must be drilled through the dies with precision measured in micrometres. The dies must be aligned and bonded with tolerances that make conventional semiconductor manufacturing look forgiving. Any defect in any layer potentially ruins the entire stack, which affects manufacturing yields and costs substantially.
Testing is complicated by the 3D structure. You cannot easily probe individual dies within a stack to diagnose problems. Testing must occur after assembly, at which point discovering defects means scrapping an assembly that’s already consumed substantial manufacturing resources. This drives conservative approaches where only known-good dies are used for stacking, which further reduces effective manufacturing capacity because marginal dies that would be acceptable for conventional memory cannot be risked in HBM stacks.
Thermal management is challenging because stacked dies generate heat that must be conducted away through the stack. The middle dies in a stack are effectively insulated by the dies above and below them, which limits how much power the memory can consume without overheating. This constrains memory clock speeds and requires careful thermal design to prevent hot spots that would degrade reliability or performance.
The economics are unfavourable compared to conventional memory. HBM costs several times more per gigabyte than DDR memory used in conventional servers. The price premium reflects both the manufacturing difficulty and the limited number of suppliers. For applications requiring the bandwidth HBM provides, the cost is justified by performance gains. For applications that could work with conventional memory, HBM is an expensive indulgence. AI accelerators need HBM because their computational throughput would be throttled by anything slower.
The supply chain concentration problem¶
HBM is manufactured by three companies at meaningful scale. SK Hynix leads in HBM production and supplies the majority of Nvidia’s requirements. Samsung manufactures HBM but has had quality challenges that affected their market share. Micron entered HBM production more recently and is ramping capacity. This three-supplier situation creates familiar concentration risks where any disruption to a major supplier immediately affects the entire AI hardware supply chain.
SK Hynix’s dominance means that Nvidia’s GPU supply depends not just on TSMC’s fabrication capacity but also on SK Hynix’s HBM production. Having the fastest GPU architecture is useful only if you can attach adequate memory, and if SK Hynix cannot supply sufficient HBM, Nvidia cannot ship complete products. This dependency isn’t publicly discussed as much as GPU fabrication but is equally important for understanding AI hardware availability.
Manufacturing capacity for HBM has not scaled as quickly as demand. Building HBM production capacity requires specialised equipment, cleanroom space, and expertise in 3D integration that cannot be rapidly acquired. The memory manufacturers have been expanding capacity but from relatively small bases because HBM was previously a niche product for high-performance computing rather than a mainstream requirement. Ramping to meet AI demand requires years of capacity expansion that’s ongoing but not yet sufficient.
The capital requirements are substantial. A HBM production line costs several billion euros to establish and requires ongoing investment in advanced packaging equipment as specifications evolve. Memory manufacturers must balance investment in HBM against other memory products where demand and margins are more predictable. They’re investing heavily in HBM but cautiously because overbuilding capacity when demand might soften would be expensive error correction later.
Alternative suppliers exist in principle but not in practice. Several additional memory manufacturers could develop HBM capabilities, but doing so requires years of development and capital investment with uncertain return. New entrants face the challenge that SK Hynix, Samsung, and Micron have accumulated years of experience and presumably have significant yield advantages from learning curve effects. Competing against established suppliers who are already ramping capacity is unattractive unless margins remain high enough to justify the investment.
The bandwidth imperative¶
AI workloads are particularly memory-intensive because training and running large neural networks involves moving enormous amounts of data between memory and computational units. A large language model might have hundreds of billions of parameters, each requiring multiple bytes of storage. Training involves repeatedly reading and updating these parameters, which means memory bandwidth becomes the limiting factor for computational throughput.
GPUs with exceptional computational capacity throttle down to memory bandwidth when running AI workloads. A GPU capable of performing petaflops of computation achieves this performance only if memory can supply data fast enough. If memory bandwidth is insufficient, computational units sit idle waiting for data, which wastes expensive hardware capability. This is why AI accelerators increasingly use HBM despite its cost and supply constraints.
The bandwidth requirements grow with model size. Larger models have more parameters, which means more data to move between memory and processors. The trend in AI has been toward progressively larger models because they generally perform better. This trend drives exponentially growing memory bandwidth requirements that conventional memory technologies cannot satisfy without using impractically large numbers of memory modules.
Memory capacity requirements are substantial but secondary to bandwidth. A system might need hundreds of gigabytes or even terabytes of memory capacity to hold model parameters, but this capacity is useless if bandwidth is insufficient to keep processors fed with data. This is why AI accelerators prioritise memory bandwidth over capacity, using HBM despite its relatively modest capacity per package compared to conventional memory.
The economic calculus is straightforward. An expensive GPU sitting idle waiting for memory data wastes money continuously. Paying premium prices for HBM that keeps the GPU fully utilised is economically rational because the computational capability is the expensive resource that must be maximised. This shifts memory from cost optimisation toward performance optimisation, which explains why AI hardware uses HBM despite its substantial cost premium over conventional alternatives.
Packaging and integration challenges¶
Integrating HBM with processors requires advanced packaging technology that adds complexity, cost, and additional manufacturing dependencies. The processor die and HBM stacks must be mounted on a common substrate or interposer with high-speed connections between them. This packaging is as technically challenging as manufacturing the components themselves and introduces additional failure modes.
Silicon interposers are common for HBM integration. These are thin silicon substrates with fine-pitch wiring that connects the processor to multiple HBM stacks. The interposer itself requires semiconductor fabrication, though at less demanding process nodes than cutting-edge logic chips. Manufacturing interposers large enough to accommodate a GPU and multiple HBM stacks pushes the limits of what’s manufacturable, particularly concerning yield and cost.
Organic substrates provide an alternative to silicon interposers with lower cost but more challenging electrical characteristics. Achieving the signal integrity required for HBM interfaces on organic substrates requires sophisticated design and manufacturing. Both approaches work but involve trade-offs between cost, electrical performance, thermal management, and manufacturing complexity.
The packaging process has its own yield losses. Even if the GPU die and HBM stacks are individually perfect, the packaging process can introduce defects through misalignment, bonding failures, or contamination. These yield losses compound with the component-level yields, reducing the number of working assemblies from each batch. Advanced packaging yield is improving but remains lower than desirable, particularly for large assemblies with multiple HBM stacks.
Packaging capacity is another potential bottleneck. Only certain facilities have the equipment and expertise to perform advanced packaging for HBM integration. TSMC provides advanced packaging services and has invested heavily in capacity, but demand growth may exceed their expansion rate. Alternative packaging providers exist but with varying capabilities and capacities. Packaging constraints could limit GPU production even when GPU fabrication and HBM supply are adequate.
The upgrade treadmill¶
Each generation of AI accelerators requires more memory bandwidth than the previous generation, which drives HBM specifications to progressively higher performance levels. HBM2 gave way to HBM2e, then HBM3, and HBM3e is now emerging with HBM4 in development. Each generation increases bandwidth and capacity but also introduces new manufacturing challenges that temporarily reduce yields until processes mature.
Memory manufacturers must continually invest in new HBM generations to remain competitive while maintaining production of previous generations for ongoing product lines. This divides their engineering resources and capital investment across multiple product generations simultaneously. Ramping new generations while sustaining old ones is operationally complex and expensive but necessary to meet the market’s diverse requirements.
The transition periods between HBM generations create temporary capacity constraints as manufacturers shift production from mature processes to newer ones. Early in a new generation’s lifecycle, yields are lower and capacity is limited. This means new GPU designs using the latest HBM might face tighter supply constraints than older designs using mature HBM generations. Planning product launches around HBM availability becomes critical for GPU manufacturers.
The cost structure varies across HBM generations. Newer generations command premium pricing while yields are low and demand exceeds supply. As processes mature and capacity increases, prices gradually decline until the next generation emerges and the cycle repeats. This creates pressure for GPU manufacturers to design products using the latest HBM to achieve best performance but also incentives to use previous-generation HBM for cost-sensitive applications.
Standards development for each HBM generation involves the JEDEC standards organisation and major memory and processor manufacturers. This coordination is necessary for ensuring compatibility but adds time to the development cycle. New HBM generations take years from initial specification to volume production, which means manufacturers must predict future requirements years in advance and hope their predictions align with actual market needs when products launch.
The cost implications¶
HBM represents a substantial portion of AI accelerator costs, sometimes exceeding the cost of the GPU die itself. For a high-end AI accelerator with multiple HBM stacks, memory might account for 30 to 50 percent of total bill of materials cost. This makes memory cost optimisation important for product economics but difficult given the limited supplier base and tight supply.
Volume pricing and long-term supply agreements matter enormously. Large purchasers negotiating directly with memory manufacturers receive better pricing than smaller buyers acquiring through distribution. GPU manufacturers like Nvidia negotiate complex supply agreements with SK Hynix that guarantee capacity and pricing in exchange for volume commitments. These agreements shift risk between parties but don’t eliminate the fundamental constraint of limited manufacturing capacity.
The premium for latest-generation HBM can be substantial. HBM3 commanded significant premiums when it first became available. These premiums gradually decline as capacity increases and processes mature, but early adopters pay substantially more for being first. This creates tension between wanting cutting-edge performance and managing product costs, particularly for companies without the pricing power to pass HBM premiums directly to customers.
Memory costs affect AI economics at multiple levels. Training costs increase with memory prices because GPU clusters require more expensive hardware. Inference costs increase because serving models efficiently requires GPUs with adequate memory bandwidth. The cumulative effect is that HBM supply constraints and pricing impact both initial AI development costs and ongoing operational expenses.
Alternative approaches like using multiple GPUs with distributed memory can partially mitigate memory constraints but introduce communication overhead and programming complexity. These trade-offs might be acceptable for large-scale training where parallelism is necessary anyway but are less attractive for inference where latency matters. The memory bottleneck cannot be entirely designed around, it can only be managed through careful system architecture and acceptance that memory bandwidth limits what’s achievable on given hardware.
The path forward (gradually)¶
HBM supply constraints will gradually ease as manufacturers expand capacity, yields improve, and additional suppliers enter the market. This improvement will be measured in years rather than months because capacity expansion requires sustained capital investment and manufacturing learning curves that cannot be rushed. The constraint will persist through the mid-2020s at minimum and possibly longer if AI hardware demand continues growing rapidly.
Technology improvements provide incremental relief. Each HBM generation offers higher bandwidth per stack, which means fewer stacks are needed for given bandwidth requirements. This reduces packaging complexity, improves yields, and stretches HBM supply further. However, these improvements are partially offset by AI accelerators demanding ever-higher total bandwidth, so technology improvements mitigate rather than eliminate the constraint.
Alternative memory technologies are explored but face significant barriers to displacing HBM. Various research projects investigate new memory architectures, materials, or integration approaches that might provide comparable bandwidth with easier manufacturing or lower cost. These alternatives are years from commercial viability and must overcome both technical challenges and the advantage that HBM manufacturers have from established production and improving economies of scale.
Architectural innovations in AI accelerators can reduce memory bandwidth requirements through better caching, compression, or computational techniques that minimise data movement. These optimisations are valuable and actively pursued but have limits determined by fundamental requirements of the algorithms being executed. Some reduction in memory bandwidth requirements is possible, but eliminating the need for high-bandwidth memory through software alone is implausible.
Industry consolidation in memory manufacturing could occur if HBM remains highly profitable and if additional manufacturers decide the investment is justified. This would increase capacity and competition but takes years to materialise and faces the barrier that existing manufacturers have substantial experience advantages. New entrants would face a challenging path to competitive yields and costs even with substantial capital investment.
The realistic outlook is that HBM supply remains tight relative to demand for several years, gradually easing as capacity expansion and yield improvements increase supply. Memory will remain a significant cost component and potential constraint for AI hardware deployment. GPU manufacturers will continue managing complex relationships with memory suppliers while attempting to design architectures that use memory efficiently. Memory manufacturers will continue ramping HBM production while carefully managing investment to avoid overbuilding capacity if demand softens.
The memory bottleneck is less visible than processor constraints but equally consequential for AI infrastructure deployment. It represents another dependency in the complex supply chain required for AI hardware, another concentration risk among limited suppliers, and another case where years of capacity expansion are required to meet demands that emerged faster than anyone predicted. Understanding the AI hardware landscape requires recognising that the impressive computational capabilities discussed in specifications depend equally on memory subsystems that rarely receive equivalent attention until they become the limiting factor, at which point everyone suddenly remembers that computation without adequate memory bandwidth is merely expensive frustration rather than useful infrastructure.