FEATURE

How Much RAM Do You Need for AI in 2026?

RAM is the bottleneck for local AI. Here is how much you actually need in 2026, why bandwidth matters more than size, and where Apple’s M5 Ultra changes everything.

Ryan Lipton

28 April 2026 · 13 min read

Comment

If you are asking how much RAM you need for AI in 2026, the honest answer is: it depends less on the number of gigabytes and more on how fast those gigabytes can move. Memory bandwidth, not capacity, is the bottleneck that determines whether a local language model feels conversational or unusable. That distinction explains why Apple’s unified memory architecture has quietly become the most important development in consumer AI hardware, and why the upcoming M5 Ultra Mac Studio could be the first desktop to make a 405-billion-parameter model feel interactive. This is not a spec-sheet comparison. It is a practical guide to what runs, what crawls, and where the AI hardware conversation is heading next.

The Bottleneck Nobody Talks About

Most guides frame local AI as a storage problem: does the model fit in memory? That is half the story. The other half is speed.

Every time a large language model generates a token, it reads the entire set of model weights from memory. Once. Per token. For a 70-billion-parameter model quantised to 4-bit precision, that is roughly 40 GB of data read for every single word the model produces. The speed at which your hardware can read that data, measured in GB/s of memory bandwidth, directly determines how fast the model responds.

The formula is simple: tokens per second ≈ memory bandwidth (GB/s) ÷ model size in memory (GB).

This is why a £1,999 graphics card with 32 GB of VRAM can feel slower than a Mac with 64 GB of unified memory when running a large model. It is not about raw compute power. It is about how fast the hardware can feed data to the processor. Get this wrong, and a model that should feel conversational becomes a three-second wait between every sentence.

Here is the practical reality across platforms in 2026:

Platform	Memory bandwidth	70B model (Q4, ~40 GB)	Limitation
Typical PC (DDR5-5600, dual channel)	~80 GB/s	~2 tokens/sec	Unusable for conversation
NVIDIA RTX 5090 (32 GB GDDR7)	1,790 GB/s	Model does not fit	32 GB VRAM ceiling
Apple M5 Max (128 GB unified)	614 GB/s	~15 tokens/sec	Comfortable; conversational
Apple M5 Ultra (256 GB unified, projected)	~1,100 GB/s	~28 tokens/sec	Fast; genuinely interactive

The RTX 5090 has extraordinary bandwidth. It also has 32 GB of VRAM. A 70B model at 4-bit quantisation occupies roughly 40 GB. The model simply does not fit. When it spills to system RAM, bandwidth collapses from 1,790 GB/s to roughly 32 GB/s over the PCIe bus. Speed collapses. The £1,999 GPU becomes a bottleneck, not a solution.

How Much RAM for AI at Every Tier

Not every use case requires a frontier model. Here is what is practical at each level in 2026, assuming you are running locally with tools like Ollama or LM Studio.

8-16 GB: Entry Level

Small models up to 9 billion parameters at 4-bit quantisation. Think Llama 3.1 8B, Mistral 7B, or Gemma 3 1B. Useful for code completion, simple Q&A, and lightweight summarisation. A £599 MacBook Air with 16 GB handles this tier comfortably. On a Windows PC with 16 GB system RAM, you are limited to CPU inference, which is functional but slow.

24-32 GB: The New Baseline

Mid-range models up to 14-27 billion parameters. Qwen 2.5 14B, Gemma 3 27B, and Mistral Small 24B all fit here. This is where local AI starts to feel genuinely useful: models at this tier produce coherent long-form writing, handle complex reasoning, and follow multi-step instructions reliably. An RTX 5090 with 32 GB VRAM runs these models at 60+ tokens per second. A 32 GB Mac Mini runs them at 30-45 tokens per second via MLX. Both are more than fast enough for interactive use.

64 GB: The Crossover Point

This is where the RAM question for AI in 2026 stops being about capacity and starts being about architecture. Apple’s unified memory pulls ahead of discrete GPU setups here. A 64 GB Mac with an M5 Pro chip (307 GB/s bandwidth) can run Llama 3.3 70B at 4-bit quantisation at roughly 8 tokens per second. Not fast, but usable. The same model on an equivalent PC build with 64 GB DDR5 and an RTX 5090 requires layer offloading: some of the model sits in the GPU’s 32 GB VRAM (fast) and the rest spills to system RAM (slow). Independent benchmarks show a Mac M4 Max at 28 tokens per second versus a PC with an RTX 4090 at 10 tokens per second on large models where spilling occurs. The Mac is not faster because it has more compute. It is faster because it has no bus to cross.

128 GB: Serious Local AI

The current ceiling for Apple’s M5 Max chip. At 614 GB/s of unified memory bandwidth, a 128 GB M5 Max configuration runs 70B models with headroom for extended context windows and can experiment with quantised versions of larger models. On the PC side, 128 GB of system RAM is possible but the GPU bottleneck remains: the RTX 5090 still only has 32 GB of VRAM. The rest sits behind the PCIe wall. More RAM does not help.

256 GB and Beyond: Frontier Territory

This is where the Mac Studio M5 Ultra enters the conversation.

Apple’s Structural Advantage

Apple did not design its chips for AI. The original goal was efficient silicon with large, fast memory pools for creative professionals. Video editors, musicians, and 3D artists needed unified memory for different reasons: scrubbing 8K timelines, loading massive sample libraries, rendering complex scenes without swapping to disc. The architecture they built for those workflows, refined further at each annual platform update, turns out to be almost perfectly suited for large language model inference.

The reason is straightforward. Unified memory means the CPU, GPU, and Neural Engine all share the same physical memory pool. There is no copying between pools, no bus transfer, no bandwidth cliff when a model exceeds one component’s allocation. A 70 GB model on a 128 GB Mac sits in one contiguous pool accessible at full bandwidth by every processor on the chip.

On a PC, the same model is split: some layers on the GPU (fast, bandwidth-rich), the remainder in system RAM (slow, bandwidth-poor), connected by PCIe (the bottleneck). This split is not a software limitation. It is a physical architecture constraint. No driver update will fix it. No optimisation will bridge it. The constraint is physical.

Apple’s software stack reinforces the hardware advantage. MLX, Apple’s machine learning framework built specifically for Apple Silicon, now achieves the highest sustained generation throughput on the platform: roughly 230 tokens per second on supported configurations, according to a November 2025 ArXiv study. In March 2026, Ollama integrated MLX as its default Apple Silicon backend, delivering 57% faster prefill and 93% faster decode without any user configuration changes. The practical result: a Mac Mini M4 Pro running Qwen3-Coder-30B via MLX produces roughly 130 tokens per second, versus 43 tokens per second on the older llama.cpp backend. Same hardware, same model, triple the output.

Andrej Karpathy, former Tesla AI director and OpenAI co-founder, bought a Mac Mini and noted publicly that large language models are ‘limited by memory bandwidth, not raw compute’. That observation is the entire thesis of Apple’s advantage in one sentence.

The M5 Ultra: Why It Matters

Apple skipped the M4 Ultra entirely. The M4 Max was designed without the UltraFusion interconnect needed to fuse two dies into a single chip. That gap means the M5 Ultra, expected at WWDC in mid-2026, will represent a two-generation leap from the M3 Ultra for buyers who have been waiting. The M5 Pro and M5 Max announcement already confirmed the architecture. The Ultra is the logical next step.

If Apple follows its established Ultra pattern (doubling the Max die), the M5 Ultra would deliver:

36-core CPU (12 performance, 24 efficiency)
Up to 80-core GPU with integrated Neural Accelerators
Up to 256 GB unified memory (some sources suggest this figure; it has not been confirmed by Apple)
Projected memory bandwidth of ~1,100 GB/s, based on doubling the 40-core M5 Max’s 614 GB/s
Expected pricing around £3,499-£3,999, based on current Mac Studio M3 Ultra pricing

Here is what those numbers mean in practice:

Model	Size (Q4)	M5 Ultra projected speed	Experience
Llama 3.3 70B	~40 GB	~28 tokens/sec	Fast. Genuinely conversational.
DeepSeek R1 Distill 70B	~40 GB	~28 tokens/sec	Complex reasoning at interactive speed
Llama 3 405B	~200 GB	~5-6 tokens/sec	Usable. Fits entirely in memory with room for context.
Multiple 70B agents	~80-120 GB	~10-15 tokens/sec each	Parallel agent workflows on a single desktop

The 405B row is the story. No consumer NVIDIA GPU can run Llama 3 405B without multi-GPU setups costing £15,000 or more. The M5 Ultra Mac Studio would load it into a single memory pool on a machine that fits on a desk and draws under 100 watts.

The Caveat Worth Naming

If the rumoured 256 GB maximum holds, that is a reduction from the M3 Ultra’s 512 GB ceiling. For the small community of users who bought into the M3 Ultra specifically to run full-precision frontier models, this would be a meaningful regression. At 256 GB, Llama 405B fits at 4-bit quantisation with limited headroom for context. At 512 GB, it fits with room to spare. Apple has not confirmed the maximum memory configuration. This is worth watching when the announcement arrives.

NVIDIA’s Counter-Argument

This is not a one-sided story. For models that sit comfortably inside 32 GB of VRAM, the RTX 5090 is substantially faster than any Apple Silicon chip. On an 8-billion-parameter model, the RTX 5090 produces 213 tokens per second. An M5 Max manages roughly 80-100. For batch inference, multi-user serving, and any workflow involving vLLM or TensorRT-LLM, NVIDIA’s CUDA ecosystem has no Mac equivalent. If you are running a local AI server for a five-person dev team using vLLM, or fine-tuning LoRA adapters on a 13B model, or if your models comfortably fit in 32 GB, a high-end NVIDIA card remains the faster and often cheaper option. The distinction matters because most AI hardware fails by misunderstanding what consumers actually need; getting the right tool for your specific workload is the entire point.

The crossover happens at roughly 35 billion parameters. Below that line, NVIDIA wins on raw speed. Above it, Apple wins on the ability to run the model at all without a catastrophic bandwidth penalty. Both positions are legitimate. Know which side of that line you sit on.

Who Should Wait for the M5 Ultra

Wait if: – You want to run 70B+ models at genuinely interactive speeds on a single machine – You are building agent workflows that require multiple large models loaded simultaneously – You need Llama 405B or equivalent frontier models accessible locally, even at quantised precision – You already own an M1/M2 Mac and the M5 Ultra represents a generational leap in both bandwidth and capacity

Buy now if: – Your models fit within 32 GB and raw speed matters more than capacity: get an RTX 5090 – You need 64-128 GB of unified memory today: the M5 Max Mac Studio or MacBook Pro is available now and handles 70B models capably – You cannot justify £3,500+ for a desktop: the Mac Mini with M5 (£599, 32 GB) runs 14B models comfortably, and that is more than enough for most practical local AI tasks

Where to Buy

If you are ready to invest in local AI hardware today, these are the machines discussed in this guide:

Apple Mac Studio (M5 Max, up to 128 GB unified memory) – Mac Studio on Amazon

Apple MacBook Pro 16 (M5 Max, up to 128 GB unified memory) – MacBook Pro 16 on Amazon

Apple Mac Mini (M5, 32 GB unified memory) – Mac Mini on Amazon

NVIDIA RTX 5090 (32 GB GDDR7) – RTX 5090 on Amazon

FAQ

How much RAM do I need to run AI locally?

For small models (7-9B parameters), 16 GB is the minimum. For mid-range models (14-32B), 32 GB is comfortable. For large models like Llama 70B, you need 64 GB as a minimum with 128 GB recommended. Crucially, bandwidth determines real-world speed: unified memory on Apple Silicon avoids the throughput cliff that cripples PC setups when models exceed GPU VRAM capacity.

What is the best computer for running LLMs at home?

For models up to 32B parameters, a PC with an RTX 5090 (32 GB VRAM) offers the fastest raw speed at around £2,000 for the GPU alone. For larger models above 35B parameters, a Mac with Apple Silicon and sufficient unified memory delivers better real-world performance because nothing spills to a slower bus. The Mac Studio with M5 Max (128 GB) is the current sweet spot for serious local AI work.

Is 32 GB RAM enough for AI?

Yes, for models up to roughly 27 billion parameters at 4-bit quantisation. This covers popular mid-range models like Qwen 2.5 14B and Gemma 3 27B, all of which produce high-quality output for writing, coding, and reasoning tasks. For larger models like Llama 70B, 32 GB is insufficient. On Apple Silicon, 32 GB unified memory performs significantly better than 32 GB of DDR5 system RAM on a PC, because the GPU can access it at full bandwidth.

Can you run ChatGPT locally?

Not ChatGPT specifically, as that is a proprietary service. However, open-source models like Llama 3.3 70B, DeepSeek R1, and Qwen 3 deliver comparable quality for many tasks and run entirely on local hardware. Tools like Ollama and LM Studio make setup straightforward. A 64 GB Mac or a PC with an RTX 5090 can run mid-range models at conversational speeds without any internet connection or subscription.

Is Apple Silicon good for AI?

Yes, for local inference on large models, it is currently the strongest consumer option. Unified memory eliminates the VRAM ceiling that forces PC setups into slow layer offloading, and Apple’s MLX framework now triples the speed of older backends on identical hardware. The limitation is scope: for AI training and batch serving, NVIDIA’s CUDA ecosystem remains dominant. Apple’s lead is specific to single-user inference on models that exceed 32 GB.

What is unified memory and why does it matter for AI?

On Apple Silicon, every processor on the chip, including the CPU, GPU, and Neural Engine, shares one physical memory pool. A PC separates GPU VRAM from system RAM, connected by a PCIe bus. When a model outgrows the GPU’s allocation, it splits across both pools and the bus becomes a severe throughput bottleneck. Apple’s architecture avoids the split entirely: the full model sits in a single high-bandwidth pool with no data copying between components. That design choice eliminates the largest constraint facing consumer AI hardware today.

Is the Mac Studio worth it for AI?

For local LLM inference on models above 35B parameters, the Mac Studio offers the best combination of memory capacity, bandwidth, and form factor available to consumers. The current M5 Max configuration with 128 GB unified memory runs 70B models at interactive speeds. The upcoming M5 Ultra variant is expected to push this to 256 GB at approximately 1,100 GB/s bandwidth, which would make it the first consumer desktop capable of running Llama 405B in a single memory pool.

RTX 5090 vs Mac Studio for AI: which is better?

Neither is universally better. The RTX 5090 is faster for smaller models that stay inside its 32 GB VRAM: 213 tokens per second on 8B models versus roughly 80-100 on Apple Silicon. The Mac Studio wins once models exceed the 32 GB VRAM ceiling, where the RTX 5090 is forced into layer offloading to system RAM and bandwidth collapses. One benchmark showed a Mac M4 Max at 28 tokens per second versus a PC with RTX 4090 at 10 tokens per second on a model that required spilling. Choose based on model size, not brand loyalty.

How much VRAM do you need for Llama 3?

Llama 3.1 8B needs roughly 6-8 GB at 4-bit quantisation, fitting comfortably on most modern GPUs. Llama 3.3 70B needs approximately 40 GB at 4-bit quantisation, exceeding the RTX 5090’s 32 GB VRAM. Llama 3 405B needs roughly 200 GB at 4-bit quantisation, far beyond any single consumer GPU. For the 70B and 405B variants, Apple Silicon with sufficient unified memory currently offers the most practical consumer path to full-speed inference.

What does memory bandwidth mean for AI performance?

Memory bandwidth measures how fast your hardware can read data from memory, expressed in GB/s. During AI text generation, the full set of model weights is loaded from memory once per token. A 70B model at 4-bit quantisation reads roughly 40 GB per token. If your memory bandwidth is 80 GB/s (typical DDR5), you get roughly 2 tokens per second. At 614 GB/s (Apple M5 Max), you get roughly 15 tokens per second. Bandwidth is the primary determinant of generation speed for large models, more important than processor clock speed or core count.

Summary

RAM is the defining constraint for local AI in 2026, but bandwidth separates usable hardware from expensive paperweights. Token generation is bound by memory bandwidth, not compute: the traditional PC architecture of separate CPU and GPU memory pools creates a hard ceiling at 32 GB of usable high-bandwidth memory. Apple’s unified memory eliminates that ceiling. The M5 Max already runs 70-billion-parameter models at interactive speeds with 128 GB at 614 GB/s. The M5 Ultra, expected mid-2026, projects to roughly 1,100 GB/s with up to 256 GB, enough to load Llama 405B from a compact desktop drawing under 100 watts. For models under 35 billion parameters, NVIDIA’s RTX 5090 is faster and often cheaper. For multi-user serving, fine-tuning, and batch inference, CUDA has no Mac rival. Above the 32 GB VRAM line for single-user inference, Apple Silicon has no consumer equivalent.