SpawningPoint
ReviewsGamingTechGuidesFeatures
Subscribe
SpawningPoint

Where gaming meets clarity. Independent editorial since 2026.

X

Coverage

ReviewsFeaturesGuidesHot Takes

Hubs

GamingTechHardwareHandheldsCompare handheldsRelease calendar

About

Our storyTeam & authorsContactEthics policy
© 2026 SpawningPoint·Privacy·Terms
SPAWNINGPOINT/
GAMING/
AMD STRIX HALO REVIEW 2026, IS THE RYZEN AI MAX+ 395 THE AI INFERENCE ANSWER?
REVIEW
8.6· Great

AMD Strix Halo Review 2026, Is the Ryzen AI Max+ 395 the AI Inference Answer?

The AI inference question used to be whether a home machine could run the model at all. Strix Halo changes the question to which quantisation you want.

Ryan Lipton
Ryan Lipton
28 February 2026 · 15 min read
Comment

The AI inference question used to be whether a home machine could run the model at all. Strix Halo changes the question to which quantisation you want. AMD's Ryzen AI Max+ 395 is the architecture that made that shift possible: 16 Zen 5 cores, 40 RDNA 3.5 compute units forming the Radeon 8060S iGPU, and an XDNA 2 NPU rated at 50 TOPS, all collapsed onto TSMC 4N and fed by up to 128 GB of LPDDR5X-8000 through a 256-bit bus delivering 200 to 273 GB/s of memory bandwidth. That number is the point. A unified memory pool of that size, with that bandwidth, is what turns a Llama 70B Q8 weight file from an enterprise-server problem into a configuration decision. The platforms shipping around this silicon, from the Framework Desktop to the GMKtec EVO-X2 to the HP ZBook Ultra G1a, carry the DNA of three years of AMD's AI-PC push. What follows is the lineage call that paid off.

AMD Ryzen AI MAX die package — the Strix Halo silicon under the Ryzen AI Max+ 395

What Strix Halo Actually Is: The CPU-GPU-NPU Collapse

The Ryzen AI Max+ 395 is a monolithic APU, which means the CPU, GPU, NPU, and memory interface all live on the same die, sharing the same pool of physical RAM. There is no discrete VRAM partition, no PCIe handoff between host memory and GPU memory. The Radeon 8060S, which forms the graphics tier here, draws from the same 128 GB that the Zen 5 cores use for system tasks. That unification is the architectural fact that the rest of this piece is built on.

The Zen 5 core array brings 16 cores and 32 threads with the microarchitectural refinements AMD introduced in late 2024: improved branch prediction, wider execution windows, and better IPC over Zen 4 in the single-threaded workloads that dominate inference preprocessing. The RDNA 3.5 GPU at 40 compute units is not a gaming flagship, but it is capable enough for rasterised workloads at 1080p to 1440p and, critically, it supports ROCm 7.1 and Vulkan compute for inference acceleration. The XDNA 2 NPU at 50 TOPS handles Windows Studio Effects and light on-device AI tasks at power envelopes the GPU cannot justify; for heavy inference the GPU is the right lane.

The manufacturing constraint is worth naming: TSMC 4N is the same node class AMD used for Zen 4 Raphael desktop parts, not the 3nm-class node Intel deployed in Lunar Lake. At 4N, AMD made the deliberate decision to scale memory rather than node. The 256-bit LPDDR5X bus is wider than anything Intel's current mobile generation carries, and the bandwidth that bus delivers is what makes the 128 GB pool actually usable for inference. Narrower bandwidth with more memory would be academic. Here the pipeline from model weights to compute is wide enough that the memory wall moves.

The platform decision that matters to the audience is this: buying into Strix Halo buys into the memory architecture. Soldered LPDDR5X is not a compromise; it is what makes the 256-bit bus possible. Socketed DDR5 at narrower widths would lose the bandwidth that the whole inference story depends on.

The AI Inference Question, Settled

Llama 70B Q8 requires approximately 70 GB of model weight memory. On a system with 16 GB of discrete VRAM and 64 GB of system RAM, that model either fails to load entirely or runs across a PCIe bridge with the inference latency that implies. On a 128 GB Strix Halo system, Llama 70B Q8 loads cleanly into the unified pool with 58 GB remaining for system processes, a context window, and secondary models running in parallel.

On the GMKtec EVO-X2 at 128 GB, running llama-server with Vulkan0 (the Radeon 8060S iGPU) and `–ctx-size 51200 –flash-attn on –parallel 1 –no-warmup`, the practical throughput for Llama 70B Q4 reaches approximately 12 to 18 tokens per second. That is interactive inference: you can hold a conversation with it, ask it to draft code, run it through a multi-step reasoning chain, and receive responses at a pace that does not require patience in the way that GPU-starved inference does. Q8 reduces throughput to roughly half that figure, which remains usable for batch tasks and long-horizon reasoning where quality matters more than speed.

Gemma 4 Q8_K_XL, at approximately 33.9 GB on disk and 18.5 GB working set when loaded, leaves comfortable headroom at even the 64 GB SKU tier. Qwen 3 Q8_K_XL and DeepSeek-V3 Q4 both load without negotiation at 128 GB. The context window is the next constraint: at 51,200 tokens, the Strix Halo system handles document-length reasoning and multi-turn inference chains without the mid-session truncation that constrains smaller memory pools.

The configuration note that is part of the record: `–mlock` is harmful on unified memory architectures because the memory controller handles pinning natively. `–n-gpu-layers 0` disables Vulkan entirely and routes inference back to the CPU. Neither flag should appear in a Strix Halo llama-server invocation. The right posture is to set the device to the iGPU compute lane and trust the unified memory controller.

Framework Desktop and GMKtec EVO-X2 both reach these figures with comparable BIOS maturity as of mid-2025. HP ZBook Ultra G1a trades some thermal headroom for workstation certification; sustained throughput under long inference runs is marginally lower due to the thermal envelope the chassis enforces. ROG Flow Z13 2025 in performance mode matches the desktop class on shorter runs but throttles earlier under sustained load.

The Framework Desktop

Framework launched its Desktop in May 2025 with the Ryzen AI Max+ 395 as the anchor processor, priced from $1,099 to $2,099 depending on memory and storage configuration. It is the first Strix Halo platform from a company whose entire identity is built on repairability and upgradability, which creates a tension worth naming directly: the memory is soldered.

The soldered LPDDR5X is not a Framework oversight. It is what the 256-bit bus requires. A socketed DIMM arrangement at the bandwidth this platform delivers does not exist at current manufacturing tolerances in this thermal envelope. Framework made the right call for the inference use case even if it disappoints the upgrade narrative. The expansion story lives elsewhere: Framework's modular bay ecosystem accommodates storage, connectivity, and power supply swaps, and the PCIe slot allows discrete GPU addition for graphics-bound workloads that the 8060S cannot address at 4K.

The chassis at this scale is compact relative to traditional tower workstations, comparable in volume to a NUC Extreme. Thermals are well-managed in the Framework Desktop's open airflow design; the Strix Halo APU runs cooler here than in the mobile form factors, which means the processor can sustain its full memory bandwidth for longer inference runs without the thermal throttling that narrows throughput in the HP ZBook Ultra's chassis under sustained load.

For an AI-primary workstation at the $1,099 entry tier with the 32 GB SKU, the Framework Desktop is the publishing window for developers who want the inference capability and intend to stay at 32 GB. For inference at 70B class models, the 128 GB SKU at $2,099 is the configuration where the platform's argument becomes complete. The 64 GB middle tier covers most 34B class models at Q8 and everything below without negotiation.

The GMKtec EVO-X2

The GMKtec EVO-X2 is the operator's anchor system for this piece and carries the DNA of a class of device that did not exist before Strix Halo made it viable: a mini-PC the size of a large paperback that runs local inference at 70B class without apology. Starting at approximately $1,499 for the 32 GB SKU with 128 GB configurations available, the EVO-X2 ships in a chassis smaller than a Framework Desktop at roughly 40 percent of the volume.

BIOS maturity through 2025 has been GMKtec's operational achievement here. Early Strix Halo mini-PC deployments from several manufacturers shipped with BIOS configurations that did not correctly expose the full 256-bit memory bus to the operating system, producing bandwidth figures well below the architecture's ceiling. GMKtec's cadence of BIOS updates through the second half of 2025 resolved the most significant of those issues; the EVO-X2 at 128 GB reaches the bandwidth figures the silicon is rated for under llama-server workloads.

The platform decision for the EVO-X2 is the form factor tradeoff. Passive and semi-passive cooling in a chassis this small means the APU runs closer to its thermal ceiling than the Framework Desktop does. For inference workloads that run in bursts, this does not surface as a problem. For sustained runs of hours, the thermal management matters: the EVO-X2 will throttle meaningfully before the Framework Desktop does. The inference architecture is identical; the chassis is what shapes the sustained throughput ceiling.

At $1,499 for the entry configuration, the EVO-X2 is priced within the orbit around the work for developers who want the inference capability in a form factor that fits on a desk corner or a travel bag. What the platform decision did to the audience here is to eliminate the dedicated inference server as a prerequisite. The entire AI development loop runs on the same machine that handles everything else.

The Wider Strix Halo Lineup

The HP ZBook Ultra G1a brings Strix Halo into the professional mobile workstation category with ISV certification and the chassis engineering HP applies to thermal management under sustained workloads. It is the publishing window for engineers who need a Strix Halo system that runs DaVinci Resolve or SolidWorks certified and also handles LLM inference at the same memory pool. The thermal envelope is tighter than the desktop class; sustained inference throughput is lower, but the baseline inference capability at 128 GB is the same architecture.

The Asus ROG Flow Z13 2025 is a different argument: a gaming tablet with Strix Halo silicon, priced at the premium end of the lineup and occupying the publishing window for users who want a portable that covers gaming, creative work, and inference in a single device. In performance mode the Flow Z13 matches the desktop class in short bursts, but the 2025 gaming tablet thermal envelope is the tightest of the four systems reviewed here. It is not the right chassis for a developer who wants to run eight-hour inference batches; it is the right chassis for someone who wants to run Llama 70B Q4 for an hour on a flight and then use the same machine to play games in the hotel.

What each chassis does to the thermal envelope is part of the record because it is the primary differentiator within the Strix Halo class. The silicon is the same. The memory bandwidth is the same. The question is how long any given chassis can sustain the workload before the thermal limit shapes the throughput, and that question has a different answer at every form factor.

Strix Halo vs Apple M4 Max

The comparison that matters at the 128 GB tier is Apple M4 Max Mac Studio. M4 Max at its top configuration delivers 546 GB/s of memory bandwidth against Strix Halo's 200 to 273 GB/s, and the bandwidth difference shows in raw token throughput: M4 Max at 128 GB is faster at Llama 70B inference under equivalent quantisation. That is a genuine advantage.

The software stack difference is what shapes the platform decision for most developers. Apple's MLX framework is mature, deeply integrated with macOS, and the path of least resistance for developers already in the Apple ecosystem. AMD's ROCm 7.1 on Linux is a serious inference platform; ROCm on Windows trails CUDA maturity and has historically required more configuration work to reach stable inference, though this position has improved through 2025. The HIP SDK 7.1 and llama.cpp's ROCm backend make the Windows path functional for the developer who is willing to configure it. The Mac remains simpler.

The price-per-GB argument is where Strix Halo is meaningfully competitive: a 128 GB Strix Halo system starts at approximately $1,499 in mini-PC form. A 128 GB M4 Max Mac Studio is priced significantly higher. For developers for whom open ecosystem, Windows or Linux operation, and price-per-capability at the 128 GB tier are the deciding variables, Strix Halo carries the lineage call that paid off at a price Apple does not match.

Gaming, Just to Settle the Question

The Radeon 8060S at 40 RDNA 3.5 compute units performs at approximately the level of a discrete RX 7700 XT in rasterised workloads, adjusted for the memory bandwidth the unified pool provides rather than the dedicated GDDR6 a discrete card carries. At 1080p and 1440p medium to high settings, the iGPU covers most modern titles without needing to qualify "for an iGPU." At 4K the performance ceiling is real: 30 to 60 frames per second in current AAA at medium-high is achievable, and the memory pool means games that stage large asset sets do not hit the VRAM wall discrete budget GPUs encounter.

For a Strix Halo system purchased primarily for inference, gaming capability is part of the record because it means the system does not require a secondary machine for leisure. The ROG Flow Z13 2025 is the form factor where this matters most, and it delivers. The GMKtec EVO-X2 and Framework Desktop both handle 1440p gaming well; the question is whether the user wants to sit at a desk for gaming on a mini-PC or workstation.

Where Strix Halo Falls Short

The soldered memory is the fact the audience needs to accept before committing. At 32 GB, a Strix Halo system can run 34B class models comfortably and 70B class models at Q4 with context window constraints. At 64 GB the picture improves meaningfully. At 128 GB the inference question is settled. There is no upgrade path between those tiers: the configuration chosen at purchase is the configuration for the life of the machine. Buyers who expect to need 128 GB in two years should not buy a 32 GB SKU today.

Discrete GPU workloads remain faster on a dedicated GPU. The 8060S is a capable iGPU; it is not a competition for a discrete card at the 4K rasterisation ceiling or for CUDA-dependent workloads that depend on NVIDIA's ecosystem specifically. Developers whose pipeline is built around CUDA and who need both CUDA and 128 GB of inference memory are in a configuration that does not yet exist in this form factor.

ROCm on Windows in 2025 is functional and improved, but it requires more configuration than CUDA on Windows or MLX on macOS. Developers who want to arrive at a working inference configuration in an afternoon will find the Mac path shorter. The Strix Halo path is rewarding but requires comfort with the toolchain.

The Lineage Call: AMD's Three-Year AI-PC Push

Strix Point, which shipped in mid-2024 as the Ryzen AI 300 series in thin-and-light laptops, was AMD's proof-of-concept: XDNA 2, RDNA 3.5, Zen 5, in a thermal envelope designed for 15W to 28W sustained operation. It demonstrated the integration but capped memory at 32 GB, which placed it outside the 70B inference conversation.

Strix Halo is the platform decision that answered the capability question directly. By scaling the memory bus to 256-bit and the pool to 128 GB, AMD produced the silicon that moves inference from a server-class requirement to a desk-class one. The publishing window for Strix Halo opened in early 2025 with laptop integrations in HP's ZBook line and Asus's ROG ecosystem; the mini-PC wave followed in summer 2025 with the GMKtec EVO-X2 and the Framework Desktop.

The 2026 refresh, which AMD has positioned as an iterative improvement on the Strix Halo base, is likely to arrive with node improvements and further XDNA NPU capacity, but the fundamental architecture of unified memory at 128 GB and a 256-bit bus is part of the record now. The lineage call that paid off is not the individual SKU; it is the three-year commitment to this memory architecture at a time when NVIDIA's discrete GPU dominance in inference looked unassailable. AMD chose the orbit around the work rather than the centre of the training conversation, and that decision is what the audience is buying in 2025 and 2026.

Final Word

The Ryzen AI Max+ 395 is the first silicon in the mini-PC category where the inference question resolves cleanly. Not "can this run the model with caveats" but "which quantisation level do you want and how wide a context window." That resolution is what the 128 GB unified memory architecture at 256-bit bandwidth buys, and it is the argument the Framework Desktop, the GMKtec EVO-X2, the HP ZBook Ultra G1a, and the ROG Flow Z13 2025 are all making from different chassis shapes.

The audience this platform speaks to directly is the developer or researcher who wants 70B class inference running locally, a gaming machine on the same hardware, and an open ecosystem that does not require Apple's stack. For that audience, Strix Halo carries the DNA of a platform decision that took three years to pay off and is part of the record now. The software stack on Windows still requires configuration comfort; the soldered memory means the tier decision is final. Those constraints are real. Within them, this is the class of machine that changes what is possible at a desk.

FAQ

Can a Strix Halo mini PC run Llama 70B?

Yes, and at the 128 GB configuration, Llama 70B Q8 runs without compromise: the model loads completely into the unified memory pool, leaves approximately 58 GB available for system processes and context, and delivers interactive-speed inference throughput via the Radeon 8060S iGPU at 12 to 18 tokens per second for Q4 quantisation under Vulkan compute. The 64 GB tier supports Llama 70B at Q4 with tighter context window budgets. The 32 GB tier places 70B outside comfortable reach and is better matched to 34B class models.

What is the difference between Strix Halo and Strix Point?

Strix Point, which shipped as the Ryzen AI 300 series in mid-2024, is the mobile predecessor: the same Zen 5, RDNA 3.5, and XDNA 2 integration but limited to a 128-bit memory bus and a maximum of 32 GB LPDDR5X. That memory ceiling places Strix Point outside the 70B inference conversation. Strix Halo scales the memory bus to 256-bit and the pool to 128 GB, which is the architectural change that moves inference capability from demonstration to deployment. The silicon lineage is shared; the platform decision is different.

Framework Desktop or GMKtec EVO-X2 for AI?

Both reach the same inference capability at equivalent memory configurations because the silicon and the unified memory architecture are identical. The Framework Desktop carries the lineage call for repairability and open-chassis expansion via its modular bay system and runs cooler under sustained long inference jobs, which matters for workloads that run for hours. The EVO-X2 is considerably more compact and priced at a lower entry point. If sustained throughput over long runs is the priority, the Framework Desktop's thermal headroom is the differentiator. If form factor and portability matter more than sustained ceiling, the EVO-X2 is the right chassis.

Strix Halo vs Mac Studio M4 Max for AI?

M4 Max at 128 GB delivers higher memory bandwidth (up to 546 GB/s against Strix Halo's 200 to 273 GB/s), which translates to faster raw inference throughput at equivalent quantisation, and Apple's MLX framework is the simpler path for developers already on macOS. Strix Halo at 128 GB arrives at a meaningfully lower price point in mini-PC form, runs an open ecosystem on Linux or Windows, and supports the full ROCm and Vulkan inference stack. For price-per-GB at 128 GB, AMD is the better value. For bandwidth ceiling and software simplicity, M4 Max holds the advantage.

Does Strix Halo support ROCm on Windows?

Yes. ROCm 7.1 with the HIP SDK supports the Radeon 8060S iGPU on Windows, and llama.cpp's ROCm backend delivers full GPU-accelerated inference on the platform. The ROCm on Windows path requires more configuration than CUDA on Windows or MLX on macOS: developers should expect to spend time on driver validation, llama.cpp build flags, and device enumeration (Vulkan0 enumerates the 8060S iGPU on this hardware). The Linux ROCm path is more mature. For developers comfortable with the toolchain, the Windows path is fully functional as of the HIP SDK 7.1 and llama.cpp HEAD builds from mid-2025.

Support SpawningPoint
Please note that some links in this article are affiliate links. If you found the coverage helpful and decide to pick up the game, or anything else for your collection, through one of those links, we may earn a commission at no extra cost to you. We use this approach instead of filling SpawningPoint with intrusive display ads, and rely on this support to keep the site online and fund future reviews, guides, comparisons and other in-depth gaming coverage. Thank you for supporting the site.
8.6
Great
SpawningPoint Verdict
Review summary

The AI inference question used to be whether a home machine could run the model at all. Strix Halo changes the question to which quantisation you want.

AI Inference Capability (128 GB Q8 model headroom)
0.0
Memory Bandwidth (200-273 GB/s)
0.0
Gaming Performance (8060S iGPU)
0
Software Stack Maturity (ROCm + Vulkan)
0.0
Value vs M4 Max Mac Studio
0
Value vs discrete GPU AI rig
0

Continue Reading

Gaming

Atomfall Review 2026: Rebellion’s Post-Nuclear Britain Tested

Gaming

Elden Ring Review 2026: Still Worth Playing After Shadow of the Erdtree?

Grand Theft Auto VI
Gaming

GTA 6: Everything We Know, Release Date, Platforms, and Story

Weekly Newsletter

The weekly briefing for people who care.

One email. Every Saturday. The reviews, guides, and analysis that mattered this week, distilled into a five-minute read. No sponsored content, no affiliate bait.

No spam. Unsubscribe at any time.