AI Workstation & GPU Performance Report (2026)

Featured Recommendations by OS & Budget

macOS

Top Performance

Mac Studio M3 Ultra — 512 GB

~$11,000

405B Q4 capable · ~135 t/s on 7B · max context ~800K tokens (7B Q4)

High End

Mac Studio M3 Ultra — 128 GB

~$8,500

120B Q4 capable · ~130 t/s on 7B · max context ~400K tokens (7B Q4)

Mid Range

MacBook Pro M4 Max — 128 GB

~$4,200

70B Q4 capable · ~75 t/s on 7B · max context ~200K tokens (7B Q4)

Budget

Mac mini M4 Pro — 24 GB

~$1,400

13B Q4 capable · ~50 t/s on 7B · max context ~14K tokens (7B Q4)

Windows

Top Performance

HP Z8 G4 — Quad NVIDIA A100 40 GB

~$30,000

70B unquantized · ~540 t/s on 7B · max context ~128K tokens (70B FP16)

High End

Lenovo ThinkStation P620 — 2× RTX A6000

~$15,000

70B Q4 capable (96 GB VRAM) · ~300 t/s on 7B · max context ~72K tokens (70B Q4)

Mid Range

HP OMEN 45L — NVIDIA RTX 5090

~$4,500

26B Q4 in VRAM · ~250 t/s on 7B · max context ~24K tokens (7B Q4)

Budget

DIY Build — RTX 4070 Ti Super 16 GB

~$1,200

13B Q4 capable · ~90 t/s on 7B · max context ~9K tokens (7B Q4)

Linux

Top Performance

Dell Precision 7960 — Quad RTX 5000 Ada

~$20,000

120B+ Q4 capable · ~450 t/s on 7B · max context ~200K tokens (70B Q4)

High End

DIY — NVIDIA A100 80 GB PCIe

~$6,000

70B Q4 (80 GB HBM2e) · ~220 t/s on 7B · max context ~80K tokens (70B Q4)

Mid Range

DIY — RTX 5090 + Ryzen 9 9950X

~$3,500

26B Q4 in VRAM · ~250 t/s on 7B · max context ~24K tokens (7B Q4)

Budget

DIY — AMD RX 7900 XTX 24 GB

~$1,400

20B Q4 via ROCm/Vulkan · ~135 t/s on 7B · max context ~14K tokens (7B Q4)

Table columns are sortable. All systems are evaluated for AI/ML workloads. TG = Tokens/sec Generation (batch size 1).

Rank	Model	Type	GPU	VRAM	Best Model (Size)	Tokens/sec (TG)	Price	Notes & Evidence
1	HP Z8 G4 (Custom)	Desktop	Quad NVIDIA A100 40GB	160 GB	70B+ (Unquantized)	~540	$30,000	Data Center Power: Can run unquantized 70B models entirely in VRAM. NVLink Advantage: Full peer-to-peer communication between GPUs is critical for massive model inference. Secondary Market: Often available on the used market, making it a "budget" entry into high-end AI. Power/Cooling: Requires significant power and cooling infrastructure not suitable for a typical office.
2	Dell Precision 7960 Tower	Desktop	Quad NVIDIA RTX 5000 Ada	128 GB	120B+ (Quantized)	~450	$20,000	Enterprise Grade: Built for high-concurrency professional AI, with certified drivers and support. RTX 6000 Ada Recommended: For max performance, users often upgrade to RTX 6000 Ada GPUs for more VRAM and CUDA cores. NVLink Bridge: Essential for combining GPU memory pools effectively.
3	Apple Mac Studio (M3 Ultra)	Desktop	Apple 80-core GPU	128 GB (Unified)	120B+ (Quantized)	~97	$8,000	Memory Bandwidth King: Unmatched unified memory is ideal for huge model inference. Underwhelming TG?: Some users report tokens/sec (TG) is lower than expected vs. high-end NVIDIA cards (e.g., 97 t/s vs 136 t/s on an RTX 3090 for some models). Silent & Efficient: Consumes far less power and is virtually silent compared to NVIDIA workstations. MLX Framework: Apple's MLX framework can be 20-30% faster than other libraries like Ollama on Apple Silicon.
4	HP OMEN 45L (2025)	Desktop	NVIDIA RTX 5090	32 GB	70B+ (Quantized)	~140	$4,500	New Flagship: The RTX 5090 offers ~1,792 GB/s bandwidth, a significant leap for memory-bound LLM tasks. MoE Performance: Excels at Mixture-of-Experts models, with benchmarks showing 100-140 t/s generation. "Space Heater": Early reports suggest very high power draw and thermal output, requiring excellent case airflow.
5	Alienware Aurora R16	Desktop	NVIDIA RTX 4090	24 GB	30B-70B	~95	$4,000	Community Favorite: Widely regarded as the best value for running 7B to 30B models locally. 30% Slower than 5090: A significant performance gap, but at a much lower price point. Stable & Supported: Mature drivers and extensive community support for AI/ML workloads.
6	Lenovo ThinkStation P620	Desktop	Dual NVIDIA RTX A6000	96 GB	120B+ (Quantized)	~300	$15,000	CPU Powerhouse: The Threadripper Pro 5995WX CPU provides massive PCIe bandwidth for multiple GPUs. PSU/Fitment Issues: Real-world user threads mention PSU limitations and tight physical space when fitting two high-end GPUs. Dual A6000 Recommended: Community advice points to using two RTX A6000s as a potent, albeit expensive, combination.
7	Asus ROG Strix Scar 18	Laptop	NVIDIA RTX 5090 (Laptop)	24 GB	30B-70B	~80	$4,500	Desktop-Class Laptop GPU: Benchmarks from similar MSI models show impressive performance but with caveats. Thermal Throttling: The main challenge is keeping the GPU cool enough to sustain peak performance. VRAM is 24GB: Note: Leaked specs showing 32GB were incorrect for the laptop version.
8	Apple MacBook Pro 16 (M4 Max)	Laptop	Apple 40-core GPU	128 GB (Unified)	30B-70B	~75	$4,200	Popular with ML Engineers: Excellent for development and testing on the go due to its large unified memory. "Not for Heavy Training": Community consensus is that it's an inference machine, not for training large models from scratch. Performance: Can achieve 70-80 t/s on 7B models (Q4 quantized) via MLX; 70B Q4 fits at ~6 t/s.
9	Asus ROG Flow Z13 (2025)	Laptop	AMD Ryzen AI Max+ 395 (iGPU)	128 GB (Unified)	7B-30B	~90	$2,100	The "Unicorn" APU: The Ryzen AI Max+ 395 supports up to 128 GB unified memory — the cheapest x86 path to 70B. ROCm vs. Vulkan: Performance depends heavily on the backend used; Vulkan in llama.cpp often outperforms ROCm for pure inference. BIOS Gotcha: Users report needing to manually allocate 96–128 GB of system RAM to the iGPU in BIOS for best performance.
10	ASUS Zenbook A14 (2025)	Laptop	Qualcomm Adreno NPU	32 GB (Shared)	7B-30B	~6	$1,800	On-Device AI Focus: Designed for efficient, low-power AI tasks. NPU Not Yet Supported: The powerful NPU is not currently usable by most open-source LLM tools (e.g., llama.cpp). Community Verdict: Currently not recommended for serious local LLM work until software support matures. Performance is low (~6 t/s on 32B models).
11	Apple MacBook Air 15" (M4)	Laptop	Apple 10-core GPU (M4)	32 GB (Unified)	7B-20B	~38	$1,299	Best value macOS laptop: 32 GB unified memory at $1,299 is the cheapest Apple Silicon path to running 20B Q4 models. Bandwidth limitation: LPDDR5X at ~120 GB/s is 3x slower than M4 Max tokens/sec on 7B is roughly half a MacBook Pro. Fanless: The M4 Air is fanless, so sustained AI inference will throttle after several minutes; burst speed is good, sustained speed drops. Best for: Developers who need portability and occasionally run 713B models. Not suited for production or long inference sessions.
12	Apple MacBook Air 13" (M3)	Laptop	Apple 10-core GPU (M3)	24 GB (Unified)	7B-13B	~30	$1,099	Cheapest modern macOS option: 24 GB unified memory runs 7B and 13B Q4 models reliably at modest speed. Fanless throttle warning: Sustained inference causes thermal throttling within a few minutes on the base config. Good upgrade path: Maxing RAM to 24 GB at purchase is highly recommended memory cannot be upgraded later. Community verdict: Solid for trying out local LLMs casually, but the M4 Air is worth the extra $200 for higher bandwidth.
13	Apple Mac mini (M4 Pro)	Desktop	Apple 20-core GPU (M4 Pro)	48 GB (Unified)	7B-40B	~65	$1,999	Best compact macOS desktop: 48 GB unified memory at $1,999 comfortably runs 30B40B Q4 models. M4 Pro bandwidth: ~273 GB/s is significantly higher than M4 base chips much better sustained inference speed. Small footprint: Fits on a crowded desk; runs silently under typical LLM loads. Recommended config: 48 GB version over 24 GB; the extra $600 opens up a much wider range of models.
14	Apple iMac 24" (M4)	Desktop	Apple 10-core GPU (M4)	32 GB (Unified)	7B-20B	~40	$1,699	All-in-one convenience: The iMac integrates display, speakers, and webcam no separate monitor needed. Bandwidth same as MacBook Air: M4 base chip has ~120 GB/s of LPDDR5X bandwidth, limiting token speed vs M4 Pro/Max. 32 GB ceiling: 32 GB is the maximum RAM option. Limits models to 20B Q4 or smaller. Best for: General-purpose users who want macOS + LLM capability without building a multi-device setup.
15	Apple MacBook Pro 14" (M4 Pro)	Laptop	Apple 20-core GPU (M4 Pro)	48 GB (Unified)	7B-40B	~58	$2,599	Best portable value for 30B: 48 GB and M4 Pro bandwidth make it the cheapest Apple laptop that comfortably runs 30B Q4. Active cooling: Unlike MacBook Air, the Pro has a fan sustained inference does not throttle. vs 16" M4 Max: The 14" M4 Pro at $2,599 is roughly half the cost and 25% slower per token; a reasonable tradeoff for portability. MLX compatible: Full support for Apple MLX framework for faster quantized inference.
16	Minisforum AI Max+ 395 (128 GB)	Desktop	AMD Ryzen AI Max+ 395 (RDNA 3.5 iGPU)	128 GB (Unified)	7B-70B	~90	$1,600	Cheapest x86 path to 70B: 128 GB unified at $1,600 is unmatched by any discrete GPU system anywhere near this price. BIOS setup required: Must manually allocate 96128 GB to the iGPU in BIOS; factory default is much lower. llama.cpp Vulkan: Use the Vulkan backend for best inference speed. ROCm support exists but adds setup complexity. Bandwidth limitation: 256 GB/s LPDDR5X is 3x slower than M3 Ultra; 70B runs at ~6 t/s vs ~18 t/s on Apple Silicon.
17	Razer Blade 16 (RTX 5090)	Laptop	NVIDIA RTX 5090 (Laptop)	24 GB	7B-20B	~120	$4,499	Fastest Windows LLM laptop: RTX 5090 Mobile uses a 575W TGP variant with GDDR7 highest bandwidth laptop GPU available. 24 GB VRAM limit: Still capped at 24 GB despite the new generation; limits to ~20B Q4 fully in VRAM. Battery life: Expect under 2 hours of battery during inference at full GPU load. Best used plugged in. Thin chassis: Razer's cooling is above average for a thin laptop but still thermally constrained vs the ROG Scar 18.
18	Acer Predator Helios 18 (RTX 4090)	Laptop	NVIDIA RTX 4090 (Laptop)	16 GB	7B-13B	~75	$2,799	Large cooling advantage: The 18" chassis allows higher sustained TGP than thinner 16" laptops more consistent performance. Only 16 GB VRAM: Despite the flagship GPU brand, laptop RTX 4090s ship with 16 GB, not 24 GB as in the desktop version. Value vs desktop: A DIY RTX 4090 desktop at the same price delivers 24 GB VRAM and significantly more sustained performance. Good for travel: If you need a powerful portable Windows machine for on-site or travel inference work, this is a strong option.
19	Lenovo Legion Pro 7i Gen 9 (RTX 4090)	Laptop	NVIDIA RTX 4090 (Laptop)	16 GB	7B-13B	~72	$2,799	Popular with developers: Strong CUDA ecosystem, good display, and widely supported driver stack make this a practical choice. 16 GB VRAM ceiling: Laptop RTX 4090 is capped at 16 GB. 13B Q4 fits; 20B Q4 requires partial CPU offloading (slower). MUX switch: Has a MUX switch for direct GPU display output, reducing latency and improving GPU utilisation during inference. Better value at $2,200: Lenovo frequently discounts this model by $500600; watch for sales before buying.
20	NZXT Player:Two Max (RTX 4090)	Desktop	NVIDIA RTX 4090	24 GB	7B-20B	~140	$2,800	Pre-built value pick: Full-fat RTX 4090 in a proper ATX tower at $2,800; cheaper than Alienware Aurora R16 with similar specs. Standard form factor: Uses a proper ATX mid-tower with a standard PSU, so the GPU can be swapped to an RTX 5090 later. RAM limited: Ships with 32 GB DDR5; upgrade to 64 GB recommended for large CPU offload buffers when running 70B. No bloatware: NZXT ships a cleaner Windows install than most OEM gaming brands.
21	CyberPowerPC Gamer Supreme (RTX 5080)	Desktop	NVIDIA RTX 5080	16 GB	7B-13B	~170	$2,400	Strong 7B speed at this price: RTX 5080 GDDR7 bandwidth (~960 GB/s) delivers ~170 t/s on 7B Q4 competitive with older 4090 systems in the same workload. 16 GB VRAM limit: Only 16 GB VRAM limits it to 13B Q4 fully in VRAM. 20B and larger require CPU offloading. Good upgrade path: Standard ATX build means the GPU can be replaced with an RTX 5090 or future card without replacing the whole system. Community note: CyberPower and iBUYPOWER quality has improved significantly; capacitor and PSU quality are now comparable to most branded systems.
22	HP Victus 16 (RTX 4060)	Laptop	NVIDIA RTX 4060 (Laptop)	8 GB	7B only	~45	$849	Entry-level LLM laptop: The cheapest new Windows laptop that can run a 7B Q4 model reasonably in VRAM (8 GB is the absolute minimum). 8 GB is the hard floor: Only 7B Q4 fits in VRAM. Anything larger requires CPU RAM offloading, dropping speed to 35 t/s. Great for learning: If you are new to local LLMs and want to experiment without a large budget, this is a practical starting point. Upgrade ceiling is low: The 8 GB VRAM cannot be upgraded; if you plan to scale up within a year, budgeting for 16 GB VRAM (RTX 4060 Ti or better) is worth it.

DIY Build Guides by Model Target & Budget

Use the filters to narrow builds by what model size you want to run and your budget range. Prices are estimates as of early 2026; GPU prices shift frequently.

Target model: Budget:

The 7B Daily Driver

Best 7-13B performance under $1,000

GPUNVIDIA RTX 4060 Ti 16 GB

CPUIntel Core i5-13400 (10-core)

RAM32 GB DDR5-5200

Storage1 TB NVMe Gen4

PSU650W 80+ Gold

Est. Cost~$850

Max model (VRAM)13B Q4 in VRAM

7B TG speed~75 t/s

Save ~29% vs HP Victus 16 RTX 4060 Ti (~$1,200 pre-built)

Budget

16 GB GDDR6 - best value VRAM for price. 7B and 13B Q4 models run fully in VRAM. Use ExLlamaV2 for maximum speed on CUDA.

The AMD Value Build

Best ROCm/Vulkan inference under $1,100

GPUAMD RX 7900 GRE 16 GB

CPUAMD Ryzen 5 7600 (6-core)

RAM32 GB DDR5-6000

Storage1 TB NVMe Gen4

PSU650W 80+ Gold

Est. Cost~$950

Max model (VRAM)13B Q4 in VRAM

7B TG speed~80 t/s (Vulkan)

Save ~32% vs HP Victus 16 RX 7800 XT (~$1,400 pre-built)

Budget

RX 7900 GRE has 576 GB/s bandwidth and 16 GB VRAM. Use llama.cpp Vulkan backend for best inference speed. ROCm optional for PyTorch.

The 20B Budget Build

Run 20B Q4 for under $1,200

GPUNVIDIA RTX 3090 24 GB (used)

CPUAMD Ryzen 7 5700X (8-core)

RAM64 GB DDR4-3600

Storage2 TB NVMe Gen3

PSU850W 80+ Gold

Est. Cost~$1,100

Max model (VRAM)20B Q4 in VRAM

7B TG speed~130 t/s

Save ~39% vs pre-built RTX 3090 builds (~$1,800 pre-built)

Budget

Used RTX 3090 (24 GB / 936 GB/s) performs near-identically to an RTX 4090 for inference. Best value path to 20B Q4. Use a single GPU only - no SLI.

The RTX 5090 Flagship Build

Maximum consumer GPU inference speed

GPUNVIDIA RTX 5090 32 GB

CPUIntel Core i9-14900K (24-core)

RAM128 GB DDR5-6400

Storage4 TB NVMe Gen4

PSU1200W 80+ Platinum

Est. Cost~$3,500

Max model (VRAM)26B Q4 in VRAM

7B TG speed~250 t/s

Save ~22% vs HP OMEN 45L RTX 5090 (~$4,500 pre-built)

Mid-range

1,792 GB/s GDDR7 bandwidth - fastest consumer card. 32 GB VRAM limits models to ~26B Q4. Ensure 360mm AIO or large tower for cooling.

The Creator Mid-Range

RTX 4090 all-rounder for 20B+ workloads

GPUNVIDIA RTX 4090 24 GB

CPUAMD Ryzen 9 7950X (16-core)

RAM128 GB DDR5-6000

Storage4 TB NVMe Gen4

PSU1000W 80+ Gold

Est. Cost~$3,000

Max model (VRAM)20B Q4 in VRAM; 70B offloads to RAM

7B TG speed~140 t/s

Save ~25% vs Alienware Aurora R16 RTX 4090 (~$4,000 pre-built)

Mid-range

Comparable to Alienware Aurora R16 at ~20% lower cost. 128 GB system RAM allows 70B CPU offload (slow, ~3 t/s) but 20B in VRAM runs at full speed.

The Ryzen AI Max 70B Build

Cheapest x86 path to 70B at real speed

SystemMinisforum AI Max+ 395 Mini PC

CPU/iGPURyzen AI Max+ 395 (16-core, RDNA 3.5)

Unified RAM128 GB LPDDR5X

Storage2 TB NVMe (user-upgradeable)

Est. Cost~$1,600

Max model (RAM)70B Q4 (~40 GB used)

7B TG speed~90 t/s (Vulkan)

Save ~24% vs Asus ROG Flow Z13 128 GB (~$2,100 pre-built)

Mid-range

Not strictly DIY, but highly upgradeable mini PC. Set iGPU VRAM allocation to 96-128 GB in BIOS. 70B Q4 runs at ~6 t/s; 7B at ~90 t/s. Best value for 70B access on Windows/Linux.

The A100 80 GB Single-GPU Build

Single pro GPU that fits 70B Q4 natively

GPUNVIDIA A100 80 GB PCIe (used)

CPUAMD Threadripper PRO 5955WX (16-core)

RAM256 GB ECC DDR4

Storage8 TB NVMe Gen4

PSU1600W 80+ Platinum

Est. Cost~$6,000

Max model (VRAM)70B Q4 fully in VRAM (80 GB HBM2e)

7B TG speed~220 t/s

Save ~60% vs Dell Precision 7960 with A100 (~$15,000 pre-built)

High-end

Used A100 80 GB PCIe cards available for $4,000-6,000 on secondary market. Workstation chassis required. 1,935 GB/s HBM2e bandwidth. Best Linux vLLM platform for a single card.

The Dual A6000 Workstation

96 GB VRAM for 70B Q4 plus headroom

GPU2x NVIDIA RTX A6000 48 GB

CPUAMD Threadripper PRO 5975WX (32-core)

RAM256 GB DDR4 ECC

Storage8 TB NVMe RAID

PSU1600W 80+ Platinum

Est. Cost~$9,000

Max model (VRAM)70B Q4 with context headroom

7B TG speed~360 t/s (tensor parallel)

Save ~40% vs Lenovo ThinkStation P620 dual A6000 (~$15,000 pre-built)

High-end

Requires WRX80 motherboard. Both A6000s share 96 GB via vLLM tensor parallelism. NVLink bridge optional. Linux + vLLM recommended.

The RTX 6000 Ada Pro Build

Single 48 GB Ada GPU, NVLink-upgradeable

GPUNVIDIA RTX 6000 Ada 48 GB

CPUIntel Core i9-14900K (24-core)

RAM128 GB DDR5-6400

Storage4 TB NVMe Gen4

PSU1200W 80+ Platinum

Est. Cost~$7,500

Max model (VRAM)40B Q4 easily; 70B Q3 marginal

7B TG speed~200 t/s

Save ~38% vs HP Z8 G4 single RTX 6000 Ada config (~$12,000 pre-built)

High-end

960 GB/s bandwidth with ECC memory. Supports NVLink for a second card later (96 GB / 1,920 GB/s combined). ATX-compatible in a standard full-tower. Windows and Linux supported.

The 120B+ Extreme Workstation

Quad A6000 - 192 GB VRAM for massive models

GPU4x NVIDIA RTX A6000 48 GB

CPUAMD Threadripper PRO 5995WX (64-core)

RAM512 GB DDR4 ECC

Storage16 TB NVMe RAID

PSU2x 1600W redundant

Est. Cost~$18,000

Max model (VRAM)120B Q4 in 192 GB VRAM

7B TG speed~720 t/s combined

Save ~40% vs HP Z8 G4 quad A100 config (~$30,000 pre-built)

Pro / Extreme

WRX80 EATX board required. Quad-slot GPU spacing mandatory - verify physical fitment. Linux + vLLM with tp=4. Requires a 30A+ dedicated circuit.

The Dual H100 Server Build

160 GB HBM3 - production inference at home

GPU2x NVIDIA H100 80 GB PCIe (used/refurb)

CPUAMD EPYC 7543 (32-core)

RAM512 GB DDR4 ECC RDIMM

Storage16 TB NVMe enterprise

Chassis4U server chassis

Est. Cost~$25,000

Max model (VRAM)120B Q4 natively; 70B FP16

7B TG speed~520 t/s combined

Custom Build Only No direct pre-built equivalent — this spec is server-grade custom only

Pro / Extreme

Used H100 PCIe cards appear on secondary market for $12,000-18,000 each. Requires server-class motherboard with IOMMU support. vLLM with Flash Attention 2 and tp=2 is the recommended production stack.

DIY vs. Pre-built: When Each Makes Sense

Why Build DIY

Cost savings: An RTX 4090 DIY build runs ~$3,000 vs $4,000+ for the Alienware Aurora R16 - 20-30% less for identical performance.

Component choice: Select your own PSU quality, motherboard stability, and case airflow. Pre-builds often cut corners on PSU brand and motherboard tier.

Upgrade path: Replace the GPU in 2 years without buying a new system. Pre-built OEM cases often use non-standard form factors.

Exact VRAM targeting: A used RTX 3090 at 24 GB costs $400-600 and performs nearly identically to an RTX 4090 for LLM inference.

Why Buy Pre-built

Warranty & support: A single vendor for all troubleshooting. For production AI work, downtime costs more than savings.

Certified driver stacks: HP and Dell workstations ship with validated configurations for GPU-intensive workloads.

Specialized hardware: Quad-GPU systems with NVLink bridges and server-class PSUs are impractical to DIY. The HP Z8 G4 and Dell Precision 7960 are only practical as pre-built or refurb.

Apple Silicon: There is no DIY path to Apple unified memory. Mac Studio is the only option for 128 GB+ high-bandwidth unified memory.

Best Value Summary

Under $1,500 - DIY wins: RTX 3090 (used) + budget CPU + 64 GB DDR4 outperforms any pre-built at this price.

$3,000-5,000 - DIY saves 20-30%: Self-built RTX 4090 or 5090 system matches Alienware/ROG for less money.

$8,000-20,000 - Pre-built pulls ahead: Workstation-class multi-GPU configs from HP/Dell/Lenovo come with support and certified setups hard to replicate DIY.

Apple Silicon - no DIY option: Mac Studio M3 Ultra is unique in offering 192-512 GB high-bandwidth unified memory in a consumer form factor.

Effectiveness Score blends raw token generation speed, memory bandwidth, and VRAM capacity into a single 0–100 index. Value Score = tokens/sec per $100 spent. Click any column header to sort.

#	GPU	Maker	VRAM (GB)	Bandwidth (GB/s)	TG 7B Q4 (t/s)	TG 70B Q4 (t/s)	Price (USD)	Value Score	Effectiveness	Best For

Manufacturer Tradeoffs

NVIDIA

Speed king: Highest tokens/sec for models that fit in VRAM — RTX 5090 at 1,792 GB/s is unmatched on the consumer market.
CUDA ecosystem: Every AI framework (PyTorch, vLLM, TensorRT, llama.cpp) works out of the box. Zero configuration headaches.
VRAM ceiling: Consumer cards top out at 32GB (5090). Professional cards (RTX 6000 Ada: 48GB, A100: 80GB) cost significantly more.
Training: Best-in-class for fine-tuning and training with full CUDA support.
Power draw: RTX 5090 ≈ 575W — a space heater. Dedicated airflow/cooling needed.
P2P blocked: Consumer multi-GPU peer-to-peer is disabled by default in drivers (workaround requires patched kernel modules).

Choose NVIDIA if: you run models ≤30B and want maximum speed, or need fine-tuning & training.

AMD

Ryzen AI Max+ 395: The standout — 128GB unified memory at ~$1,500–2,500 is the cheapest way to run 70B models on x86.
Bandwidth: Radeon RX 7900 XTX has 960 GB/s — competitive with NVIDIA RTX 4090 (1,008 GB/s), at a lower price.
ROCm ecosystem: Improving rapidly (vLLM support added 2024), but still lags CUDA. Occasional bugs, fewer tutorials.
Vulkan backend: llama.cpp Vulkan is often faster than ROCm for pure inference; ~20% advantage in recent tests.
Discrete VRAM limit: RX 7900 XTX tops at 24GB — same ceiling as RTX 4090.
BIOS gotcha: Ryzen AI Max laptops default to 32GB GPU allocation; must manually set to 96–128GB for proper performance.

Choose AMD if: you need 70B+ on a budget (Ryzen AI Max+ 395), or want competitive discrete GPU performance at lower cost than NVIDIA.

Apple Silicon

Unified memory: M3 Ultra at 192–512GB is the only consumer path to running 200B+ models locally without a multi-GPU rig.
MLX framework: Apple-optimized; 20–30% faster than Ollama for inference. Speculative decoding can roughly double token generation.
Token generation speed: M3 Ultra (~800 GB/s) loses to RTX 5090 (1,792 GB/s) for models that fit in VRAM — ~97 t/s vs ~250 t/s on 7B.
Power efficiency: Mac Studio at ~150W vs RTX 5090 system at ~800W+. Silent under moderate load.
Concurrency: Handles concurrent requests worse than NVIDIA; not ideal for multi-user serving.
No GPU-only option: Apple Silicon is integrated — you buy the whole computer, not a card.
macOS only: No Linux, no CUDA. Only MLX, llama.cpp, and Ollama with Metal.

Choose Apple if: you need to run very large models (70B+) in a quiet, efficient package, or you're already in the Apple ecosystem.

Find Your Best Match

Tell us your preferences and we'll narrow the field to the best hardware option for you.

Your Requirements

Preferred OS / Platform

Form Factor

Budget $5,000

$500$35,000

Largest model I want to run

Priority

← Set your requirements and click "Find Best Match"

Compare Hardware for Local AI

Select up to three systems or chips. Every metric shows the absolute value plus a percentage difference relative to the first system you choose. RAM and VRAM capacity directly determine which model sizes fit; memory bandwidth is the primary driver of token generation speed.

Select Systems

System A (baseline)

System B

System C (optional)

Token speed figures are real-world medians from llama.cpp benchmarks. RAM impact on model size follows Q4 quantisation rules (~1.1 GB per 1B params).

Select systems on the left and click Compare.

Optimization Guide — Getting Maximum Performance from Local LLMs

Inference speed for locally self-hosted LLMs is almost entirely memory-bandwidth-bound, not compute-bound. Each token generated requires reading the full model weights from memory once. The faster your memory bus, the faster your tokens. GPU VRAM bandwidth (>1,000 GB/s) dwarfs CPU DRAM bandwidth (30–100 GB/s), which is why a GPU can be 10–30× faster than CPU-only. Knowing which bottleneck you have lets you choose the right optimization.

What Actually Limits Performance — by Hardware Path

Hardware Path	Primary Bottleneck	Secondary Bottleneck	Max RAM Available	Typical 7B t/s	Typical 70B t/s
NVIDIA GPU (e.g. RTX 5090)	VRAM capacity (32 GB ceiling on consumer cards)	VRAM bandwidth when model fits	32 GB VRAM + system RAM for overflow	~250	N/A (model too large for VRAM)
NVIDIA Pro GPU (e.g. A100 80GB)	VRAM bandwidth (2,000 GB/s)	PCIe bandwidth for multi-GPU	80 GB per card	~260	~22 (full VRAM)
Apple Silicon (e.g. M3 Ultra)	Unified memory bandwidth (~800 GB/s)	Metal compute shader efficiency	192–512 GB (model & OS share same pool)	~120	~18–20
AMD RX 7900 XTX	VRAM capacity (24 GB ceiling)	ROCm/HIP software overhead	24 GB VRAM	~135	N/A
AMD Ryzen AI Max+ 395	Unified memory bandwidth (~256 GB/s)	ROCm vs. Vulkan backend selection	128 GB unified	~90	~6
CPU-only (e.g. i9-13900K, DDR5)	System RAM bandwidth (40–80 GB/s)	Thread count, NUMA topology	Addressable system RAM	~15–25	~2–4

Token speeds are batch-size-1 (single-user) generation using Q4_K_M quantisation. Multi-user batch inference favours higher-compute GPUs more.

Quantisation — The Most Impactful Software Setting

Quantisation reduces model precision to fit more into memory and speed up loads. Choosing the right level is the single biggest lever available regardless of hardware.

Format	Bits/Weight	RAM vs FP16	Quality vs FP16	Best Use	Speedup vs FP16
FP32	32	2× more	Identical	Training only	0.5×
FP16 / BF16	16	1× (baseline)	Near-lossless	High-VRAM GPU inference, fine-tuning	1×
Q8_0	8	~0.5×	Effectively lossless (<0.1% perplexity)	When you have VRAM to spare	1.8×
Q6_K	6	~0.37×	Minimal quality loss	Large models on capacity-limited systems	2.2×
Q5_K_M	5	~0.31×	Slight loss, noticeable only on coding	Best quality when RAM is just enough	2.8×
Q4_K_M	4	~0.26×	Good — recommended default	Default recommendation for all hardware	3.2×
Q3_K_M	3	~0.19×	Noticeable degradation	Only if Q4 doesn't fit in RAM	3.8×
Q2_K	2	~0.13×	Significant quality loss — avoid	Emergency fallback	4.5×
IQ4_XS (imatrix)	~4.25	~0.27×	Better than Q4_K_M at same size	Best Q4-tier quality	3.0×

Rule of thumb: Q4_K_M uses approximately 1.1 GB per 1B parameters. A 70B model ≈ 40 GB Q4; a 13B model ≈ 8 GB Q4; a 7B model ≈ 4.4 GB Q4. Importance-matrix (imatrix) quants (IQ series) give better quality at the same bit depth.

Platform-Specific Optimization Paths

NVIDIA — CUDA Ecosystem

CUDA + cuBLAS (default): All major frameworks use this. Best throughput for models that fully fit in VRAM. Zero additional configuration in Ollama or LM Studio.
TensorRT-LLM: NVIDIA's production inference engine. Requires model compilation; delivers up to 2× the throughput of vanilla llama.cpp for continuous batching workloads.
FlashAttention-2: Reduces attention memory complexity from O(n²) to O(n). Enabled by default in vLLM and llama.cpp — provides 1.5–2× speedup on long contexts.
Multi-GPU NVLink: Pro cards (A100, H100, RTX 6000 Ada) support NVLink for full bandwidth pooling. Consumer cards do not — PCIE bandwidth becomes the bottleneck with 2× GPUs.
GPU offload layers (-ngl): In llama.cpp, set -ngl 99 to send all transformer layers to GPU. If VRAM is too small, reduce layers incrementally; each layer in VRAM improves speed.
FP8 quantisation (Hopper GPUs): H100/H200 support FP8 native precision — 2× throughput over BF16 with <1% quality loss. Not available on consumer RTX cards.

Apple Silicon — Metal / MLX

MLX framework (Apple-native): 20–30% faster than llama.cpp with Ollama for inference on Apple Silicon. Install via pip install mlx-lm. Best choice for Apple hardware in 2025.
Metal Performance Shaders (MPS): Used by PyTorch on Apple Silicon. Not as optimised as MLX for LLM inference but works for fine-tuning and training experiments.
Speculative decoding (MLX): Use a small draft model (e.g. 1B) to propose tokens, verified by the main model. Can nearly double tokens/sec on large models with no quality change.
llama.cpp Metal backend: Used by Ollama. Slower than native MLX but has broader model format support. Set LLAMA_METAL=1 at compile time.
Unified memory — use it all: Unlike a GPU, macOS can dynamically share RAM between GPU and CPU. Run models that are 70–80% of total RAM; leave ~20% for macOS and applications.
Context length and memory: Increasing context window (e.g. 8K → 128K) consumes significant KV-cache RAM. At 128K context a 70B model can use >60 GB extra RAM. Set --ctx-size conservatively.

AMD — ROCm / Vulkan / HIP

ROCm (HIP): AMD's CUDA-equivalent. Now supported in llama.cpp, vLLM, and PyTorch. Requires ROCm 6.0+ on compatible RX 6000/7000 series GPUs. Performance is within ~10% of CUDA on inference.
Vulkan backend (llama.cpp): Cross-platform GPU acceleration using Vulkan compute shaders. Often 10–20% faster than ROCm on discrete GPUs for pure inference. Use -DGGML_VULKAN=ON at build time.
Ryzen AI Max+ 395 — BIOS RAM allocation: By default, the NPU/GPU shares 32 GB. Open BIOS, set iGPU VRAM allocation to 96 or 128 GB for full performance. Without this, large models page to slow system RAM.
hipBLAS (ROCm BLAS): Replaces cuBLAS for AMD systems. Install via ROCm SDK and set HIP_VISIBLE_DEVICES=0. Used automatically by torch and llama.cpp ROCm builds.
AMDGPU target: Set HSA_OVERRIDE_GFX_VERSION=11.0.0 for RDNA3 cards (RX 7000 series) if ROCm doesn't detect the GPU correctly — common issue on consumer ARC/RDNA setups.
Concurrent requests: AMD's ROCm handles batching less efficiently than CUDA for high concurrency. For single-user inference, the gap is small (<10%). For serving multiple users, NVIDIA holds a clear advantage.

CPU-Only Inference (x86 / ARM)

AVX-512 (Intel Icelake+, AMD Zen4): llama.cpp auto-detects and uses AVX-512 VNNI instructions for 4-bit matrix multiply. ~1.4× faster than AVX2. Use a compiled binary or set CMAKE_CXX_FLAGS=-march=native.
Thread count: Set threads to equal physical cores (not hyperthreads). Hyperthreading adds latency on memory-bound tasks. E.g. for a 16P + 8E core i9-13900K, use -t 16 (P-cores only).
NUMA pinning: On multi-socket systems, pin to one NUMA domain. Cross-socket memory access halves effective bandwidth. Use numactl --cpunodebind=0 --membind=0.
Dual-channel / quad-channel RAM: Each additional memory channel doubles bandwidth. DDR5-6000 quad-channel gives ~190 GB/s vs ~48 GB/s for DDR4 single-channel — a 4× inference speedup.
mmap and mlock: llama.cpp by default mmaps model files. Use --mlock to pin the model in RAM and prevent page faults during inference. Requires sufficient free RAM and root/CAP_IPC_LOCK.

Inference Framework Selection

Ollama: Easiest setup; wraps llama.cpp with automatic hardware detection (CUDA, Metal, ROCm). Good for development and single-user. Not optimised for high-throughput multi-user serving.
LM Studio: GUI-based; good for non-technical users. Uses llama.cpp backend. GPU offload via "GPU Layers" slider. Match layers to VRAM — rule of thumb: each 7B layer ~= 100 MB VRAM.
llama.cpp (direct): Maximum control. Supports every acceleration path. Use -ngl 99 -t 8 --ctx-size 4096 --mlock as a good starting configuration.
vLLM: Best for production multi-user serving on NVIDIA. PagedAttention enables efficient KV-cache sharing. Not supported on Apple Silicon. Requires CUDA 11.8+.
ExLlamaV2: Fastest CUDA inference for quantised models. Uses EXL2 format (superior to GGUF on NVIDIA). Can exceed TensorRT-LLM speed for chat workloads at Q4–Q5.

Context Length and KV-Cache

KV-cache size scales quadratically: A 7B model at 4K context uses ~500 MB KV-cache; at 128K context it uses ~16 GB. Reduce --ctx-size to free memory for the model itself.
KV quantisation (--cache-type-k q8_0): Quantising the KV cache from FP16 to Q8 or Q4 halves its memory footprint with minor quality impact. Supported in llama.cpp ≥ b2300.
Grouped-query attention (GQA) models: Llama 3, Mistral, and Qwen2+ use GQA — their KV cache is inherently smaller than older models. Prefer GQA models when context length matters.
Flash Attention 2: Required for efficient long-context inference. Enabled automatically in vLLM and supported in llama.cpp with --flash-attn.

RAM / VRAM Impact on Model Access — Reference Table

This table shows what changes when you increase or reduce available RAM/VRAM. The key insight: more memory expands which models fit; more bandwidth increases how fast they run. Adding RAM without adding bandwidth does not improve token speed — it only enables larger models.

Available RAM / VRAM	Largest model (Q4_K_M)	Token speed change vs. prior tier	70B Q4 accessible?	Notes
4 GB	~3B Q4	—	No	OS overhead leaves ~2.5 GB usable. 3B models run well.
8 GB	~7B Q4	+0% speed (same BW), +model tier	No	Most popular tier. 7B Q4 = ~4.4 GB. Small context headroom.
16 GB	~13B Q4	+0% speed, +model tier	No	13B Q4 ≈ 7.9 GB. 8K context adds ~1 GB. Comfortable headroom.
24 GB	~20B Q4	+0% speed, +model tier	No	Maximum for consumer discrete GPUs (NVIDIA RTX 4090, AMD RX 7900 XTX). 20B Q4 ≈ 12 GB.
32 GB	~26B Q4, 70B Q2	+0% speed, marginal 70B Q2	Marginal (Q2 only)	RTX 5090 ceiling. Q2 70B runs but quality suffers. Strong for 13–26B models.
48 GB	~40B Q4, 70B Q3	+0% speed, +model tier	Q3 only (~3.8 t/s)	RTX A6000. 70B Q3 just fits; expect lower quality than Q4.
64 GB+	70B Q4 (full)	+0% speed, full 70B access	Yes — Q4_K_M	First tier enabling 70B Q4 fully. Apple M-series, Ryzen AI Max. 70B Q4 ≈ 40 GB; leaves 20+ GB for context.
96 GB	~80B Q4, 120B Q3	+0% speed, 120B tier	Yes — with headroom	Dual 3090/A6000, M2 Ultra 96 GB. Llama 3.1 405B in Q2 just fits.
128 GB	~105B Q4, 120B Q3	+0% speed (CPU BW limited)	Yes — comfortably	Ryzen AI Max+ 395, M-series 128 GB. 70B + 32K context fits easily.
256 GB	~200B Q4, 400B Q3	+0% speed, 200B tier (Apple BW)	Yes — fast (~18 t/s)	M3 Ultra 256 GB. Llama 3.1 405B Q3 fits comfortably.
512 GB	405B Q4 full	+0% speed, 400B tier	Yes	M3 Ultra 512 GB. 405B Q4 ≈ 230 GB. Full unquantised 70B fits.

Key Takeaways by Use Case

Goal	Recommended Path	Key Setting
Fastest possible 7B inference	NVIDIA RTX 5090 + ExLlamaV2	EXL2 Q4 format, enable Flash Attention
Run 70B on a budget	AMD Ryzen AI Max+ 395 mini PC (~$1,500)	Set BIOS iGPU VRAM to 128 GB, use Vulkan backend
Quiet, large-model desktop	Apple Mac Studio M3 Ultra 128 GB + MLX	Use `mlx_lm.generate`, enable speculative decoding
Multi-user production serving	NVIDIA A100/H100 + vLLM	Enable PagedAttention, continuous batching, tensor parallelism
Maximum context length (>64K tokens)	Apple M3 Ultra 256 GB or 512 GB	Set `--ctx-size 65536`, quantise KV cache to Q8
Fine-tuning / LoRA training	NVIDIA GPU with 24+ GB VRAM	Use QLoRA (4-bit base + FP16 adapters), gradient checkpointing
Best CPU-only on x86	Dual-channel DDR5 + Zen4/Raptor Lake	Compile llama.cpp with `-march=native`, use `-t <physical cores>`

Operating Systems for Local AI Development

Your operating system determines which GPU acceleration paths, quantization formats, and developer tools are available. This page compares Windows, macOS, and Linux across the dimensions that matter most for running, developing, and fine-tuning local AI models.

Feature Compatibility Matrix

Feature / Tool	Windows 11	macOS (Apple Silicon)	Linux (Ubuntu 22.04+)
NVIDIA CUDA	Full	Not available	Full — best driver control
AMD ROCm (discrete GPU)	Not available — ROCm for discrete is Linux-only	Not available	Full — ROCm 6.0+ on RX 6000/7000
Apple MLX / Metal	Not available	Native — best inference path	Not available
DirectML (Microsoft)	Native — ONNX on NVIDIA, AMD, Intel GPU	Not available	Not available
Vulkan inference (llama.cpp)	Full	Not available	Full
Ollama	Full	Full	Full
LM Studio	Full	Full	Full
vLLM (production serving)	Via WSL2 only	Not supported	Full — native, best performance
ExLlamaV2	Full (CUDA)	Not available	Full (CUDA)
Claude Code	WSL2 required	Native — recommended	Native — recommended
Aider coding agent	Works; WSL2 preferred	Full	Full
Docker + GPU passthrough	Via Docker Desktop + WSL2	No GPU passthrough to containers	Full — NVIDIA Container Toolkit
Axolotl / Unsloth fine-tuning	Via WSL2 + CUDA	Not supported	Full — recommended platform
GGUF quantization	Full	Full	Full
EXL2 format	Full (CUDA)	Not available	Full (CUDA)
GPTQ / AWQ formats	Full (CUDA)	Not available	Full (CUDA)
MLX native format	Not available	Native — Apple Silicon only	Not available
BitsAndBytes / QLoRA	Via WSL2 (unreliable natively)	Not supported	Full — recommended for QLoRA
Hugging Face transformers	Works; WSL2 for training	Inference via MPS; no CUDA	Full
GPU driver management	Easy — NVIDIA App or AMD Adrenaline	Automatic — macOS system updates	Manual — DKMS, kernel headers

Platform Detail

Windows 11

Best for: NVIDIA CUDA users wanting a dual-purpose gaming and AI workstation. Widest consumer GPU support, easiest driver management, and the best GUI tool ecosystem.

CUDA works natively. Ollama, LM Studio, ExLlamaV2, and llama.cpp all run without WSL2 on NVIDIA GPUs. CUDA driver installation is handled automatically by the NVIDIA App.
WSL2 is near-mandatory for serious Python-based AI work. Tools like vLLM, Axolotl, Unsloth, and BitsAndBytes are Linux-first. Install WSL2 (Ubuntu 22.04 LTS) and do Python AI development inside it. NVIDIA GPU passthrough to WSL2 delivers near-native CUDA performance with minimal overhead.
Claude Code requires WSL2. It is not natively supported on PowerShell or Command Prompt. After enabling WSL2, install Node.js inside Ubuntu and run npm install -g @anthropic-ai/claude-code. This one step also resolves most Python ML compatibility issues.
AMD discrete GPU ROCm is not available on Windows. AMD users should use the Vulkan backend in llama.cpp/Ollama (which works well for inference) as an alternative. For PyTorch training with AMD discrete GPUs, Linux is required.
DirectML accelerates ONNX Runtime models (Phi-3, Mistral, Llama via ONNX) on any Windows GPU including Intel Arc. Slower than CUDA but hardware-agnostic — useful when CUDA is not an option.
Recommended setup: NVIDIA drivers natively → WSL2 + Ubuntu 22.04 → Ollama and LM Studio on native Windows for daily chat → Claude Code and Python tools (vLLM, transformers, axolotl) inside WSL2.
AI coding tool overhead. Windows NTFS filesystem operations are measurably slower than macOS APFS or Linux ext4. AI coding agents (Claude Code, Aider, Cursor) perform frequent file reads, diffs, and writes during each interaction. On Windows, this translates to longer response latencies and higher token consumption per task — the agent spends more time waiting on I/O and generates more retries. In practice, the same coding task can cost 10–30% more tokens on Windows compared to macOS or Linux. For heavy agentic coding use, WSL2 with files stored inside the Linux filesystem (not /mnt/c/) mitigates this, but native macOS or Linux remains faster.

macOS (Apple Silicon)

Best for: Developers who want a polished Unix environment, large unified-memory model capacity, low power draw, and the best Claude Code experience without Linux complexity.

MLX is the optimal inference path. Apple's open-source MLX framework is 20–30% faster than llama.cpp Metal for inference on Apple Silicon. Install via pip install mlx-lm, then run models directly with mlx_lm.generate. Ollama uses Metal/llama.cpp internally and is slightly slower but simpler to set up.
Unified memory is the key architectural advantage. CPU and GPU share one high-bandwidth pool — up to 512 GB on Mac Studio M3 Ultra. A 70B model (40 GB Q4) loads without any VRAM/host memory boundary or PCIe transfer overhead.
Claude Code works natively. macOS is a first-class Claude Code platform. zsh, Homebrew, and all standard Unix utilities work out of the box. No WSL2 or compatibility layer needed.
No CUDA, EXL2, GPTQ/AWQ, vLLM, or BitsAndBytes. Axolotl and Unsloth for fine-tuning are not supported. MLX includes its own LoRA fine-tuning path (mlx_lm.lora) for Apple Silicon.
Supported quantization formats: GGUF (all variants via Ollama/llama.cpp) and MLX native formats (int4, int8, fp16, bf16). EXL2, GPTQ, AWQ, and BitsAndBytes are not available.
RAM is soldered and non-upgradeable. Choose your memory configuration at purchase. The 128 GB step (M4 Max MacBook Pro or Mac Studio) is the threshold for 70B Q4 models; 192–512 GB (Mac Studio M3 Ultra) enables 120B+ models.
Power efficiency. An M3 Ultra running 70B at 18 t/s draws ~150 W. An equivalent-speed NVIDIA A100 draws 300–400 W. For always-on home servers this is a meaningful difference over time.

Linux (Ubuntu 22.04 / 24.04)

Best for: Fine-tuning, production multi-user serving, AMD discrete GPU acceleration, multi-GPU setups, and developers who need the full toolchain without compatibility workarounds.

The reference platform for all AI/ML work. vLLM, Axolotl, Unsloth, TensorRT-LLM, and BitsAndBytes are developed and tested on Linux first. Some capabilities are Linux-only or substantially faster on Linux.
ROCm works properly only on Linux. AMD ROCm 6.0+ enables full GPU acceleration for llama.cpp, vLLM, and PyTorch on RX 6000/7000 series GPUs. On Windows, ROCm support for discrete GPUs is not available — use Vulkan there instead.
Claude Code and Aider are native. The full agentic experience — file editing, bash commands, multi-file reasoning, git integration — works without any wrapper or workaround.
Fine-tuning requires Linux. Axolotl (most widely used LoRA/QLoRA pipeline), Unsloth (2× faster LoRA with less VRAM), and BitsAndBytes (required for QLoRA) all require Linux + CUDA. Attempting native Windows fine-tuning consistently produces environment and CUDA library compatibility errors.
Docker + NVIDIA Container Toolkit enables fully reproducible GPU-accelerated environments. Deploy vLLM in one command: docker run --gpus all vllm/vllm-openai:latest --model meta-llama/Llama-3.1-70B-Instruct — the standard production pattern.
Driver management is the main friction point. NVIDIA drivers require matching kernel headers and DKMS modules. Before upgrading the kernel, confirm the driver package is compatible. Use apt-mark hold linux-image-generic to pin a stable kernel once the driver is working.
Full quantization access: GGUF, EXL2, GPTQ, AWQ, BitsAndBytes (4-bit/8-bit), and all llama.cpp-supported formats. Linux + CUDA gives access to every inference and training optimization path simultaneously.

Recommended Setup by Goal

Goal	Recommended OS	Reasoning
Run local chat models (beginner)	Windows or macOS	Ollama and LM Studio work on all platforms. Windows has the widest consumer GPU support; macOS has the simplest environment setup.
Fine-tune models (LoRA / QLoRA)	Linux (Ubuntu 22.04)	Axolotl, Unsloth, and BitsAndBytes require Linux + CUDA. Most tutorials, Docker images, and Hugging Face guides assume Ubuntu.
Production multi-user model serving	Linux	vLLM with PagedAttention, Docker + NVIDIA Container Toolkit, and production monitoring (Prometheus, Grafana) are all Linux-native.
Daily Claude Code use	macOS or Linux	Native first-class support on both. On Windows, WSL2 setup is required and file I/O across the WSL2/Windows boundary is slower.
AMD discrete GPU inference	Linux	ROCm is Linux-only for discrete GPUs. On Windows, use Vulkan in llama.cpp/Ollama — functional but ~10–15% slower than ROCm on Linux for the same hardware.
Run 70B+ parameter models	macOS (Apple Silicon) or Linux (multi-GPU)	Apple unified memory and NVIDIA NVLink multi-GPU are the two viable paths to 70B+ at usable speed without an extreme budget.
Gaming + AI on one machine	Windows	Windows remains the primary gaming platform. NVIDIA CUDA AI tools work natively on Windows alongside games without issue.
Power-efficient always-on server	macOS (Apple Silicon)	Apple Silicon offers the best performance-per-watt ratio for LLM inference. A Mac mini M4 Pro can idle at under 10 W while ready to serve requests.
Maximum quantization format access	Linux	Linux + CUDA is the only platform where GGUF, EXL2, GPTQ, AWQ, BitsAndBytes, and ONNX with CUDA EP are all simultaneously available.

Software Guide — Inference, Coding Assistants, Fine-tuning & Cloud Tools

A structured catalog of the tools available across each stage of AI work: running local models, developing with AI assistance, fine-tuning models, and accessing cloud AI services. All tools listed are open source or have a meaningful free tier unless noted.

Local Inference Engines

These tools load model files from disk and run them on your CPU or GPU. They are the foundation of local AI use.

Ollama

The easiest way to run local models. Wraps llama.cpp with automatic GPU detection and exposes a REST API on localhost:11434 compatible with the OpenAI API spec.

Open source Windows / macOS / Linux

Model library: ollama pull llama3.1:70b downloads, quantizes, and runs in one command. 200+ curated models available.
GPU: Auto-detects CUDA (NVIDIA), Metal (Apple), and ROCm/Vulkan (AMD). No configuration needed for most setups.
API compatibility: Exposes /v1/chat/completions endpoint, so any OpenAI-compatible client (Open WebUI, Continue.dev, Cursor) connects immediately.
Limitation: Not optimised for high-throughput multi-user serving. Single-user development and local apps are the sweet spot.

LM Studio

GUI application for discovering, downloading, and running GGUF models. Built-in model browser, GPU layer slider, and a local server with OpenAI-compatible API.

Free (closed source) Windows / macOS / Linux

Best for beginners: No command line required. Browse models from Hugging Face directly in the app, download with one click, adjust GPU layers via slider.
GPU offload: Drag the "GPU Layers" slider to send as many transformer layers as fit into VRAM. Rule of thumb: each 7B layer ≈ 100 MB VRAM.
Local server mode: Enables a persistent server that other apps (Open WebUI, Continue.dev) can connect to.
Limitation: Closed source; slightly behind llama.cpp direct on performance. Does not support vLLM-style batching.

Jan.ai

Open-source desktop app providing a full ChatGPT-style local interface with model management, extension support, and a local API server.

Open source Windows / macOS / Linux

Full offline operation: Designed to work with no internet connection. Model files are stored locally; no data leaves the machine.
Extension ecosystem: Supports plugins for additional functionality such as web search, document retrieval, and remote model endpoints.
API server: Exposes an OpenAI-compatible API on localhost:1337, allowing integration with any client that speaks the OpenAI format.
Good alternative to LM Studio for users who prefer open-source software with the same ease of use.

GPT4All

Fully offline AI assistant desktop app. No internet access required at any point, including model downloads (which can be done offline via direct GGUF import).

Open source Windows / macOS / Linux

Air-gap friendly: The strongest choice for environments where internet connectivity must be fully severed after setup.
LocalDocs feature: Index local documents (PDF, Word, text) and ask questions about them with retrieval-augmented generation (RAG).
CPU-first design: Works well on CPU-only systems. GPU acceleration via Vulkan is available but less optimised than Ollama/LM Studio.

KoboldCPP

A single-binary fork of llama.cpp with a built-in web UI, advanced sampling controls, and strong support for long-form creative writing and roleplay use cases.

Open source Windows / macOS / Linux

Single executable: Download one file and run it — no Python environment or package manager required.
Advanced sampling: Includes mirostat, tail-free sampling, and Typical P sampling not found in all frontends — particularly useful for creative/story generation.
SillyTavern/Agnai integration: The community-standard backend for character-based AI chat interfaces.

llama.cpp (direct)

The foundational C++ inference engine that powers Ollama, LM Studio, and KoboldCPP. Run it directly for maximum control over every parameter.

Open source Windows / macOS / Linux

Maximum control: Exposes every optimization flag: -ngl (GPU layers), --ctx-size, --cache-type-k, --flash-attn, -t (threads), --mlock, and more.
Supports all GPU backends in one codebase: CUDA, Metal, ROCm (HIPrt), Vulkan, OpenCL, and CPU-only (AVX2, AVX-512).
GGUF quantization server: Includes llama-quantize to convert and quantize models locally without uploading to a cloud service.
Best choice when you are benchmarking, scripting, or need a parameter not exposed by higher-level tools.

vLLM

Production-grade inference server for NVIDIA GPUs. Uses PagedAttention for efficient KV-cache memory management and continuous batching for high-throughput multi-user serving.

Open source Linux (CUDA required)

PagedAttention: Manages KV-cache in variable-size pages, dramatically reducing memory fragmentation and enabling 2–5× higher throughput than llama.cpp for concurrent users.
Continuous batching: Processes multiple requests simultaneously without waiting for a full batch, reducing latency under load.
OpenAI-compatible API: Drop-in replacement for OpenAI endpoints — connect any client without code changes.
Requires Linux + CUDA. Not supported on macOS or Windows without WSL2. The standard choice for production NVIDIA deployments.

ExLlamaV2

The fastest CUDA inference engine for quantized models. Uses the EXL2 quantization format, which achieves higher quality than GGUF Q4 at the same file size via per-row quantization.

Open source Windows Linux

EXL2 format: Per-row mixed-precision quantization. A 4-bit EXL2 model typically outperforms a GGUF Q4_K_M on the same hardware in quality benchmarks.
Speed: Consistently outperforms TensorRT-LLM on chat workloads at Q4–Q5. Best choice for maximum CUDA inference speed with quantized models.
Speculative decoding support: Use a small draft model to propose tokens — can yield near 2× speed on certain workloads.
Tabby API server: ExLlamaV2 is the backend for TabbyAPI, a high-performance OpenAI-compatible server used by many frontends.

AI Coding Assistants

Tools that integrate AI into the code-writing and editing workflow, from inline completions to full agentic coding agents that can edit multiple files and run commands.

Claude Code (Anthropic)

Terminal-based agentic coding assistant. Reads and edits files, runs shell commands, searches the codebase, and writes and executes code autonomously within a directory. Uses Claude models via API.

Pay-per-token macOS native Linux native WSL2 on Windows

Full agentic loop: Can read project files, propose and apply edits, run tests, and iterate on errors without constant user prompting. Most effective on large codebases.
Git integration: Understands git history, creates commits, and works across branches. Best used in a git-tracked repository.
Windows note: Requires WSL2 (Ubuntu 22.04+). Running inside WSL2 gives the same experience as Linux. File edits across the /mnt/c mount work but are slower than editing files inside the WSL2 filesystem.
Model selection: Defaults to Claude Sonnet; can be configured to use Claude Opus for complex reasoning tasks at higher cost.

GitHub Copilot

The most widely used AI coding assistant. Provides inline code completion and a chat panel within VS Code, JetBrains IDEs, Neovim, and others. Uses GPT-4o and Claude models depending on task.

$10/month (individual) All platforms

Inline completion: Suggests single lines or entire function bodies as you type with low latency. Works in virtually all languages.
Copilot Chat: Ask questions about the codebase, get explanations, generate tests, and fix errors directly in the IDE sidebar.
Copilot Workspace / Agent: Multi-step task completion — describe a feature and Copilot proposes a plan, creates files, and opens PRs (enterprise tier).
Model choice: GitHub now lets users select the underlying model (GPT-4o, Claude Sonnet, Gemini 2.0) per session.

Cursor

AI-first code editor forked from VS Code. Adds a multi-file edit "Composer" mode, codebase-wide context awareness, and tight model integration beyond what VS Code extensions provide.

Free tier / $20 month pro Windows / macOS / Linux

Composer mode: Describe a change in natural language; Cursor proposes and applies diffs across multiple files simultaneously — the main differentiator from GitHub Copilot.
Codebase indexing: Indexes your entire repository for semantic search, giving the model accurate context even in large codebases.
Model selection: Supports Claude 3.5/3.7 Sonnet, GPT-4o, and o3 — swappable per-task. Can also point to a local Ollama endpoint for completions.
Tab / Ghost text: A distinctive predictive editing feature that anticipates your next edit based on recent changes — faster than standard autocomplete.

Continue.dev

Open-source AI coding extension for VS Code and JetBrains. Unique in that it supports local Ollama models as well as cloud APIs, giving you AI coding assistance with no data leaving your machine.

Open source Windows / macOS / Linux

Local model support: Connect to Ollama, LM Studio, or any OpenAI-compatible API. Run Qwen2.5-Coder 32B locally for code completion with full privacy.
Configurable via JSON: config.json specifies models for chat, completion, and embedding independently — mix local and cloud models.
Tab autocomplete: Uses a separate fast local model (e.g. Qwen2.5-Coder 1.5B) for low-latency line completions while using a larger model for chat.
Best choice for developers who want GitHub Copilot-style features but with local models and no subscription.

Windsurf (Codeium)

AI-first code editor from Codeium, similar to Cursor. Offers a "Cascade" agent mode for multi-step task completion and deep IDE integration. Strong free tier.

Free tier / $15 month pro Windows / macOS / Linux

Cascade agent: Handles multi-file edits, runs terminal commands, reads error output, and iterates — similar to Claude Code but embedded in the IDE.
Generous free tier: Unlimited code completions and a meaningful number of agent interactions per month at no cost — competitive with Cursor.
Codeium extension also available as a VS Code/JetBrains plugin if you don't want to switch editors, providing completions without the full Windsurf agent.

Aider

Terminal-based AI coding agent emphasizing git-awareness. Applies changes as atomic git commits, supports multiple files per session, and works with any LLM including local Ollama models.

Open source Windows / macOS / Linux

Git-first workflow: Every change is committed with a descriptive message, making it easy to review, revert, or cherry-pick AI edits.
Local model support: Connect to an Ollama endpoint and use Qwen2.5-Coder 32B or DeepSeek-Coder V2 for fully offline coding assistance.
Benchmark leading: Consistently at the top of the SWE-Bench leaderboard for open-source coding agents, particularly in architect + editor mode (uses two models cooperatively).
Works well alongside Claude Code — use Aider for git-structured file edits and Claude Code for broader exploration and command execution.

Local Chat Frontends & Self-Hosted UIs

Web or desktop interfaces that connect to Ollama, LM Studio, or vLLM and provide a chat experience similar to ChatGPT but running entirely on your hardware.

Open WebUI

The most popular self-hosted ChatGPT-style web interface. Connects to Ollama and OpenAI-compatible APIs. Runs as a Docker container or Python package. Feature parity with ChatGPT's web UI including document upload, image analysis, and multi-model switching.

Open source All platforms (browser-based)

One-command install: docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main — connects to a running Ollama instance automatically.
RAG support: Upload PDFs, web URLs, or YouTube transcripts; the model retrieves relevant chunks during conversation.
Multi-model conversations: Chat with multiple models side by side and compare responses.
User management: Supports multiple user accounts with role-based access — suitable for sharing a local model server within a team.

LibreChat

Self-hosted multi-provider chat interface supporting OpenAI, Anthropic, Google, Azure, and local models simultaneously in one UI. Strong feature set with branching conversations, code execution, and plugin support.

Open source All platforms (browser-based, Docker)

Multi-provider in one UI: Switch between ChatGPT, Claude, Gemini, and a local Ollama model within the same conversation history.
Branching conversations: Fork any message to explore alternative responses without losing the original thread.
Best choice when you want to compare cloud and local model outputs or need team access to multiple API providers from a single shared interface.

Chatbox

Desktop application for AI chat supporting OpenAI, Anthropic, Google, Azure, and local Ollama endpoints. Stores conversations locally; no cloud sync unless configured. Good low-friction option for non-technical users.

Free (basic) / Pro subscription Windows / macOS / Linux

Desktop-native: Installs like a normal application; no Docker or server setup needed. Connect to Ollama by pointing it to http://localhost:11434.
Privacy-first defaults: Conversation history stored locally by default. Suitable for sensitive work where cloud storage of chat history is undesirable.

Msty

Desktop AI assistant that aggregates local Ollama models and cloud APIs in a polished UI. Notable for its "Knowledge" feature — attach local documents to conversations — and side-by-side model comparison.

Free tier available Windows / macOS / Linux

Model comparison view: Query multiple models simultaneously and view responses side by side in the same window.
Knowledge library: Attach PDF, Word, or text files to a conversation with built-in RAG. No separate setup required.
Good choice for users who want a more polished interface than Open WebUI without the Docker setup complexity.

Cloud AI Services & API Access

Services that provide access to large frontier models via web interface or API. Useful when a task exceeds local hardware capacity or when the latest proprietary model is needed.

Service	Models Available	Pricing	Privacy / Notes
Claude.ai (Anthropic)	Claude 3.5 Haiku, Claude Sonnet 4, Claude Opus 4	Free tier / $20–$25/month Pro / API pay-per-token	Strong system prompt adherence. Pro tier includes extended context (200K tokens), Projects (persistent memory), and document upload. API keys required for Claude Code.
ChatGPT (OpenAI)	GPT-4o, o3, o3-mini, GPT-4.5	Free (GPT-4o limited) / $20/month Plus / API pay-per-token	Widest tool-use ecosystem. GPT-4o includes image generation (DALL-E 3), web browsing, and code execution in one model. o3 is the strongest reasoning model for complex multi-step problems.
DuckDuckGo AI Chat	Claude 3 Haiku, GPT-4o mini, Llama 3, Mistral	Free (no account required)	Privacy-focused: DuckDuckGo does not store conversations and removes identifying metadata before forwarding to model providers. Best for quick queries where privacy is a priority.
Perplexity	Sonar (proprietary), Claude, GPT-4o	Free / $20/month Pro	AI search engine that cites sources with inline references. Excellent for research queries, real-time information, and technical documentation lookups. Pro tier adds file upload and image generation.
Phind	Phind-70B (code-focused), GPT-4o	Free (limited) / $20/month Pro	Developer-focused search and chat. Specialises in code questions, pulls from Stack Overflow, GitHub, and documentation. Strong at answering "how do I do X in Y framework" questions with working code.
Google Gemini	Gemini 2.0 Flash, Gemini 2.5 Pro	Free / $20/month Advanced / API pay-per-token	Gemini 2.5 Pro has the longest context window of any frontier model (1M tokens). Excellent for large document analysis. Deep Google Workspace integration. Gemini 2.0 Flash is extremely cost-effective via API.

Fine-tuning & Training Frameworks

Tools for customizing pre-trained models on your own data using LoRA, QLoRA, or full fine-tuning. Requires NVIDIA GPU (Linux) or Apple Silicon (MLX path only).

Axolotl

The most widely used open-source LoRA/QLoRA fine-tuning framework. YAML-based configuration means no Python scripting needed for most use cases. Supports virtually all major model architectures.

Open source Linux + CUDA

YAML configuration: Specify model, dataset, LoRA rank, batch size, learning rate, and all training parameters in a single config.yml. Run with axolotl train config.yml.
Broad format support: Accepts ShareGPT, Alpaca, instruction, and completion formats out of the box. Custom tokenization templates for any chat format.
QLoRA support: Load the base model in 4-bit BitsAndBytes precision while training LoRA adapters in FP16 — fine-tune a 70B model on a single A100 80GB.
DeepSpeed integration: Enables multi-GPU and CPU memory offloading for training models that don't fit in VRAM alone.

Unsloth

Accelerated LoRA fine-tuning library that rewrites attention kernels in Triton for 2× speed and up to 70% less VRAM compared to standard HuggingFace PEFT. The go-to tool for consumer GPU fine-tuning.

Open source Linux + CUDA

2× faster than HF PEFT at the same quality. Fine-tune Llama 3.1 8B on a 24 GB GPU in under 2 hours on a typical instruction dataset.
70% less VRAM. Fine-tune a 7B model on 6 GB VRAM; a 13B on 12 GB. Enables fine-tuning on consumer RTX 3060/4060 GPUs that previously could not run training at all.
Colab-native: Unsloth provides free Google Colab notebooks for every major model — fine-tune without owning any GPU hardware.
Export to GGUF: Built-in export to GGUF quantization formats (Q4_K_M, Q5_K_M, Q8_0) immediately after training.

LLaMA-Factory

Web UI and CLI for fine-tuning over 100 model architectures with LoRA, QLoRA, DoRA, ORPO, DPO, and RLHF. Provides a visual training dashboard, dataset management, and evaluation tools in one interface.

Open source Linux + CUDA (primary)

Web UI (LlamaBoard): Configure and launch fine-tuning jobs from a browser without any command line or Python knowledge.
Advanced alignment methods: Supports RLHF, DPO, ORPO, and reward model training — not just supervised fine-tuning. The most comprehensive fine-tuning toolkit available.
Multi-GPU and distributed training via DeepSpeed and FSDP with minimal configuration.

MLX Fine-tuning (Apple)

Apple's MLX framework includes a built-in LoRA fine-tuning path via mlx_lm.lora. The only fine-tuning option available on Apple Silicon — no CUDA, Axolotl, or Unsloth needed.

Open source macOS (Apple Silicon only)

Native Apple Silicon acceleration: Uses Metal for all matrix operations — no overhead from cross-platform abstraction layers.
LoRA and QLoRA supported on most Llama, Mistral, Phi, and Qwen architecture models. Uses MLX native int4 quantization for the base model during QLoRA.
Limitation: Dataset format support is narrower than Axolotl. Works best for instruction fine-tuning on simple JSONL datasets. Complex alignment training (DPO, RLHF) is not yet fully supported.
Run with: mlx_lm.lora --model <path> --train --data ./data --iters 1000 --batch-size 4

Model Discovery, Evaluation & Experiment Tracking

Tool	Purpose	Cost	Key Features
Hugging Face Hub	Model repository & discovery	Free (storage limits) / Pro	800,000+ models, datasets, and Spaces. Filter by task, license, and quantization format. The standard place to find GGUF, GPTQ, and AWQ quantized model variants. Direct download links for Ollama and llama.cpp.
Ollama Model Library	Curated models for Ollama	Free	Curated, verified GGUF models runnable with one command. Smaller and more opinionated than HF Hub. Includes all popular models (Llama 3, Mistral, Qwen, Phi, Gemma) pre-tested with Ollama.
LMSYS Chatbot Arena	Blind model evaluation leaderboard	Free	Community-ranked model leaderboard based on blind A/B comparisons. The most reliable quality ranking because it reflects human preferences, not just academic benchmarks. Good for choosing between similar model sizes.
Weights & Biases (W&B)	Experiment tracking & ML monitoring	Free (personal) / $50/month Teams	Log training loss curves, hyperparameters, GPU metrics, and model checkpoints. Integrates with Axolotl, Unsloth, and transformers natively. The standard tool for tracking fine-tuning runs.
MLflow	Open-source experiment tracking	Free (self-hosted)	Local alternative to W&B. Track experiments, compare model versions, and serve models via its built-in REST API. No cloud dependency — runs on your own machine or server.
lm-evaluation-harness	Standardized model benchmarking	Free (open source)	Eleuther AI's framework for running standard benchmarks (MMLU, HellaSwag, ARC, TruthfulQA) against local models. The tool used to generate most public benchmark numbers. Run with `lm_eval --model hf --model_args pretrained=<model> --tasks mmlu`.

Featured Recommendations by OS & Budget

DIY Build Guides by Model Target & Budget

DIY vs. Pre-built: When Each Makes Sense

Why Build DIY

Why Buy Pre-built

Best Value Summary

Manufacturer Tradeoffs

NVIDIA

AMD

Apple Silicon

Find Your Best Match

Your Requirements

Compare Hardware for Local AI

Select Systems

Optimization Guide — Getting Maximum Performance from Local LLMs

What Actually Limits Performance — by Hardware Path

Quantisation — The Most Impactful Software Setting

Platform-Specific Optimization Paths

RAM / VRAM Impact on Model Access — Reference Table

Key Takeaways by Use Case

Operating Systems for Local AI Development

Feature Compatibility Matrix

Platform Detail

Recommended Setup by Goal

Software Guide — Inference, Coding Assistants, Fine-tuning & Cloud Tools

Local Inference Engines

AI Coding Assistants

Local Chat Frontends & Self-Hosted UIs

Cloud AI Services & API Access

Fine-tuning & Training Frameworks

Model Discovery, Evaluation & Experiment Tracking

System Detail