AI Workstation & GPU Performance Report (2026)

A comparative analysis of pre-built and DIY systems for local AI development.

Benchmark My System

Featured Recommendations by OS & Budget

macOS
Top Performance
Mac Studio M3 Ultra — 512 GB
~$11,000
405B Q4 capable · ~135 t/s on 7B · max context ~800K tokens (7B Q4)
High End
Mac Studio M3 Ultra — 128 GB
~$8,500
120B Q4 capable · ~130 t/s on 7B · max context ~400K tokens (7B Q4)
Mid Range
MacBook Pro M4 Max — 128 GB
~$4,200
70B Q4 capable · ~75 t/s on 7B · max context ~200K tokens (7B Q4)
Budget
Mac mini M4 Pro — 24 GB
~$1,400
13B Q4 capable · ~50 t/s on 7B · max context ~14K tokens (7B Q4)
Windows
Top Performance
HP Z8 G4 — Quad NVIDIA A100 40 GB
~$30,000
70B unquantized · ~540 t/s on 7B · max context ~128K tokens (70B FP16)
High End
Lenovo ThinkStation P620 — 2× RTX A6000
~$15,000
70B Q4 capable (96 GB VRAM) · ~300 t/s on 7B · max context ~72K tokens (70B Q4)
Mid Range
HP OMEN 45L — NVIDIA RTX 5090
~$4,500
26B Q4 in VRAM · ~250 t/s on 7B · max context ~24K tokens (7B Q4)
Budget
DIY Build — RTX 4070 Ti Super 16 GB
~$1,200
13B Q4 capable · ~90 t/s on 7B · max context ~9K tokens (7B Q4)
Linux
Top Performance
Dell Precision 7960 — Quad RTX 5000 Ada
~$20,000
120B+ Q4 capable · ~450 t/s on 7B · max context ~200K tokens (70B Q4)
High End
DIY — NVIDIA A100 80 GB PCIe
~$6,000
70B Q4 (80 GB HBM2e) · ~220 t/s on 7B · max context ~80K tokens (70B Q4)
Mid Range
DIY — RTX 5090 + Ryzen 9 9950X
~$3,500
26B Q4 in VRAM · ~250 t/s on 7B · max context ~24K tokens (7B Q4)
Budget
DIY — AMD RX 7900 XTX 24 GB
~$1,400
20B Q4 via ROCm/Vulkan · ~135 t/s on 7B · max context ~14K tokens (7B Q4)

Table columns are sortable. All systems are evaluated for AI/ML workloads. TG = Tokens/sec Generation (batch size 1).

Rank Model Type GPU VRAM Best Model (Size) Tokens/sec (TG) Price Notes & Evidence
1 HP Z8 G4 (Custom) Desktop Quad NVIDIA A100 40GB 160 GB 70B+ (Unquantized) ~540 $30,000
  • Data Center Power: Can run unquantized 70B models entirely in VRAM.
  • NVLink Advantage: Full peer-to-peer communication between GPUs is critical for massive model inference.
  • Secondary Market: Often available on the used market, making it a "budget" entry into high-end AI.
  • Power/Cooling: Requires significant power and cooling infrastructure not suitable for a typical office.
2 Dell Precision 7960 Tower Desktop Quad NVIDIA RTX 5000 Ada 128 GB 120B+ (Quantized) ~450 $20,000
  • Enterprise Grade: Built for high-concurrency professional AI, with certified drivers and support.
  • RTX 6000 Ada Recommended: For max performance, users often upgrade to RTX 6000 Ada GPUs for more VRAM and CUDA cores.
  • NVLink Bridge: Essential for combining GPU memory pools effectively.
3 Apple Mac Studio (M3 Ultra) Desktop Apple 80-core GPU 128 GB (Unified) 120B+ (Quantized) ~97 $8,000
  • Memory Bandwidth King: Unmatched unified memory is ideal for huge model inference.
  • Underwhelming TG?: Some users report tokens/sec (TG) is lower than expected vs. high-end NVIDIA cards (e.g., 97 t/s vs 136 t/s on an RTX 3090 for some models).
  • Silent & Efficient: Consumes far less power and is virtually silent compared to NVIDIA workstations.
  • MLX Framework: Apple's MLX framework can be 20-30% faster than other libraries like Ollama on Apple Silicon.
4 HP OMEN 45L (2025) Desktop NVIDIA RTX 5090 32 GB 70B+ (Quantized) ~140 $4,500
  • New Flagship: The RTX 5090 offers ~1,792 GB/s bandwidth, a significant leap for memory-bound LLM tasks.
  • MoE Performance: Excels at Mixture-of-Experts models, with benchmarks showing 100-140 t/s generation.
  • "Space Heater": Early reports suggest very high power draw and thermal output, requiring excellent case airflow.
5 Alienware Aurora R16 Desktop NVIDIA RTX 4090 24 GB 30B-70B ~95 $4,000
  • Community Favorite: Widely regarded as the best value for running 7B to 30B models locally.
  • 30% Slower than 5090: A significant performance gap, but at a much lower price point.
  • Stable & Supported: Mature drivers and extensive community support for AI/ML workloads.
6 Lenovo ThinkStation P620 Desktop Dual NVIDIA RTX A6000 96 GB 120B+ (Quantized) ~300 $15,000
  • CPU Powerhouse: The Threadripper Pro 5995WX CPU provides massive PCIe bandwidth for multiple GPUs.
  • PSU/Fitment Issues: Real-world user threads mention PSU limitations and tight physical space when fitting two high-end GPUs.
  • Dual A6000 Recommended: Community advice points to using two RTX A6000s as a potent, albeit expensive, combination.
7 Asus ROG Strix Scar 18 Laptop NVIDIA RTX 5090 (Laptop) 24 GB 30B-70B ~80 $4,500
  • Desktop-Class Laptop GPU: Benchmarks from similar MSI models show impressive performance but with caveats.
  • Thermal Throttling: The main challenge is keeping the GPU cool enough to sustain peak performance.
  • VRAM is 24GB: Note: Leaked specs showing 32GB were incorrect for the laptop version.
8 Apple MacBook Pro 16 (M4 Max) Laptop Apple 40-core GPU 128 GB (Unified) 30B-70B ~75 $4,200
  • Popular with ML Engineers: Excellent for development and testing on the go due to its large unified memory.
  • "Not for Heavy Training": Community consensus is that it's an inference machine, not for training large models from scratch.
  • Performance: Can achieve 70-80 t/s on 7B models (Q4 quantized) via MLX; 70B Q4 fits at ~6 t/s.
9 Asus ROG Flow Z13 (2025) Laptop AMD Ryzen AI Max+ 395 (iGPU) 128 GB (Unified) 7B-30B ~90 $2,100
  • The "Unicorn" APU: The Ryzen AI Max+ 395 supports up to 128 GB unified memory — the cheapest x86 path to 70B.
  • ROCm vs. Vulkan: Performance depends heavily on the backend used; Vulkan in llama.cpp often outperforms ROCm for pure inference.
  • BIOS Gotcha: Users report needing to manually allocate 96–128 GB of system RAM to the iGPU in BIOS for best performance.
10 ASUS Zenbook A14 (2025) Laptop Qualcomm Adreno NPU 32 GB (Shared) 7B-30B ~6 $1,800
  • On-Device AI Focus: Designed for efficient, low-power AI tasks.
  • NPU Not Yet Supported: The powerful NPU is not currently usable by most open-source LLM tools (e.g., llama.cpp).
  • Community Verdict: Currently not recommended for serious local LLM work until software support matures. Performance is low (~6 t/s on 32B models).
11 Apple MacBook Air 15" (M4) Laptop Apple 10-core GPU (M4) 32 GB (Unified) 7B-20B ~38 $1,299
  • Best value macOS laptop: 32 GB unified memory at $1,299 is the cheapest Apple Silicon path to running 20B Q4 models.
  • Bandwidth limitation: LPDDR5X at ~120 GB/s is 3x slower than M4 Max tokens/sec on 7B is roughly half a MacBook Pro.
  • Fanless: The M4 Air is fanless, so sustained AI inference will throttle after several minutes; burst speed is good, sustained speed drops.
  • Best for: Developers who need portability and occasionally run 713B models. Not suited for production or long inference sessions.
12 Apple MacBook Air 13" (M3) Laptop Apple 10-core GPU (M3) 24 GB (Unified) 7B-13B ~30 $1,099
  • Cheapest modern macOS option: 24 GB unified memory runs 7B and 13B Q4 models reliably at modest speed.
  • Fanless throttle warning: Sustained inference causes thermal throttling within a few minutes on the base config.
  • Good upgrade path: Maxing RAM to 24 GB at purchase is highly recommended memory cannot be upgraded later.
  • Community verdict: Solid for trying out local LLMs casually, but the M4 Air is worth the extra $200 for higher bandwidth.
13 Apple Mac mini (M4 Pro) Desktop Apple 20-core GPU (M4 Pro) 48 GB (Unified) 7B-40B ~65 $1,999
  • Best compact macOS desktop: 48 GB unified memory at $1,999 comfortably runs 30B40B Q4 models.
  • M4 Pro bandwidth: ~273 GB/s is significantly higher than M4 base chips much better sustained inference speed.
  • Small footprint: Fits on a crowded desk; runs silently under typical LLM loads.
  • Recommended config: 48 GB version over 24 GB; the extra $600 opens up a much wider range of models.
14 Apple iMac 24" (M4) Desktop Apple 10-core GPU (M4) 32 GB (Unified) 7B-20B ~40 $1,699
  • All-in-one convenience: The iMac integrates display, speakers, and webcam no separate monitor needed.
  • Bandwidth same as MacBook Air: M4 base chip has ~120 GB/s of LPDDR5X bandwidth, limiting token speed vs M4 Pro/Max.
  • 32 GB ceiling: 32 GB is the maximum RAM option. Limits models to 20B Q4 or smaller.
  • Best for: General-purpose users who want macOS + LLM capability without building a multi-device setup.
15 Apple MacBook Pro 14" (M4 Pro) Laptop Apple 20-core GPU (M4 Pro) 48 GB (Unified) 7B-40B ~58 $2,599
  • Best portable value for 30B: 48 GB and M4 Pro bandwidth make it the cheapest Apple laptop that comfortably runs 30B Q4.
  • Active cooling: Unlike MacBook Air, the Pro has a fan sustained inference does not throttle.
  • vs 16" M4 Max: The 14" M4 Pro at $2,599 is roughly half the cost and 25% slower per token; a reasonable tradeoff for portability.
  • MLX compatible: Full support for Apple MLX framework for faster quantized inference.
16 Minisforum AI Max+ 395 (128 GB) Desktop AMD Ryzen AI Max+ 395 (RDNA 3.5 iGPU) 128 GB (Unified) 7B-70B ~90 $1,600
  • Cheapest x86 path to 70B: 128 GB unified at $1,600 is unmatched by any discrete GPU system anywhere near this price.
  • BIOS setup required: Must manually allocate 96128 GB to the iGPU in BIOS; factory default is much lower.
  • llama.cpp Vulkan: Use the Vulkan backend for best inference speed. ROCm support exists but adds setup complexity.
  • Bandwidth limitation: 256 GB/s LPDDR5X is 3x slower than M3 Ultra; 70B runs at ~6 t/s vs ~18 t/s on Apple Silicon.
17 Razer Blade 16 (RTX 5090) Laptop NVIDIA RTX 5090 (Laptop) 24 GB 7B-20B ~120 $4,499
  • Fastest Windows LLM laptop: RTX 5090 Mobile uses a 575W TGP variant with GDDR7 highest bandwidth laptop GPU available.
  • 24 GB VRAM limit: Still capped at 24 GB despite the new generation; limits to ~20B Q4 fully in VRAM.
  • Battery life: Expect under 2 hours of battery during inference at full GPU load. Best used plugged in.
  • Thin chassis: Razer's cooling is above average for a thin laptop but still thermally constrained vs the ROG Scar 18.
18 Acer Predator Helios 18 (RTX 4090) Laptop NVIDIA RTX 4090 (Laptop) 16 GB 7B-13B ~75 $2,799
  • Large cooling advantage: The 18" chassis allows higher sustained TGP than thinner 16" laptops more consistent performance.
  • Only 16 GB VRAM: Despite the flagship GPU brand, laptop RTX 4090s ship with 16 GB, not 24 GB as in the desktop version.
  • Value vs desktop: A DIY RTX 4090 desktop at the same price delivers 24 GB VRAM and significantly more sustained performance.
  • Good for travel: If you need a powerful portable Windows machine for on-site or travel inference work, this is a strong option.
19 Lenovo Legion Pro 7i Gen 9 (RTX 4090) Laptop NVIDIA RTX 4090 (Laptop) 16 GB 7B-13B ~72 $2,799
  • Popular with developers: Strong CUDA ecosystem, good display, and widely supported driver stack make this a practical choice.
  • 16 GB VRAM ceiling: Laptop RTX 4090 is capped at 16 GB. 13B Q4 fits; 20B Q4 requires partial CPU offloading (slower).
  • MUX switch: Has a MUX switch for direct GPU display output, reducing latency and improving GPU utilisation during inference.
  • Better value at $2,200: Lenovo frequently discounts this model by $500600; watch for sales before buying.
20 NZXT Player:Two Max (RTX 4090) Desktop NVIDIA RTX 4090 24 GB 7B-20B ~140 $2,800
  • Pre-built value pick: Full-fat RTX 4090 in a proper ATX tower at $2,800; cheaper than Alienware Aurora R16 with similar specs.
  • Standard form factor: Uses a proper ATX mid-tower with a standard PSU, so the GPU can be swapped to an RTX 5090 later.
  • RAM limited: Ships with 32 GB DDR5; upgrade to 64 GB recommended for large CPU offload buffers when running 70B.
  • No bloatware: NZXT ships a cleaner Windows install than most OEM gaming brands.
21 CyberPowerPC Gamer Supreme (RTX 5080) Desktop NVIDIA RTX 5080 16 GB 7B-13B ~170 $2,400
  • Strong 7B speed at this price: RTX 5080 GDDR7 bandwidth (~960 GB/s) delivers ~170 t/s on 7B Q4 competitive with older 4090 systems in the same workload.
  • 16 GB VRAM limit: Only 16 GB VRAM limits it to 13B Q4 fully in VRAM. 20B and larger require CPU offloading.
  • Good upgrade path: Standard ATX build means the GPU can be replaced with an RTX 5090 or future card without replacing the whole system.
  • Community note: CyberPower and iBUYPOWER quality has improved significantly; capacitor and PSU quality are now comparable to most branded systems.
22 HP Victus 16 (RTX 4060) Laptop NVIDIA RTX 4060 (Laptop) 8 GB 7B only ~45 $849
  • Entry-level LLM laptop: The cheapest new Windows laptop that can run a 7B Q4 model reasonably in VRAM (8 GB is the absolute minimum).
  • 8 GB is the hard floor: Only 7B Q4 fits in VRAM. Anything larger requires CPU RAM offloading, dropping speed to 35 t/s.
  • Great for learning: If you are new to local LLMs and want to experiment without a large budget, this is a practical starting point.
  • Upgrade ceiling is low: The 8 GB VRAM cannot be upgraded; if you plan to scale up within a year, budgeting for 16 GB VRAM (RTX 4060 Ti or better) is worth it.

DIY Build Guides by Model Target & Budget

Use the filters to narrow builds by what model size you want to run and your budget range. Prices are estimates as of early 2026; GPU prices shift frequently.

The 7B Daily Driver
Best 7-13B performance under $1,000
GPUNVIDIA RTX 4060 Ti 16 GB
CPUIntel Core i5-13400 (10-core)
RAM32 GB DDR5-5200
Storage1 TB NVMe Gen4
PSU650W 80+ Gold
Est. Cost~$850
Max model (VRAM)13B Q4 in VRAM
7B TG speed~75 t/s
Save ~29% vs HP Victus 16 RTX 4060 Ti (~$1,200 pre-built)
Budget

16 GB GDDR6 - best value VRAM for price. 7B and 13B Q4 models run fully in VRAM. Use ExLlamaV2 for maximum speed on CUDA.

The AMD Value Build
Best ROCm/Vulkan inference under $1,100
GPUAMD RX 7900 GRE 16 GB
CPUAMD Ryzen 5 7600 (6-core)
RAM32 GB DDR5-6000
Storage1 TB NVMe Gen4
PSU650W 80+ Gold
Est. Cost~$950
Max model (VRAM)13B Q4 in VRAM
7B TG speed~80 t/s (Vulkan)
Save ~32% vs HP Victus 16 RX 7800 XT (~$1,400 pre-built)
Budget

RX 7900 GRE has 576 GB/s bandwidth and 16 GB VRAM. Use llama.cpp Vulkan backend for best inference speed. ROCm optional for PyTorch.

The 20B Budget Build
Run 20B Q4 for under $1,200
GPUNVIDIA RTX 3090 24 GB (used)
CPUAMD Ryzen 7 5700X (8-core)
RAM64 GB DDR4-3600
Storage2 TB NVMe Gen3
PSU850W 80+ Gold
Est. Cost~$1,100
Max model (VRAM)20B Q4 in VRAM
7B TG speed~130 t/s
Save ~39% vs pre-built RTX 3090 builds (~$1,800 pre-built)
Budget

Used RTX 3090 (24 GB / 936 GB/s) performs near-identically to an RTX 4090 for inference. Best value path to 20B Q4. Use a single GPU only - no SLI.

The RTX 5090 Flagship Build
Maximum consumer GPU inference speed
GPUNVIDIA RTX 5090 32 GB
CPUIntel Core i9-14900K (24-core)
RAM128 GB DDR5-6400
Storage4 TB NVMe Gen4
PSU1200W 80+ Platinum
Est. Cost~$3,500
Max model (VRAM)26B Q4 in VRAM
7B TG speed~250 t/s
Save ~22% vs HP OMEN 45L RTX 5090 (~$4,500 pre-built)
Mid-range

1,792 GB/s GDDR7 bandwidth - fastest consumer card. 32 GB VRAM limits models to ~26B Q4. Ensure 360mm AIO or large tower for cooling.

The Creator Mid-Range
RTX 4090 all-rounder for 20B+ workloads
GPUNVIDIA RTX 4090 24 GB
CPUAMD Ryzen 9 7950X (16-core)
RAM128 GB DDR5-6000
Storage4 TB NVMe Gen4
PSU1000W 80+ Gold
Est. Cost~$3,000
Max model (VRAM)20B Q4 in VRAM; 70B offloads to RAM
7B TG speed~140 t/s
Save ~25% vs Alienware Aurora R16 RTX 4090 (~$4,000 pre-built)
Mid-range

Comparable to Alienware Aurora R16 at ~20% lower cost. 128 GB system RAM allows 70B CPU offload (slow, ~3 t/s) but 20B in VRAM runs at full speed.

The Ryzen AI Max 70B Build
Cheapest x86 path to 70B at real speed
SystemMinisforum AI Max+ 395 Mini PC
CPU/iGPURyzen AI Max+ 395 (16-core, RDNA 3.5)
Unified RAM128 GB LPDDR5X
Storage2 TB NVMe (user-upgradeable)
Est. Cost~$1,600
Max model (RAM)70B Q4 (~40 GB used)
7B TG speed~90 t/s (Vulkan)
Save ~24% vs Asus ROG Flow Z13 128 GB (~$2,100 pre-built)
Mid-range

Not strictly DIY, but highly upgradeable mini PC. Set iGPU VRAM allocation to 96-128 GB in BIOS. 70B Q4 runs at ~6 t/s; 7B at ~90 t/s. Best value for 70B access on Windows/Linux.

The A100 80 GB Single-GPU Build
Single pro GPU that fits 70B Q4 natively
GPUNVIDIA A100 80 GB PCIe (used)
CPUAMD Threadripper PRO 5955WX (16-core)
RAM256 GB ECC DDR4
Storage8 TB NVMe Gen4
PSU1600W 80+ Platinum
Est. Cost~$6,000
Max model (VRAM)70B Q4 fully in VRAM (80 GB HBM2e)
7B TG speed~220 t/s
Save ~60% vs Dell Precision 7960 with A100 (~$15,000 pre-built)
High-end

Used A100 80 GB PCIe cards available for $4,000-6,000 on secondary market. Workstation chassis required. 1,935 GB/s HBM2e bandwidth. Best Linux vLLM platform for a single card.

The Dual A6000 Workstation
96 GB VRAM for 70B Q4 plus headroom
GPU2x NVIDIA RTX A6000 48 GB
CPUAMD Threadripper PRO 5975WX (32-core)
RAM256 GB DDR4 ECC
Storage8 TB NVMe RAID
PSU1600W 80+ Platinum
Est. Cost~$9,000
Max model (VRAM)70B Q4 with context headroom
7B TG speed~360 t/s (tensor parallel)
Save ~40% vs Lenovo ThinkStation P620 dual A6000 (~$15,000 pre-built)
High-end

Requires WRX80 motherboard. Both A6000s share 96 GB via vLLM tensor parallelism. NVLink bridge optional. Linux + vLLM recommended.

The RTX 6000 Ada Pro Build
Single 48 GB Ada GPU, NVLink-upgradeable
GPUNVIDIA RTX 6000 Ada 48 GB
CPUIntel Core i9-14900K (24-core)
RAM128 GB DDR5-6400
Storage4 TB NVMe Gen4
PSU1200W 80+ Platinum
Est. Cost~$7,500
Max model (VRAM)40B Q4 easily; 70B Q3 marginal
7B TG speed~200 t/s
Save ~38% vs HP Z8 G4 single RTX 6000 Ada config (~$12,000 pre-built)
High-end

960 GB/s bandwidth with ECC memory. Supports NVLink for a second card later (96 GB / 1,920 GB/s combined). ATX-compatible in a standard full-tower. Windows and Linux supported.

The 120B+ Extreme Workstation
Quad A6000 - 192 GB VRAM for massive models
GPU4x NVIDIA RTX A6000 48 GB
CPUAMD Threadripper PRO 5995WX (64-core)
RAM512 GB DDR4 ECC
Storage16 TB NVMe RAID
PSU2x 1600W redundant
Est. Cost~$18,000
Max model (VRAM)120B Q4 in 192 GB VRAM
7B TG speed~720 t/s combined
Save ~40% vs HP Z8 G4 quad A100 config (~$30,000 pre-built)
Pro / Extreme

WRX80 EATX board required. Quad-slot GPU spacing mandatory - verify physical fitment. Linux + vLLM with tp=4. Requires a 30A+ dedicated circuit.

The Dual H100 Server Build
160 GB HBM3 - production inference at home
GPU2x NVIDIA H100 80 GB PCIe (used/refurb)
CPUAMD EPYC 7543 (32-core)
RAM512 GB DDR4 ECC RDIMM
Storage16 TB NVMe enterprise
Chassis4U server chassis
Est. Cost~$25,000
Max model (VRAM)120B Q4 natively; 70B FP16
7B TG speed~520 t/s combined
Custom Build Only No direct pre-built equivalent — this spec is server-grade custom only
Pro / Extreme

Used H100 PCIe cards appear on secondary market for $12,000-18,000 each. Requires server-class motherboard with IOMMU support. vLLM with Flash Attention 2 and tp=2 is the recommended production stack.

DIY vs. Pre-built: When Each Makes Sense

Why Build DIY

Cost savings: An RTX 4090 DIY build runs ~$3,000 vs $4,000+ for the Alienware Aurora R16 - 20-30% less for identical performance.

Component choice: Select your own PSU quality, motherboard stability, and case airflow. Pre-builds often cut corners on PSU brand and motherboard tier.

Upgrade path: Replace the GPU in 2 years without buying a new system. Pre-built OEM cases often use non-standard form factors.

Exact VRAM targeting: A used RTX 3090 at 24 GB costs $400-600 and performs nearly identically to an RTX 4090 for LLM inference.

Why Buy Pre-built

Warranty & support: A single vendor for all troubleshooting. For production AI work, downtime costs more than savings.

Certified driver stacks: HP and Dell workstations ship with validated configurations for GPU-intensive workloads.

Specialized hardware: Quad-GPU systems with NVLink bridges and server-class PSUs are impractical to DIY. The HP Z8 G4 and Dell Precision 7960 are only practical as pre-built or refurb.

Apple Silicon: There is no DIY path to Apple unified memory. Mac Studio is the only option for 128 GB+ high-bandwidth unified memory.

Best Value Summary

Under $1,500 - DIY wins: RTX 3090 (used) + budget CPU + 64 GB DDR4 outperforms any pre-built at this price.

$3,000-5,000 - DIY saves 20-30%: Self-built RTX 4090 or 5090 system matches Alienware/ROG for less money.

$8,000-20,000 - Pre-built pulls ahead: Workstation-class multi-GPU configs from HP/Dell/Lenovo come with support and certified setups hard to replicate DIY.

Apple Silicon - no DIY option: Mac Studio M3 Ultra is unique in offering 192-512 GB high-bandwidth unified memory in a consumer form factor.

Effectiveness Score blends raw token generation speed, memory bandwidth, and VRAM capacity into a single 0–100 index. Value Score = tokens/sec per $100 spent. Click any column header to sort.

# GPU Maker VRAM (GB) Bandwidth (GB/s) TG 7B Q4 (t/s) TG 70B Q4 (t/s) Price (USD) Value Score Effectiveness Best For

Manufacturer Tradeoffs

NVIDIA

  • Speed king: Highest tokens/sec for models that fit in VRAM — RTX 5090 at 1,792 GB/s is unmatched on the consumer market.
  • CUDA ecosystem: Every AI framework (PyTorch, vLLM, TensorRT, llama.cpp) works out of the box. Zero configuration headaches.
  • VRAM ceiling: Consumer cards top out at 32GB (5090). Professional cards (RTX 6000 Ada: 48GB, A100: 80GB) cost significantly more.
  • Training: Best-in-class for fine-tuning and training with full CUDA support.
  • Power draw: RTX 5090 ≈ 575W — a space heater. Dedicated airflow/cooling needed.
  • P2P blocked: Consumer multi-GPU peer-to-peer is disabled by default in drivers (workaround requires patched kernel modules).
Choose NVIDIA if: you run models ≤30B and want maximum speed, or need fine-tuning & training.

AMD

  • Ryzen AI Max+ 395: The standout — 128GB unified memory at ~$1,500–2,500 is the cheapest way to run 70B models on x86.
  • Bandwidth: Radeon RX 7900 XTX has 960 GB/s — competitive with NVIDIA RTX 4090 (1,008 GB/s), at a lower price.
  • ROCm ecosystem: Improving rapidly (vLLM support added 2024), but still lags CUDA. Occasional bugs, fewer tutorials.
  • Vulkan backend: llama.cpp Vulkan is often faster than ROCm for pure inference; ~20% advantage in recent tests.
  • Discrete VRAM limit: RX 7900 XTX tops at 24GB — same ceiling as RTX 4090.
  • BIOS gotcha: Ryzen AI Max laptops default to 32GB GPU allocation; must manually set to 96–128GB for proper performance.
Choose AMD if: you need 70B+ on a budget (Ryzen AI Max+ 395), or want competitive discrete GPU performance at lower cost than NVIDIA.

Apple Silicon

  • Unified memory: M3 Ultra at 192–512GB is the only consumer path to running 200B+ models locally without a multi-GPU rig.
  • MLX framework: Apple-optimized; 20–30% faster than Ollama for inference. Speculative decoding can roughly double token generation.
  • Token generation speed: M3 Ultra (~800 GB/s) loses to RTX 5090 (1,792 GB/s) for models that fit in VRAM — ~97 t/s vs ~250 t/s on 7B.
  • Power efficiency: Mac Studio at ~150W vs RTX 5090 system at ~800W+. Silent under moderate load.
  • Concurrency: Handles concurrent requests worse than NVIDIA; not ideal for multi-user serving.
  • No GPU-only option: Apple Silicon is integrated — you buy the whole computer, not a card.
  • macOS only: No Linux, no CUDA. Only MLX, llama.cpp, and Ollama with Metal.
Choose Apple if: you need to run very large models (70B+) in a quiet, efficient package, or you're already in the Apple ecosystem.

Find Your Best Match

Tell us your preferences and we'll narrow the field to the best hardware option for you.

Your Requirements

$500$35,000

← Set your requirements and click "Find Best Match"

Compare Hardware for Local AI

Select up to three systems or chips. Every metric shows the absolute value plus a percentage difference relative to the first system you choose. RAM and VRAM capacity directly determine which model sizes fit; memory bandwidth is the primary driver of token generation speed.

Select Systems

Token speed figures are real-world medians from llama.cpp benchmarks. RAM impact on model size follows Q4 quantisation rules (~1.1 GB per 1B params).

Select systems on the left and click Compare.

Optimization Guide — Getting Maximum Performance from Local LLMs

Inference speed for locally self-hosted LLMs is almost entirely memory-bandwidth-bound, not compute-bound. Each token generated requires reading the full model weights from memory once. The faster your memory bus, the faster your tokens. GPU VRAM bandwidth (>1,000 GB/s) dwarfs CPU DRAM bandwidth (30–100 GB/s), which is why a GPU can be 10–30× faster than CPU-only. Knowing which bottleneck you have lets you choose the right optimization.

What Actually Limits Performance — by Hardware Path

Hardware PathPrimary BottleneckSecondary Bottleneck Max RAM AvailableTypical 7B t/sTypical 70B t/s
NVIDIA GPU (e.g. RTX 5090) VRAM capacity (32 GB ceiling on consumer cards) VRAM bandwidth when model fits 32 GB VRAM + system RAM for overflow ~250N/A (model too large for VRAM)
NVIDIA Pro GPU (e.g. A100 80GB) VRAM bandwidth (2,000 GB/s) PCIe bandwidth for multi-GPU 80 GB per card ~260~22 (full VRAM)
Apple Silicon (e.g. M3 Ultra) Unified memory bandwidth (~800 GB/s) Metal compute shader efficiency 192–512 GB (model & OS share same pool) ~120~18–20
AMD RX 7900 XTX VRAM capacity (24 GB ceiling) ROCm/HIP software overhead 24 GB VRAM ~135N/A
AMD Ryzen AI Max+ 395 Unified memory bandwidth (~256 GB/s) ROCm vs. Vulkan backend selection 128 GB unified ~90~6
CPU-only (e.g. i9-13900K, DDR5) System RAM bandwidth (40–80 GB/s) Thread count, NUMA topology Addressable system RAM ~15–25~2–4

Token speeds are batch-size-1 (single-user) generation using Q4_K_M quantisation. Multi-user batch inference favours higher-compute GPUs more.

Quantisation — The Most Impactful Software Setting

Quantisation reduces model precision to fit more into memory and speed up loads. Choosing the right level is the single biggest lever available regardless of hardware.

FormatBits/WeightRAM vs FP16Quality vs FP16Best UseSpeedup vs FP16
FP32322× moreIdenticalTraining only 0.5×
FP16 / BF16161× (baseline)Near-losslessHigh-VRAM GPU inference, fine-tuning
Q8_08~0.5×Effectively lossless (<0.1% perplexity)When you have VRAM to spare 1.8×
Q6_K6~0.37×Minimal quality lossLarge models on capacity-limited systems 2.2×
Q5_K_M5~0.31×Slight loss, noticeable only on codingBest quality when RAM is just enough 2.8×
Q4_K_M4~0.26×Good — recommended default Default recommendation for all hardware 3.2×
Q3_K_M3~0.19×Noticeable degradationOnly if Q4 doesn't fit in RAM 3.8×
Q2_K2~0.13×Significant quality loss — avoidEmergency fallback 4.5×
IQ4_XS (imatrix)~4.25~0.27×Better than Q4_K_M at same sizeBest Q4-tier quality 3.0×

Rule of thumb: Q4_K_M uses approximately 1.1 GB per 1B parameters. A 70B model ≈ 40 GB Q4; a 13B model ≈ 8 GB Q4; a 7B model ≈ 4.4 GB Q4. Importance-matrix (imatrix) quants (IQ series) give better quality at the same bit depth.

Platform-Specific Optimization Paths

NVIDIA — CUDA Ecosystem
  • CUDA + cuBLAS (default): All major frameworks use this. Best throughput for models that fully fit in VRAM. Zero additional configuration in Ollama or LM Studio.
  • TensorRT-LLM: NVIDIA's production inference engine. Requires model compilation; delivers up to 2× the throughput of vanilla llama.cpp for continuous batching workloads.
  • FlashAttention-2: Reduces attention memory complexity from O(n²) to O(n). Enabled by default in vLLM and llama.cpp — provides 1.5–2× speedup on long contexts.
  • Multi-GPU NVLink: Pro cards (A100, H100, RTX 6000 Ada) support NVLink for full bandwidth pooling. Consumer cards do not — PCIE bandwidth becomes the bottleneck with 2× GPUs.
  • GPU offload layers (-ngl): In llama.cpp, set -ngl 99 to send all transformer layers to GPU. If VRAM is too small, reduce layers incrementally; each layer in VRAM improves speed.
  • FP8 quantisation (Hopper GPUs): H100/H200 support FP8 native precision — 2× throughput over BF16 with <1% quality loss. Not available on consumer RTX cards.
Apple Silicon — Metal / MLX
  • MLX framework (Apple-native): 20–30% faster than llama.cpp with Ollama for inference on Apple Silicon. Install via pip install mlx-lm. Best choice for Apple hardware in 2025.
  • Metal Performance Shaders (MPS): Used by PyTorch on Apple Silicon. Not as optimised as MLX for LLM inference but works for fine-tuning and training experiments.
  • Speculative decoding (MLX): Use a small draft model (e.g. 1B) to propose tokens, verified by the main model. Can nearly double tokens/sec on large models with no quality change.
  • llama.cpp Metal backend: Used by Ollama. Slower than native MLX but has broader model format support. Set LLAMA_METAL=1 at compile time.
  • Unified memory — use it all: Unlike a GPU, macOS can dynamically share RAM between GPU and CPU. Run models that are 70–80% of total RAM; leave ~20% for macOS and applications.
  • Context length and memory: Increasing context window (e.g. 8K → 128K) consumes significant KV-cache RAM. At 128K context a 70B model can use >60 GB extra RAM. Set --ctx-size conservatively.
AMD — ROCm / Vulkan / HIP
  • ROCm (HIP): AMD's CUDA-equivalent. Now supported in llama.cpp, vLLM, and PyTorch. Requires ROCm 6.0+ on compatible RX 6000/7000 series GPUs. Performance is within ~10% of CUDA on inference.
  • Vulkan backend (llama.cpp): Cross-platform GPU acceleration using Vulkan compute shaders. Often 10–20% faster than ROCm on discrete GPUs for pure inference. Use -DGGML_VULKAN=ON at build time.
  • Ryzen AI Max+ 395 — BIOS RAM allocation: By default, the NPU/GPU shares 32 GB. Open BIOS, set iGPU VRAM allocation to 96 or 128 GB for full performance. Without this, large models page to slow system RAM.
  • hipBLAS (ROCm BLAS): Replaces cuBLAS for AMD systems. Install via ROCm SDK and set HIP_VISIBLE_DEVICES=0. Used automatically by torch and llama.cpp ROCm builds.
  • AMDGPU target: Set HSA_OVERRIDE_GFX_VERSION=11.0.0 for RDNA3 cards (RX 7000 series) if ROCm doesn't detect the GPU correctly — common issue on consumer ARC/RDNA setups.
  • Concurrent requests: AMD's ROCm handles batching less efficiently than CUDA for high concurrency. For single-user inference, the gap is small (<10%). For serving multiple users, NVIDIA holds a clear advantage.
CPU-Only Inference (x86 / ARM)
  • AVX-512 (Intel Icelake+, AMD Zen4): llama.cpp auto-detects and uses AVX-512 VNNI instructions for 4-bit matrix multiply. ~1.4× faster than AVX2. Use a compiled binary or set CMAKE_CXX_FLAGS=-march=native.
  • Thread count: Set threads to equal physical cores (not hyperthreads). Hyperthreading adds latency on memory-bound tasks. E.g. for a 16P + 8E core i9-13900K, use -t 16 (P-cores only).
  • NUMA pinning: On multi-socket systems, pin to one NUMA domain. Cross-socket memory access halves effective bandwidth. Use numactl --cpunodebind=0 --membind=0.
  • Dual-channel / quad-channel RAM: Each additional memory channel doubles bandwidth. DDR5-6000 quad-channel gives ~190 GB/s vs ~48 GB/s for DDR4 single-channel — a 4× inference speedup.
  • mmap and mlock: llama.cpp by default mmaps model files. Use --mlock to pin the model in RAM and prevent page faults during inference. Requires sufficient free RAM and root/CAP_IPC_LOCK.
Inference Framework Selection
  • Ollama: Easiest setup; wraps llama.cpp with automatic hardware detection (CUDA, Metal, ROCm). Good for development and single-user. Not optimised for high-throughput multi-user serving.
  • LM Studio: GUI-based; good for non-technical users. Uses llama.cpp backend. GPU offload via "GPU Layers" slider. Match layers to VRAM — rule of thumb: each 7B layer ~= 100 MB VRAM.
  • llama.cpp (direct): Maximum control. Supports every acceleration path. Use -ngl 99 -t 8 --ctx-size 4096 --mlock as a good starting configuration.
  • vLLM: Best for production multi-user serving on NVIDIA. PagedAttention enables efficient KV-cache sharing. Not supported on Apple Silicon. Requires CUDA 11.8+.
  • ExLlamaV2: Fastest CUDA inference for quantised models. Uses EXL2 format (superior to GGUF on NVIDIA). Can exceed TensorRT-LLM speed for chat workloads at Q4–Q5.
Context Length and KV-Cache
  • KV-cache size scales quadratically: A 7B model at 4K context uses ~500 MB KV-cache; at 128K context it uses ~16 GB. Reduce --ctx-size to free memory for the model itself.
  • KV quantisation (--cache-type-k q8_0): Quantising the KV cache from FP16 to Q8 or Q4 halves its memory footprint with minor quality impact. Supported in llama.cpp ≥ b2300.
  • Grouped-query attention (GQA) models: Llama 3, Mistral, and Qwen2+ use GQA — their KV cache is inherently smaller than older models. Prefer GQA models when context length matters.
  • Flash Attention 2: Required for efficient long-context inference. Enabled automatically in vLLM and supported in llama.cpp with --flash-attn.

RAM / VRAM Impact on Model Access — Reference Table

This table shows what changes when you increase or reduce available RAM/VRAM. The key insight: more memory expands which models fit; more bandwidth increases how fast they run. Adding RAM without adding bandwidth does not improve token speed — it only enables larger models.

Available RAM / VRAM Largest model (Q4_K_M) Token speed change vs. prior tier 70B Q4 accessible? Notes
4 GB~3B Q4NoOS overhead leaves ~2.5 GB usable. 3B models run well.
8 GB~7B Q4+0% speed (same BW), +model tierNoMost popular tier. 7B Q4 = ~4.4 GB. Small context headroom.
16 GB~13B Q4+0% speed, +model tierNo13B Q4 ≈ 7.9 GB. 8K context adds ~1 GB. Comfortable headroom.
24 GB~20B Q4+0% speed, +model tierNoMaximum for consumer discrete GPUs (NVIDIA RTX 4090, AMD RX 7900 XTX). 20B Q4 ≈ 12 GB.
32 GB~26B Q4, 70B Q2+0% speed, marginal 70B Q2Marginal (Q2 only)RTX 5090 ceiling. Q2 70B runs but quality suffers. Strong for 13–26B models.
48 GB~40B Q4, 70B Q3+0% speed, +model tierQ3 only (~3.8 t/s)RTX A6000. 70B Q3 just fits; expect lower quality than Q4.
64 GB+70B Q4 (full)+0% speed, full 70B accessYes — Q4_K_MFirst tier enabling 70B Q4 fully. Apple M-series, Ryzen AI Max. 70B Q4 ≈ 40 GB; leaves 20+ GB for context.
96 GB~80B Q4, 120B Q3+0% speed, 120B tierYes — with headroomDual 3090/A6000, M2 Ultra 96 GB. Llama 3.1 405B in Q2 just fits.
128 GB~105B Q4, 120B Q3+0% speed (CPU BW limited)Yes — comfortablyRyzen AI Max+ 395, M-series 128 GB. 70B + 32K context fits easily.
256 GB~200B Q4, 400B Q3+0% speed, 200B tier (Apple BW)Yes — fast (~18 t/s)M3 Ultra 256 GB. Llama 3.1 405B Q3 fits comfortably.
512 GB405B Q4 full+0% speed, 400B tierYesM3 Ultra 512 GB. 405B Q4 ≈ 230 GB. Full unquantised 70B fits.

Key Takeaways by Use Case

GoalRecommended PathKey Setting
Fastest possible 7B inferenceNVIDIA RTX 5090 + ExLlamaV2EXL2 Q4 format, enable Flash Attention
Run 70B on a budgetAMD Ryzen AI Max+ 395 mini PC (~$1,500)Set BIOS iGPU VRAM to 128 GB, use Vulkan backend
Quiet, large-model desktopApple Mac Studio M3 Ultra 128 GB + MLXUse mlx_lm.generate, enable speculative decoding
Multi-user production servingNVIDIA A100/H100 + vLLMEnable PagedAttention, continuous batching, tensor parallelism
Maximum context length (>64K tokens)Apple M3 Ultra 256 GB or 512 GBSet --ctx-size 65536, quantise KV cache to Q8
Fine-tuning / LoRA trainingNVIDIA GPU with 24+ GB VRAMUse QLoRA (4-bit base + FP16 adapters), gradient checkpointing
Best CPU-only on x86Dual-channel DDR5 + Zen4/Raptor LakeCompile llama.cpp with -march=native, use -t <physical cores>

Operating Systems for Local AI Development

Your operating system determines which GPU acceleration paths, quantization formats, and developer tools are available. This page compares Windows, macOS, and Linux across the dimensions that matter most for running, developing, and fine-tuning local AI models.

Feature Compatibility Matrix

Feature / Tool Windows 11 macOS (Apple Silicon) Linux (Ubuntu 22.04+)
NVIDIA CUDA Full Not available Full — best driver control
AMD ROCm (discrete GPU) Not available — ROCm for discrete is Linux-only Not available Full — ROCm 6.0+ on RX 6000/7000
Apple MLX / Metal Not available Native — best inference path Not available
DirectML (Microsoft) Native — ONNX on NVIDIA, AMD, Intel GPU Not available Not available
Vulkan inference (llama.cpp) Full Not available Full
Ollama Full Full Full
LM Studio Full Full Full
vLLM (production serving) Via WSL2 only Not supported Full — native, best performance
ExLlamaV2 Full (CUDA) Not available Full (CUDA)
Claude Code WSL2 required Native — recommended Native — recommended
Aider coding agent Works; WSL2 preferred Full Full
Docker + GPU passthrough Via Docker Desktop + WSL2 No GPU passthrough to containers Full — NVIDIA Container Toolkit
Axolotl / Unsloth fine-tuning Via WSL2 + CUDA Not supported Full — recommended platform
GGUF quantization Full Full Full
EXL2 format Full (CUDA) Not available Full (CUDA)
GPTQ / AWQ formats Full (CUDA) Not available Full (CUDA)
MLX native format Not available Native — Apple Silicon only Not available
BitsAndBytes / QLoRA Via WSL2 (unreliable natively) Not supported Full — recommended for QLoRA
Hugging Face transformers Works; WSL2 for training Inference via MPS; no CUDA Full
GPU driver management Easy — NVIDIA App or AMD Adrenaline Automatic — macOS system updates Manual — DKMS, kernel headers

Platform Detail

Windows 11

Best for: NVIDIA CUDA users wanting a dual-purpose gaming and AI workstation. Widest consumer GPU support, easiest driver management, and the best GUI tool ecosystem.

  • CUDA works natively. Ollama, LM Studio, ExLlamaV2, and llama.cpp all run without WSL2 on NVIDIA GPUs. CUDA driver installation is handled automatically by the NVIDIA App.
  • WSL2 is near-mandatory for serious Python-based AI work. Tools like vLLM, Axolotl, Unsloth, and BitsAndBytes are Linux-first. Install WSL2 (Ubuntu 22.04 LTS) and do Python AI development inside it. NVIDIA GPU passthrough to WSL2 delivers near-native CUDA performance with minimal overhead.
  • Claude Code requires WSL2. It is not natively supported on PowerShell or Command Prompt. After enabling WSL2, install Node.js inside Ubuntu and run npm install -g @anthropic-ai/claude-code. This one step also resolves most Python ML compatibility issues.
  • AMD discrete GPU ROCm is not available on Windows. AMD users should use the Vulkan backend in llama.cpp/Ollama (which works well for inference) as an alternative. For PyTorch training with AMD discrete GPUs, Linux is required.
  • DirectML accelerates ONNX Runtime models (Phi-3, Mistral, Llama via ONNX) on any Windows GPU including Intel Arc. Slower than CUDA but hardware-agnostic — useful when CUDA is not an option.
  • Recommended setup: NVIDIA drivers natively → WSL2 + Ubuntu 22.04 → Ollama and LM Studio on native Windows for daily chat → Claude Code and Python tools (vLLM, transformers, axolotl) inside WSL2.
  • AI coding tool overhead. Windows NTFS filesystem operations are measurably slower than macOS APFS or Linux ext4. AI coding agents (Claude Code, Aider, Cursor) perform frequent file reads, diffs, and writes during each interaction. On Windows, this translates to longer response latencies and higher token consumption per task — the agent spends more time waiting on I/O and generates more retries. In practice, the same coding task can cost 10–30% more tokens on Windows compared to macOS or Linux. For heavy agentic coding use, WSL2 with files stored inside the Linux filesystem (not /mnt/c/) mitigates this, but native macOS or Linux remains faster.
macOS (Apple Silicon)

Best for: Developers who want a polished Unix environment, large unified-memory model capacity, low power draw, and the best Claude Code experience without Linux complexity.

  • MLX is the optimal inference path. Apple's open-source MLX framework is 20–30% faster than llama.cpp Metal for inference on Apple Silicon. Install via pip install mlx-lm, then run models directly with mlx_lm.generate. Ollama uses Metal/llama.cpp internally and is slightly slower but simpler to set up.
  • Unified memory is the key architectural advantage. CPU and GPU share one high-bandwidth pool — up to 512 GB on Mac Studio M3 Ultra. A 70B model (40 GB Q4) loads without any VRAM/host memory boundary or PCIe transfer overhead.
  • Claude Code works natively. macOS is a first-class Claude Code platform. zsh, Homebrew, and all standard Unix utilities work out of the box. No WSL2 or compatibility layer needed.
  • No CUDA, EXL2, GPTQ/AWQ, vLLM, or BitsAndBytes. Axolotl and Unsloth for fine-tuning are not supported. MLX includes its own LoRA fine-tuning path (mlx_lm.lora) for Apple Silicon.
  • Supported quantization formats: GGUF (all variants via Ollama/llama.cpp) and MLX native formats (int4, int8, fp16, bf16). EXL2, GPTQ, AWQ, and BitsAndBytes are not available.
  • RAM is soldered and non-upgradeable. Choose your memory configuration at purchase. The 128 GB step (M4 Max MacBook Pro or Mac Studio) is the threshold for 70B Q4 models; 192–512 GB (Mac Studio M3 Ultra) enables 120B+ models.
  • Power efficiency. An M3 Ultra running 70B at 18 t/s draws ~150 W. An equivalent-speed NVIDIA A100 draws 300–400 W. For always-on home servers this is a meaningful difference over time.
Linux (Ubuntu 22.04 / 24.04)

Best for: Fine-tuning, production multi-user serving, AMD discrete GPU acceleration, multi-GPU setups, and developers who need the full toolchain without compatibility workarounds.

  • The reference platform for all AI/ML work. vLLM, Axolotl, Unsloth, TensorRT-LLM, and BitsAndBytes are developed and tested on Linux first. Some capabilities are Linux-only or substantially faster on Linux.
  • ROCm works properly only on Linux. AMD ROCm 6.0+ enables full GPU acceleration for llama.cpp, vLLM, and PyTorch on RX 6000/7000 series GPUs. On Windows, ROCm support for discrete GPUs is not available — use Vulkan there instead.
  • Claude Code and Aider are native. The full agentic experience — file editing, bash commands, multi-file reasoning, git integration — works without any wrapper or workaround.
  • Fine-tuning requires Linux. Axolotl (most widely used LoRA/QLoRA pipeline), Unsloth (2× faster LoRA with less VRAM), and BitsAndBytes (required for QLoRA) all require Linux + CUDA. Attempting native Windows fine-tuning consistently produces environment and CUDA library compatibility errors.
  • Docker + NVIDIA Container Toolkit enables fully reproducible GPU-accelerated environments. Deploy vLLM in one command: docker run --gpus all vllm/vllm-openai:latest --model meta-llama/Llama-3.1-70B-Instruct — the standard production pattern.
  • Driver management is the main friction point. NVIDIA drivers require matching kernel headers and DKMS modules. Before upgrading the kernel, confirm the driver package is compatible. Use apt-mark hold linux-image-generic to pin a stable kernel once the driver is working.
  • Full quantization access: GGUF, EXL2, GPTQ, AWQ, BitsAndBytes (4-bit/8-bit), and all llama.cpp-supported formats. Linux + CUDA gives access to every inference and training optimization path simultaneously.

Recommended Setup by Goal

GoalRecommended OSReasoning
Run local chat models (beginner)Windows or macOSOllama and LM Studio work on all platforms. Windows has the widest consumer GPU support; macOS has the simplest environment setup.
Fine-tune models (LoRA / QLoRA)Linux (Ubuntu 22.04)Axolotl, Unsloth, and BitsAndBytes require Linux + CUDA. Most tutorials, Docker images, and Hugging Face guides assume Ubuntu.
Production multi-user model servingLinuxvLLM with PagedAttention, Docker + NVIDIA Container Toolkit, and production monitoring (Prometheus, Grafana) are all Linux-native.
Daily Claude Code usemacOS or LinuxNative first-class support on both. On Windows, WSL2 setup is required and file I/O across the WSL2/Windows boundary is slower.
AMD discrete GPU inferenceLinuxROCm is Linux-only for discrete GPUs. On Windows, use Vulkan in llama.cpp/Ollama — functional but ~10–15% slower than ROCm on Linux for the same hardware.
Run 70B+ parameter modelsmacOS (Apple Silicon) or Linux (multi-GPU)Apple unified memory and NVIDIA NVLink multi-GPU are the two viable paths to 70B+ at usable speed without an extreme budget.
Gaming + AI on one machineWindowsWindows remains the primary gaming platform. NVIDIA CUDA AI tools work natively on Windows alongside games without issue.
Power-efficient always-on servermacOS (Apple Silicon)Apple Silicon offers the best performance-per-watt ratio for LLM inference. A Mac mini M4 Pro can idle at under 10 W while ready to serve requests.
Maximum quantization format accessLinuxLinux + CUDA is the only platform where GGUF, EXL2, GPTQ, AWQ, BitsAndBytes, and ONNX with CUDA EP are all simultaneously available.

Software Guide — Inference, Coding Assistants, Fine-tuning & Cloud Tools

A structured catalog of the tools available across each stage of AI work: running local models, developing with AI assistance, fine-tuning models, and accessing cloud AI services. All tools listed are open source or have a meaningful free tier unless noted.

Local Inference Engines

These tools load model files from disk and run them on your CPU or GPU. They are the foundation of local AI use.

Ollama
The easiest way to run local models. Wraps llama.cpp with automatic GPU detection and exposes a REST API on localhost:11434 compatible with the OpenAI API spec.
Open source Windows / macOS / Linux
  • Model library: ollama pull llama3.1:70b downloads, quantizes, and runs in one command. 200+ curated models available.
  • GPU: Auto-detects CUDA (NVIDIA), Metal (Apple), and ROCm/Vulkan (AMD). No configuration needed for most setups.
  • API compatibility: Exposes /v1/chat/completions endpoint, so any OpenAI-compatible client (Open WebUI, Continue.dev, Cursor) connects immediately.
  • Limitation: Not optimised for high-throughput multi-user serving. Single-user development and local apps are the sweet spot.
LM Studio
GUI application for discovering, downloading, and running GGUF models. Built-in model browser, GPU layer slider, and a local server with OpenAI-compatible API.
Free (closed source) Windows / macOS / Linux
  • Best for beginners: No command line required. Browse models from Hugging Face directly in the app, download with one click, adjust GPU layers via slider.
  • GPU offload: Drag the "GPU Layers" slider to send as many transformer layers as fit into VRAM. Rule of thumb: each 7B layer ≈ 100 MB VRAM.
  • Local server mode: Enables a persistent server that other apps (Open WebUI, Continue.dev) can connect to.
  • Limitation: Closed source; slightly behind llama.cpp direct on performance. Does not support vLLM-style batching.
Jan.ai
Open-source desktop app providing a full ChatGPT-style local interface with model management, extension support, and a local API server.
Open source Windows / macOS / Linux
  • Full offline operation: Designed to work with no internet connection. Model files are stored locally; no data leaves the machine.
  • Extension ecosystem: Supports plugins for additional functionality such as web search, document retrieval, and remote model endpoints.
  • API server: Exposes an OpenAI-compatible API on localhost:1337, allowing integration with any client that speaks the OpenAI format.
  • Good alternative to LM Studio for users who prefer open-source software with the same ease of use.
GPT4All
Fully offline AI assistant desktop app. No internet access required at any point, including model downloads (which can be done offline via direct GGUF import).
Open source Windows / macOS / Linux
  • Air-gap friendly: The strongest choice for environments where internet connectivity must be fully severed after setup.
  • LocalDocs feature: Index local documents (PDF, Word, text) and ask questions about them with retrieval-augmented generation (RAG).
  • CPU-first design: Works well on CPU-only systems. GPU acceleration via Vulkan is available but less optimised than Ollama/LM Studio.
KoboldCPP
A single-binary fork of llama.cpp with a built-in web UI, advanced sampling controls, and strong support for long-form creative writing and roleplay use cases.
Open source Windows / macOS / Linux
  • Single executable: Download one file and run it — no Python environment or package manager required.
  • Advanced sampling: Includes mirostat, tail-free sampling, and Typical P sampling not found in all frontends — particularly useful for creative/story generation.
  • SillyTavern/Agnai integration: The community-standard backend for character-based AI chat interfaces.
llama.cpp (direct)
The foundational C++ inference engine that powers Ollama, LM Studio, and KoboldCPP. Run it directly for maximum control over every parameter.
Open source Windows / macOS / Linux
  • Maximum control: Exposes every optimization flag: -ngl (GPU layers), --ctx-size, --cache-type-k, --flash-attn, -t (threads), --mlock, and more.
  • Supports all GPU backends in one codebase: CUDA, Metal, ROCm (HIPrt), Vulkan, OpenCL, and CPU-only (AVX2, AVX-512).
  • GGUF quantization server: Includes llama-quantize to convert and quantize models locally without uploading to a cloud service.
  • Best choice when you are benchmarking, scripting, or need a parameter not exposed by higher-level tools.
vLLM
Production-grade inference server for NVIDIA GPUs. Uses PagedAttention for efficient KV-cache memory management and continuous batching for high-throughput multi-user serving.
Open source Linux (CUDA required)
  • PagedAttention: Manages KV-cache in variable-size pages, dramatically reducing memory fragmentation and enabling 2–5× higher throughput than llama.cpp for concurrent users.
  • Continuous batching: Processes multiple requests simultaneously without waiting for a full batch, reducing latency under load.
  • OpenAI-compatible API: Drop-in replacement for OpenAI endpoints — connect any client without code changes.
  • Requires Linux + CUDA. Not supported on macOS or Windows without WSL2. The standard choice for production NVIDIA deployments.
ExLlamaV2
The fastest CUDA inference engine for quantized models. Uses the EXL2 quantization format, which achieves higher quality than GGUF Q4 at the same file size via per-row quantization.
Open source Windows Linux
  • EXL2 format: Per-row mixed-precision quantization. A 4-bit EXL2 model typically outperforms a GGUF Q4_K_M on the same hardware in quality benchmarks.
  • Speed: Consistently outperforms TensorRT-LLM on chat workloads at Q4–Q5. Best choice for maximum CUDA inference speed with quantized models.
  • Speculative decoding support: Use a small draft model to propose tokens — can yield near 2× speed on certain workloads.
  • Tabby API server: ExLlamaV2 is the backend for TabbyAPI, a high-performance OpenAI-compatible server used by many frontends.

AI Coding Assistants

Tools that integrate AI into the code-writing and editing workflow, from inline completions to full agentic coding agents that can edit multiple files and run commands.

Claude Code (Anthropic)
Terminal-based agentic coding assistant. Reads and edits files, runs shell commands, searches the codebase, and writes and executes code autonomously within a directory. Uses Claude models via API.
Pay-per-token macOS native Linux native WSL2 on Windows
  • Full agentic loop: Can read project files, propose and apply edits, run tests, and iterate on errors without constant user prompting. Most effective on large codebases.
  • Git integration: Understands git history, creates commits, and works across branches. Best used in a git-tracked repository.
  • Windows note: Requires WSL2 (Ubuntu 22.04+). Running inside WSL2 gives the same experience as Linux. File edits across the /mnt/c mount work but are slower than editing files inside the WSL2 filesystem.
  • Model selection: Defaults to Claude Sonnet; can be configured to use Claude Opus for complex reasoning tasks at higher cost.
GitHub Copilot
The most widely used AI coding assistant. Provides inline code completion and a chat panel within VS Code, JetBrains IDEs, Neovim, and others. Uses GPT-4o and Claude models depending on task.
$10/month (individual) All platforms
  • Inline completion: Suggests single lines or entire function bodies as you type with low latency. Works in virtually all languages.
  • Copilot Chat: Ask questions about the codebase, get explanations, generate tests, and fix errors directly in the IDE sidebar.
  • Copilot Workspace / Agent: Multi-step task completion — describe a feature and Copilot proposes a plan, creates files, and opens PRs (enterprise tier).
  • Model choice: GitHub now lets users select the underlying model (GPT-4o, Claude Sonnet, Gemini 2.0) per session.
Cursor
AI-first code editor forked from VS Code. Adds a multi-file edit "Composer" mode, codebase-wide context awareness, and tight model integration beyond what VS Code extensions provide.
Free tier / $20 month pro Windows / macOS / Linux
  • Composer mode: Describe a change in natural language; Cursor proposes and applies diffs across multiple files simultaneously — the main differentiator from GitHub Copilot.
  • Codebase indexing: Indexes your entire repository for semantic search, giving the model accurate context even in large codebases.
  • Model selection: Supports Claude 3.5/3.7 Sonnet, GPT-4o, and o3 — swappable per-task. Can also point to a local Ollama endpoint for completions.
  • Tab / Ghost text: A distinctive predictive editing feature that anticipates your next edit based on recent changes — faster than standard autocomplete.
Continue.dev
Open-source AI coding extension for VS Code and JetBrains. Unique in that it supports local Ollama models as well as cloud APIs, giving you AI coding assistance with no data leaving your machine.
Open source Windows / macOS / Linux
  • Local model support: Connect to Ollama, LM Studio, or any OpenAI-compatible API. Run Qwen2.5-Coder 32B locally for code completion with full privacy.
  • Configurable via JSON: config.json specifies models for chat, completion, and embedding independently — mix local and cloud models.
  • Tab autocomplete: Uses a separate fast local model (e.g. Qwen2.5-Coder 1.5B) for low-latency line completions while using a larger model for chat.
  • Best choice for developers who want GitHub Copilot-style features but with local models and no subscription.
Windsurf (Codeium)
AI-first code editor from Codeium, similar to Cursor. Offers a "Cascade" agent mode for multi-step task completion and deep IDE integration. Strong free tier.
Free tier / $15 month pro Windows / macOS / Linux
  • Cascade agent: Handles multi-file edits, runs terminal commands, reads error output, and iterates — similar to Claude Code but embedded in the IDE.
  • Generous free tier: Unlimited code completions and a meaningful number of agent interactions per month at no cost — competitive with Cursor.
  • Codeium extension also available as a VS Code/JetBrains plugin if you don't want to switch editors, providing completions without the full Windsurf agent.
Aider
Terminal-based AI coding agent emphasizing git-awareness. Applies changes as atomic git commits, supports multiple files per session, and works with any LLM including local Ollama models.
Open source Windows / macOS / Linux
  • Git-first workflow: Every change is committed with a descriptive message, making it easy to review, revert, or cherry-pick AI edits.
  • Local model support: Connect to an Ollama endpoint and use Qwen2.5-Coder 32B or DeepSeek-Coder V2 for fully offline coding assistance.
  • Benchmark leading: Consistently at the top of the SWE-Bench leaderboard for open-source coding agents, particularly in architect + editor mode (uses two models cooperatively).
  • Works well alongside Claude Code — use Aider for git-structured file edits and Claude Code for broader exploration and command execution.

Local Chat Frontends & Self-Hosted UIs

Web or desktop interfaces that connect to Ollama, LM Studio, or vLLM and provide a chat experience similar to ChatGPT but running entirely on your hardware.

Open WebUI
The most popular self-hosted ChatGPT-style web interface. Connects to Ollama and OpenAI-compatible APIs. Runs as a Docker container or Python package. Feature parity with ChatGPT's web UI including document upload, image analysis, and multi-model switching.
Open source All platforms (browser-based)
  • One-command install: docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main — connects to a running Ollama instance automatically.
  • RAG support: Upload PDFs, web URLs, or YouTube transcripts; the model retrieves relevant chunks during conversation.
  • Multi-model conversations: Chat with multiple models side by side and compare responses.
  • User management: Supports multiple user accounts with role-based access — suitable for sharing a local model server within a team.
LibreChat
Self-hosted multi-provider chat interface supporting OpenAI, Anthropic, Google, Azure, and local models simultaneously in one UI. Strong feature set with branching conversations, code execution, and plugin support.
Open source All platforms (browser-based, Docker)
  • Multi-provider in one UI: Switch between ChatGPT, Claude, Gemini, and a local Ollama model within the same conversation history.
  • Branching conversations: Fork any message to explore alternative responses without losing the original thread.
  • Best choice when you want to compare cloud and local model outputs or need team access to multiple API providers from a single shared interface.
Chatbox
Desktop application for AI chat supporting OpenAI, Anthropic, Google, Azure, and local Ollama endpoints. Stores conversations locally; no cloud sync unless configured. Good low-friction option for non-technical users.
Free (basic) / Pro subscription Windows / macOS / Linux
  • Desktop-native: Installs like a normal application; no Docker or server setup needed. Connect to Ollama by pointing it to http://localhost:11434.
  • Privacy-first defaults: Conversation history stored locally by default. Suitable for sensitive work where cloud storage of chat history is undesirable.
Msty
Desktop AI assistant that aggregates local Ollama models and cloud APIs in a polished UI. Notable for its "Knowledge" feature — attach local documents to conversations — and side-by-side model comparison.
Free tier available Windows / macOS / Linux
  • Model comparison view: Query multiple models simultaneously and view responses side by side in the same window.
  • Knowledge library: Attach PDF, Word, or text files to a conversation with built-in RAG. No separate setup required.
  • Good choice for users who want a more polished interface than Open WebUI without the Docker setup complexity.

Cloud AI Services & API Access

Services that provide access to large frontier models via web interface or API. Useful when a task exceeds local hardware capacity or when the latest proprietary model is needed.

Service Models Available Pricing Privacy / Notes
Claude.ai (Anthropic) Claude 3.5 Haiku, Claude Sonnet 4, Claude Opus 4 Free tier / $20–$25/month Pro / API pay-per-token Strong system prompt adherence. Pro tier includes extended context (200K tokens), Projects (persistent memory), and document upload. API keys required for Claude Code.
ChatGPT (OpenAI) GPT-4o, o3, o3-mini, GPT-4.5 Free (GPT-4o limited) / $20/month Plus / API pay-per-token Widest tool-use ecosystem. GPT-4o includes image generation (DALL-E 3), web browsing, and code execution in one model. o3 is the strongest reasoning model for complex multi-step problems.
DuckDuckGo AI Chat Claude 3 Haiku, GPT-4o mini, Llama 3, Mistral Free (no account required) Privacy-focused: DuckDuckGo does not store conversations and removes identifying metadata before forwarding to model providers. Best for quick queries where privacy is a priority.
Perplexity Sonar (proprietary), Claude, GPT-4o Free / $20/month Pro AI search engine that cites sources with inline references. Excellent for research queries, real-time information, and technical documentation lookups. Pro tier adds file upload and image generation.
Phind Phind-70B (code-focused), GPT-4o Free (limited) / $20/month Pro Developer-focused search and chat. Specialises in code questions, pulls from Stack Overflow, GitHub, and documentation. Strong at answering "how do I do X in Y framework" questions with working code.
Google Gemini Gemini 2.0 Flash, Gemini 2.5 Pro Free / $20/month Advanced / API pay-per-token Gemini 2.5 Pro has the longest context window of any frontier model (1M tokens). Excellent for large document analysis. Deep Google Workspace integration. Gemini 2.0 Flash is extremely cost-effective via API.

Fine-tuning & Training Frameworks

Tools for customizing pre-trained models on your own data using LoRA, QLoRA, or full fine-tuning. Requires NVIDIA GPU (Linux) or Apple Silicon (MLX path only).

Axolotl
The most widely used open-source LoRA/QLoRA fine-tuning framework. YAML-based configuration means no Python scripting needed for most use cases. Supports virtually all major model architectures.
Open source Linux + CUDA
  • YAML configuration: Specify model, dataset, LoRA rank, batch size, learning rate, and all training parameters in a single config.yml. Run with axolotl train config.yml.
  • Broad format support: Accepts ShareGPT, Alpaca, instruction, and completion formats out of the box. Custom tokenization templates for any chat format.
  • QLoRA support: Load the base model in 4-bit BitsAndBytes precision while training LoRA adapters in FP16 — fine-tune a 70B model on a single A100 80GB.
  • DeepSpeed integration: Enables multi-GPU and CPU memory offloading for training models that don't fit in VRAM alone.
Unsloth
Accelerated LoRA fine-tuning library that rewrites attention kernels in Triton for 2× speed and up to 70% less VRAM compared to standard HuggingFace PEFT. The go-to tool for consumer GPU fine-tuning.
Open source Linux + CUDA
  • 2× faster than HF PEFT at the same quality. Fine-tune Llama 3.1 8B on a 24 GB GPU in under 2 hours on a typical instruction dataset.
  • 70% less VRAM. Fine-tune a 7B model on 6 GB VRAM; a 13B on 12 GB. Enables fine-tuning on consumer RTX 3060/4060 GPUs that previously could not run training at all.
  • Colab-native: Unsloth provides free Google Colab notebooks for every major model — fine-tune without owning any GPU hardware.
  • Export to GGUF: Built-in export to GGUF quantization formats (Q4_K_M, Q5_K_M, Q8_0) immediately after training.
LLaMA-Factory
Web UI and CLI for fine-tuning over 100 model architectures with LoRA, QLoRA, DoRA, ORPO, DPO, and RLHF. Provides a visual training dashboard, dataset management, and evaluation tools in one interface.
Open source Linux + CUDA (primary)
  • Web UI (LlamaBoard): Configure and launch fine-tuning jobs from a browser without any command line or Python knowledge.
  • Advanced alignment methods: Supports RLHF, DPO, ORPO, and reward model training — not just supervised fine-tuning. The most comprehensive fine-tuning toolkit available.
  • Multi-GPU and distributed training via DeepSpeed and FSDP with minimal configuration.
MLX Fine-tuning (Apple)
Apple's MLX framework includes a built-in LoRA fine-tuning path via mlx_lm.lora. The only fine-tuning option available on Apple Silicon — no CUDA, Axolotl, or Unsloth needed.
Open source macOS (Apple Silicon only)
  • Native Apple Silicon acceleration: Uses Metal for all matrix operations — no overhead from cross-platform abstraction layers.
  • LoRA and QLoRA supported on most Llama, Mistral, Phi, and Qwen architecture models. Uses MLX native int4 quantization for the base model during QLoRA.
  • Limitation: Dataset format support is narrower than Axolotl. Works best for instruction fine-tuning on simple JSONL datasets. Complex alignment training (DPO, RLHF) is not yet fully supported.
  • Run with: mlx_lm.lora --model <path> --train --data ./data --iters 1000 --batch-size 4

Model Discovery, Evaluation & Experiment Tracking

Tool Purpose Cost Key Features
Hugging Face Hub Model repository & discovery Free (storage limits) / Pro 800,000+ models, datasets, and Spaces. Filter by task, license, and quantization format. The standard place to find GGUF, GPTQ, and AWQ quantized model variants. Direct download links for Ollama and llama.cpp.
Ollama Model Library Curated models for Ollama Free Curated, verified GGUF models runnable with one command. Smaller and more opinionated than HF Hub. Includes all popular models (Llama 3, Mistral, Qwen, Phi, Gemma) pre-tested with Ollama.
LMSYS Chatbot Arena Blind model evaluation leaderboard Free Community-ranked model leaderboard based on blind A/B comparisons. The most reliable quality ranking because it reflects human preferences, not just academic benchmarks. Good for choosing between similar model sizes.
Weights & Biases (W&B) Experiment tracking & ML monitoring Free (personal) / $50/month Teams Log training loss curves, hyperparameters, GPU metrics, and model checkpoints. Integrates with Axolotl, Unsloth, and transformers natively. The standard tool for tracking fine-tuning runs.
MLflow Open-source experiment tracking Free (self-hosted) Local alternative to W&B. Track experiments, compare model versions, and serve models via its built-in REST API. No cloud dependency — runs on your own machine or server.
lm-evaluation-harness Standardized model benchmarking Free (open source) Eleuther AI's framework for running standard benchmarks (MMLU, HellaSwag, ARC, TruthfulQA) against local models. The tool used to generate most public benchmark numbers. Run with lm_eval --model hf --model_args pretrained=<model> --tasks mmlu.