A comparative analysis of pre-built and DIY systems for local AI development.
Table columns are sortable. All systems are evaluated for AI/ML workloads. TG = Tokens/sec Generation (batch size 1).
| Rank | Model | Type | GPU | VRAM | Best Model (Size) | Tokens/sec (TG) | Price | Notes & Evidence |
|---|---|---|---|---|---|---|---|---|
| 1 | HP Z8 G4 (Custom) | Desktop | Quad NVIDIA A100 40GB | 160 GB | 70B+ (Unquantized) | ~540 | $30,000 |
|
| 2 | Dell Precision 7960 Tower | Desktop | Quad NVIDIA RTX 5000 Ada | 128 GB | 120B+ (Quantized) | ~450 | $20,000 |
|
| 3 | Apple Mac Studio (M3 Ultra) | Desktop | Apple 80-core GPU | 128 GB (Unified) | 120B+ (Quantized) | ~97 | $8,000 |
|
| 4 | HP OMEN 45L (2025) | Desktop | NVIDIA RTX 5090 | 32 GB | 70B+ (Quantized) | ~140 | $4,500 |
|
| 5 | Alienware Aurora R16 | Desktop | NVIDIA RTX 4090 | 24 GB | 30B-70B | ~95 | $4,000 |
|
| 6 | Lenovo ThinkStation P620 | Desktop | Dual NVIDIA RTX A6000 | 96 GB | 120B+ (Quantized) | ~300 | $15,000 |
|
| 7 | Asus ROG Strix Scar 18 | Laptop | NVIDIA RTX 5090 (Laptop) | 24 GB | 30B-70B | ~80 | $4,500 |
|
| 8 | Apple MacBook Pro 16 (M4 Max) | Laptop | Apple 40-core GPU | 128 GB (Unified) | 30B-70B | ~75 | $4,200 |
|
| 9 | Asus ROG Flow Z13 (2025) | Laptop | AMD Ryzen AI Max+ 395 (iGPU) | 128 GB (Unified) | 7B-30B | ~90 | $2,100 |
|
| 10 | ASUS Zenbook A14 (2025) | Laptop | Qualcomm Adreno NPU | 32 GB (Shared) | 7B-30B | ~6 | $1,800 |
|
| 11 | Apple MacBook Air 15" (M4) | Laptop | Apple 10-core GPU (M4) | 32 GB (Unified) | 7B-20B | ~38 | $1,299 |
|
| 12 | Apple MacBook Air 13" (M3) | Laptop | Apple 10-core GPU (M3) | 24 GB (Unified) | 7B-13B | ~30 | $1,099 |
|
| 13 | Apple Mac mini (M4 Pro) | Desktop | Apple 20-core GPU (M4 Pro) | 48 GB (Unified) | 7B-40B | ~65 | $1,999 |
|
| 14 | Apple iMac 24" (M4) | Desktop | Apple 10-core GPU (M4) | 32 GB (Unified) | 7B-20B | ~40 | $1,699 |
|
| 15 | Apple MacBook Pro 14" (M4 Pro) | Laptop | Apple 20-core GPU (M4 Pro) | 48 GB (Unified) | 7B-40B | ~58 | $2,599 |
|
| 16 | Minisforum AI Max+ 395 (128 GB) | Desktop | AMD Ryzen AI Max+ 395 (RDNA 3.5 iGPU) | 128 GB (Unified) | 7B-70B | ~90 | $1,600 |
|
| 17 | Razer Blade 16 (RTX 5090) | Laptop | NVIDIA RTX 5090 (Laptop) | 24 GB | 7B-20B | ~120 | $4,499 |
|
| 18 | Acer Predator Helios 18 (RTX 4090) | Laptop | NVIDIA RTX 4090 (Laptop) | 16 GB | 7B-13B | ~75 | $2,799 |
|
| 19 | Lenovo Legion Pro 7i Gen 9 (RTX 4090) | Laptop | NVIDIA RTX 4090 (Laptop) | 16 GB | 7B-13B | ~72 | $2,799 |
|
| 20 | NZXT Player:Two Max (RTX 4090) | Desktop | NVIDIA RTX 4090 | 24 GB | 7B-20B | ~140 | $2,800 |
|
| 21 | CyberPowerPC Gamer Supreme (RTX 5080) | Desktop | NVIDIA RTX 5080 | 16 GB | 7B-13B | ~170 | $2,400 |
|
| 22 | HP Victus 16 (RTX 4060) | Laptop | NVIDIA RTX 4060 (Laptop) | 8 GB | 7B only | ~45 | $849 |
|
Use the filters to narrow builds by what model size you want to run and your budget range. Prices are estimates as of early 2026; GPU prices shift frequently.
16 GB GDDR6 - best value VRAM for price. 7B and 13B Q4 models run fully in VRAM. Use ExLlamaV2 for maximum speed on CUDA.
RX 7900 GRE has 576 GB/s bandwidth and 16 GB VRAM. Use llama.cpp Vulkan backend for best inference speed. ROCm optional for PyTorch.
Used RTX 3090 (24 GB / 936 GB/s) performs near-identically to an RTX 4090 for inference. Best value path to 20B Q4. Use a single GPU only - no SLI.
1,792 GB/s GDDR7 bandwidth - fastest consumer card. 32 GB VRAM limits models to ~26B Q4. Ensure 360mm AIO or large tower for cooling.
Comparable to Alienware Aurora R16 at ~20% lower cost. 128 GB system RAM allows 70B CPU offload (slow, ~3 t/s) but 20B in VRAM runs at full speed.
Not strictly DIY, but highly upgradeable mini PC. Set iGPU VRAM allocation to 96-128 GB in BIOS. 70B Q4 runs at ~6 t/s; 7B at ~90 t/s. Best value for 70B access on Windows/Linux.
Used A100 80 GB PCIe cards available for $4,000-6,000 on secondary market. Workstation chassis required. 1,935 GB/s HBM2e bandwidth. Best Linux vLLM platform for a single card.
Requires WRX80 motherboard. Both A6000s share 96 GB via vLLM tensor parallelism. NVLink bridge optional. Linux + vLLM recommended.
960 GB/s bandwidth with ECC memory. Supports NVLink for a second card later (96 GB / 1,920 GB/s combined). ATX-compatible in a standard full-tower. Windows and Linux supported.
WRX80 EATX board required. Quad-slot GPU spacing mandatory - verify physical fitment. Linux + vLLM with tp=4. Requires a 30A+ dedicated circuit.
Used H100 PCIe cards appear on secondary market for $12,000-18,000 each. Requires server-class motherboard with IOMMU support. vLLM with Flash Attention 2 and tp=2 is the recommended production stack.
Cost savings: An RTX 4090 DIY build runs ~$3,000 vs $4,000+ for the Alienware Aurora R16 - 20-30% less for identical performance.
Component choice: Select your own PSU quality, motherboard stability, and case airflow. Pre-builds often cut corners on PSU brand and motherboard tier.
Upgrade path: Replace the GPU in 2 years without buying a new system. Pre-built OEM cases often use non-standard form factors.
Exact VRAM targeting: A used RTX 3090 at 24 GB costs $400-600 and performs nearly identically to an RTX 4090 for LLM inference.
Warranty & support: A single vendor for all troubleshooting. For production AI work, downtime costs more than savings.
Certified driver stacks: HP and Dell workstations ship with validated configurations for GPU-intensive workloads.
Specialized hardware: Quad-GPU systems with NVLink bridges and server-class PSUs are impractical to DIY. The HP Z8 G4 and Dell Precision 7960 are only practical as pre-built or refurb.
Apple Silicon: There is no DIY path to Apple unified memory. Mac Studio is the only option for 128 GB+ high-bandwidth unified memory.
Under $1,500 - DIY wins: RTX 3090 (used) + budget CPU + 64 GB DDR4 outperforms any pre-built at this price.
$3,000-5,000 - DIY saves 20-30%: Self-built RTX 4090 or 5090 system matches Alienware/ROG for less money.
$8,000-20,000 - Pre-built pulls ahead: Workstation-class multi-GPU configs from HP/Dell/Lenovo come with support and certified setups hard to replicate DIY.
Apple Silicon - no DIY option: Mac Studio M3 Ultra is unique in offering 192-512 GB high-bandwidth unified memory in a consumer form factor.
Effectiveness Score blends raw token generation speed, memory bandwidth, and VRAM capacity into a single 0–100 index. Value Score = tokens/sec per $100 spent. Click any column header to sort.
| # | GPU | Maker | VRAM (GB) | Bandwidth (GB/s) | TG 7B Q4 (t/s) | TG 70B Q4 (t/s) | Price (USD) | Value Score | Effectiveness | Best For |
|---|
Tell us your preferences and we'll narrow the field to the best hardware option for you.
← Set your requirements and click "Find Best Match"
Select up to three systems or chips. Every metric shows the absolute value plus a percentage difference relative to the first system you choose. RAM and VRAM capacity directly determine which model sizes fit; memory bandwidth is the primary driver of token generation speed.
Select systems on the left and click Compare.
Inference speed for locally self-hosted LLMs is almost entirely memory-bandwidth-bound, not compute-bound. Each token generated requires reading the full model weights from memory once. The faster your memory bus, the faster your tokens. GPU VRAM bandwidth (>1,000 GB/s) dwarfs CPU DRAM bandwidth (30–100 GB/s), which is why a GPU can be 10–30× faster than CPU-only. Knowing which bottleneck you have lets you choose the right optimization.
| Hardware Path | Primary Bottleneck | Secondary Bottleneck | Max RAM Available | Typical 7B t/s | Typical 70B t/s |
|---|---|---|---|---|---|
| NVIDIA GPU (e.g. RTX 5090) | VRAM capacity (32 GB ceiling on consumer cards) | VRAM bandwidth when model fits | 32 GB VRAM + system RAM for overflow | ~250 | N/A (model too large for VRAM) |
| NVIDIA Pro GPU (e.g. A100 80GB) | VRAM bandwidth (2,000 GB/s) | PCIe bandwidth for multi-GPU | 80 GB per card | ~260 | ~22 (full VRAM) |
| Apple Silicon (e.g. M3 Ultra) | Unified memory bandwidth (~800 GB/s) | Metal compute shader efficiency | 192–512 GB (model & OS share same pool) | ~120 | ~18–20 |
| AMD RX 7900 XTX | VRAM capacity (24 GB ceiling) | ROCm/HIP software overhead | 24 GB VRAM | ~135 | N/A |
| AMD Ryzen AI Max+ 395 | Unified memory bandwidth (~256 GB/s) | ROCm vs. Vulkan backend selection | 128 GB unified | ~90 | ~6 |
| CPU-only (e.g. i9-13900K, DDR5) | System RAM bandwidth (40–80 GB/s) | Thread count, NUMA topology | Addressable system RAM | ~15–25 | ~2–4 |
Token speeds are batch-size-1 (single-user) generation using Q4_K_M quantisation. Multi-user batch inference favours higher-compute GPUs more.
Quantisation reduces model precision to fit more into memory and speed up loads. Choosing the right level is the single biggest lever available regardless of hardware.
| Format | Bits/Weight | RAM vs FP16 | Quality vs FP16 | Best Use | Speedup vs FP16 |
|---|---|---|---|---|---|
| FP32 | 32 | 2× more | Identical | Training only | 0.5× |
| FP16 / BF16 | 16 | 1× (baseline) | Near-lossless | High-VRAM GPU inference, fine-tuning | 1× |
| Q8_0 | 8 | ~0.5× | Effectively lossless (<0.1% perplexity) | When you have VRAM to spare | 1.8× |
| Q6_K | 6 | ~0.37× | Minimal quality loss | Large models on capacity-limited systems | 2.2× |
| Q5_K_M | 5 | ~0.31× | Slight loss, noticeable only on coding | Best quality when RAM is just enough | 2.8× |
| Q4_K_M | 4 | ~0.26× | Good — recommended default | Default recommendation for all hardware | 3.2× |
| Q3_K_M | 3 | ~0.19× | Noticeable degradation | Only if Q4 doesn't fit in RAM | 3.8× |
| Q2_K | 2 | ~0.13× | Significant quality loss — avoid | Emergency fallback | 4.5× |
| IQ4_XS (imatrix) | ~4.25 | ~0.27× | Better than Q4_K_M at same size | Best Q4-tier quality | 3.0× |
Rule of thumb: Q4_K_M uses approximately 1.1 GB per 1B parameters. A 70B model ≈ 40 GB Q4; a 13B model ≈ 8 GB Q4; a 7B model ≈ 4.4 GB Q4. Importance-matrix (imatrix) quants (IQ series) give better quality at the same bit depth.
-ngl): In llama.cpp, set -ngl 99 to send all transformer layers to GPU. If VRAM is too small, reduce layers incrementally; each layer in VRAM improves speed.pip install mlx-lm. Best choice for Apple hardware in 2025.LLAMA_METAL=1 at compile time.--ctx-size conservatively.-DGGML_VULKAN=ON at build time.HIP_VISIBLE_DEVICES=0. Used automatically by torch and llama.cpp ROCm builds.HSA_OVERRIDE_GFX_VERSION=11.0.0 for RDNA3 cards (RX 7000 series) if ROCm doesn't detect the GPU correctly — common issue on consumer ARC/RDNA setups.CMAKE_CXX_FLAGS=-march=native.-t 16 (P-cores only).numactl --cpunodebind=0 --membind=0.--mlock to pin the model in RAM and prevent page faults during inference. Requires sufficient free RAM and root/CAP_IPC_LOCK.-ngl 99 -t 8 --ctx-size 4096 --mlock as a good starting configuration.--ctx-size to free memory for the model itself.--cache-type-k q8_0): Quantising the KV cache from FP16 to Q8 or Q4 halves its memory footprint with minor quality impact. Supported in llama.cpp ≥ b2300.--flash-attn.This table shows what changes when you increase or reduce available RAM/VRAM. The key insight: more memory expands which models fit; more bandwidth increases how fast they run. Adding RAM without adding bandwidth does not improve token speed — it only enables larger models.
| Available RAM / VRAM | Largest model (Q4_K_M) | Token speed change vs. prior tier | 70B Q4 accessible? | Notes |
|---|---|---|---|---|
| 4 GB | ~3B Q4 | — | No | OS overhead leaves ~2.5 GB usable. 3B models run well. |
| 8 GB | ~7B Q4 | +0% speed (same BW), +model tier | No | Most popular tier. 7B Q4 = ~4.4 GB. Small context headroom. |
| 16 GB | ~13B Q4 | +0% speed, +model tier | No | 13B Q4 ≈ 7.9 GB. 8K context adds ~1 GB. Comfortable headroom. |
| 24 GB | ~20B Q4 | +0% speed, +model tier | No | Maximum for consumer discrete GPUs (NVIDIA RTX 4090, AMD RX 7900 XTX). 20B Q4 ≈ 12 GB. |
| 32 GB | ~26B Q4, 70B Q2 | +0% speed, marginal 70B Q2 | Marginal (Q2 only) | RTX 5090 ceiling. Q2 70B runs but quality suffers. Strong for 13–26B models. |
| 48 GB | ~40B Q4, 70B Q3 | +0% speed, +model tier | Q3 only (~3.8 t/s) | RTX A6000. 70B Q3 just fits; expect lower quality than Q4. |
| 64 GB+ | 70B Q4 (full) | +0% speed, full 70B access | Yes — Q4_K_M | First tier enabling 70B Q4 fully. Apple M-series, Ryzen AI Max. 70B Q4 ≈ 40 GB; leaves 20+ GB for context. |
| 96 GB | ~80B Q4, 120B Q3 | +0% speed, 120B tier | Yes — with headroom | Dual 3090/A6000, M2 Ultra 96 GB. Llama 3.1 405B in Q2 just fits. |
| 128 GB | ~105B Q4, 120B Q3 | +0% speed (CPU BW limited) | Yes — comfortably | Ryzen AI Max+ 395, M-series 128 GB. 70B + 32K context fits easily. |
| 256 GB | ~200B Q4, 400B Q3 | +0% speed, 200B tier (Apple BW) | Yes — fast (~18 t/s) | M3 Ultra 256 GB. Llama 3.1 405B Q3 fits comfortably. |
| 512 GB | 405B Q4 full | +0% speed, 400B tier | Yes | M3 Ultra 512 GB. 405B Q4 ≈ 230 GB. Full unquantised 70B fits. |
| Goal | Recommended Path | Key Setting |
|---|---|---|
| Fastest possible 7B inference | NVIDIA RTX 5090 + ExLlamaV2 | EXL2 Q4 format, enable Flash Attention |
| Run 70B on a budget | AMD Ryzen AI Max+ 395 mini PC (~$1,500) | Set BIOS iGPU VRAM to 128 GB, use Vulkan backend |
| Quiet, large-model desktop | Apple Mac Studio M3 Ultra 128 GB + MLX | Use mlx_lm.generate, enable speculative decoding |
| Multi-user production serving | NVIDIA A100/H100 + vLLM | Enable PagedAttention, continuous batching, tensor parallelism |
| Maximum context length (>64K tokens) | Apple M3 Ultra 256 GB or 512 GB | Set --ctx-size 65536, quantise KV cache to Q8 |
| Fine-tuning / LoRA training | NVIDIA GPU with 24+ GB VRAM | Use QLoRA (4-bit base + FP16 adapters), gradient checkpointing |
| Best CPU-only on x86 | Dual-channel DDR5 + Zen4/Raptor Lake | Compile llama.cpp with -march=native, use -t <physical cores> |
Your operating system determines which GPU acceleration paths, quantization formats, and developer tools are available. This page compares Windows, macOS, and Linux across the dimensions that matter most for running, developing, and fine-tuning local AI models.
| Feature / Tool | Windows 11 | macOS (Apple Silicon) | Linux (Ubuntu 22.04+) |
|---|---|---|---|
| NVIDIA CUDA | Full | Not available | Full — best driver control |
| AMD ROCm (discrete GPU) | Not available — ROCm for discrete is Linux-only | Not available | Full — ROCm 6.0+ on RX 6000/7000 |
| Apple MLX / Metal | Not available | Native — best inference path | Not available |
| DirectML (Microsoft) | Native — ONNX on NVIDIA, AMD, Intel GPU | Not available | Not available |
| Vulkan inference (llama.cpp) | Full | Not available | Full |
| Ollama | Full | Full | Full |
| LM Studio | Full | Full | Full |
| vLLM (production serving) | Via WSL2 only | Not supported | Full — native, best performance |
| ExLlamaV2 | Full (CUDA) | Not available | Full (CUDA) |
| Claude Code | WSL2 required | Native — recommended | Native — recommended |
| Aider coding agent | Works; WSL2 preferred | Full | Full |
| Docker + GPU passthrough | Via Docker Desktop + WSL2 | No GPU passthrough to containers | Full — NVIDIA Container Toolkit |
| Axolotl / Unsloth fine-tuning | Via WSL2 + CUDA | Not supported | Full — recommended platform |
| GGUF quantization | Full | Full | Full |
| EXL2 format | Full (CUDA) | Not available | Full (CUDA) |
| GPTQ / AWQ formats | Full (CUDA) | Not available | Full (CUDA) |
| MLX native format | Not available | Native — Apple Silicon only | Not available |
| BitsAndBytes / QLoRA | Via WSL2 (unreliable natively) | Not supported | Full — recommended for QLoRA |
| Hugging Face transformers | Works; WSL2 for training | Inference via MPS; no CUDA | Full |
| GPU driver management | Easy — NVIDIA App or AMD Adrenaline | Automatic — macOS system updates | Manual — DKMS, kernel headers |
Best for: NVIDIA CUDA users wanting a dual-purpose gaming and AI workstation. Widest consumer GPU support, easiest driver management, and the best GUI tool ecosystem.
npm install -g @anthropic-ai/claude-code. This one step also resolves most Python ML compatibility issues.Best for: Developers who want a polished Unix environment, large unified-memory model capacity, low power draw, and the best Claude Code experience without Linux complexity.
pip install mlx-lm, then run models directly with mlx_lm.generate. Ollama uses Metal/llama.cpp internally and is slightly slower but simpler to set up.mlx_lm.lora) for Apple Silicon.Best for: Fine-tuning, production multi-user serving, AMD discrete GPU acceleration, multi-GPU setups, and developers who need the full toolchain without compatibility workarounds.
docker run --gpus all vllm/vllm-openai:latest --model meta-llama/Llama-3.1-70B-Instruct — the standard production pattern.apt-mark hold linux-image-generic to pin a stable kernel once the driver is working.| Goal | Recommended OS | Reasoning |
|---|---|---|
| Run local chat models (beginner) | Windows or macOS | Ollama and LM Studio work on all platforms. Windows has the widest consumer GPU support; macOS has the simplest environment setup. |
| Fine-tune models (LoRA / QLoRA) | Linux (Ubuntu 22.04) | Axolotl, Unsloth, and BitsAndBytes require Linux + CUDA. Most tutorials, Docker images, and Hugging Face guides assume Ubuntu. |
| Production multi-user model serving | Linux | vLLM with PagedAttention, Docker + NVIDIA Container Toolkit, and production monitoring (Prometheus, Grafana) are all Linux-native. |
| Daily Claude Code use | macOS or Linux | Native first-class support on both. On Windows, WSL2 setup is required and file I/O across the WSL2/Windows boundary is slower. |
| AMD discrete GPU inference | Linux | ROCm is Linux-only for discrete GPUs. On Windows, use Vulkan in llama.cpp/Ollama — functional but ~10–15% slower than ROCm on Linux for the same hardware. |
| Run 70B+ parameter models | macOS (Apple Silicon) or Linux (multi-GPU) | Apple unified memory and NVIDIA NVLink multi-GPU are the two viable paths to 70B+ at usable speed without an extreme budget. |
| Gaming + AI on one machine | Windows | Windows remains the primary gaming platform. NVIDIA CUDA AI tools work natively on Windows alongside games without issue. |
| Power-efficient always-on server | macOS (Apple Silicon) | Apple Silicon offers the best performance-per-watt ratio for LLM inference. A Mac mini M4 Pro can idle at under 10 W while ready to serve requests. |
| Maximum quantization format access | Linux | Linux + CUDA is the only platform where GGUF, EXL2, GPTQ, AWQ, BitsAndBytes, and ONNX with CUDA EP are all simultaneously available. |
A structured catalog of the tools available across each stage of AI work: running local models, developing with AI assistance, fine-tuning models, and accessing cloud AI services. All tools listed are open source or have a meaningful free tier unless noted.
These tools load model files from disk and run them on your CPU or GPU. They are the foundation of local AI use.
ollama pull llama3.1:70b downloads, quantizes, and runs in one command. 200+ curated models available./v1/chat/completions endpoint, so any OpenAI-compatible client (Open WebUI, Continue.dev, Cursor) connects immediately.-ngl (GPU layers), --ctx-size, --cache-type-k, --flash-attn, -t (threads), --mlock, and more.llama-quantize to convert and quantize models locally without uploading to a cloud service.Tools that integrate AI into the code-writing and editing workflow, from inline completions to full agentic coding agents that can edit multiple files and run commands.
/mnt/c mount work but are slower than editing files inside the WSL2 filesystem.config.json specifies models for chat, completion, and embedding independently — mix local and cloud models.Web or desktop interfaces that connect to Ollama, LM Studio, or vLLM and provide a chat experience similar to ChatGPT but running entirely on your hardware.
docker run -d -p 3000:80 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main — connects to a running Ollama instance automatically.http://localhost:11434.Services that provide access to large frontier models via web interface or API. Useful when a task exceeds local hardware capacity or when the latest proprietary model is needed.
| Service | Models Available | Pricing | Privacy / Notes |
|---|---|---|---|
| Claude.ai (Anthropic) | Claude 3.5 Haiku, Claude Sonnet 4, Claude Opus 4 | Free tier / $20–$25/month Pro / API pay-per-token | Strong system prompt adherence. Pro tier includes extended context (200K tokens), Projects (persistent memory), and document upload. API keys required for Claude Code. |
| ChatGPT (OpenAI) | GPT-4o, o3, o3-mini, GPT-4.5 | Free (GPT-4o limited) / $20/month Plus / API pay-per-token | Widest tool-use ecosystem. GPT-4o includes image generation (DALL-E 3), web browsing, and code execution in one model. o3 is the strongest reasoning model for complex multi-step problems. |
| DuckDuckGo AI Chat | Claude 3 Haiku, GPT-4o mini, Llama 3, Mistral | Free (no account required) | Privacy-focused: DuckDuckGo does not store conversations and removes identifying metadata before forwarding to model providers. Best for quick queries where privacy is a priority. |
| Perplexity | Sonar (proprietary), Claude, GPT-4o | Free / $20/month Pro | AI search engine that cites sources with inline references. Excellent for research queries, real-time information, and technical documentation lookups. Pro tier adds file upload and image generation. |
| Phind | Phind-70B (code-focused), GPT-4o | Free (limited) / $20/month Pro | Developer-focused search and chat. Specialises in code questions, pulls from Stack Overflow, GitHub, and documentation. Strong at answering "how do I do X in Y framework" questions with working code. |
| Google Gemini | Gemini 2.0 Flash, Gemini 2.5 Pro | Free / $20/month Advanced / API pay-per-token | Gemini 2.5 Pro has the longest context window of any frontier model (1M tokens). Excellent for large document analysis. Deep Google Workspace integration. Gemini 2.0 Flash is extremely cost-effective via API. |
Tools for customizing pre-trained models on your own data using LoRA, QLoRA, or full fine-tuning. Requires NVIDIA GPU (Linux) or Apple Silicon (MLX path only).
config.yml. Run with axolotl train config.yml.mlx_lm.lora. The only fine-tuning option available on Apple Silicon — no CUDA, Axolotl, or Unsloth needed.mlx_lm.lora --model <path> --train --data ./data --iters 1000 --batch-size 4| Tool | Purpose | Cost | Key Features |
|---|---|---|---|
| Hugging Face Hub | Model repository & discovery | Free (storage limits) / Pro | 800,000+ models, datasets, and Spaces. Filter by task, license, and quantization format. The standard place to find GGUF, GPTQ, and AWQ quantized model variants. Direct download links for Ollama and llama.cpp. |
| Ollama Model Library | Curated models for Ollama | Free | Curated, verified GGUF models runnable with one command. Smaller and more opinionated than HF Hub. Includes all popular models (Llama 3, Mistral, Qwen, Phi, Gemma) pre-tested with Ollama. |
| LMSYS Chatbot Arena | Blind model evaluation leaderboard | Free | Community-ranked model leaderboard based on blind A/B comparisons. The most reliable quality ranking because it reflects human preferences, not just academic benchmarks. Good for choosing between similar model sizes. |
| Weights & Biases (W&B) | Experiment tracking & ML monitoring | Free (personal) / $50/month Teams | Log training loss curves, hyperparameters, GPU metrics, and model checkpoints. Integrates with Axolotl, Unsloth, and transformers natively. The standard tool for tracking fine-tuning runs. |
| MLflow | Open-source experiment tracking | Free (self-hosted) | Local alternative to W&B. Track experiments, compare model versions, and serve models via its built-in REST API. No cloud dependency — runs on your own machine or server. |
| lm-evaluation-harness | Standardized model benchmarking | Free (open source) | Eleuther AI's framework for running standard benchmarks (MMLU, HellaSwag, ARC, TruthfulQA) against local models. The tool used to generate most public benchmark numbers. Run with lm_eval --model hf --model_args pretrained=<model> --tasks mmlu. |