AI Agent Evaluation Pipeline

Load Models

Compile llama-cpp-python (first time only) and download GGUF models.

LLM Judge — Qwen3-8B-Q4_K_M (auto-downloaded)

Dataset Generator — Nemotron3-Nano-4B-Q4_K_M (auto-downloaded)

Models are loaded locally via llama.cpp. Click Save & Load Models below to compile llama-cpp-python (first time only, ~2–3 min) and download the GGUFs.

Evaluation Levels