Load Models
Compile llama-cpp-python (first time only) and download GGUF models.
LLM Judge โ Qwen3-8B-Q4_K_M (auto-downloaded)
Dataset Generator โ Nemotron3-Nano-4B-Q4_K_M (auto-downloaded)
Models are loaded locally via llama.cpp. Click Save & Load Models below to compile llama-cpp-python (first time only, ~2โ3 min) and download the GGUFs.
Evaluation Levels
๐ฆ Session Evaluators (once per session)
๐ Trace Evaluators (once per conversation turn)
๐ง Span Evaluators (once per tool call)
Settings
0.3 0.9
1 5