Load Models

Compile llama-cpp-python (first time only) and download GGUF models.

LLM Judge โ€” Qwen3-8B-Q4_K_M (auto-downloaded)

Dataset Generator โ€” Nemotron3-Nano-4B-Q4_K_M (auto-downloaded)

Models are loaded locally via llama.cpp. Click Save & Load Models below to compile llama-cpp-python (first time only, ~2โ€“3 min) and download the GGUFs.

Evaluation Levels

๐Ÿ“ฆ Session Evaluators (once per session)

Checkbox Group

๐Ÿ”„ Trace Evaluators (once per conversation turn)

Checkbox Group

๐Ÿ”ง Span Evaluators (once per tool call)

Checkbox Group

Settings

0.3 0.9
1 5