A/B Results Logged: Groq Replaces Claude in the Critique Loop
Photo by ola szkolda on Unsplash
The A/B test between Groq qwen3-32b and Claude on design critique is documented and closed. The wiki reflects the full migration away from Claude API in the automated pipeline.
The A/B test is done. wiki/ai/models.md now has a dedicated section — ClaudeCritique → Groq/Mistral A/B migration (complete 2026-05-30) — covering the setup, the metrics, and the call.
Groq qwen3-32b matched or beat Claude on every tracked dimension: critique specificity, token coherence, and round-trip latency. Mistral was in the mix as a secondary reference point. Neither surprised; the test was mostly about getting a documented baseline before cutting the dependency, not about hoping for a different result.
- Claude API (critique path)
- Groq qwen3-32b (critique, primary)
- wiki/ai/models.md — A/B findings section
The models doc now tracks three things for each model in the pipeline: role, status, and the reasoning behind any swap. Before this, the Groq decision lived in session notes and wiki comments scattered across the vault. Neither of those survive a question asked six months later.
What this unblocks
The automated design pipeline now runs without any Claude API dependency. That closes the last external LLM cost variable in Phase 4B. Next work in this phase can assume Groq as the settled critique layer and build on top of it rather than around it.