Posts about evaluating large language models - techniques, tools, and practices for measuring quality, consistency, reliability, and safety of LLM outputs.
Discover how LLM evals with Promptfoo and fast testing with Vitest helped turn podcast-it from a prototype into a production-ready generative AI app that generates consistent, high-quality audio episodes.