Benchmarks

Curated benchmarks for high-stakes domains

Latest Release

Video Evaluation Benchmark: Janus vs. Baseline Models

We ran 120 evaluations across 30 real-world video generation tasks, testing 4 different evaluation models. The Janus Judge achieved 47.4% accuracy, outperforming GPT-5.2, Claude Sonnet 4.5, and Gemini 2.5 Flash.

We're developing curated benchmarks for high-stakes domains like healthcare, finance, and customer support, as well as complex enterprise workflows where reliability matters.

If interested in early access, reach out at team@withjanus.com.