AI-generated video is rapidly scaling across production workflows, but evaluating quality remains a bottleneck. Human review is too slow, while off-the-shelf foundation models miss critical issues that matter to production teams.
To validate our approach, we evaluated the Janus multimodal judge against popular off-the-shelf models to see how well they could evaluate AI-generated video the way humans would.
We ran 120 evaluations across 30 real-world video generation tasks, testing 4 evaluation models on 6 quality dimensions: brief adherence, brand compliance, message effectiveness, visual quality, temporal consistency, and safety/brand risk. The Janus multimodal judge leads in accuracy (48.33%), recall (47.44%), and correlation (Spearman ρ 0.2872), outperforming GPT-5.2, Claude Sonnet 4.5, and Gemini 2.5 Flash.
All models were evaluated under a single, shared rubric, using the same prompts and the same 30 videos to ensure consistency.
We evaluated four models:
Our rubric design, evaluation metrics, and benchmark setup follows the VideoScore21 setup, extended from their 3 dimensions to our 6.
Our benchmark consists of 30 short-form AI-generated videos with human annotations across 6 production-relevant quality dimensions. This represents the first release from a larger dataset we're actively developing. All videos include audio tracks and represent real-world ecommerce and brand video generation scenarios. We used Fal to generate the videos for this benchmark.
| Category | Count | Description |
|---|---|---|
| Total Videos | 30 | All videos include audio and human annotations |
| Video Generation Models | 3 | Veo 3.1, Wan 2.6, Kling 2.6 (10 videos each) |
| Task Types | 2 | Baseline (12 videos) and Adversarial (18 videos) |
| Adversarial Categories | 6 | Each category has 3 videos testing specific failure modes |
The dataset includes videos generated by three leading video generation models. All videos are ecommerce-focused, featuring product demonstrations, brand showcases, and promotional content typical of short-form advertising:
Videos are split into two categories designed to test different aspects of evaluation capability:
Baseline Tasks (12 videos): Standard ecommerce video generation prompts testing core evaluation capabilities. These include typical scenarios like product unboxing and brand showcases.
Adversarial Tasks (18 videos): Deliberately challenging prompts designed to test specific failure modes that commonly occur in production, like product logos drifting or disappearing mid-video and value props placed incorrectly.
All videos were manually annotated by human evaluators using the same rubric applied to model evaluations, serving as ground truth for evaluating model performance. Each video received:
We report multiple metrics to capture different aspects of evaluation performance. All metrics are computed per dimension and then macro-averaged across the 6 dimensions so each contributes equally, regardless of sample counts or score distribution.
The percentage of evaluations where the model's score exactly matches the human ground truth score. This is our primary metric because production workflows require precise score matching: a model that scores 3 when humans score 4 may incorrectly approve or reject content.
Calculation: (Number of exact matches where model_score == human_score) / (Total evaluations) × 100
The percentage of evaluations where the model's score is within 1 point of the human score. This metric captures near-misses and is useful for understanding when models are "close enough" for practical use.
Calculation: (Number of scores where abs(model_score - human_score) <= 1.0) / (Total evaluations) × 100
The average absolute difference between model scores and human scores across all evaluations. Lower MAE indicates better alignment with human evaluator assessments.
Calculation: Mean(|model_score - human_score|) across all evaluations.
Measures the rank-order correlation between model and human scores using Spearman's rank correlation coefficient. A high Spearman ρ indicates the model correctly identifies which videos are better or worse, even if exact scores don't match. Values range from -1 (perfect inverse correlation) to +1 (perfect correlation).
Measures linear correlation between model and human scores using Pearson's correlation coefficient. Unlike Spearman, PLCC captures whether score magnitudes scale proportionally. Values range from -1 to +1.
The percentage of videos where the model flags issues (scores < 3.0 on the 1-5 scale). A higher strictness indicates the model is more conservative, flagging more content as problematic. This metric helps understand a model's default behavior independent of accuracy.
Calculation: (Number of model scores < 3.0) / (Total evaluations) × 100
Of all videos the model flags (scores < 3.0), what percentage did humans also flag? High precision means the model rarely flags acceptable content. This is critical for operational use where false positives waste reviewer time.
Calculation: (True Positives) / (True Positives + False Positives) × 100
Where:
Of all videos humans flagged (scores < 3.0), what percentage did the model catch? High recall means the model rarely misses problematic content. This is critical for production workflows where missing issues can cause brand damage or compliance violations.
Calculation: (True Positives) / (True Positives + False Negatives) × 100
Where:
The Janus Judge leads across nearly all evaluation dimensions, outperforming GPT-5.2, Claude Sonnet 4.5, and Gemini 2.5 Flash in accuracy, recall, and correlation with human evaluator scores, while achieving the lowest mean absolute error.
| Model | Accuracy | Recall | Precision | MAE | Spearman ρ |
|---|---|---|---|---|---|
| Janus Judge | 48.33% | 47.44% | 43.24% | 1.01 | 0.2872 |
| Claude Sonnet 4.5 | 39.44% | 42.33% | 53.79% | 1.06 | 0.2867 |
| Gemini 2.5 Flash | 38.89% | 30.13% | 54.86% | 1.22 | 0.2278 |
| GPT-5.2 | 32.22% | 30.02% | 53.29% | 1.06 | 0.2439 |
| Dimension | Janus | Claude | GPT | Gemini |
|---|---|---|---|---|
| Brief Adherence | 46.67% | 43.33% | 20.00% | 30.00% |
| Brand Compliance | 26.67% | 30.00% | 33.33% | 26.67% |
| Message Effectiveness | 40.00% | 33.33% | 23.33% | 26.67% |
| Visual Technical Quality | 43.33% | 20.00% | 16.67% | 30.00% |
| Temporal Consistency | 40.00% | 43.33% | 20.00% | 36.67% |
| Safety/Brand Risk | 93.33% | 66.67% | 80.00% | 83.33% |
Janus achieves the highest recall, meaning it catches more human-flagged issues than any baseline; Claude is second, while Gemini and GPT miss far more problems. If missing issues is costly, this coverage posture is the safer default.
Gemini, Claude, and GPT are more selective than Janus, producing fewer false alarms but missing more issues. This is the precision-versus-coverage trade-off in action, with distinct commercial implications depending on whether false positives or missed defects carry higher cost.
Janus tops rank alignment with humans, edging out Claude and ahead of GPT and Gemini. Better correlation means closer agreement with human evaluator assessments on which videos are better or worse, underpinning trust when decisions hinge on relative quality.
Janus achieves the highest exact-match accuracy, with Claude, Gemini, and GPT behind. The higher accuracy plus top recall explains why we catch more real-world issues despite lower precision; in operational settings, it's the safer verifier when the cost of misses outweighs extra reviews.
Taken together, the charts show Janus favors coverage and human alignment (top recall, correlation, relaxed accuracy, and exact match), which makes it the safer default for industry where missed issues are costly and speed to confidence matters. The baselines favor selectivity (higher precision), trading coverage for fewer false alarms, a reasonable choice when reviewer bandwidth is limited, but risky when problems slip through. The results reveal a clear pattern: our purpose-built approach to video evaluation aligns better with human evaluator assessments across multiple dimensions, catching more issues while maintaining stronger correlation with how humans actually assess quality. Yet even with these advantages, all models struggle on certain dimensions, exposing fundamental limitations in current video evaluation capabilities that we explore next.
We analyzed cases where humans flagged issues that models missed to identify failure patterns. Using Gemini 2.5 Flash for taxonomy creation and categorization, we produced per-model breakdowns.
| Category | Count | % |
|---|---|---|
| Brand Element Fidelity & Placement | 6 | 24.00% |
| Object Temporal Inconsistency | 6 | 24.00% |
| Technical Artifacts | 3 | 12.00% |
| Visual Clarity Issues | 3 | 12.00% |
| Value Proposition & Hook Issues | 2 | 8.00% |
| Category | Count | % |
|---|---|---|
| Object Temporal Inconsistency | 6 | 17.14% |
| Call-to-Action (CTA) Issues | 5 | 14.29% |
| Brand Element Fidelity & Placement | 5 | 14.29% |
| Value Proposition & Hook Issues | 5 | 14.29% |
| Object Color & Material Mismatch | 4 | 11.43% |
| Category | Count | % |
|---|---|---|
| Visual Clarity Issues | 5 | 16.67% |
| Object Temporal Inconsistency | 5 | 16.67% |
| Brand Element Fidelity & Placement | 4 | 13.33% |
| Object Identity & Shape Mismatch | 3 | 10.00% |
| Call-to-Action (CTA) Issues | 3 | 10.00% |
| Category | Count | % |
|---|---|---|
| Technical Artifacts | 5 | 23.81% |
| Object Temporal Inconsistency | 4 | 19.05% |
| Brand Element Fidelity & Placement | 3 | 14.29% |
| Object Identity & Shape Mismatch | 2 | 9.52% |
| Lack of Expected Motion/Dynamism | 2 | 9.52% |
*We've capped each table to show only the top 5 rows for readability. The full data is available on request from our team.
Across baselines, two themes dominate human-flagged misses: temporal/object stability and brand element fidelity. Claude and GPT frequently overlook small logo/serial placement errors and slow identity drift, which translates into brand risk and continuity breaks that reviewers must manually catch. Gemini more often misses key messaging elements or weakens the value proposition while also missing brand fidelity details. Janus shows similar weaknesses, with technical artifacts accounting for nearly a quarter of its misses, indicating challenges in detecting subtle visual glitches and tracking object identity across frames.
These failure patterns have clear production implications: they increase compliance risk, require manual reviewer intervention, and can invalidate claims about product appearance or behavior. The clustered evidence indicates the next improvements needed: strengthening temporal consistency detection, enhancing brand element fidelity verification, and improving object identity tracking to help catch the subtle failure modes that currently slip through automated evaluation.
This benchmark represents an initial exploration into video evaluation capabilities, and several limitations should be acknowledged:
We were unable to test as many evaluators as originally planned due to time constraints. We plan to add more evaluators in our expanded version.
With 30 videos across 6 dimensions, the dataset is relatively small. While sufficient to demonstrate clear performance differences between different verifiers, a larger and more diverse dataset would provide stronger statistical confidence and better edge case coverage.
Finding suitable existing datasets proved difficult due to several constraints:
These limitations directly motivated the creation of this benchmark. The lack of industry-relevant, expert-annotated datasets for AI-generated video evaluation represents a significant gap in the field. Due to these constraints, we built our own dataset from scratch, selecting a subset of 30 videos for this benchmark. We plan to release an expanded version once our full dataset is generated in the coming weeks.
This benchmark establishes a foundation for evaluating video quality in production contexts, where expert annotations are essential but scarce. The results demonstrate that Janus leads in accuracy, recall, and correlation with human evaluator assessments, outperforming general-purpose models. However, all models had difficulty with temporal consistency and tracking objects across frames, indicating clear next steps for improvement. As AI-generated video scales, these evaluation capabilities will become increasingly critical for maintaining quality, compliance, and brand integrity in automated workflows.
1. VideoScore2: Think before You Score in Generative Video Evaluation. Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen. arXiv:2509.22799, 2025. https://arxiv.org/abs/2509.22799
Interested in our benchmark?
If you're a lab interested in our verifier, this dataset, or adding your model to this benchmark, we'd love to hear from you. Reach out to discuss how we can support your evaluation capabilities.