Benchmark

Video Evaluation Benchmark: Janus vs. Baseline Models

AI-generated video is rapidly scaling across production workflows, but evaluating quality remains a bottleneck. Human review is too slow, while off-the-shelf foundation models miss critical issues that matter to production teams.

To validate our approach, we evaluated the Janus multimodal judge against popular off-the-shelf models to see how well they could evaluate AI-generated video the way humans would.

We ran 120 evaluations across 30 real-world video generation tasks, testing 4 evaluation models on 6 quality dimensions: brief adherence, brand compliance, message effectiveness, visual quality, temporal consistency, and safety/brand risk. The Janus multimodal judge leads in accuracy (48.33%), recall (47.44%), and correlation (Spearman ρ 0.2872), outperforming GPT-5.2, Claude Sonnet 4.5, and Gemini 2.5 Flash.

Methodology

All models were evaluated under a single, shared rubric, using the same prompts and the same 30 videos to ensure consistency.

We evaluated four models:

  • Janus Judge: Our proprietary verifier designed for multi-modal video evaluation
  • GPT-5.2: OpenAI's flagship model, evaluated on sampled video frames.
  • Claude Sonnet 4.5: Anthropic's frontier multimodal model, evaluated on sampled video frames.
  • Gemini 2.5 Flash: Google's fast, video-native multimodal model, evaluated on full video input.

Our rubric design, evaluation metrics, and benchmark setup follows the VideoScore21 setup, extended from their 3 dimensions to our 6.

Evaluation setup

  • Janus (Judge + Mapper): Full video and prompt sent to Janus API for multi-modal analysis. Because our API's native output format differs from the standardized 6-dimension rubric, we use Gemini 2.5 Flash to map the results to the same 6 rubric dimensions (1–5) with reasoning.
  • Baselines (Direct): Each model receives the video, prompt, and rubric text directly.

Controls for consistency and fairness

  • Same 6-dimension rubric text for every model.
  • Same generation prompt and the same set of 30 videos for all models.
  • Uniform 1–5 Likert scale: 5 (Gold Standard), 4 (Production Ready), 3 (Acceptable), 2 (Concept Only), 1 (Critical Failure).
30
Videos Evaluated
Real-world ecommerce and brand video generation tasks.
120
Total Evaluations
4 models × 30 videos across 6 dimensions.
6
Quality Dimensions
Brief adherence, brand compliance, message effectiveness, visual quality, temporal consistency, safety/brand risk.

Dataset

Our benchmark consists of 30 short-form AI-generated videos with human annotations across 6 production-relevant quality dimensions. This represents the first release from a larger dataset we're actively developing. All videos include audio tracks and represent real-world ecommerce and brand video generation scenarios. We used Fal to generate the videos for this benchmark.

Dataset Composition

CategoryCountDescription
Total Videos30All videos include audio and human annotations
Video Generation Models3Veo 3.1, Wan 2.6, Kling 2.6 (10 videos each)
Task Types2Baseline (12 videos) and Adversarial (18 videos)
Adversarial Categories6Each category has 3 videos testing specific failure modes

Video Generation Models

The dataset includes videos generated by three leading video generation models. All videos are ecommerce-focused, featuring product demonstrations, brand showcases, and promotional content typical of short-form advertising:

  • Google Veo 3.1 (10 videos)
  • Kuaishou Wan 2.6 (10 videos)
  • Kuaishou Kling 2.6 (10 videos)

Task Design: Baseline vs. Adversarial

Videos are split into two categories designed to test different aspects of evaluation capability:

Baseline Tasks (12 videos): Standard ecommerce video generation prompts testing core evaluation capabilities. These include typical scenarios like product unboxing and brand showcases.

Adversarial Tasks (18 videos): Deliberately challenging prompts designed to test specific failure modes that commonly occur in production, like product logos drifting or disappearing mid-video and value props placed incorrectly.

Human Annotation Process

All videos were manually annotated by human evaluators using the same rubric applied to model evaluations, serving as ground truth for evaluating model performance. Each video received:

  • Scores (1-5 Likert scale) for all 6 dimensions
  • Detailed rationales explaining each score

Metrics

We report multiple metrics to capture different aspects of evaluation performance. All metrics are computed per dimension and then macro-averaged across the 6 dimensions so each contributes equally, regardless of sample counts or score distribution.

Exact-Match Accuracy (Primary Metric)

The percentage of evaluations where the model's score exactly matches the human ground truth score. This is our primary metric because production workflows require precise score matching: a model that scores 3 when humans score 4 may incorrectly approve or reject content.

Calculation: (Number of exact matches where model_score == human_score) / (Total evaluations) × 100

Relaxed Accuracy

The percentage of evaluations where the model's score is within 1 point of the human score. This metric captures near-misses and is useful for understanding when models are "close enough" for practical use.

Calculation: (Number of scores where abs(model_score - human_score) <= 1.0) / (Total evaluations) × 100

Mean Absolute Error (MAE)

The average absolute difference between model scores and human scores across all evaluations. Lower MAE indicates better alignment with human evaluator assessments.

Calculation: Mean(|model_score - human_score|) across all evaluations.

Spearman's Rank Correlation (ρ)

Measures the rank-order correlation between model and human scores using Spearman's rank correlation coefficient. A high Spearman ρ indicates the model correctly identifies which videos are better or worse, even if exact scores don't match. Values range from -1 (perfect inverse correlation) to +1 (perfect correlation).

Pearson Linear Correlation Coefficient (PLCC)

Measures linear correlation between model and human scores using Pearson's correlation coefficient. Unlike Spearman, PLCC captures whether score magnitudes scale proportionally. Values range from -1 to +1.

Strictness

The percentage of videos where the model flags issues (scores < 3.0 on the 1-5 scale). A higher strictness indicates the model is more conservative, flagging more content as problematic. This metric helps understand a model's default behavior independent of accuracy.

Calculation: (Number of model scores < 3.0) / (Total evaluations) × 100

Precision

Of all videos the model flags (scores < 3.0), what percentage did humans also flag? High precision means the model rarely flags acceptable content. This is critical for operational use where false positives waste reviewer time.

Calculation: (True Positives) / (True Positives + False Positives) × 100

Where:

  • True Positive (TP): Model flags (score < 3.0) AND human flags (score < 3.0)
  • False Positive (FP): Model flags (score < 3.0) BUT human doesn't (score ≥ 3.0)

Recall

Of all videos humans flagged (scores < 3.0), what percentage did the model catch? High recall means the model rarely misses problematic content. This is critical for production workflows where missing issues can cause brand damage or compliance violations.

Calculation: (True Positives) / (True Positives + False Negatives) × 100

Where:

  • True Positive (TP): Model flags (score < 3.0) AND human flags (score < 3.0)
  • False Negative (FN): Model doesn't flag (score ≥ 3.0) BUT human flags (score < 3.0)

Results

The Janus Judge leads across nearly all evaluation dimensions, outperforming GPT-5.2, Claude Sonnet 4.5, and Gemini 2.5 Flash in accuracy, recall, and correlation with human evaluator scores, while achieving the lowest mean absolute error.

Overall Performance

ModelAccuracyRecallPrecisionMAESpearman ρ
Janus Judge48.33%47.44%43.24%1.010.2872
Claude Sonnet 4.539.44%42.33%53.79%1.060.2867
Gemini 2.5 Flash38.89%30.13%54.86%1.220.2278
GPT-5.232.22%30.02%53.29%1.060.2439

Performance Across All 6 Dimensions (Accuracy)

DimensionJanusClaudeGPTGemini
Brief Adherence46.67%43.33%20.00%30.00%
Brand Compliance26.67%30.00%33.33%26.67%
Message Effectiveness40.00%33.33%23.33%26.67%
Visual Technical Quality43.33%20.00%16.67%30.00%
Temporal Consistency40.00%43.33%20.00%36.67%
Safety/Brand Risk93.33%66.67%80.00%83.33%

Findings

Recall by Model

Janus Judge
Claude Sonnet 4.5
Gemini 2.5 Flash
GPT-5.2
0%10%20%30%40%50%60%70%80%90%100%47.44%42.33%30.13%30.02%Janus JudgeClaude Sonnet 4.5Gemini 2.5 FlashGPT-5.2Model UsedRecall (%)

Janus achieves the highest recall, meaning it catches more human-flagged issues than any baseline; Claude is second, while Gemini and GPT miss far more problems. If missing issues is costly, this coverage posture is the safer default.

Precision by Model

Janus Judge
Claude Sonnet 4.5
Gemini 2.5 Flash
GPT-5.2
0%10%20%30%40%50%60%70%80%90%100%43.24%53.79%54.86%53.29%Janus JudgeClaude Sonnet 4.5Gemini 2.5 FlashGPT-5.2Model UsedPrecision (%)

Gemini, Claude, and GPT are more selective than Janus, producing fewer false alarms but missing more issues. This is the precision-versus-coverage trade-off in action, with distinct commercial implications depending on whether false positives or missed defects carry higher cost.

Spearman Correlation by Model

Janus Judge
Claude Sonnet 4.5
Gemini 2.5 Flash
GPT-5.2
0%10%20%30%40%50%60%70%80%90%100%0.28720.28670.22780.2439Janus JudgeClaude Sonnet 4.5Gemini 2.5 FlashGPT-5.2Model UsedSpearman Correlation (ρ)

Janus tops rank alignment with humans, edging out Claude and ahead of GPT and Gemini. Better correlation means closer agreement with human evaluator assessments on which videos are better or worse, underpinning trust when decisions hinge on relative quality.

Exact Match Accuracy by Model

Janus Judge
Claude Sonnet 4.5
Gemini 2.5 Flash
GPT-5.2
0%10%20%30%40%50%60%70%80%90%100%48.33%39.44%38.89%32.22%Janus JudgeClaude Sonnet 4.5Gemini 2.5 FlashGPT-5.2Model UsedExact Match Accuracy (%)

Janus achieves the highest exact-match accuracy, with Claude, Gemini, and GPT behind. The higher accuracy plus top recall explains why we catch more real-world issues despite lower precision; in operational settings, it's the safer verifier when the cost of misses outweighs extra reviews.

Taken together, the charts show Janus favors coverage and human alignment (top recall, correlation, relaxed accuracy, and exact match), which makes it the safer default for industry where missed issues are costly and speed to confidence matters. The baselines favor selectivity (higher precision), trading coverage for fewer false alarms, a reasonable choice when reviewer bandwidth is limited, but risky when problems slip through. The results reveal a clear pattern: our purpose-built approach to video evaluation aligns better with human evaluator assessments across multiple dimensions, catching more issues while maintaining stronger correlation with how humans actually assess quality. Yet even with these advantages, all models struggle on certain dimensions, exposing fundamental limitations in current video evaluation capabilities that we explore next.

Where Models Fail

We analyzed cases where humans flagged issues that models missed to identify failure patterns. Using Gemini 2.5 Flash for taxonomy creation and categorization, we produced per-model breakdowns.

Human-Flagged Misses: Claude Sonnet 4.5

CategoryCount%
Brand Element Fidelity & Placement624.00%
Object Temporal Inconsistency624.00%
Technical Artifacts312.00%
Visual Clarity Issues312.00%
Value Proposition & Hook Issues28.00%

Human-Flagged Misses: Gemini 2.5 Flash

CategoryCount%
Object Temporal Inconsistency617.14%
Call-to-Action (CTA) Issues514.29%
Brand Element Fidelity & Placement514.29%
Value Proposition & Hook Issues514.29%
Object Color & Material Mismatch411.43%

Human-Flagged Misses: GPT-5.2

CategoryCount%
Visual Clarity Issues516.67%
Object Temporal Inconsistency516.67%
Brand Element Fidelity & Placement413.33%
Object Identity & Shape Mismatch310.00%
Call-to-Action (CTA) Issues310.00%

Human-Flagged Misses: Janus Judge

CategoryCount%
Technical Artifacts523.81%
Object Temporal Inconsistency419.05%
Brand Element Fidelity & Placement314.29%
Object Identity & Shape Mismatch29.52%
Lack of Expected Motion/Dynamism29.52%

*We've capped each table to show only the top 5 rows for readability. The full data is available on request from our team.

Across baselines, two themes dominate human-flagged misses: temporal/object stability and brand element fidelity. Claude and GPT frequently overlook small logo/serial placement errors and slow identity drift, which translates into brand risk and continuity breaks that reviewers must manually catch. Gemini more often misses key messaging elements or weakens the value proposition while also missing brand fidelity details. Janus shows similar weaknesses, with technical artifacts accounting for nearly a quarter of its misses, indicating challenges in detecting subtle visual glitches and tracking object identity across frames.

These failure patterns have clear production implications: they increase compliance risk, require manual reviewer intervention, and can invalidate claims about product appearance or behavior. The clustered evidence indicates the next improvements needed: strengthening temporal consistency detection, enhancing brand element fidelity verification, and improving object identity tracking to help catch the subtle failure modes that currently slip through automated evaluation.

Limitations

This benchmark represents an initial exploration into video evaluation capabilities, and several limitations should be acknowledged:

Evaluator Coverage

We were unable to test as many evaluators as originally planned due to time constraints. We plan to add more evaluators in our expanded version.

Dataset Size and Diversity

With 30 videos across 6 dimensions, the dataset is relatively small. While sufficient to demonstrate clear performance differences between different verifiers, a larger and more diverse dataset would provide stronger statistical confidence and better edge case coverage.

Finding suitable existing datasets proved difficult due to several constraints:

  • Most video datasets contain only video with no audio, making them unsuitable for evaluating multimodal content.
  • Many datasets lacked expert annotations, which made them difficult to use as baseline datasets since we need manual annotations to judge evaluator accuracy.
  • Available datasets evaluated technical metrics (frame-level segmentation, optical flow) that don't align with the success metrics that production teams care about.

Why We Built This Benchmark

These limitations directly motivated the creation of this benchmark. The lack of industry-relevant, expert-annotated datasets for AI-generated video evaluation represents a significant gap in the field. Due to these constraints, we built our own dataset from scratch, selecting a subset of 30 videos for this benchmark. We plan to release an expanded version once our full dataset is generated in the coming weeks.

Summary

This benchmark establishes a foundation for evaluating video quality in production contexts, where expert annotations are essential but scarce. The results demonstrate that Janus leads in accuracy, recall, and correlation with human evaluator assessments, outperforming general-purpose models. However, all models had difficulty with temporal consistency and tracking objects across frames, indicating clear next steps for improvement. As AI-generated video scales, these evaluation capabilities will become increasingly critical for maintaining quality, compliance, and brand integrity in automated workflows.

References

1. VideoScore2: Think before You Score in Generative Video Evaluation. Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen. arXiv:2509.22799, 2025. https://arxiv.org/abs/2509.22799

Interested in our benchmark?

If you're a lab interested in our verifier, this dataset, or adding your model to this benchmark, we'd love to hear from you. Reach out to discuss how we can support your evaluation capabilities.