Janus

A venture-backed stealth startup working with clients such as Dolby was building an end-to-end ad-generation workflow. Their system pulled brand content from company websites and automatically produced short-form video ads designed for platforms like LinkedIn and X. As adoption grew, one constraint became unavoidable: quality assurance did not scale. Facing this bottleneck, the team reached out to Janus for our automated evaluation layer that could keep pace with their production demands.

Why Long-Horizon Workflows Break Traditional Evaluation

It was no longer enough for the pipeline to simply "complete successfully." In advertising, even a single faulty video can immediately erode client trust and lead to churn. Users expectations of videos are high, and seeing a glitch or error can lead to a negative perception of the brand. So the real challenge wasn't in generating, but ensuring that every output met the qualitative standards required to ship. Video realism, brand consistency, and audio–visual cohesion all mattered, but manual review became a bottleneck. Engineers were spending hours each day watching videos, annotating failures, and debating subjective quality calls, severely slowing down iteration.

Their system was a long-horizon, agentic workflow, which we define as one that requires coherent decision-making across many interdependent steps over time, with delayed or distributed reward signals such that no single step can be evaluated in isolation.

Each new action introduces uncertainty and minor variances can compound over time. Therefore, existing solutions fell short for structural reasons:

Manual review was the most accurate option as human ground truth is the gold standard, but this was unscalable. Engineers spent hours watching videos, scouring through traces, and debugging runs without producing reusable or systematic feedback.
Off-the-shelf LLM judges performed poorly on multimodal artifacts. Most rely on static frame sampling, missing critical temporal failures like audio-sync issues or motion artifacts that only emerge over the full sequence.
Creating golden datasets consumed weeks of engineering time diverted from building, and any changes to prompts, models, or workflow logic meant starting over. Coming up with edge cases and tasks for the agent was also a non-trivial problem.
Component-level tools evaluated individual steps in isolation, while the most damaging failures were workflow-level. Scripts could be correct while visuals drifted, logos could appear but feel pasted on, or audio could be clean yet semantically disconnected from what was shown.

What the team needed was not another framework that addressed only part of the problem, but an end-to-end reliability system that could score the entire pipeline as a single unit and localize where long-horizon breakdowns occur.

Integration of our End-to-End Reliability Layer

We worked with the team to integrate our self-serve reliability platform for their generation workflow. Rather than evaluating isolated prompts or intermediate steps, our platform was used to test the entire workflow. Each completed run was scored against their relevant KPIs, with structured outputs explaining what failed, why it failed, and how severe the issue was.

While we allow users to define and tune judges independently via our platform, we worked closely with the team for this rollout to calibrate our proprietary verifiers to the startup’s production requirements. The KPI set focused on failure modes that blocked shipping: logo placement and integration, subtitle accuracy and synchronization, script-to-visual alignment, audio–visual cohesion, and overall visual quality and motion realism. These issues could not be tracked by generic LLM judge prompts, but required purpose-built verifiers tuned to the visual and semantic standards expected of modern video advertising.

Janus was integrated directly into the company's architecture via SDK. Our infrastructure could synthetically generate inputs, trigger the workflow, trace every tool call, and evaluate the final composited video with our judges. Results were shown immediately in the Janus UI, allowing the team to independently inspect each run's trace, KPI scores, violation explanations, and improvement suggestions, as well as export data for downstream analysis. Engineers could also annotate individual test runs and verifier outputs in the UI, creating a continuous feedback loop that progressively aligned automated judgments with human ground truth.

What the Results Revealed at Scale

To validate integration, we executed 100 full end-to-end workflow runs across five distinct brands. Each run was treated as a complete production test, from initial input through final ad output, and evaluated using the same KPI rules and thresholds. We were provided details on the brands to be tested as well as their corresponding assets, so no synthetic inputs had to be generated. This allowed us to capture the most realistic distribution of tasks that the system would encounter in production, which let us calibrate our verifiers to the most common failure modes.

The results were immediately actionable. Instead of a single "quality score," Janus produced KPI-level signals that mapped directly to engineering work. Script generation proved consistently strong, indicating that the generation stack could produce coherent and brand-relevant narration. The primary issues that appeared were in visual quality as well as audio–visual cohesion. There were common instances of unrealistic animations or sudden shifting across frames, as well as audio that didn't match up with what was being shown, causing ads to appear AI-generated.

Branding failures were also complex. There were near zero cases of logo absence, but the integration quality was inconsistent, where logos would sometimes appear as high-contrast overlays with backgrounds that didn't fit well with the video. Subtitle failures primarily involved grammatical errors in overlayed text or cases where on-screen text deviated from spoken narration. This showed that while foundation models have mastered creative range, they still struggle with the finer points of realism and coherence.

The most important breakthrough was automated filtering. The team needed a way to prevent faulty videos from reaching customers without human review. We enabled this by providing a custom API to directly call our verifiers. This way, the team could define hard requirements, such as no logo clipping or no critical subtitle errors, and automatically block faulty outputs in production. Subjective human review was replaced with a consistent layer that could be enforced on every production run.

Impact and Looking Forward

Over the course of 3 weeks, Janus was integrated, proven with 100 tests, and extended with evaluation pipelines configured to the team's production needs. What previously required multiple hours of daily manual testing could now be done in 10-15 minutes. This wasn't just faster, it fundamentally changed how the team operated. For each test run, they could see exactly which steps failed to adhere to their KPIs and why, along with targeted improvement suggestions. Having this data enabled them to demonstrate to other clients that their system was a platform capable of producing the best videos, backed by concrete results.

By the end of the integration, the startup had a reusable evaluation system, or what we refer to as 'one-click evals'. Production generations could now be safely run, scored automatically, blocked when they failed quality checks, and used to feed violation patterns back into prompts, model selection, and post-processing. Quality improvement became a structured process instead of subjective review, leading to a more reliable product.

This case reveals a fundamental shift in how AI systems must be evaluated. As workflows become more complex and outputs more multimodal, traditional approaches break down. Enterprise buyers have higher expectations, yet adoption stalls because current systems fail to meet them, leading to decreased trust in vendor capabilities and doubts over performance. The industry is moving from simple pass/fail checks to systems that can validate entire pipelines end-to-end. This is the evaluation layer that makes production-grade AI possible.

What’s next

In the coming weeks, our team will publish a multimodal benchmark evaluating leading foundation video models against real enterprise KPIs, as well as an evaluator accuracy study comparing Janus verifiers to other off-the-shelf judges. Together, these releases will help teams choose both the right generator and a scoring layer they can trust.

Interested in scaling your evaluations?

Reach out to our team at team@withjanus.com to get set up with our platform or access our benchmarks.

How a Leading Video Startup Automated Evals with Janus

Why Long-Horizon Workflows Break Traditional Evaluation

Integration of our End-to-End Reliability Layer

What the Results Revealed at Scale

Impact and Looking Forward

What’s next