Wan Technical Report: Alibaba’s Open-Source Powerhouse for AI Video Generation

Wan: Alibaba’s Open-Source Powerhouse for AI Video Generation

In early 2024, OpenAI’s Sora lit up the AI world by generating videos with a level of realism once reserved for Hollywood. While awe-inspiring, models like Sora are locked behind closed doors—leaving the open-source community scrambling to catch up. That changes now.

Wan, developed by Alibaba Group, is a breakthrough open-source suite of video foundation models. Designed to bridge the gap between commercial-grade video generators and the open-source world, Wan isn’t just a technical achievement—it’s a statement of intent. With competitive performance, a broad application range, and surprising efficiency (even on consumer GPUs), Wan redefines what’s possible with open generative models.

Breaking the Bottleneck: Why Wan Had to Be Built

Video generation has been rapidly evolving, but major challenges still limit widespread use and innovation. Most open-source models are still stuck in narrow tasks like basic text-to-video and struggle with high-fidelity motion, multilingual support, or efficient deployment. Meanwhile, commercial models are leaping ahead, backed by immense private compute and data.

Wan was created to solve this imbalance. It's designed to be open, scalable, and—perhaps most importantly—capable of generating videos that feel dynamic, grounded, and nuanced. Think swirling snow, readable signage in both Chinese and English, and camera movements that make sense in physical space. All of this is backed by a model suite that’s reproducible, modular, and engineered for scale.

Engineering the Core: Inside Wan’s Next-Gen Architecture

At the heart of Wan lies a highly optimized architecture composed of three major components: a spatio-temporal VAE , a diffusion transformer , and a multilingual text encoder . Each part has been designed not only for performance but for usability across real-world tasks.

The Wan-VAE is responsible for compressing videos in both time and space. It’s a 3D causal variational autoencoder that reduces the video data volume by over 250× while maintaining fine-grained motion detail. Using causal convolutions and a clever feature cache mechanism, it enables efficient long-form video processing—a pain point for most video models.

Complementing this is the Diffusion Transformer, a pure transformer model designed to process these compressed latent features. It uses full spatio-temporal attention to reason about both the sequence and layout of video content. What’s impressive here is the use of Flow Matching—a newer training method that avoids iterative noise prediction in favor of more stable, mathematically grounded ODE modeling.

To interpret user prompts and guide the generation, Wan uses umT5, a multilingual text encoder. It’s capable of handling complex, descriptive instructions in both English and Chinese, ensuring the model doesn't just generate video—it follows direction.

The Data Backbone: How Wan Was Trained on Trillions of Tokens

A model is only as good as the data it's trained on, and Wan’s data pipeline is a masterclass in modern dataset engineering. Over billions of images and videos were curated, cleaned, and enriched to train this model.

The process began with large-scale filtering—removing watermarked content, NSFW material, overly blurry footage, and low-resolution clips. But Wan went further. It introduced a motion quality classifier to prioritize videos with smooth, expressive movements and a balanced motion-to-static ratio. Meanwhile, a visual text pipeline processed both synthetic and real-world text-in-image samples, boosting Wan’s ability to render on-screen text legibly and accurately.

To give the model a deeper understanding of what’s happening in each frame, Alibaba built its own dense captioning system, trained to rival even Google’s Gemini 1.5 Pro. This system labels elements like camera angle, object count, motion types, scene categories, and more—creating a richly annotated training set for downstream tasks like editing and personalization.

Big Models, Small Footprints: Meet Wan 1.3B and 14B

Wan comes in two versions: the 1.3B parameter model and the more powerful 14B parameter flagship. Both are capable of producing high-resolution video up to 480p, and both share the same robust architecture.

The real surprise? The 1.3B model is designed to run on consumer-grade GPUs with just 8.19 GB of VRAM. That’s a game-changer. It means artists, developers, and small studios can access high-quality video generation without needing a rack of A100s.

The 14B model, by contrast, is designed to push the boundaries. Trained on trillions of tokens, it excels in long-form video consistency, realistic motion, and following intricate textual prompts. Whether generating natural scenes or stylized animations, the 14B model proves that open-source can be competitive at the frontier.

Going Head-to-Head: How Wan Performs Against the Competition

In both benchmark evaluations and head-to-head human preference tests, Wan consistently comes out on top. It not only beats open-source models like Mochi and HunyuanVideo, but also competes favorably with commercial heavyweights like Runway Gen-3.

This isn’t just about quality—it’s about control. Wan allows for fine-grained camera movement, visual text rendering, prompt-following, and style diversity—all areas where previous models struggled or required hand-tuning.

Moreover, in ablation studies, the Wan team showed that its flow-matching loss function and dense-captioning strategy were pivotal in achieving such strong alignment and coherence. This makes Wan not just good, but principled—a model suite where every design choice is validated and optimized.

Model performance scores on Vbench.

Model Name	Quality Score	Semantic Score	Total Score
MiniMax-Video-01 (MiniMax, 2024.09)	84.85%	77.65%	83.41%
Hunyuan (Open-Source Version) (Kong et al., 2024)	85.09%	75.82%	83.24%
Gen-3 (2024-07) (Runway, 2024.06)	84.11%	75.17%	82.32%
CogVideoX1.5-5B (5s SAT prompt-optimized) (Yang et al., 2025b)	82.78%	79.76%	82.17%
Kling (2024-07 high-performance mode) (Kuaishou, 2024.06)	83.39%	75.68%	81.85%
Sora (OpenAI, 2024)	85.51%	79.35%	84.28%
Wan 1.3B	84.92%	80.10%	83.96%
Wan 14B (2025-02-24)	86.67%	84.44%	86.22%

Speed, Scale, and Efficiency: A Model You Can Actually Use

Training and inference efficiency are where Wan shines even more. During training, Alibaba uses a sophisticated 2D context parallelism scheme (Ulysses + Ring Attention), reducing communication overhead across GPUs. During inference, they introduced diffusion caching, exploiting the similarities between sampling steps to speed things up.

Combined with FP8 quantization and activation offloading, Wan achieves real-time or near-real-time generation speeds. The result: a 1.62× speedup over traditional models, with no perceptible loss in video quality.

More Than Just Text-to-Video: Real Applications, Right Now

Wan isn’t limited to one task—it’s a platform. It supports a full range of multimodal video tasks, including:

Image-to-video: Turn a single image into a dynamic scene.
Instructional video editing: Modify clips using natural language commands.
Personalized generation: Zero-shot customization for avatars or branded content.
Camera control: Adjust zoom, panning, or viewpoint using text.
Real-time video generation: Thanks to smart caching and lightweight models.
Audio generation: Synchronized sound to accompany generated visuals.

Whether you're a filmmaker, educator, advertiser, or game developer, Wan can adapt to your needs.

The Big Picture: What Wan Means for Research and Industry

From an academic standpoint, Wan is a treasure trove. With open code, open weights, and transparent training methodologies, it sets a new standard for reproducibility in the video generation community. Researchers can build on its modules, run evaluations, and fine-tune the system for novel domains.

On the business side, Wan opens the door for cost-effective, high-quality content generation. Marketing videos, educational explainers, social media clips—these can now be created at scale without paying per-frame fees to black-box APIs. It gives creators, startups, and enterprises a serious competitive edge.

What’s Next: The Road Ahead for Wan

Wan is already one of the most capable video generation models available, but its roadmap is just getting started. The team plans to push toward 1080p and 4K generation, integrate 3D awareness, and expand multi-language support for more global accessibility.

They’re also working on interactive storytelling, where models generate video based on user feedback in real time, and plug-and-play adapters for verticals like healthcare, education, and gaming.

Where to Try It

Everything is available right now:

Whether you’re a researcher, artist, startup, or just curious—Wan is open and ready.

TL;DR

Wan is the most powerful open-source video generation suite to date. With cutting-edge architecture, rigorous training, and wide accessibility, it doesn’t just compete with closed models—it sets a new benchmark for what open AI can achieve.