Wan 2.5 Deep Dive: Evolution, Capabilities, and a Head-to-Head with Google Veo 3

The Evolution of the Wan Series

Alibaba's open-source "Wan" lineup has always balanced research ambition with real-world deployment. From early compositing tools to today's multi-modal generators, each release has pushed Wan 2.5 closer to the creative workflows of filmmakers, marketers, and game studios.

Wan 2.1: The multilingual starting point

Launched in 2024, Wan 2.1 specialised in text-to-video teasers and experimented with multilingual prompts. Hardware constraints capped its resolution and clip length, yet the version laid the open-source, Chinese-language foundation that Wan 2.5 still benefits from.

Wan 2.2: MoE delivers a quality leap

The 2025 upgrade to Wan 2.2 introduced a Mixture-of-Experts (MoE) architecture. High-noise experts preserved global structure, while low-noise experts refined details, enabling Wan 2.5's later fidelity gains. The model shipped with T2V, TI2V, I2V, and S2V pipelines, outputting 720p footage at 24 fps under an Apache 2.0 licence. Wan 2.2 outperformed OpenAI Sora, Pika 2.2, and Runway 3 on several public benchmarks - clear proof that open video generation was production-ready.

Wan 2.5: A milestone for native audio-video fusion

1. Launch background

At Alibaba Cloud's 2025 Yunqi Conference, Wan 2.5 debuted as a unified text, audio, and video model. Wan 2.5's training objective couples all three modalities, extending clip length from 5s to 10s and upgrading to 1080p resolution while maintaining an open licence.

2. Core technical highlights

Unified multi-modal backbone: Wan 2.5 extends the MoE design so text, image, video, and audio signals train inside one network, aligning semantics, visuals, and sound in a single forward pass.
10-second 1080p output: Wan 2.5 consistently renders 10-second, 1080p sequences with stable motion, paving the way for future 4K variants.
Native audio synthesis: Unlike Wan 2.2, Wan 2.5 generates dialogue, ambience, and score in sync with the footage. Fal.ai showcases Wan 2.5 clips featuring coffee-shop chatter, wind, and running water.
Better prompt adherence: Wan 2.5 interprets long, multi-sentence prompts and camera directions more precisely, with smoother style transfer.
RLHF alignment: Wan 2.5 incorporates human preference optimisation, improving aesthetic alignment versus Wan 2.2.
Overall performance gains: Alibaba reports Wan 2.5 boosts generation speed by 25%, visual quality by 30%, semantic accuracy by 40%, and motion fidelity by 35%, all while keeping the Apache 2.0 licence.

Comparing Wan 2.5 with Google Veo 3

1. Veo 3 at a glance

Google introduced Veo 3 in May 2025. Cybernews highlights that Veo 3 outputs up to 8 seconds of 1080p video with rich camera moves, detailed faces, and built-in audio generation. Access, however, is limited to Vertex AI, the Gemini web app, or Gemini API - there is no open-source release.

2. Key differences

Veo 3 follows a closed, subscription-based model. Pricing tiers such as Plus (around $25 per month) and Pro (around $99 per month) gate throughput and features. In contrast, Wan 2.5 remains fully open-source, deployable on-prem or in the cloud, and already localised for Chinese and global creators. Analyst reports note that Wan 2.5 operates without geo-blocks, whereas Veo 3 may require a VPN in certain regions.

3. Comparison table

Feature or Metric	Wan 2.5	Veo 3
Open-source availability	Yes (Apache 2.0 licence)	No (closed commercial stack)
Audio-video sync	Native in Wan 2.5	Supported, closed ecosystem
Max resolution	1080p	720p-1080p depending on tier
Clip duration	10 seconds	8 seconds
Chinese prompt support	First-class in Wan 2.5	Limited
Regional access	Global, no VPN required	VPN needed in some territories
Commercial usage	Free to self-host Wan 2.5	Subscription required
API integrations	Yes, multiple providers	Yes, via Google-managed endpoints

Five representative prompts for head-to-head testing

Use identical scenarios to compare Wan 2.5 and Veo 3 output. Each prompt stresses Wan 2.5's multi-modal strengths:

Coffee shop conversation: Morning sunlight through a window, a barista saying "I'd like a latte," full ambient audio. Wan 2.5 should deliver synced speech and background noise.
High-speed chase: F1 car weaving through neon-lit streets with sweeping camera moves and pulsing electronic soundtrack. Wan 2.5's motion fidelity and audio mix shine here.
Beach singer: Sunset guitarist on the shore, Wan 2.5 blending vocals, waves, and gulls into a single render.
Sci-fi planet: Deep-blue planet rotating with red satellites and an atmospheric synth score. Wan 2.5's style flexibility keeps the scene cohesive.
Extreme rock climbing: Wide-angle summit push with howling wind and dramatic percussion - Wan 2.5 keeps climber poses stable while driving the soundtrack.

Evaluate Wan 2.5 versus Veo 3 using the same prompts, then compare audio alignment, visual stability, and stylistic adherence.

How to experience Wan 2.5 today

Tongyi Wanxiang official portal: Alibaba Cloud's UI for Wan 2.5, offering free quotas and guided workflows at wan.video
Fal.ai Playground: Experiment with Wan 2.5 preview endpoints and APIs inside Fal.ai's sandbox at fal.ai
Alibaba Cloud Bailian Platform: Enterprise-grade Wan 2.5 access with managed infrastructure and SLAs at bailian.console.aliyun.com
Aggregator platforms: Services such as wananimate.org let subscribers trigger Wan 2.5, Veo 3, and other engines under one account, perfect for A/B testing.

Strategy and future roadmap

At Yunqi 2025, Alibaba committed roughly $38 billion over three years to AI and cloud infrastructure. Eddie Wu, CEO of Alibaba Digital Media & Entertainment, described a future in which foundation models "embed across devices with long-term memory," underscoring that open-source remains central to the group's plan. Since open-sourcing Qwen in 2023, Alibaba has released 300+ AI models, amassing over 600 million downloads and inspiring 1,700 derivative projects. Wan 2.5 extends that open ecosystem and signals what's next: 4K pipelines, longer clips, tighter audio alignment, and deeper RLHF-driven scene understanding.

Summary

Wan 2.5 marks a decisive shift for open video generation. By unifying text, imagery, and sound inside one Wan 2.5 architecture, creators gain richer storytelling tools, faster iteration, and higher quality than previous Wan releases. Against Google Veo 3, Wan 2.5 stands out for openness, Chinese localisation, and enterprise-friendly licensing. As the Wan 2.5 roadmap moves toward 4K and extended runtimes, expect the model to anchor creative pipelines worldwide - and to showcase Alibaba's sustained investment in AI video innovation.