
Wan 2.5 Deep Dive: Evolution, Capabilities, and a Head-to-Head with Google Veo 3
Wan 2.5 versus Veo 3 - which video generation model delivers more?
The Evolution of the Wan Series
Alibaba's open-source "Wan" lineup has always balanced research ambition with real-world deployment. From early compositing tools to today's multi-modal generators, each release has pushed Wan 2.5 closer to the creative workflows of filmmakers, marketers, and game studios.
Wan 2.1: The multilingual starting point
Launched in 2024, Wan 2.1 specialised in text-to-video teasers and experimented with multilingual prompts. Hardware constraints capped its resolution and clip length, yet the version laid the open-source, Chinese-language foundation that Wan 2.5 still benefits from.
Wan 2.2: MoE delivers a quality leap
The 2025 upgrade to Wan 2.2 introduced a Mixture-of-Experts (MoE) architecture. High-noise experts preserved global structure, while low-noise experts refined details, enabling Wan 2.5's later fidelity gains. The model shipped with T2V, TI2V, I2V, and S2V pipelines, outputting 720p footage at 24 fps under an Apache 2.0 licence. Wan 2.2 outperformed OpenAI Sora, Pika 2.2, and Runway 3 on several public benchmarks - clear proof that open video generation was production-ready.
Wan 2.5: A milestone for native audio-video fusion
1. Launch background
At Alibaba Cloud's 2025 Yunqi Conference, Wan 2.5 debuted as a unified text, audio, and video model. Wan 2.5's training objective couples all three modalities, extending clip length from 5s to 10s and upgrading to 1080p resolution while maintaining an open licence.
2. Core technical highlights
- Unified multi-modal backbone: Wan 2.5 extends the MoE design so text, image, video, and audio signals train inside one network, aligning semantics, visuals, and sound in a single forward pass.
- 10-second 1080p output: Wan 2.5 consistently renders 10-second, 1080p sequences with stable motion, paving the way for future 4K variants.
- Native audio synthesis: Unlike Wan 2.2, Wan 2.5 generates dialogue, ambience, and score in sync with the footage. Fal.ai showcases Wan 2.5 clips featuring coffee-shop chatter, wind, and running water.
- Better prompt adherence: Wan 2.5 interprets long, multi-sentence prompts and camera directions more precisely, with smoother style transfer.
- RLHF alignment: Wan 2.5 incorporates human preference optimisation, improving aesthetic alignment versus Wan 2.2.
- Overall performance gains: Alibaba reports Wan 2.5 boosts generation speed by 25%, visual quality by 30%, semantic accuracy by 40%, and motion fidelity by 35%, all while keeping the Apache 2.0 licence.
Comparing Wan 2.5 with Google Veo 3
1. Veo 3 at a glance
Google introduced Veo 3 in May 2025. Cybernews highlights that Veo 3 outputs up to 8 seconds of 1080p video with rich camera moves, detailed faces, and built-in audio generation. Access, however, is limited to Vertex AI, the Gemini web app, or Gemini API - there is no open-source release.
2. Key differences
Veo 3 follows a closed, subscription-based model. Pricing tiers such as Plus (around $25 per month) and Pro (around $99 per month) gate throughput and features. In contrast, Wan 2.5 remains fully open-source, deployable on-prem or in the cloud, and already localised for Chinese and global creators. Analyst reports note that Wan 2.5 operates without geo-blocks, whereas Veo 3 may require a VPN in certain regions.
3. Comparison table
Feature or Metric | Wan 2.5 | Veo 3 |
---|---|---|
Open-source availability | Yes (Apache 2.0 licence) | No (closed commercial stack) |
Audio-video sync | Native in Wan 2.5 | Supported, closed ecosystem |
Max resolution | 1080p | 720p-1080p depending on tier |
Clip duration | 10 seconds | 8 seconds |
Chinese prompt support | First-class in Wan 2.5 | Limited |
Regional access | Global, no VPN required | VPN needed in some territories |
Commercial usage | Free to self-host Wan 2.5 | Subscription required |
API integrations | Yes, multiple providers | Yes, via Google-managed endpoints |
Five representative prompts for head-to-head testing
Use identical scenarios to compare Wan 2.5 and Veo 3 output. Each prompt stresses Wan 2.5's multi-modal strengths:
- Coffee shop conversation: Morning sunlight through a window, a barista saying "I'd like a latte," full ambient audio. Wan 2.5 should deliver synced speech and background noise.
- High-speed chase: F1 car weaving through neon-lit streets with sweeping camera moves and pulsing electronic soundtrack. Wan 2.5's motion fidelity and audio mix shine here.
- Beach singer: Sunset guitarist on the shore, Wan 2.5 blending vocals, waves, and gulls into a single render.
- Sci-fi planet: Deep-blue planet rotating with red satellites and an atmospheric synth score. Wan 2.5's style flexibility keeps the scene cohesive.
- Extreme rock climbing: Wide-angle summit push with howling wind and dramatic percussion - Wan 2.5 keeps climber poses stable while driving the soundtrack.
Evaluate Wan 2.5 versus Veo 3 using the same prompts, then compare audio alignment, visual stability, and stylistic adherence.
How to experience Wan 2.5 today
- Tongyi Wanxiang official portal: Alibaba Cloud's UI for Wan 2.5, offering free quotas and guided workflows at wan.video
- Fal.ai Playground: Experiment with Wan 2.5 preview endpoints and APIs inside Fal.ai's sandbox at fal.ai
- Alibaba Cloud Bailian Platform: Enterprise-grade Wan 2.5 access with managed infrastructure and SLAs at bailian.console.aliyun.com
- Aggregator platforms: Services such as wananimate.org let subscribers trigger Wan 2.5, Veo 3, and other engines under one account, perfect for A/B testing.
Strategy and future roadmap
At Yunqi 2025, Alibaba committed roughly $38 billion over three years to AI and cloud infrastructure. Eddie Wu, CEO of Alibaba Digital Media & Entertainment, described a future in which foundation models "embed across devices with long-term memory," underscoring that open-source remains central to the group's plan. Since open-sourcing Qwen in 2023, Alibaba has released 300+ AI models, amassing over 600 million downloads and inspiring 1,700 derivative projects. Wan 2.5 extends that open ecosystem and signals what's next: 4K pipelines, longer clips, tighter audio alignment, and deeper RLHF-driven scene understanding.
Summary
Wan 2.5 marks a decisive shift for open video generation. By unifying text, imagery, and sound inside one Wan 2.5 architecture, creators gain richer storytelling tools, faster iteration, and higher quality than previous Wan releases. Against Google Veo 3, Wan 2.5 stands out for openness, Chinese localisation, and enterprise-friendly licensing. As the Wan 2.5 roadmap moves toward 4K and extended runtimes, expect the model to anchor creative pipelines worldwide - and to showcase Alibaba's sustained investment in AI video innovation.
Categories
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates