Alibaba Group has announced its newest open-source AI model, Wan2.2-S2V (Speech-to-Video). According to the company, this new model allows the user to convert portrait photos into expressive “film-quality” avatars that can speak, sing, and perform.
Wan2.2-S2V is part of the Wan2.2 video generation series and uses a single image and an audio clip to produce fully animated videos with varying framing options. These include portrait, bust, and full-body perspectives. Additionally, the model can generate a wide range of character actions and environmental conditions.

It also supports various types of audio recordings, including natural dialogue and musical performances. The model’s video generation capabilities are also not limited to human avatars. The company states that Wan2.2-S2V supports a diverse range of figures, including cartoons, animals, as well as other stylised characters.
Furthermore, Alibaba claims that the model’s innovative frame processing technique considerably reduces computational overhead. This allows for stable long-video generation. Aside from that, the model is tailored to meet the needs of different content creators, with support for formats like vertical short-form clips to traditional horizontal film. Users also have the choice of 480p and 720p output resolutions.

The AI model is available for download via Hugging Face and GitHub. Aside from these platforms, the model can also be found on ModelScope, which is the Alibaba Cloud open-source community.
(Source: Alibaba press release)