Alibaba launched a collection of synthetic intelligence (AI) video technology fashions on Wednesday. Dubbed Wan 2.1, these are open-source fashions that can be utilized for each tutorial and industrial functions. The Chinese e-commerce large launched the fashions in a number of parameter-based variants. Developed by the corporate’s Wan workforce, these fashions had been first launched in January and the corporate claimed that Wan 2.1 can generate extremely reasonable movies. Currently, these fashions are being hosted on the AI and machine studying (ML) hub Hugging Face.
Alibaba Introduces Wan 2.1 Video Generation Models
The new Alibaba video AI fashions are hosted on Alibaba’s Wan workforce’s Hugging Face web page. The mannequin pages additionally element the Wan 2.1 suite of enormous language fashions (LLMs). There are 4 fashions in whole — T2V-1.3B, T2V-14B, I2V-14B-720P, and I2V-14B-480P. The T2V is brief for text-to-video whereas the I2V stands for image-to-video.
The researchers declare that the smallest variant, Wan 2.1 T2V-1.3B, will be run on a consumer-grade GPU with as little as 8.19GB vRAM. As per the put up, the AI mannequin can generate a five-second-long video with 480p decision utilizing an Nvidia RTX 4090 in about 4 minutes.
While the Wan 2.1 suite is aimed toward video technology, they’ll additionally carry out different features akin to picture technology, video-to-audio technology, and video enhancing. However, the presently open-sourced fashions usually are not able to these superior duties. For video technology, it accepts textual content prompts in Chinese and English languages in addition to picture inputs.
Coming to the structure, the researchers revealed that the Wan 2.1 fashions are designed utilizing a diffusion transformer structure. However, the corporate innovated the bottom structure with new variational autoencoders (VAE), coaching methods, and extra.
Most notably, the AI fashions use a brand new 3D causal VAE structure dubbed Wan-VAE. It improves spatiotemporal compression and reduces reminiscence utilization. The autoencoder can encode and decode unlimited-length 1080p decision movies with out dropping historic temporal data. This allows constant video technology.
Based on inside testing, the corporate claimed that the Wan 2.1 fashions outperform OpenAI’s Sora AI mannequin in consistency, scene technology high quality, single object accuracy, and spatial positioning.
These fashions can be found below the Apache 2.0 licence. While it does permit for unrestricted utilization for educational and analysis functions, industrial utilization comes with a number of restrictions.