Alibaba finally open sources Digital Human Technology!!

Challenges in Digital Human Technology

Digital human technology has been implemented in many scenarios, including virtual digital human live streaming, which greatly improves live streaming efficiency and reduces company costs. However, in the development process of digital human industry solutions, the key technical challenges mainly cover the following points: ensuring the smoothness and naturalness of character model movements, and combining flexible and varied live language expression skills to achieve a lifelike digital human anchor effect. Although many commercial platforms have launched a series of solutions, this time, Ant Group's Alipay has open-sourced its digital human technology: EchoMimic. This technology makes character expressions vivid and diverse, movements smooth and natural, suitable for generating virtual anchors and character videos. Now, users can build their own digital human system at no cost!

Alibaba Open Sources EchoMimic Digital Human Technology
EchoMimic can generate portrait videos separately through audio and facial landmarks, or generate them through a combination of audio and selected facial landmarks, as shown in the technical architecture below.

In the field of animated portrait images driven by audio input, significant progress has been made in generating realistic dynamic portrait images. Traditional methods are limited to converting images into videos using audio or facial key points. Although they can produce satisfactory results, some issues still exist. For example, methods driven solely by audio may be unstable at times due to relatively weak audio signals, while methods driven solely by facial key points, although more stable in driving, may result in unnatural outcomes due to excessive control of key point information. To address these challenges, Alibaba has adopted a new approach with EchoMimic. EchoMimic simultaneously trains using audio and facial landmarks. By implementing a novel training strategy, EchoMimic can not only generate portrait videos separately through audio and facial landmarks but also generate them through a combination of audio and selected facial landmarks. EchoMimic has been comprehensively compared with other algorithms on various public datasets and datasets we collected, demonstrating outstanding performance in both quantitative and qualitative evaluations. Additional visual effects and source code can be found on the EchoMimic project page.

Github Link: https://github.com/BadToBest/EchoMimic
Official Website: https://badtobest.github.io/echomimic.html