SingingBot: An Avatar-Driven System for Robotic Face Singing Performance

Zhuoxiong Xu, Xuanchen Li, Yuhao Cheng, Fei Xu, Yichao Yan, Xiaokang Yang
1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
(Corresponding author)
Arxiv
Teaser

Example results of our SingingBot, the proposed robotics singing framework. Our method generates actuation signals for humanoid robots from arbitrary songs, conveying rich emotions while maintaining obvious lip-audio synchronization.

Abstract

Equipping robotic faces with singing capabilities is crucial for empathetic Human-Robot Interaction. However, existing robotic face driving research primarily focuses on conversations or mimicking static expressions, struggling to meet the high demands for continuous emotional expression and coherence in singing. To address this, we propose a novel avatar-driven framework for appealing robotic singing. We first leverage portrait video generation models embedded with extensive human priors to synthesize vivid singing avatars, providing reliable expression and emotion guidance. Subsequently, these facial features are transferred to the robot via semantic-oriented mapping functions that span a wide expression space. Furthermore, to quantitatively evaluate the emotional richness of robotic singing, we propose the Emotion Dynamic Range metric to measure the emotional breadth within the Valence-Arousal space, revealing that a broad emotional spectrum is crucial for appealing performances. Comprehensive experiments prove that our method achieves rich emotional expressions while maintaining lip-audio synchronization, significantly outperforming existing approaches.

Video

Pipeline

Pipeline
Overall pipeline of SingingBot for robotic singing performance. Given a vocal audio and a reference portrait, our method first synthesizes a vivid avatar singing animation using a pretrained video diffusion model. Benefiting from the embedded extensive expression and emotion priors, the avatar animation serves as a reliable driving source for subsequent robotic performance. Through the semantic-oriented piecewise functions, the avatar's facial features are then mapped to the physical robot motion space.