Prof. Wei XUE, Assistant Professor of the Division of Emerging Interdisciplinary Areas (EMIA), shared his insight of building and improving singing voice model at the 18th National Conference on Man-machine Speech Communication on Dec 8, 2023 and at the 3rd SpeechHome Conference on Speech Technology on Nov 18, 2023.
National Conference on Man-Machine Speech (NCMMSC) is an important stage for domestic experts, scholars and scientific researchers in the field of speech to exchange the latest research results and promote the continuous progress of research and development in this field.
Titled "Building the Singing Voice Foundation Model", Prof. XUE shared the research results by his team of constructing a large model of singing foundation to realize cross-gender, language, range, zero-resource, and fast-generation song synthesis. Unlike traditional AI singers that require hours of training data and a fixed repertoire, this model can support lyrics and tune modification, and can achieve the effect of singing any new song using only a few tens of seconds of data to achieve song synthesis instead of simple conversion.
The SpeechHome Conference on Speech Technology aims to promote voice technology exchanges between industry, academia and research institutes, gain insight into future technological innovation trends, and promote the development of intelligent voice technology in cutting-edge and open-source fields.
In this conference, Prof. XUE identified the existing problems of vocal synthesis, including extreme lack of labelled data, high cost of fine labelling and limited timbre. He further introduced a “High-speed, high-quality, zero-resource vocal synthesis”. With supporting technologies including CoMoSpeech and ZSinger, the diffusion model-based vocal synthesis method can be truly deployable in real-time for industrial-grade applications and allow modeling and lyric/melody control of arbitrary human timbre without labeling data.