User profiles for Rongjie Huang

Rongjie Huang

Zhejiang University
Verified email at zju.edu.cn
Cited by 1059

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the lack …

Audiogpt: Understanding and generating speech, music, sound, and talking head

R Huang, M Li, D Yang, J Shi, X Chang, Z Ye… - Proceedings of the …, 2024 - ojs.aaai.org
Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …

Prodiff: Progressive fast diffusion model for high-quality text-to-speech

R Huang, Z Zhao, H Liu, J Liu, C Cui… - Proceedings of the 30th …, 2022 - dl.acm.org
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process costs …

Fastdiff: A fast conditional diffusion model for high-quality speech synthesis

R Huang, MWY Lam, J Wang, D Su, D Yu… - arXiv preprint arXiv …, 2022 - arxiv.org
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process costs …

Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech

R Huang, Y Ren, J Liu, C Cui… - Advances in Neural …, 2022 - proceedings.neurips.cc
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples
with unseen style (eg, speaker identity, emotion, and prosody) derived from an acoustic …

M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus

…, L Deng, J Liu, Y Ren, J He, R Huang… - Advances in …, 2022 - proceedings.neurips.cc
The lack of publicly available high-quality and accurately labeled datasets has long been a
major bottleneck for singing voice synthesis (SVS). To tackle this problem, we present …

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

X Cheng, T Jin, R Huang, L Li, W Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …

Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus

R Huang, F Chen, Y Ren, J Liu, C Cui… - Proceedings of the 29th …, 2021 - dl.acm.org
High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to
the singing voice data shortage, limited singer generalization, and large computational cost. …

Instructtts: Modelling expressive TTS in discrete latent space with natural language style prompt

D Yang, S Liu, R Huang, C Weng, H Meng - arXiv preprint arXiv …, 2023 - arxiv.org
Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according
to human's demands. Nowadays, there are two common ways to control speaking styles: (1…

Singgan: Generative adversarial network for high-fidelity singing voice generation

R Huang, C Cui, F Chen, Y Ren, J Liu, Z Zhao… - Proceedings of the 30th …, 2022 - dl.acm.org
Deep generative models have achieved significant progress in speech synthesis to date,
while high-fidelity singing voice synthesis is still an open problem for its long continuous …