Google Scholar

User profiles for Rongjie Huang

Rongjie Huang

Zhejiang University

Verified email at zju.edu.cn

Cited by 1059

[PDF] mlr.press

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press

Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the lack …

Save Cite Cited by 132 Related articles All 7 versions View as HTML

[PDF] aaai.org

Audiogpt: Understanding and generating speech, music, sound, and talking head

R Huang, M Li, D Yang, J Shi, X Chang, Z Ye… - Proceedings of the …, 2024 - ojs.aaai.org

Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …

Save Cite Cited by 89 Related articles All 3 versions View as HTML

[PDF] arxiv.org

Prodiff: Progressive fast diffusion model for high-quality text-to-speech

R Huang, Z Zhao, H Liu, J Liu, C Cui… - Proceedings of the 30th …, 2022 - dl.acm.org

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process costs …

Save Cite Cited by 109 Related articles All 3 versions

[PDF] arxiv.org

Fastdiff: A fast conditional diffusion model for high-quality speech synthesis

R Huang, MWY Lam, J Wang, D Su, D Yu… - arXiv preprint arXiv …, 2022 - arxiv.org

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process costs …

Save Cite Cited by 121 Related articles All 5 versions View as HTML

[PDF] neurips.cc

Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech

R Huang, Y Ren, J Liu, C Cui… - Advances in Neural …, 2022 - proceedings.neurips.cc

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples
with unseen style (eg, speaker identity, emotion, and prosody) derived from an acoustic …

Save Cite Cited by 60 Related articles All 5 versions View as HTML

[PDF] neurips.cc

M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus

…, L Deng, J Liu, Y Ren, J He, R Huang… - Advances in …, 2022 - proceedings.neurips.cc

The lack of publicly available high-quality and accurately labeled datasets has long been a
major bottleneck for singing voice synthesis (SVS). To tackle this problem, we present …

Save Cite Cited by 40 Related articles All 3 versions View as HTML

[PDF] thecvf.com

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

X Cheng, T Jin, R Huang, L Li, W Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …

Save Cite Cited by 12 Related articles All 5 versions View as HTML

[PDF] arxiv.org

Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus

R Huang, F Chen, Y Ren, J Liu, C Cui… - Proceedings of the 29th …, 2021 - dl.acm.org

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to
the singing voice data shortage, limited singer generalization, and large computational cost. …

Save Cite Cited by 70 Related articles All 4 versions

[PDF] arxiv.org

Instructtts: Modelling expressive TTS in discrete latent space with natural language style prompt

D Yang, S Liu, R Huang, C Weng, H Meng - arXiv preprint arXiv …, 2023 - arxiv.org

Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according
to human's demands. Nowadays, there are two common ways to control speaking styles: (1…

Save Cite Cited by 37 Related articles All 2 versions View as HTML

[PDF] arxiv.org

Singgan: Generative adversarial network for high-fidelity singing voice generation

R Huang, C Cui, F Chen, Y Ren, J Liu, Z Zhao… - Proceedings of the 30th …, 2022 - dl.acm.org

Deep generative models have achieved significant progress in speech synthesis to date,
while high-fidelity singing voice synthesis is still an open problem for its long continuous …

Save Cite Cited by 48 Related articles All 3 versions

Create alert

Cite

Advanced search

Saved to My library

User profiles for Rongjie Huang

Rongjie Huang

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

Audiogpt: Understanding and generating speech, music, sound, and talking head

Prodiff: Progressive fast diffusion model for high-quality text-to-speech

Fastdiff: A fast conditional diffusion model for high-quality speech synthesis

Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech

M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus

Instructtts: Modelling expressive TTS in discrete latent space with natural language style prompt

Singgan: Generative adversarial network for high-fidelity singing voice generation