User profiles for Rongjie Huang
Rongjie HuangZhejiang University Verified email at zju.edu.cn Cited by 1059 |
Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the lack …
text-to-video generation. Its application to audio still lags behind for two main reasons: the lack …
Audiogpt: Understanding and generating speech, music, sound, and talking head
Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …
domains and tasks, challenging our understanding of learning and cognition. Despite the …
Prodiff: Progressive fast diffusion model for high-quality text-to-speech
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process costs …
performances in many generative tasks. However, the inherited iterative sampling process costs …
Fastdiff: A fast conditional diffusion model for high-quality speech synthesis
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading
performances in many generative tasks. However, the inherited iterative sampling process costs …
performances in many generative tasks. However, the inherited iterative sampling process costs …
Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples
with unseen style (eg, speaker identity, emotion, and prosody) derived from an acoustic …
with unseen style (eg, speaker identity, emotion, and prosody) derived from an acoustic …
M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus
The lack of publicly available high-quality and accurately labeled datasets has long been a
major bottleneck for singing voice synthesis (SVS). To tackle this problem, we present …
major bottleneck for singing voice synthesis (SVS). To tackle this problem, we present …
Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition
Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …
researchers exploring cross-lingual translation techniques such as machine translation and …
Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus
High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to
the singing voice data shortage, limited singer generalization, and large computational cost. …
the singing voice data shortage, limited singer generalization, and large computational cost. …
Instructtts: Modelling expressive TTS in discrete latent space with natural language style prompt
Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according
to human's demands. Nowadays, there are two common ways to control speaking styles: (1…
to human's demands. Nowadays, there are two common ways to control speaking styles: (1…
Singgan: Generative adversarial network for high-fidelity singing voice generation
Deep generative models have achieved significant progress in speech synthesis to date,
while high-fidelity singing voice synthesis is still an open problem for its long continuous …
while high-fidelity singing voice synthesis is still an open problem for its long continuous …