Real-Time Voice Cloning Using Deep Learning
DOI:
https://doi.org/10.52783/jns.v14.3981Keywords:
Voice Cloning, Real-Time Speech Synthesis, Deep Learning, Speaker Embedding, Tacotron 2, WaveRNN, Zero-Shot Learning, Neural VocoderAbstract
Voice cloning—the ability to synthesize natural- sounding speech in a target speaker’s voice—has emerged as a powerful tool with applications in accessibility, virtual assistants, entertainment, and human-computer interaction. Traditional voice synthesis systems are often constrained by the need for extensive speaker-specific data and prolonged training cycles, limiting their scalability and adaptability. This paper presents a real-time deep learning-based voice cloning framework capable of synthesizing speech in any speaker’s voice using only a few seconds of reference audio. The architecture integrates a speaker encoder for extracting vocal identity, a text-to-spectrogram syn- thesizer based on Tacotron 2, and a WaveRNN vocoder for high-fidelity waveform generation. Advanced preprocessing, such as silence trimming and normalization, is employed to enhance speaker embedding quality. The system operates in a zero-shot setting without the need for speaker-specific retraining. Objective evaluation metrics including PESQ, STOI, and Mel Cepstral Distortion (MCD) demonstrate the effectiveness of the proposed model, achieving notable improvements in speech quality, intelli- gibility, and speaker similarity compared to baseline approaches. This work contributes to advancing real-time, data-efficient, and scalable voice synthesis systems and highlights their potential across a range of real-world applications.
Downloads
Metrics
References
Wu, C.H., Hsia, C.C., Liu, T.H. and Wang, J.F., 2006, “Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis”, IEEE Transactions on Audio, Speech, and Language Processing, 14(4), pp.1109–1116.
Nose, T., Ota, Y. and Kobayashi, T., 2010, “HMM-based voice conver- sion using quantized F0 context”, IEICE Transactions on Information and Systems, 93(9), pp.2483–2490.
Watts, O., Yamagishi, J., King, S. and Berkling, K., 2009, “Syn- thesis of child speech with HMM adaptation and voice conversion”, IEEE Transactions on Audio, Speech, and Language Processing, 18(5), pp.1005–1016.
Nose, T. and Kobayashi, T., 2011, “Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental fre- quency”, Speech Communication, 53(7), pp.973–985.
Qiao, Y., Saito, D. and Minematsu, N., 2010, “HMM-based sequence- to-frame mapping for voice conversion”, IEEE ICASSP-2010.
Percybrooks, W., Moore, E. and McMillan, C., 2013, “Phoneme inde- pendent HMM voice conversion”, IEEE ICASSP-2013.
Okubo, T., Mochizuki, R. and Kobayashi, T., 2006, “Hybrid voice conversion of unit selection and generation using prosody dependent HMM”, IEICE Transactions on Information and Systems, 89(11), pp.2775–2782.
Masuko, T., Tokuda, K., Kobayashi, T. and Imai, S., 1997, “Voice characteristics conversion for HMM-based speech synthesis system”, IEEE ICASSP-1997.
Yamagishi, J., Tamura, M., Masuko, T., Tokuda, K. and Kobayashi, T., 2003, “A training method of average voice model for HMM- based speech synthesis”, IEICE Transactions on Fundamentals, 86(8), pp.1956–1963.
Rashad, M.Z., El-Bakry, H.M., Isma’il, I.R. and Mastorakis, N., 2010, “An overview of text-to-speech synthesis techniques”, Latest Trends on Communications and Information Technology, pp.84–89.
Stylianou, Y., 2001, “Applying the harmonic plus noise model in concatenative speech synthesis”, IEEE Transactions on Speech and Audio Processing, 9(1), pp.21–29.
Hunt, A.J. and Black, A.W., 1996, “Unit selection in a concatenative speech synthesis system using a large speech database”, IEEE ICASSP- 1996.
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... and Wu, Y., 2018, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis”, Advances in Neural Information Processing Systems (NeurIPS), 31.
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., ... and Kavukcuoglu, K., 2018, “Efficient neural audio synthesis”, International Conference on Machine Learning (ICML), pp.2410–2419.
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., ... and Saurous, R.A., 2017, “Tacotron: Towards end-to-end speech synthesis”, arXiv preprint, arXiv:1703.10135.
Wan, L., Wang, Q., Papir, A. and Moreno, I.L., 2018, “Generalized end- to-end loss for speaker verification”, IEEE ICASSP-2018, pp.4879–4883.
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., ... and Wu, Y., 2018, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions”, IEEE ICASSP-2018, pp.4779–4783.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... and Kavukcuoglu, K., 2016, “WaveNet: A generative model for raw audio”, arXiv preprint, arXiv:1609.03499.
Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R., Battenberg, E., Shor, J., ... and Saurous, R.A., 2018, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis”, ICML- 2018.
Kumar, K., Su, S., Ganesh, S., Ramakrishnan, A., Lee, J., Kim, J., ... and Kim, Y., 2020, “Recent advances in voice cloning via deep learning: Challenges and opportunities”, IEEE Transactions on Audio, Speech, and Language Processing.
Henter, G.E., Klejsa, J., Merritt, T., Gustafsson, J. and Beskow, J., 2021, “Fast and reliable neural vocoding using collaborative training strategies”, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Zeghidour, N., Luebs, A., Pino, J., Grangier, D., Synnaeve, G. and Dupoux, E., 2020, “Multi-speaker TTS and zero-shot voice cloning with speaker-conditional generative models”, Proceedings of Interspeech.
Pamisetty, G. and Murty, K.S.R., 2023, “Prosody-TTS: An end-to-end speech synthesis system with prosody control”, Circuits, Systems, and Signal Processing, 42(1), pp.361–384.
Sak, H., Gu¨ngo¨r, T. and Safkan, Y., 2006, “A corpus-based concatenative speech synthesis system for Turkish”, Turkish Journal of Electrical Engineering and Computer Sciences, 14(2), pp.209–223.
Oura, K., Mase, A., Yamada, T., Muto, S., Nankaku, Y. and Tokuda, K., 2010, “Recent development of the HMM-based singing voice synthesis system—Sinsy”, in Proceedings of the ISCA Workshop on Speech Synthesis.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
Terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.