Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data
This research presents a neural network based voice conversion model. While it is a known fact that voiced sounds and prosody are the most important component of the voice conversion framework, what is not known is their objective contributions particularly in a noisy and uncontrolled environment. This model uses a 3 layer feedforward neural network to map the Linear prediction analysis coefficients of a source speaker to the acoustic vector space of the target speaker with a view to objectively determine the contributions of the voiced, unvoiced and supra-segmental components of sounds to the voice conversion model. Results showed that vowels “a”, “i”, “o” have the most significant contribution in the conversion success. The voiceless sounds were also found to be most affected by the noisy training data. An average noise level of 40 dB above the noise floor were found to degrade the voice conversion success by 55.14 percent relative to the voiced sounds. The result also show that for cross-gender voice conversion, prosody conversion is more significant in scenarios where a female is the target speaker.
S.H Mohammadi and A. Kain, "An overview of voice conversion systems." Speech Communication, vol 88, pp. 65-82, 2017.
O.A. Agbolade and S. A. Oyetunji, "Voice conversion using coefficient mapping and neural network." IEEE International Conference for Students on Applied Engineering (ICSAE), pp. 479-483, 2016.
R.A. Khan and J. S. Chitode. "Concatenative speech synthesis: A Review." International Journal of Computer Applications, vol 136, no. 3, 1- 6, 2016
Y. Saito, S. Takamichi, and H. Saruwatari. "Statistical parametric speech synthesis incorporating generative adversarial networks." IEEE/ACM Transactions on Audio, Speech, and Language Processing vol 26, no. 1 pp 84-96, 2017.
S.K. Gill and P. Singh. "Discontinuity removal in concatenative synthesized speech." Int. J. Eng. Technol. Sci. Res. Vol 4, no. 4, pp 415-419, 2017
L. Sneha, and S. S. Upadhya. "Text to speech synthesizer-formant synthesis." International Conference on Nascent Technologies in Engineering (ICNTE), pp. 1-4. IEEE, 2017.
M.V. Ramos, A. W. Black, R. F. Astudillo, I. Trancoso, and N. Fonseca. "Segment Level Voice Conversion with Recurrent Neural Networks." In INTERSPEECH, pp. 3414-3418, 2017.
M. Abe, S. Nakamura, K. Shikano and H. Kuwabara “Voice conversion through vector quantization.” International Conference on. Acoustics, Speech, and Signal Processing, 1988.
H. Valbret, E. Moulines and J. Tubach, “Voice transformation using PSOLA technique.” IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992
E. Moulines and F. Charpentier. “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones.” Speech communication, vol. 9 no. 5, pp 453-467, 1990
A. Songar and B. Harita, “MATLAB based Voice Conversion Model using PSOLA Algorithm.” International Journal of Digital Application and Contemporary research, vol 1 no 8, pp 1-4. 2013.
S. Desai,E.V Raghavendra, B.Yegnanarayana, A.W. Black and K. Prahallad.”Voice conversion using artificial neural networks.” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3893-3896. 2009.
S.K. Bharti, S.G. Koolagudi, K.S. Rao, A. Choudhary and B. Kumar. "Voice conversion using linear prediction coefficients and artificial neural network." ACM Proceedings of the CUBE International Information Technology Conference, pp. 240-245, 2012.
S. H.Mohammadi and A. Kain. “Voice conversion using deep neural networks with speaker-independent pre-training.” IEEE Spoken Language Technology Workshop (SLT), 2014.
L. Chen, L. Zhen-Hua and D. Li-Rong. "Voice conversion using generative trained deep neural networks with multiple frame spectral envelopes." Fifteenth Annual Conference of the International Speech Communication Association. pp. 2313-2317, 2014.
L. Chen, L. Zhen-Hua and D. Li-Rong. "Voice conversion using deep neural networks with layer-wise generative training." IEEE/ACM Transactions on Audio, Speech, and Language Processing. Vol. 22 no. 12 pp. 1859-1872, 2014.
F. Xie, Q. Yao, Y. Fan, K. Soong and H. Li. "Sequence error (SE) minimization training of neural network for voice conversion." Fifteenth Annual Conference of the International Speech Communication Association. pp. 2283-2287, 2014.
J. Nirmal, M. Zaveri, S. Patnaik, and P. Kachare. "Voice conversion using general regression neural network." Applied Soft Computing, vol. 24, pp. 1-12, 2014.
T. Nakashika, T. Takiguchi and Y. Ariki. "Voice conversion using speaker-dependent conditional restricted Boltzmann machine." EURASIP Journal on Audio, Speech, and Music Processing, no. 1, pp. 1-12, 2015
T. Nakashika, T. Takiguchi, and Y. Ariki. "Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines." IEEE/ACM Transactions on Audio, Speech, and Language Processing vol. 23, no. 3. pp. 580-587, 2014.
L. Sun, S. Kang, K. Li, and H. Meng. "Voice conversion using deep bidirectional long short-term memory based recurrent neural networks." IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4869-4873, 2015.
Kaneko, Takuhiro, and Hirokazu Kameoka. "Parallel-data-free voice conversion using cycle-consistent adversarial networks." arXiv preprint arXiv:1711.11293 (2017).
Y.S. Yun, R.E. Ladner, “Bilingual Voice Conversion by Weighted Frequency Warping Based on Formant Space.” Habernal I., Matoušek V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg vol 8082.
Y.S. Yun, J. Jung, S. Eun, S. Cha and S.S. So, “Voice Conversion of Synthesized Speeches Using Deep Neural Networks.” IEEE International Conference on Green and Human Information Technology (ICGHIT), pp. 93-96, 2019.
Y. Yang, S. Gryllia and L.L.S. Cheng. “Wh-question or wh-declarative? Prosody makes the difference.” Speech Communication. doi.org/10.1016/j.specom.2020.02.002, 2020
H. Haskins. A phonetically balanced test of speech discrimination for children. Unpublished master’s thesis. Evanston, IL: Northwestern University. 1949
L. Rabiner and R. Schafer. “Introduction to digital speech processing.” Foundations and trends in signal processing, vol 1 no. 1, pp. 1-194, 2007
T. Toda, H. Saruwatari and K. Shikano. “Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum.” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001
T. Toda, A.W. Black, and K. Tokuda “Voice conversion based on maximum likelihood estimation of spectral parameter trajectory.” IEEE Transactions on Audio, Speech, and Language Processing. Vol. 15 no 8, pp. 2222-2235, 2007.
S. Takamichi, K. Mitsui, Y. Saito, T. Koriyama, N. Tanji and H. Saruwatari “JVS corpus: free Japanese multi-speaker voice corpus.” arXiv preprint arXiv:1908.06248. 2019.