Emotional Intelligence in Text-To-Speech Synthesis in Pali Language Using Fuzzy Logic
DOI:
https://doi.org/10.46947/joaasr632024938Keywords:
Emotional Speech Synthesis, Text-To-Speech, Fuzzy logic, Unit Selection, Speech Synthesis.Abstract
The field of emotional text-to-speech (TTS) synthesis is making swift progress within the realm of artificial intelligence, holding immense promise to transform our interaction with technology. By using advanced algorithms to analyze and understand the emotional content of text, these systems are able to produce spoken language that accurately conveys the intended emotional tone of the message. Despite the existence of several Text-To-Speech systems across various languages, Pali language is yet to have its own. As a result, we have taken the initiative to create a Text-To-Speech synthesizer exclusively for Pali. Our system offers an end-to-end solution for emotional speech synthesis via Text-To-Speech. We address the problem by incorporating disentangled, well-grained prosody features with global, sentence-level emotion implanting. These well-grained features learn to denote local prosodic differences disentangled from the speaker, tone, and worldwide emotion label. Prosody is usually modeled by rules, so we have implemented the fuzzy logic system to develop a controller for the prosody of Pali speech. The fuzzy controller handles different linguistic parameters in three types of sentences: interrogative, exclamatory, and declarative. The final system produces comprehensible speech that mimics the appropriate intonation for every type of sentence.
In this paper, we introduce and outline the application of a fuzzy paradigm to incorporate a Text-To-Speech system for the Pali language while preserving a rule-based Concatenative synthesizer. In the outline of classic Concatenative TTS systems, we recommend a new method in order to increase Concatenative unit selection computation, directed at increasing synthetic speech perceptual superiority. In order to tackle the problem of phonemes that are prone to multiple descriptions in rule-based speech synthesis, the proposed solution involves a fuzzy system.
In the introductory section, we offer a concise description of the current context surrounding the challenge of emotional speech synthesis. The second section of this paper outlines the notable advancements made in emotional speech synthesis, acknowledging the contributions of various researchers in this field. The third section delves into the technical details of implementing a fuzzy system The last section of the paper presents the main conclusions and future research scope.
Metrics
References
T. Kim, J. Lee, and J.-H. Lee, "Emotional speech synthesis using a deep neural network," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5986-5990.
S. Cho, K. Lee, J. Lee, S.-g. Lee, and J. Kim, "Exploring the limits of voice conversion and TTS for creating a high-quality, realistic, and controllable speech synthesis," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5118-5122.
W. Ping, K. Peng, B. Kingsbury, and B. Ramabhadran, "Deep voice 3: Scaling text-to-speech with convolutional sequence learning," arXiv preprint arXiv:1710.07654, 2018.
S. King, Y. Stylianou, and J. Yamagishi, "Prosody generation and manipulation for expressive speech synthesis," Speech Communication, vol. 74, pp. 47-60, 2015.
J. Lee, D. Kim, and J. Lee, "Emotional expressiveness of synthetic speech using prosodic features," Journal of Ambient Intelligence and Humanized Computing, vol. 8, no. 1, pp. 61-72, 2017.
S. Moubayed, C. Busso, L. McAllister, and J. Lee, "Emotional speech synthesis using a machine learning approach," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 754-758, 2018.
T. Takala and J. Juola, "Emotional speech synthesis using deep neural networks," in 2016 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 874-880, 2016.
D. Kim, J. Lee, and J. Lee, "Emotional speech synthesis using a deep neural network," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5986-5990, 2017.
K. Srihari et al., "Text-to-Speech Synthesis for Indian Languages," Proceedings of the IEEE, vol. 101, no. 5, pp. 1227-1238, 2013. DOI: 10.1109/JPROC.2013.2244527.
K. Sreenivasa Rao et al., "A Text-to-Speech Synthesizer for Kannada Language," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 9, pp. 2441-2454, 2012. DOI: 10.1109/TASL.2012.2194229.
A. G. Ramakrishnan et al., "Speech Synthesis for Indian Languages," Sadhana - Academy Proceedings in Engineering Sciences, vol. 39, no. 3, pp. 673-695, 2014. DOI: 10.1007/s12046-014-0249-2.
S. Sengupta and S. Basu, "Development of a Hindi Text-to-Speech Synthesis System Using Artificial Neural Networks," Proceedings of the 2nd International Conference on Intelligent Human Computer Interaction (IHCI), pp. 1-6, 2010. DOI: 10.1109/IHCI.2010.5686927.
A. K. Vuppala et al., "Development of Text-to-Speech Synthesis System for Telugu Language," International Journal of Speech Technology, vol. 19, no. 1, pp. 19-29, 2016. DOI: 10.1007/s10772-015-9316-3.
K. Srihari et al., "Text-to-Speech Synthesis for Indian Languages," Proceedings of the IEEE, vol. 101, no. 5, pp. 1227-1238, 2013. DOI: 10.1109/JPROC.2013.2244527.
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2021). Beyond Accuracy: Behavioral Testing of NLP models with CheckList. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4908-4924).
Google. (2016). WaveNet: A Generative Model for Raw Audio. Retrieved from https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
Arik, S. O., Chen, Y., Peng, K., & Pfister, B. (2017). Deep Voice: Real-time Neural Text-to-Speech. arXiv preprint arXiv:1702.07825.
T. Atanasova, T. Polzin, and S. Schnieder, "Emotional speech synthesis: A review," in Proceedings of the 4th International Conference on Language Resources and Evaluation, 2004, pp. 1237-1240.
C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, 2008.
X. Chen, L. Ma, Y. Zhang and Y. Xu, "Emotional Speech Synthesis Based on Deep Recurrent Neural Networks," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 2184-2195, Nov. 2018, doi: 10.1109/TASLP.2018.2859853.
J. Fan, C. Wang, Y. Zhang, T. Jiang, and T. Huang, "Emotion synthesis from speech with a hybrid model of deep belief network and rule-based model," in Proceedings of the 2015 IEEE International Conference on Multimedia and Expo, 2015, pp. 1-6.
C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, Dec. 2008.
Kenter, T., Neumann, M., & Schuster, M., “Generating emotional speech with variational autoencoders”. In Proceedings of the Conference of the North American Chapter of the Association for Computational pp. 2747-2756, 2019.
Tacheon Kim, Jaeyoung Lee, and Jin-Hwa Lee, "Emotional speech synthesis using a deep neural network," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5986-5990.
S. Cho, K. Lee, J. Lee, S.-g. Lee, and J. Kim, "Exploring the limits of voice conversion and TTS for creating a high-quality, realistic, and controllable speech synthesis," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5118-5122.
W. Ping, K. Peng, B. Kingsbury, and B. Ramabhadran, "Deep voice 3: Scaling text-to-speech with convolutional sequence learning," arXiv preprint arXiv:1710.07654, 2018.
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.