International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
10th percentile
Powered by  Scopus
Scopus coverage:
Nov 2018 to May 2020


IJSTR >> Volume 8 - Issue 11, November 2019 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

Improving Estimation Of Valence And Arousal Emotion Dimensions Based On Emotion Unit

[Full Text]



Reda Elbarougy



Acoustic features extraction, emotion dimensions' estimation, voiced segment, support vector regression (SVR), improve the challenge valence dimension, speech segmentation, proper time for emotion recognition.



The objective of this research is to improve the estimation accuracy of emotion dimensions: valence and arousal. Former studies for speech emotion recognition (SER) mostly supposed that the affective content is stable and unchangeable through the entire utterance. Thus, these studies have been conducted based on the entire utterance as one unit for estimating these dimensions. However, this assumption is not fulfilled especially for long utterance because emotion is dynamic and may fluctuate through the long utterances. Consequently, the extracted low-level descriptors from such utterances are less effective for SER systems since they are mixture of different affective states. Most of these research ignored the investigation for the proper time scale to be used when extracting features. Therefore, a novel emotion unit based on voiced segments is proposed for improving the estimation accuracy. To evaluate the proposed method, SER system based on the dimensional approach using support vector regression is used. For validating it, the EMO-DB database is used. To measure the accuracy, mean absolute error (MAE) for the estimated values of valence and arousal is used as a metric. Results revealed that the emotion unit that contains three and four voiced segments gives the best MAE for valence and arousal, respectively. It is found that the performance of the proposed method using voiced related emotion unit outperforms the conventional method using utterance unit for both valence and arousal. The improvement in terms of MAE is from 0.68 to 0.51 for valence dimension, and from 0.34 to 0.21 for arousal dimension.



[1] W. Jiang, Z. Wang, J. S. Jin , X. Han and C. Li, "Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network", Sensors, 19(12), 2730, (2019).
[2] Y. Aoki, C. F. Huang, and M. Akagi , "An emotional speech recognition system based on multi-layer emotional speech perception model", 2009 International Workshop on Nonlinear Circuits and Signal ProcessingNCSP'09, Waikiki, Hawaii, March 1-3, (2009).
[3] R. Elbarougy and M. Akagi, "Improving speech emotion dimensions' estimation using a three-layer model of human perception," Acoustical Science and Technology, vol. 35, no. 2, pp. 86–98,( 2014).
[4] Y. Hamada, R. Elbarougy, M. Akagi, "A method for emotional speech synthesis based on the position of emotional state in Valence-Activation space." In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA), Siem Reap, Cambodia, 9–12 December 2014; pp. 1–7, (2014).
[5] R. Elbarougy, Han Xiao, M. Akagi, Junfeng Li, "Toward relaying an affective Speech-to-Speech translator: Cross-language perception of emotional state represented by emotion dimensions", 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), pp. 1-6, (2014).
[6] H. Gunes, B. Schuller, M. Pantic, R. Cowie, "Emotion representation analysis and synthesis in continuous space: A survey," Proc. IEEE Int. Conf. Autom. Face Gesture Recog., pp. 827-834, (2011).
[7] R. Elbarougy and M. Akagi, "Speech emotion recognition system based on a dimensional approach using a three-layered model," Proc. APSIPA ASC (2012).
[8] S. G. Karadogan and J. Larsen, "Combining semantic and acoustic features for valence and arousal recognition in speech,’" Proc. Cognitive Information Processing (2012).
[9] A. Tursunov, S. Kwon and H.S. Pang, "Discriminating Emotions in the Valence Dimension from Speech Using Timbre Features", Appl. Sci. 2019, 9(12), 2470, (2019).
[10] R. Elbarougy and M. Akagi, "Feature Selection Method for Real-time Speech Emotion Recognition" O-COCOSDA2017 pp. 86-91, Seoul, Korea, November, (2017).
[11] J. Cai, W. Chen and Z. Yin "Multiple Transferable Recursive Feature Elimination Technique for Emotion Recognition Based on EEG Signals", Symmetry, 11(5), 683, (2019).
[12] R. Elbarougy, "Extracting A Discriminative Acoustic Features from Voiced Segments for Improving Speech Emotion Recognition Accuracy", International Journal of Advanced Research in Computer Science and Electronics Engineering 8(9), pp. 39-44, (2019).
[13] R. Elbarougy, and M. Akagi, "Optimizing fuzzy inference systems for improving speech emotion recognition," Advances in Intelligent Systems and Computing, vol. 533, pp. 85-95, (2017).
[14] M. Grimm, K. Kroschel, S. Narayanan, "Support Vector Regression for Automatic Recognition of Spontaneous Emotions in Speech," In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP ‘07, Honolulu, HI, USA, 15–20 April 2007; pp. IV-1085–IV-1088, (2007).
[15] Z. Zhang, B. Wu, B. Schuller, "Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech," In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6705–6709, (2019).
[16] K.-Y. Huang, C.-H. Wu, Q.-B. Hong, M.-H. Su, Y.-H. Chen, "Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds," In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5866–5870, (2019).
[17] R. Elbarougy, "Speech Emotion Recognition based on Voiced Emotion Unit", International Journal of Computer Applications, 178(47), pp. 22-28, (2019).
[18] E. Tzinis; A. Potamianos, "Segment-based speech emotion recognition using recurrent neural networks," In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; pp. 190–195, (2017).
[19] M. T. Shami, M. S. Kamel, "Segment-based approach to the recognition of emotions in speech", Proc. IEEE Int.Conf. Multimedia and EXPO, 2005-July.
[20] B. Schuller, G. Rigoll, "Timing levels in segment-based speech emotion recognition," in Proceedings of the INTERSPEECH-ICSLP 2006, Pittsburgh, September 2006. ISCA, pp. 1818–1821, (2006).
[21] J. H. Yeh, T. L. Pao, C. Y. Lin, Y. W. Tsai, and Y. T. Chen, "Segment-Based Emotion Recognition from Continuous Mandarin Chinese Speech," Computers in Human Behavior, vol. 27, no. 5, pp. 1545-1552, (2011).
[22] T. Vogt. "Real-time automatic emotion recognition from speech." PhD thesis, Technischen Fakultät der Universität Bielefeld, (2010).
[23] A. Batliner, D. Seppi, S. Steidl, and B. Schuller, "Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach," Advances in Human-Computer Interaction, 2010. vol. 2010, Article ID 782802, 15 pages, (2010).
[24] F. Eyben, M. Wöllmer, B. Schuller, "A multi-task approach to continuous five-dimensional affect sensing in natural speech," ACM Trans. Interact. Intell. Syst., Special Issue on Affective Interaction in Natural Environments 2(1), 29 (2012).
[25] V. Parsa, D. Jamieson, "Acoustic discrimination of pathological voice: sustained vowels versus continuous speech," J. Speech, Lang. Hear. Res. 44, 327–339 (2001)
[26] Z. Zeng, M. Pantic, G. I. Rosiman, T. S. Huang, "A survey of affect recognition methods: audio, visual, and spontaneous expressions," IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)
[27] B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, A. Wendemuth, "Comparing one and two-stage acoustic modeling in the recognition of emotion in speech," in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2007, Kyoto, Japan, 2007. IEEE, pp. 596–600, (2007).
[28] B. Schuller, F. Weninger, Y. Zhang, F. Ringeval, A. Batliner, S. Steidl, F. Eyben, E. Marchi, A. Vinciarelli, K. Scherer, "Affective and behavioural computing: Lessons learnt from the first computational para-linguistics challenge.'" Comput. Speech Lang. 2019, 53, 156–180, (2019).
[29] C. Busso, S. Lee, S. Narayanan, "Using neutral speech models for emotional speech analysis," in Proceedings of the INTERSPEECH 2007, Antwerp, Belgium, August 2007. ISCA, pp. 2225–2228, (2007).
[30] E. Mower, S.S. Narayanan, "A hierarchical static-dynamic framework for emotion classification," in Proceedings of the ICASSP 2011, Prague, Czech Republic, May 2011. IEEE, pp. 2372–2375, (2011).
[31] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, "A Database of German Emotional Speech," INTERSPEECH (2005).
[32] R. Cowie and R. Cornelius, "Describing the emotional states that are expressed in speech," Speech Communication 40(1-2):5-32, (2003)
[33] B. Gold and N. Morgan, "Speech and audio signal processing," Wiley New York, (2000).
[34] F. Ringeval, & M. Chetouani, "A vowel based approach for acted emotion recognition," In INTERSPEECH, Brisbane, Australia, 22–26 September, (2008).
[35] H. Kawahara, and I.M.-katsuse, and A.D. Cheveign, "Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech Communication, vol. 27, pp. 187–207, (1999).