Preview

Discourse

Advanced search

Speech Emotion Recognition: Humans vs Machines

https://doi.org/10.32603/2412-8562-2019-5-5-136-152

Abstract

Introduction. The study focuses on emotional speech perception and speech emotion recognition using prosodic clues alone. Theoretical problems of defining prosody, intonation and emotion along with the challenges of emotion classification are discussed. An overview of acoustic and perceptional correlates of emotions found in speech is provided. Technical approaches to speech emotion recognition are also considered in the light of the latest emotional speech automatic classification experiments.

Methodology and sources. The typical “big six” classification commonly used in technical applications is chosen and modified to include such emotions as disgust and shame.
A database of emotional speech in Russian is created under sound laboratory conditions. A perception experiment is run using Praat software’s experimental environment.

Results and discussion. Cross-cultural emotion recognition possibilities are revealed, as the Finnish and international participants recognised about a half of samples correctly. Nonetheless, native speakers of Russian appear to distinguish a larger proportion of emotions correctly. The effects of foreign languages knowledge, musical training and gender on the performance in the experiment were insufficiently prominent. The most commonly confused pairs of emotions, such as shame and sadness, surprise and fear, anger and disgust as well as confusions with neutral emotion were also given due attention.

Conclusion. The work can contribute to psychological studies, clarifying emotion classification and gender aspect of emotionality, linguistic research, providing new evidence for prosodic and comparative language studies, and language technology, deepening the understanding of possible challenges for SER systems.

About the Authors

S. Werner
University of Eastern Finland
Finland

Stefan Werner – PhD (Linguistics) (2000), Professor, University of Eastern Finland

  FI-80100 Joensuu, Finland; FI-70210 Kuopio



G. N. Petrenko
Saint Petersburg Electrotechnical University
Russian Federation

Georgii N. Petrenko – Assistant Lecturer at the Department of Foreign Languages 

5 Professora Popova str., St Petersburg 197376



References

1. Osipov, G.V. (ed.) (1999), Rossiiskaya sotsiologicheskaya entsiklopediya [Russian Sociological Encyclopedia], NORMA-INFRA-M, Moscow, available at: http://sociologicheskaya.academic.ru/1401/ (accessed 03.11.2015).

2. Ilyin, E.P. (2013), Emotions and Feelings. 2nd ed., Piter, SPb, Russia.

3. Seppnen, T., Toivanen, J. and Vyrynen E. (2003), “Mediateam speech corpus: a first large finnish emotional speech database”, Proceed. of XV International Conf. of Phonetic Science, vol. 3, Barcelona, Spain, 3–9 aug. 2003, pp. 2469–2472.

4. El Ayadi, M., Kamel, M.S. and Karray, F. (2011), “Survey on speech emotion recognition: Features, classification schemes, and databases”, Pattern Recognition, vol. 44, iss. 3, pp. 572–587. DOI: https://doi.org/10.1016/j.patcog.2010.09.020.

5. Galunov, V.I. (2008), “On the possibility of speaker’s emotional state recognition from speech”, Speech Technology, vol. 1, pp. 60–66.

6. Bryzgunova, E.A. (1980), “Intonation”, in Shvedova, N.Yu. (ed.) Russian Grammar, vol. 1, Nauka, Moscow, USSR, pp. 96–122.

7. Darwin, C. (1897), The Expression of the Emotions in Man and Animals. D. Appleton & Company, NY, USA.

8. Ostwald, P.F. (1964), “Acoustic Manifestations of Emotional Disturbance”, Disorders of Communication, XLII, pp. 450–465.

9. Williams, C.E. and Stevens, K.N. (1972), “Emotions and speech: Some acoustical correlates”, The Journal of the Acoustical Society of America, vol. 52, no. 4, pp. 1238–1250.

10. Boersma, P. (2002) “Praat, a system for doing phonetics by computer”, Glot International, vol. 5, iss. 9/10, pp. 341–345.

11. Nash, R. (1968), Intonational Interference in the Speech of Puerto Rican Bilinguals, an Instrumental Study Based on Oral Readings of a Juan Bobo Story, Inter American Univ., San Juan, PR.

12. Svetozarova, N.D. (1982), Intonatsionnaya sistema russkogo yazyka [The Intonation System of the Russian Language], Leningrad Univ. Publishing House, Leningrad, USSR.

13. DiCanio C. and Hatcher, R. (2018), “On the non-universality of intonation: Evidence from Triqui”, The Journal of the Acoustical Society of America, vol. 144, iss. 3, DOI: https://doi.org/10.1121/ 1.5068494 (accessed 15.09.2019).

14. Petrenko, G.K. and Shumkov, A.A. (2014), Speech and Music: Points of Contact, ETU Publishing House, SPb, Russia.

15. Khitrov, M.V., Davydov, A.G., Tkachenya, A.V., Kiselev, V.V. and Romashkin, Yu.N. (2012), “Automatic Speech Emotion Recognition Using the Support Vector Method and Gini Coefficient”, Speech Technology, vol. 4, pp. 34–43.

16. Manerov, V.H. (1993), “Experimental and Theoretical Foundations of Social Identification of Speaker Interpretation”, Abstract of Dr. Sci. (psychology) dissertation, The Herzen State Pedagogical Univ. of Russia, SPb, Russia.

17. Leont'ev, A.N. (1971), Needs, Motives and Emotions, Moscow State Univ., Moscow, USSR.

18. Vartanyan, I.A., Galunov, V.I., Dmitrieva, E.S., Zaitseva, K.A., Koroleva, I.V., Kuzmin, Yu.I., Morozov, V.P. and Shurgaya, G.G. (1988), Vospriyatie rechi. Voprosy funktsional'noi asimmetrii mozga [Speech perception. Functional brain asymmetry issues], Nauka, Leningrad, USSR.

19. Vartanov, A.V., Kosareva, Yu.I. (2015), “Emotions of a person and a monkey: subjective scaling of vocalizations”, Moscow University Psychology Bulletin, vol. 2, pp. 93–109. DOI: 10.11621/vsp.2015.02.93.

20. Rozaliev, V.L. (2007), “Construction the model of emotions on speech of the person”, Izvestiya VolgGTU, iss. 3, no. 9 (35), pp. 65-68.

21. Ververidis, D. and Kotropoulos, C. (2006), “Emotional Speech Recognition: Resources, Features, and Methods”, Speech Communication, vol. 48, iss. 9, pp. 1162–1181. DOI: 10.1016/j.specom.2006.04.003.

22. Fayek, H.M., Lech, M. and Cavedon, L. (2017), “Evaluating deep learning architectures for Speech Emotion Recognition”, Neural Networks, vol. 92, pp. 60–68. DOI: 10.1016/j.neunet.2017.02.013.

23. Sidorov, K.V. and Filatova, N.N. (2012), “Analysis of Signs of Emotive Speech”, Vestnik TvGTU, no. 20, pp. 26–31.

24. Xiao, Z., Dellandrea, E., Dou, W. and Chen, L. (2005), “Features extraction and selection for emotional speech classification”, IEEE Conference on Advanced Video and Signal Based Surveillance, Como, Italy, 5–16 Sept. 2005, pp. 411–416. DOI: 10.1109/AVSS.2005.1577304.

25. Fewzee, P. and Karray, F. (2012), “Dimensionality Reduction for Emotional Speech Recognition”, International Conference on Privacy, Security, Risk and Trust (PASSAT), International Conference on SocialCom, IEEE, 03–05 Sept., 2012, Amsterdam, Netherlands. pp. 532–537. DOI: 10.1109/SocialCom-PASSAT.2012.83.

26. Brester, K.Yu., Semenkin, E.S. and Sidorov, M.Yu. (2014), “Automatic Feature Selection System for Human Emotion Recognition in Speech Communication”, Software and Systems, no. 4 (108), available at: http://cyberleninka.ru/article/n/sistema-avtomaticheskogo-izvlecheniya-informativnyhpriznakov-dlya-raspoznavaniya-emotsiy-cheloveka-v-rechevoy-kommunikatsii (accessed 15.07.2019).

27. Eyben, F., Wöllmer, M. and Schuller, B. (2010) “OpenSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor”, Proceedings of the 18th ACM international conference on Multimedia, oct. 25–29, 2010, Firenze, Italy, pp. 1459–1462. DOI: 10.1145/1873951.1874246.

28. Liberman, M., Davis, K., Grossman, M., Martey, N.and Bell, J. (2002), Emotional Prosody Speech and Transcripts LDC2002S28. Web Download. Philadelphia: Linguistic Data Consortium.

29. Ramabhadran, B., Gustman, S. Byrne, W., Hajič J., Oard D., J. Scott Olsson, Picheny M. and Psutka J. (2012), USC-SFI MALACH Interviews and Transcripts English LDC2012S05, DVD, Linguistic Data Consortium, Philadelphia, USA.

30. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. and Weiss, B. (2005) “A Database of German Emotional Speech”, 9th European Conference on Speech Communication and Technology, Lisboa, Portugal, sept. 4–8, 2005, pp. 1–4.

31. Makarova, V. and Petrushin, V. (2002), “RUSLANA: a database of Russian emotional utterances”, 7th International Conference on Spoken Language Processing, ICSLP2002 – INTERSPEECH 2002, available at: https://www.researchgate.net/publication/221491469_RUSLANA_a_database_of_Russian_emotional_uttera nces/ (accessed 23.06.2018).

32. Shriberg, E., Bates, R., Stolcke, A., Taylor, P., Jurafsky D. et al. (1998), “Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?”, Language and Speech, vol. 41 (3–4), pp. 443–492.

33. Coleman, J. (2005), Introducing Speech and Language Processing, Cambridge Univ. Press. Cambridge, UK.

34. Dickinson, M., Brew, C. and Meurers, D. (2012), Language and Computers, John Wiley & Sons Hoboken, NJ, USA.

35. Durand, J., Gut, U. and Kristoffersen, G. (2014), The Oxford handbook of corpus phonology, Oxford Univ. Press, Oxford, UK.

36. Hirst, D. and Di Cristo, A. (ed.) (1998), Intonation Systems: A Survey of Twenty Languages, Cambridge Univ. Press, Cambridge, UK.

37. Rueckert, L. (2011), “Gender Differences in Empathy”, in Scapaletti, D.J. (ed.) Psychology of Empathy, Nova Science Publishers, NY, USA, pp. 221–234.

38. Palmer, H.E. (1924), English Intonation with Systematic Exercises, Heffer, Cambridge, UK.


Review

For citations:


Werner S., Petrenko G.N. Speech Emotion Recognition: Humans vs Machines. Discourse. 2019;5(5):136-152. https://doi.org/10.32603/2412-8562-2019-5-5-136-152

Views: 916


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2412-8562 (Print)
ISSN 2658-7777 (Online)