Introduction

In recent years, with the rapid development of human-computer interaction technology, speech emotion recognition has attracted extensive attention of researchers, especially in the aspect of speech emotion feature extraction, researchers have done a lot of work and achieved fruitful results(Adelma, p12).

Language is an essential tool for human communication. The human language contains not only written information but also human emotions and other information. For example, due to the speaker’s different feelings, the same sentence often has various meanings and impressions to the listener. This is why we call it “being obedient”( Hamada, p16). The traditional information science world is just a “neutral” knowledge world that processes characteristic information. For example, traditional speech processing systems only focus on the accuracy of speech lexical communication and completely ignore the emotional factors contained in speech signals. So it just reflects one aspect of the information.

The perceptual science world, which corresponds to the known world and is equally important, is also an essential part of information processing. Therefore, the artificial processing of emotional signal features is of great significance in the field of signal processing and artificial intelligence. In recent years, it is a new research topic to extract emotion characteristics from a speech signal and judge the speaker’s emotion. As the research of Hamada(p5) has just started, there are few achievements in this field. At the present research level, it is generally limited to find physical parameters that can reflect emotional characteristics by analyzing the various components of impassioned speech signal such as duration, speech speed, amplitude, fundamental frequency, and frequency spectrum.

This paper investigates the time structure, amplitude structure, underlying frequency structure, and resonance peak structure of four speech signals, including joy, anger, surprise and sadness. The characteristics of different emotional signals were found by comparing them with non-emotional calm speech signals. As a preliminary study of emotion signal processing, it provides theoretical data with practical value for emotion speech signal processing and emotion recognition(Hamada, p8). 1. It is of great significance to select speech data for emotional analysis and appropriate speech signals for psychological analysis. However, at home and abroad (Lee, p14), conditions and standards for voice data for emotional analysis have not been proposed.

In our emotional analysis experiment, we mainly consider two aspects in the selection of experimental sentences: first, the penalties we choose must have a high degree of emotional freedom; Secondly, the same sentence should be able to use a variety of emotions for analysis and comparison. Based on these two principles, the two statements shown in figure 1 were selected as speech data for emotional analysis. Five male performers were asked to pronounce each sentence three times, including five emotions: calm, joy, anger, surprise and sadness. A total of 300 candidates were collected and analyzed. Moreover, this paper also introduces the model of speech emotion recognition system, then summarizes and explains the application of speech emotion features in emotion recognition and discusses the problems faced by emotion feature extraction.

Results & Discussion

The sound is sound waves produced by the vibration of an object. It is a wave phenomenon, transmitted through a medium (air, solid or liquid) that can be sensed by the auditory organs of humans or animals. The object that initially vibrates is called a sound source. The basic analog form of sound (voice information) is a sound wave called a voice signal. Voice signals can be converted into electrical signals by microphones and converted into speech waveforms. Below is the waveform of the “should we catch up?” message(Huang, p12). The abscissa represents valence activation, and the ordinate represents the norm called jitter.

Figure.1 Correlation between mean ratings and average jitter of each sentence

As can be seen from the structural analysis of emotional vocalization time shown in figure 1, the length of happy, angry and surprised pronunciations is compressed, while the period of calm articulations is slightly lengthened. Among the compressed sounds of joy, anger, and surprise, violence was the shortest, followed by shock, followed by pleasure. Regarding the relationship between speed and emotion, happy, angry, surprised and calm pronunciations were faster, while sad pronunciations were slower.

Through further observation, it can be seen that these phenomena are due to some phonemes in emotional speech compared with the calm statement, ambiguous pronunciation, lengthening or omission of phonemes. Based on the above analysis results, we can use the temporal structure of emotional discourse to distinguish sadness from other emotional signals. It is also possible to differentiate between the emotional messages of joy, anger, and surprise by setting a specific time particular threshold. In the case of anger and surprise signals, it is clear that the temporal structure of light is insufficient to make useful results.

Figure 2.

The results showed that the range of happiness, anger, and surprise was more extensive than that of calm emotion. On the contrary, compared with calmness, the extent of sadness decreased. Also, according to the hearing experiments, emotional signals tended to show that the larger the average amplitude of happiness, anger and surprise, the smaller the average amplitude of sadness, and the more obvious the emotional effect. Using amplitude characteristics, we can clearly distinguish joy, anger, surprise, and depression. Also, amplitude characteristics can identify joy, excitement and surprise emotional signals.

Through the analysis of the fundamental frequency structure, it can be seen that the primary frequency is also one of the critical characteristics reflecting emotional information (Ben-David, p4). In order to analyze the frequency structure of the characteristic necessary impassioned speech signal, we first obtain the smooth emotional speech signal of the primary frequency trajectory curve (Bänziger, p6), then analyze the moving signal with different changes of the necessary frequency trajectory curve, and find out the structural characteristics of the emotional signal with different fundamental frequencies.

Compared with calm sounds, the frequency of the first formant of joy and anger increased slightly, while that of sadness decreased significantly. On closer inspection, we found that this is because people’s mouths open wider when expressing joy and anger than when they speak calmly. In addition to opening the mouth smaller than usual, the expression of sadness is accompanied by a vague nasal sound.

The dynamic range of the first formant frequency of the four emotions is higher than that of the calm state, among which the surprise is the largest. The rate of change of the frequency of the first resonance peak in all four emotions was lower than that in the calm state, and the level of sadness was the most economical. Above, we analyzed and compared the speech signals including joy, anger, sadness, and surprise from three aspects: time structure, amplitude structure, and fundamental frequency structure.

Due to some individual differences, the above analysis results vary to different degrees for different speakers, but the overall trend of the analysis results is consistent. As the subject of future research, we will further analyze the relationship between emotional speech, its spectrum, and formant features, to find the best characteristic parameters for future impassioned speech signal processing.

Reference

Adelman, James S., Zachary Estes, and Martina Cossu. “Emotional sound symbolism: Languages rapidly signal valence via phonemes.” Cognition 175 (2018): 122-130.

Bänziger, Tanja, and Klaus R. Scherer. “The role of intonation in emotional expressions.” Speech communication 46.3-4 (2005): 252-267.

Ben-David, Boaz M., et al. “Prosody and semantics are separate but not separable channels in the perception of emotional speech: Test for the rating of emotions in speech.” Journal of Speech, Language, and Hearing Research 59.1 (2016): 72-89.

Hamada, Yasuhiro, Reda Elbarougy, and Masato Akagi. “A method for emotional speech synthesis based on the position of the emotional state in Valence-Activation space.” Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE, 2014.

Huang, Zhaocheng, and Julien Epps. “An Investigation of Partition-based and Phonetically-aware Acoustic Features for Continuous Emotion Prediction from Speech.” IEEE Transactions on Affective Computing 1 (2018): 1-1.

Lee, Chi-Chun, Daniel Bone, and Shrikanth S. Narayanan. “An analysis of the relationship between signal-derived vocal arousal score and human emotion production and perception.” Sixteenth Annual Conference of the International Speech Communication Association. 2015.