Introduction

In recent years, with the rapid development of human-computer interaction technology, speech emotion recognition has attracted extensive attention of researchers, especially in the aspect of speech emotion feature extraction, researchers have done a lot of work and achieved fruitful results(Adelma, p12).

Language is an essential tool for human communication. The human language contains not only written information but also human emotions and other information. For example, due to the speaker’s different feelings, the same sentence often has various meanings and impressions to the listener. This is why we call it “being obedient”( Hamada, p16). The traditional information science world is just a “neutral” knowledge world that processes characteristic information. For example, traditional speech processing systems only focus on the accuracy of speech lexical communication and completely ignore the emotional factors contained in speech signals. So it just reflects one aspect of the information.

The perceptual science world, which corresponds to the known world and is equally important, is also an essential part of information processing. Therefore, the artificial processing of emotional signal features is of great significance in the field of signal processing and artificial intelligence. In recent years, it is a new research topic to extract emotion characteristics from a speech signal and judge the speaker’s emotion. As the research of Hamada(p5) has just started, there are few achievements in this field. At the present research level, it is generally limited to find physical parameters that can reflect emotional characteristics by analyzing the various components of impassioned speech signal such as duration, speech speed, amplitude, fundamental frequency, and frequency spectrum.

This paper investigates the time structure, amplitude structure, underlying frequency structure, and resonance peak structure of four speech signals, including joy, anger, surprise and sadness. The characteristics of different emotional signals were found by comparing them with non-emotional calm speech signals. As a preliminary study of emotion signal processing, it provides theoretical data with practical value for emotion speech signal processing and emotion recognition(Hamada, p8). 1. It is of great significance to select speech data for emotional analysis and appropriate speech signals for psychological analysis. However, at home and abroad (Lee, p14), conditions and standards for voice data for emotional analysis have not been proposed.

In our emotional analysis experiment, we mainly consider two aspects in the selection of experimental sentences: first, the penalties we choose must have a high degree of emotional freedom; Secondly, the same sentence should be able to use a variety of emotions for analysis and comparison. Based on these two principles, the two statements shown in figure 1 were selected as speech data for emotional analysis. Five male performers were asked to pronounce each sentence three times, including five emotions: calm, joy, anger, surprise and sadness. A total of 300 candidates were collected and analyzed. Moreover, this paper also introduces the model of speech emotion recognition system, then summarizes and explains the application of speech emotion features in emotion recognition and discusses the problems faced by emotion feature extraction.

Results & Discussion

The sound is sound waves produced by the vibration of an object. It is a wave phenomenon, transmitted through a medium (air, solid or liquid) that can be sensed by the auditory organs of humans or animals. The object that initially vibrates is called a sound source. The basic analog form of sound (voice information) is a sound wave called a voice signal. Voice signals can be converted into electrical signals by microphones and converted into speech waveforms. Below is the waveform of the “should we catch up?” message(Huang, p12). The abscissa represents valence activation, and the ordinate represents the norm called jitter.

Figure.1 Correlation between mean ratings and average jitter of each sentence

As can be seen from the structural analysis of emotional vocalization time shown in figure 1, the length of happy, angry and surprised pronunciations is compressed, while the period of calm articulations is slightly lengthened. Among the compressed sounds of joy, anger, and surprise, violence was the shortest, followed by shock, followed by pleasure. Regarding the relationship between speed and emotion, happy, angry, surprised and calm pronunciations were faster, while sad pronunciations were slower.

Through further observation, it can be seen that these phenomena are due to some phonemes in emotional speech compared with the calm statement, ambiguous pronunciation, lengthening or omission of phonemes. Based on the above analysis results, we can use the temporal structure of emotional discourse to distinguish sadness from other emotional signals. It is also possible to differentiate between the emotional messages of joy, anger, and surprise by setting a specific time particular threshold. In the case of anger and surprise signals, it is clear that the temporal structure of light is insufficient to make useful results.

Figure 2.

The results showed that the range of happiness, anger, and surprise was more extensive than that of calm emotion. On the contrary, compared with calmness, the extent of sadness decreased. Also, according to the hearing experiments, emotional signals tended to show that the larger the average amplitude of happiness, anger and surprise, the smaller the average amplitude of sadness, and the more obvious the emotional effect. Using amplitude characteristics, we can clearly distinguish joy, anger, surprise, and depression. Also, amplitude characteristics can identify joy, excitement and surprise emotional signals.

Through the analysis of the fundamental frequency structure, it can be seen that the primary frequency is also one of the critical characteristics reflecting emotional information (Ben-David, p4). In order to analyze the frequency structure of the characteristic necessary impassioned speech signal, we first obtain the smooth emotional speech signal of the primary frequency trajectory curve (Bänziger, p6), then analyze the moving signal with different