1 Introduction

Digital data is constantly growing and becoming more accessible, making it challenging for software tools and technologies to display, store, manage, and analyze it [1]. In real-world social media, such as Twitter and Facebook, Polyglots are much more likely to submit code-switched items by combining two different natural languages [2]. In the data mining (DM) and natural language processing (NLP) sectors, these code-switched texts have generated a plethora of new research areas, including speech recognition, information extraction, language modeling, and lexicon analysis, to mention a few [3]. Emotion identification or sentiment analysis for code-switched texts, which aims to find emotions or sentiments in a piece of mixed-language literature, is one of the most popular research topics [4].

In the past ten years, numerous neural network models have been investigated with the aim of code-switched emotion detection. The current approaches are primarily concerned with building robust neural models with intricate features or architectures [5]. To enhance the code-switched detection model, CNNs and LSTM with the attention mechanism are applied. These techniques, which take characteristics directly from the code-switched text itself, might convey different emotions in either one language or both [6]. The goal of speech emotion recognition (SER) is to identify emotion in speech, regardless of the semantic content. Figure 1 shows the block representation of the SER system. The speech is first given to the ML-based training system then it gets pre-processed with another pre-processing system. After it undergoes feature extraction to specify the features. Another text sample with Mel frequency cepstral coefficients (MFCC) and (MEDC)-enabled featured extraction goes to the classifier. The classifier differentiates the difference between the two outputs of the feature extractions and then sends a signal to the emotion recognition system. The emotional recognition system finally detects the emotions from the given speech.

Fig. 1
figure 1

Block diagram of SER

Languages usually have different ways of expressing emotions, which keeps these techniques from progressing. As a result, a successful model ought to be able to more efficiently and effectively mine both monolingual and bilingual data. A parallel translation with a bilingual perspective translates texts using code-mixing into both languages. By doing this, we can prevent information loss and preserve the original contexts as much as we can in both languages [7]. Additionally, the total system can use attention-based Bidirectional LSTMs as the shared encoder under adversarial learning to dynamically and selectively use both the monolingual private and the bilingual shared features in code-mixed texts [5]. A bilingual-view parallel translation, which translates code-mixed texts into both languages, provides the least amount of information loss while preserving the original settings [8]. The task of extracting features from concurrent translation texts in two languages is handled by an adversarial dual-channel encoder [9]. Additionally, the total system can use attention-based Bidirectional LSTM Networks as the shared encoder under adversarial learning to dynamically and selectively use both the monolingual private and the bilingual shared features in code-mixed texts [10].

In recent years, the connection between human and machine communication has grown in significance. Many studies were conducted in the 1950s to teach robots to recognize human voices [11, 12]. To enhance the effectiveness of human-to-machine communication, the statistics of the human voice must be recognized [13]. A range of circumstances, such as educational and therapeutic settings, as well as entertainment and the arts, can benefit from the use of emotional content in speech [14]. Speech signals may now be used to communicate between humans and machines thanks to several technological developments [15]. Speech recognition and speech-to-text (STT) technology have made mobile phones an increasingly popular means of communication [16]. One of the signal recognition fields of study that is expanding the quickest is speech recognition. SER is a new area of study that has the potential to advance a variety of industries, including automatic translation systems and human–machine interfaces [17]. As a result, the study concentrates on a variety of speech extraction traits, emotion databases, and classification strategies. Figure 2 shows the implication of CNN methods in the SER.

Fig. 2
figure 2

Application of CNN methodology in SER [158]

Speech contains signals that reveal the speaker, their language, and their emotions in addition to the message. The majority of the speech processing algorithms currently in use work admirably with neutral studio recordings, but when it comes to emotional speech, they struggle [18]. This is because it is challenging to represent and define the emotions that are expressed through spoken language [19]. Communication becomes more natural when it contains emotional content. By using the appropriate semantics, emotions can be used to describe the same idea in a different way [20]. The project's primary goal is to analyze and examine the potential use of ML and DL for emotion-based hazardous speech identification [21]. Naive Bayes, Support Vector Machine, and K-Nearest Neighbor models are three ML methods are used to check the identify emotion label. The Naïve Bayes model demonstrates strong performance with an accuracy of 82.3% in emotion detection. Additionally, the F1-Score metrics for this model stand at 0.89. This achievement is particularly noteworthy in the context of analyzing emotions and identifying them within Tamil song lyrics.

This review was formulated by following the steps of the PRISMA methodology and it is organized into different sections. Section 2 explains the overview and research gap. Section 3 discusses the traditional emotional extraction in speech. Section 4 reviews the previous works of recent years. The outcomes of the application of the PRISMA approach are reported together with the responses to the research questions and the acquired results are discussed in Section 5. In Section 6, conclusions and closing remarks are offered.

2 Overview and research gap

In past years a lot of research has been going on SER. The major focus is on extracting emotions like anger, happiness, sadness, fear, surprise, and disgust from the speech but almost more than 300 emotions apperars in the speech [22]. It is very difficult to find out all emotions from speech and decide whether it is harmful to society. DL techniques are the subset of ML that is widely used in extraction from voice, especially data models created for the special object for pattern recognition and detection making. One of the frequently used approaches is multi-model learning with deeper layers of architecture like RNN, Deep Belief Neiwork, Deep Boltzmann Machine (DBM), CNN, and Auto Encoder [23]. In recent years, emotional extraction in speech has gained much attention, especially after the popularity of social media. The peer-reviewed journals have shown a significant increase in the past few years. Science Direct shows 5006 results on the topic out of which 649 are review articles, 3,475 are research articles, 67 are encyclopedias, 433 are book chapters, 72 are conference abstracts, 4 are book reviews, and the rest are other types of articles. On the other hand, MDPI shows 62 search results on this topic. Springer shows 19,614 search results. Out of this, 6,499 are chapters, 3,956 are articles, 3,788 are books, 5,040 are conference papers and proceedings, the rest are other documents. Finally, IEEEXplore shows 1,964 search results. This has 1,689 conferences, and 246 journals, the rest are other documents. Figure 3 shows the year-wise development works from 1999 to 2022. The graph shows that in the past three years, there is a huge development in the research work. This suggests that this topic is gaining huge attention but there is still too much research work that needs to be done in this field. According to the literature, there are significant differences across the databases in terms of the number of performers, the number of emotions recognized, and the methodology. Speech-emotional databases are used in both psychological investigations to understand the patient's behavior and in circumstances when it is desirable to automate emotion recognition. When real-time data is used, the system gets complicated and emotion recognition is challenging. Figure 3 hows the advancement of the SER papers in the recent years.

Fig. 3
figure 3

Advancement of the SER research work in recent years

Extraction of features and selection is also a major focus of current research, which aims to increase performance accuracy by selecting the best characteristics. To improve system performance and recognize the appropriate emotions, classifier selection is a difficult process, according to data analysis [24]. Although many classifiers have been selected for the speech emotion identification system, no clear victor has emerged. SER is a complex problem. Emotions and facial expressions are two possible ways to recognize emotions when conducting sentiment analysis in the future [25]. If a future study is to be done on emotion recognition in general or on hazardous speech detection in particular, certain difficulties have been noted.

In the past ten years, contributions to the SER system focused on the novel approach based on a statistical method name extracted statistical feature works on a unimodal approach, gender, speaker-independent, and real-time [26]. Feature learning approaches from speech data are used to extract the feature statistically in terms of the degree of standard derivation [27]. The real-time issue with human–computer interaction is to catch the human voice and reply with accuracy like a human. The design to resolve this problem used an automatic speech recognition system that can figure out different emotional classes coming from the human mind and select the major feature extraction from the speech signals [28]. The improvement of SER used a novel feature-learning method based on an adaptive time–frequency coefficient to improve the accuracy of SER using the simulation performance based on the Persian Drama Radio Emotional Corpus, the Surrey Audio-Visual Expressed Emotion Database, and Berlin Emotional Speech Database [29]. The experimental result set shows that methodology adaptive time–frequency based on FFT with Cepstral features works effectively resulting in SAVEE (80% accuracy), EMO-DB(97.57% accuracy), and PDREC (91.46% accuracy) data sets [30]. The SER is a complicated problem because it extracts the natural feature emotion from the real audio data set, and it is playing too. SER plays a significant role in human–computer interaction [31]. The study's main goal is to improve classification accuracy and extract eight emotions from human speech. Emotion prediction from the speech used MFF-Aug research by white noise injection, pitch tuning, and noise removal [32]. On pre-processed speech signals, the feature extraction techniques MFCC, zero crossing rate, and root mean square [33]. To analyze the voice emotional classification and speech representation used the CNN approach. The next step is to compare the LSTM and CNN method. The TESS, CREMA, RAVDESS, and SAVEE datasets were used for analyzing the experimental methodology and accuracy of 92.6%, 89.94%, 84.9%, and 99.6% [34]. SER has a broad range in the smart application field of medical science, human–robot interaction, and online gaming apps [35]. Smart SER system applications are two main major problems computational cost and time to figure out this issue used preprocessing steps on six databases i.e. EmoDB, RAVDESS, IEMOCAP, ShEMO, DEMoS, and MSP-Improv which speech segments with similar formate characteristics [36].

3 Traditional emotional extraction in speech

Signal pre-processing, feature extraction, and classification are the three core elements of emotion identification systems based on digitized speech [37]. To establish meaningful units of the signal, acoustic pre-processing techniques like denoising and segmentation are used. To find the pertinent features present in the signal, feature extraction is used. Speech signal processing, feature extraction, and classification are all covered in-depth in this section [38]. Due to their importance to the subject, the distinctions between spontaneous and performed speech are also examined. Speech enhancement is carried out in the initial step of speech-based signal processing, where the noisy components are eliminated [39]. Feature extraction and feature selection make up the second stage. The pre-processed speech signal is used to extract the necessary features, and the extracted features are then used to make the selection. The study of speech signals in the temporal and frequency domains is typically the foundation for such feature extraction and selection [40]. In the third stage, different classifiers, including Gaussian Mixture Model and Hidden Markov Model, are used to categorize these features. Last but not least, several emotions are identified based on feature classification [41].

3.1 Improving speech input data for speech emotion recognition

During the data collection phase, noise frequently taints the input data used for emotion recognition. These flaws make the feature extraction and classification less precise. This means that for emotion detection and recognition algorithms to function properly, the input data must be improved. The speaker and recording variance is removed during this pre-processing stage while the emotional discrimination is retained [42].

3.2 Extraction and selection of features in speech emotion recognition

Following augmentation, segments are used to characterize the speech stream. Based on the information gathered, pertinent traits are extracted and divided into several groups. Short-term categorization, which is based on properties that last just a short while, like energy, formants, and pitch, is one type of classification. The other is known as long-term categorization, and two often employed long-term features are mean and standard deviation [43]. The intensity, pitch, pace, and variance of uttered words are among the prosodic qualities that are typically significant in identifying different types of emotions from the input speech signal [44].

3.3 Acoustical measures in speech emotion recognition

Each emotion's information availability is encrypted. Among the most studied subjects in this area are vocal parameters and how they relate to emotion identification. Many factors are typically taken into account, including spoken word characteristics like intensity, pitch, pace, and quality of voice [45]. The assumption that emotions are separate categories with independent existence is a common one in the simple view of emotion. In many cases, the relationship between intensity and pitch, and activation is such that the intensity value rises with a high pitch and falls with a low pitch [45]. If the speaker is acting, whether there are many different speakers, and the person's mood or personality all have an impact on how acoustic factors transfer to emotion. Emotions in HCI are typically not the conventional discrete emotions; rather, they are frequently weakly expressed, jumbled, and difficult to identify from one another [46]. Based on the feelings exhibited by a person, emotional remarks in literature are classified as either good or negative. Some research suggests that actors exaggerate their emotional expressions because listener-based performed emotions are significantly stronger and more accurate than real emotions. Areas within the space, according to the study, can describe basic emotions. While valence depicts the impact of positivity and negativity on emotions, arousal shows the intensity of serenity or excitement [47].

3.4 Classification of speech emotion recognition features

Different classifiers have been researched in the literature to create systems like SER, speech recognition, and speaker verification [48]. On the other hand, the reasons for selecting a specific classifier for a given speech task are frequently left out of most applications. Typically, classifiers are chosen based on an empirical evaluation of some signs or a rule of thumb, as was previously discussed. Ordinarily, the two primary types of pattern recognition classifiers used for SER can be broadly divided into linear classifiers and non-linear classifiers [49]. With a linear arrangement of numerous objects, linear classifiers typically conduct classification based on object attributes. Most of the time, these objects are assessed as an array known as a feature vector [50]. On the other hand, non-linear classifiers are used to characterize things before creating a non-linear weighted combination of those objects.

3.5 Databases for recognition of speech emotion

Many academics use speech emotional databases for several research projects. The most crucial aspects of evaluation for emotion recognition are the caliber of the databases used and the performance attained. Depending on the reason for developing speech systems, different techniques and goals are used to collect voice databases [51]. The basic categories of speech databases are used to construct emotional speech systems. The speech data in these databases was captured by skilled and seasoned actors. This database is regarded as the one that makes it easiest to collect the speech-based dataset of different emotions out of all the others [52]. It is estimated that this method is used to compile over 60% of speech datasets. This is a different kind of database where the emotional set is gathered by fabricating a fake emotional circumstance. This is done without the artist or speaker's awareness. This database is more lifelike than actor-based databases [53].

The speaker should be aware that they have been videotaped for research purposes, therefore an ethical question might arise. These databases are tough to collect owing to the difficulties in recognition even though they are the most realistic. Conversations from contact centers, the general public, and other situations are typically recorded for natural emotional speech databases [54]. When research on speech-based emotion identification began to take off in the early 1990s, researchers frequently started using acted databases before switching to realistic databases [55]. The most often utilized performed databases are the 10 actors' recorded voices included in the Berlin emotional speech database and the Danish emotional speech database [56]. Four test subjects were asked to speak a variety of words in five distinct emotional states. German-Aibo emotion and Smart-Kom data, where the actors' voices are captured in a lab, are included in the data. Additionally, real-world call center interactions captured during live recordings have been utilized [57]. According to the literature, there are significant differences across the databases in terms of the number of performers, the number of emotions recognized, and the methodology [58]. Speech-emotional databases are used in both psychological investigations to understand the patient's behavior and in circumstances when it is desirable to automate emotion recognition [59].

4 Methodology

This literature review uses the PRISMA methodology. We screened 46,131 articles from the WoS and Scopus websites. To find relevant articles, we use an advanced filtering system on different scientific websites. We use logical operators to find the most relevant documents. Figure 4 shows the flowchart of the PRIMSA methodology used to select the papers. Table 1 shows the methodology used for shorting the relevant papers for this review.

Fig. 4
figure 4

Flowchart of the used PRISMA methodology

Table 1 Keyword filtering system used for shorting the papers

5 Literature review

As a developing field of ML research, DL has attracted more attention in recent years [60]. DL techniques for SER have several advantages over conventional techniques, including the ability to recognize composite systems and structures without the need for tuning and physical feature extraction, the capacity to work with unlabeled data [61], and a propensity to extract low-level features from raw data. For a machine, identifying human emotions is a challenging task, but for us, it is straightforward [62]. To improve communication between humans and robots, an emotion recognition system makes use of knowledge about emotions [63]. The basic frequencies, Linear Prediction Cepstrum Coefficient, and Mel Frequency Cepstrum Coefficient are a few of the speech aspects that have been researched [64]. There is a possibility of speaker-dependent or speaker-independent emotional information recognition. Other classifiers include K-nearest Neighbors (KNN), SVM, CNN, and others [65]. Various strategies for identifying emotional states in speech using selected papers from the period 2005 to 2018 help to create a model that can recognize and classify six different emotions using a Deep Neural Network (DNN) for emotion identification [66]. The research concludes by averaging the accuracy of the two databases, which were used in the study. Both sets of data are utilized to extract features using MFCC, SVM, and Gaussian mixture models using the retrieved feature to categorize the speaker's age [67]. The emotional approach utilized in the study, which focuses on transformation, is then used to forecast the training data. Real-time CNN models were suggested as a way to recognize emotions [68]. This paradigm includes subcategories for being angry, joyful, and depressed. The model's accuracy is 66.1% on average. A few different emotions were identified. It is therefore difficult to predict any other feeling but their own [69]. Text categorization was employed to examine the content of speech. The main method for translating emotions from audio to text is text mining. Huang and his coworkers created a novel method for identifying emotions. Using a nonlinear SVM classifier, four kinds of emotions were identified [70]. The Deep Belief Network model took too long to extract features compared to other feature extraction techniques, which was the flaw in this unique approach [71].

2010 through 2022 saw the exploration of this work. All ML techniques are combined with "hate speech identification" in the inquiry. It had been done to categorize the procedure using an ML approach. The state-of-the-art review methodology had been modified from [77,78,79,80,81,82,83,84,85,86,87,88,89]. Tables 2 and 3 provided a quick summary of some current research in this area.

Table 2 Review of DL in SER
Table 3 Recent ML methods and their applications in hate speech detection

A 1D CNN performs better in classification tasks than traditional machine learning algorithms. By learning low-level or spectral information, SER technology is used to categorize emotions. A CNN-based method for identifying emotions employs feature space for low-level data such as pitch and energy as well as spectral information such as a log-Mel spectrogram, STFT. Calculating the spectral flux, which assesses the spectral change between two frames, involves squaring the difference between the normalized magnitudes of the spectra of two successive short-term windows [90].

$$F{1}_{\left(i,i-1\right)}=\sum_{K=1}^{W{f}_{L}}{\left(E{N}_{i}\left(K\right)-W{{N}_{i}}_{-1}\left(K\right)\right)}^{2}$$

where ENi(K) is the Kth normalized discrete Fourier transform, i is the frame, Wf is the weighted frequency, N is the bin number, and F(n) is the center frequency of the bin (Table 4).

Table 4 Notable preliminary research using ML

Subjectivity analysis can separate objective sentences from subjective statements and remove the latter from the corpus. A rule-based classifier for identifying hate speech was constructed using a vocabulary built on semantic, anti-, and theme-based components. Two machine learning methods that can be used directly to raise precision and recall scores are SVM and maximal entropy [99]. MFCC and Gaussian Mixture Models are widely combined to identify or forecast the presence or severity of depression in speakers [100]. The context of the discussions, which may include the speakers' emotions or mood, is inferred by the BigEAR architecture using a psychological audio processing chain (PAPC). The advantage of the BigEAR framework is that psychologists are no longer required to evaluate the expanding body of acoustic big data, which calls for them to carefully listen to each audio recording and classify emotions [101]. Bi-grams, SentiWordNet, and stop word removal have all been demonstrated to improve accuracy when it comes to Twitter feature selection [102]. The most popular machine learning algorithms for sentiment analysis, emotion analysis, and hate speech identification on social media platforms are shown in a block diagram in Fig. 5. This demonstrates that LSTM and SVM algorithms are frequently used to produce the most accurate outcomes (Table 5).

Fig. 5
figure 5

Accuracy of most suitable ML algorithms application in hate speech detection

Table 5 Advanced ML application for hate speech detection

6 Results and discussion

6.1 Methods for identifying emotions in speech

The fundamental ER system is made up of the processes listed below, according to speech as depicted in Fig. 6.

Fig. 6
figure 6

Emotion recognition system [157]

The speech samples are utilized as input in the initial stage. If a standard database is not used, samples are preprocessed to eliminate noise using a variety of trade-offs, including Audacity, WavePad, Sony Creative Noise, Once Audio, and PRATT Reduction [110]. To obtain the final result, a classification is then applied to the samples. The main reason for doing this is to get signals with high-frequency characteristics.

6.2 Standard SER systems

Traditional SER Systems follow the steps of speech normalization, feature extraction, feature selection, and classification, as indicated in Fig. 7 which illustrates the fundamental process for identifying emotions in incoming speech. Following the separation of the noise components, feature extraction and selection are carried out in the process of normalizing speech [111]. The first step in the analysis of speech signals for emotion detection is the extraction and counting of speech features. The majority of the time, a time- and frequency-domain analysis of the spoken data produces the speech features [112]. The creation of a database of speech features generated from input voice signals follows. The classifiers can identify emotions in the final stage. In order to recognize emotions, classifiers use a variety of pattern-matching algorithms [113].

Fig. 7
figure 7

Standard SER

6.3 Speech normalization

Speech normalization is the process of the emotional data that is recorded and is usually diminished by outside noise (like the "hiss" of the recording device). This change will lead to inaccurate feature extraction and categorization. Therefore, normalization is an important step in the identification of emotions [114]. With the preservation of emotional distinction, this pre-processing stage gets rid of speaker and recording fluctuation. The two most popular methods of normalization are energy normalization and pitch normalization [115].

6.4 Selection and extraction of emotional speech features

After being normalized, the emotional speech signal is divided into segments and then decomposed into meaningful units. These components often express the speaker's emotional state through speech signals [116]. The following step is the feature set extraction as shown in Table 6. These emotional speech characteristics can be categorized in a variety of ways.

Table 6 Characteristics of emotions

Two categories can be used to separate long-term and short-term traits. Examples of short-term qualities are formants, pitch, and energy because they only last for a small period. Long-term characteristics are a statistical tool for examining a digital audio stream. The Mean and Standard Deviation are two of the most often applied long-term measures. If more features are employed, the categorization process will be more accurate [117].

6.5 Training data

Numerous databases have been created by the voice-processing community [118]. The databases contain training and test data sets. There is an English version of the emotional prosody speech and documents database from 2002. Three different types of databases are used by SER, and the review in Table 6 shows examples of some researche works which works on the training datasets in order to extraction emotions from the speech. The Table 6 also discuss the type of databases they used and what kind of emotions they were abled to figures out form the the available sources of database (Table 7).

Table 7 Types of databases with emotions

Type 1 emotional speech includes personal labels. Acting out or simulating speech is done professionally. To obtain these, actors are requested to speak with a specific emotion, such as DES or EMO-DB [122]. Realistic, human-like expressive speech is type 2. Natural speech is simply unplanned speech that conveys an individual's actual feelings. These databases are based on actual applications from the real world, like contact centers. Instead of labeling, the speaker employs self-reporting to elicit feelings in Type 3 to manage to label. Emotional speech is prompted by type 3. Expressed long short-term memory speech is not fictional or neutral [119].

6.6 Classifiers for emotion recognition

Only a few systems have had classifiers explored in the literature: SER, voice recognition, and speaker recognition [120]. In contrast, most implementations hardly ever explain why a particular classifier was chosen for a specific speech task. Selecting classifiers often involves either a general rule or empirical analyses of some indicators.

6.7 MFCC

The MFCCs represent some aspects of human speech perception and production. For instance, MFCC displays the logarithmic volume and pitch perception of the human auditory system [121]. The MFCC cepstral coefficients are produced using a twist frequency scale centered on human auditory perception. By using windowing, the voice signal is first divided into frames before being subjected to MFCC computation [122]. Since their amplitude is smaller than that of low-frequency formants, high-frequency formants are highlighted. This guarantees that the amplitude of each formant is the same. After windowing, Fast Fourier Transform is used to get the power spectrum from each frame [123]. After that, filter banks are processed on the power spectrum using mel-scale. After the power spectrum has been transformed into the logarithmic domain, the speech signal is subjected to the function to derive the MFCC coefficients [124].

6.8 Feature extraction and feature set classification

A crucial step in emotion identification is selecting and extracting relevant characteristics. The overall performance of the system. They can be classified into two primary groups: Spectral Features and Prosodic Features. The features are selected using a variety of techniques in order to be processed. Using LPA[125], PLPCS, PLP, FT, RASTA, MFCC[126], and FFT to extract emotions such as burden, anxiety, surprise, natural, grief, and happiness, the CASIA and EMODB dataset has an average recognition rate of 87.5% [127].

6.9 The ML and DL methods for researching emotions

Speech Emotion Recognition is a field of study that seeks to infer the speaker's emotional state from speech data. Progress in emotion identification, according to various surveys, would simplify many systems. Consequently, raising the standard of living [123]. We'll go over SER's applications in more detail in the section that follows. For instance, it is not possible to reliably deduce an emotion from the surroundings, culture, a person's facial expression, or speech corpus. One of the last significant challenges that an operating system in the actual world must overcome is the knowledge of dealing with bilingual inputs [128].

Due to the ubiquity of mixed-language speech in everyday situations, cross-language recognition demands more performance experience. A survey was carried out to better understand speech emotion identification. The method of feature extraction is used to identify the most crucial components of a signal. The extracted feature vectors are mapped to the appropriate emotions in the final stage using classifiers. In-depth discussions of feature extraction, classification, and speech signal processing can be found in [129, 133]. The differences between spontaneous and performed speech are also looked at because they are important to the topic [130]. A noisy component is removed in the first stage of speech-based signal processing. The second stage is divided into two components: feature extraction and feature selection. The desired features are extracted from the preprocessed voice input and used to make a selection [94]. Speech is recorded via microphone sounds and utilized by the system. The sound card of a computer is then used to build a digital representation of the received sounds. Feature extraction and selection Out of the 300 different emotional states, the returned speech features are selected based on emotional relevance [131]. Figure 8 shows the increasing development of the SER topics in the scientific areas that makes it one of the most vulnerable topic areas.

Fig. 8
figure 8

Recent development of SER using ML and DL techniques

The main objective of speech emotion recognition systems is classification. It is challenging to classify emotions because the average set of emotions includes more than 300 different emotional states. Since some of the most frequent human emotions are fear, surprise, fury, joy, contempt, and sadness, the naturalness of a speech emotion recognition system is what is evaluated [132]. ML algorithms can be used to recognize emotions in speech. It has been done using a variety of methods, including RF, SVM [133], GMMs, HMMs, CNNs, KNN, and MLP. In the past, these algorithms have been routinely used to identify emotions.

6.10 Emotions and database type

The two basic methods for categorizing emotions are the dimensional approach and the use of categories. The category method breaks down emotions into more manageable categories. The six main emotions are anger, joy, happiness, sadness, fear, surprise, and disgust [134]. There are two categories for emotions, the second of which is axis-based and has several dimensions [135]. Tao found 89.6% of the emotions in the CASIA Chinese emotion corpus using a decision tree diagram. In the work of [136], a GMM was employed as a classifier to categorize emotion-founded MFCCs. Several Berlin emotional datasets were identified using a three-stage classical SVM. In order to categorize emotions in the Marathi voice dataset, MFCCs extracted the features from the Berlin EmoDB database [137]. To determine the emotional content of a person's speech, the KNN algorithm was applied. The Berlin emotive speech database operated flawlessly 90 to 99.5 percent of the time. Hossain and Shamim presented cooperative media systems in 2014 that make use of MFCCs and standard characteristics like emotions from voice signals. To identify emotions in speech, Alonso et al. exploited paralinguistic and prosodic characteristics. They used SVM, a radial basis function neural network, and an auto-associative neural network after integrating two characteristics from a music library, the residual phase and MFCCs [138]. Researchers used a database of scholarly publications from China to investigate SVMs and DBNs. DBNs had an accuracy of 94.5% whereas SVMs was about 85% accurate. High-order statistical traits and characteristics based on particle swarm optimization were used in this work. Following the extraction of spectral information from voice recordings, [139] categorized speech emotions using an HMM and SVM. Performance analysis for different languages uses a variety of ML techniques. The comparison shows that different ML methods have been used to identify speech emotions for several languages. In light of this, the best-case accuracy for the best-case scenario has been determined. Although emotions vary, the research selects the most accurate example utilizing a variety of feature extraction techniques and ML techniques [140].

In a range of research projects, several academics use emotional speech databases. The performance and quality of the databases employed are the most important factors in assessing emotion recognition systems [141]. Depending on why speech systems are being developed, different data collection techniques and objectives may be used. Table 8 provides a summary of several publicly accessible datasets of emotional speech [142]. The creation of emotional speech systems uses three different types of speech databases. A continuum that can be used to illustrate database classification is shown in Fig. 9. The intricacy of several emotion recognition databases is represented in the image.

Table 8 ML comparison of speech features
Fig. 9
figure 9

Emotions and their complexities [159]

Actors with a high level of training and experience recorded the voice information in these databases [143]. From any of the previous databases, this is the simplest way to get a speech-based dataset of various moods. It is estimated that this method is used to collect around 60% of speech datasets.

Due to the fact that they collect emotional data by creating an artificial emotional state, these databases are also known as induced databases [144]. The speaker or performer is unaware that this is taking place. Compared to actor-based databases, this database is more naturalistic. There can be an ethical issue because the speaker should be aware that they are being filmed for research [145]. Natural databases are the most realistic because they are the hardest to recognize, but they are also the hardest to obtain. Typically, emotional speech databases are compiled from conversations in contact centers, the general public, and other sources [146]. Emotion Recognition is used by contact centers to classify incoming calls according to their emotional content. Emotion Recognition as a performance criterion for conversational analysis can be used to determine satisfied and dissatisfied clients [147]. The SER in-car board system can intervene to keep the driver safe and prevent accidents when it recognizes the driver's mental state.

The performance evaluation provided in [148] investigates a variety of speeches' acoustic properties and classifier algorithms, which helps explore modern ways of emotion recognition. The design of DL makes it possible to utilize it for modalities other than NLP, like SER and voice recognition. It is possible to use the RNN for natural language phrase classification and natural image processing [149]. As a final point, DL is replacing traditional SER techniques as the favored approach. Unsupervised and multimodal SER, as well as NLP and speech recognition, are all on the rise [150]. It is effective to identify emotions while simultaneously employing both aural and visual information. This trait to incorporate is a crucial decision in the development of any vocal system [151]. The features chosen should represent the information being delivered through them. The representation of speech information by various speech components, speaker, emotion, speech, etc. substantially overlaps [152]. The speech characteristics comparison is shown in the table below. As a result, many features in speech research are chosen experimentally, while others are picked with the use of Principal Component Analysis [153]. The ML comparison of the speech features for various techniques is shown in Table 8.

The impact of emotional expression is also influenced by the speech's linguistic substance [154]. To increase the precision of emotion recognition, emotional speech can be utilized to identify prominent words and traits that can be recovered from these words, in addition to more conventional aspects [155]. A real-time application where it is crucial to authenticate requests is call monitoring in the ambulance and fire brigade. Under the umbrella of emotion verification, pertinent aspects and models may be researched in this regard [156]. Figure 10 shows the most used keywords for the emotion extraction. In this research, LSTM, MFCC, RNN, CNN are found to be the most useful keywords.

Fig. 10
figure 10

Most used keywords for the emotion extraction

7 Conclusion and future work

A new taxonomy was introduced, and the main ML techniques for hate speech identification were illustrated. According to the study, among the various DL techniques, RF, CNN, SVM, and LSTM had the greatest practical uses. These algorithms work well for sentimental and emotional analysis as well as the detection of hate speech. Incontrovertible analytical data from a range of sources, such as common documents, business reports, social media monitoring, and customer support tickets, are provided by emotional analysis. On the other hand, DL enables the employment of more potent tools and algorithms for data analysis. The classifier and database to use to accurately assess emotions can be chosen using the data presented in the earlier articles. The most often searched-for emotions are neutral, disgust, happiness, and sadness, along with other characteristics like a burden, joy, surprise, and fear. The classifier used has an impact on the extraction rate. Due to the drawbacks of using subpar sample recordings in the databases, the accuracy in DBN networks is between 56 and 57 percent, and the recognition ratio has decreased. In this paper, both deep learning and machine learning for SER have been carefully analyzed.The paper includes a block diagram of the voice emotion detection system and a brief introduction to SER. To classify a speech recognition system, they must be able to distinguish between isolated, connected, spontaneous, and continuous words. There are several different techniques to research and assess approaches for recognizing emotions, including Emotion Recognition, DL, and ML. Researchers have recently paid a lot of attention to the topic of speech-based emotion recognition. This study examined a huge number of research publications using databases, feature extraction, and classifiers. The research on emotion recognition systems conducted between 1994 and 2022 is summarized in this document. In order to improve performance accuracy, current research has placed a lot of emphasis on the extraction of features and feature selection. Data analysis shows that classifier selection is a challenging task to enhance system performance and recognize the proper emotions. No obvious winner has emerged despite the selection of several classifiers for the speech emotion identification system.

This work gives an in-depth analysis of all the properties, databases, classifiers, and methods utilized to address the complicated challenge of SER Inferring that SVM performs better than the other model across all studies is possible. In the future, sentiment analysis may be used to identify emotions through facial expressions and emotions. We hope to be able to recognize offensive speech in the future from a variety of monitoring data. We want to consider visual information in addition to comment text to distinguish between dangerous emotions. Online texts can also be handled using the adaptive bagging approach, which enriches the processing at the level of dynamic processing by processing the texts as streams. Future research on this subject might examine how to improve the performance of our model by using BERT to build the embedding layer.