Evaluation of emotional arousal level and depression severity using the centripetal force derived from voice

In this research, we propose a new voice feature called centripetal force (CF) to investigate the relationship between emotional arousal level and depression severity. First, CFs were calculated from various speech recordings in the interactive emotional dyadic motion capture database, and the correlation with the arousal level of each voice was examined. The resulting correlation coefficient was 0.52. We collected a total of 178 datasets comprising 10 speech phrases and the Hamilton Rating Scale for Depression (HAM-D) score of outpatients with major depression at the Ginza Taimei Clinic (GTC) and the National Defense Medical College (NDMC) Hospital. The correlation coefficients between CF and HAM-D scores were -0.33 and -0.43 at the GTC and NDMC, respectively. Next, the dataset was divided into the no depression group (HAM-D<8) and the depression group (HAM-D [≥] 8) according to the HAM-D score. There was a significant difference in the mean CF values between the two groups in both the GTC and NDMC data (p = 0.0089 and p = 0.0016, respectively). The AUC when discriminating both groups by CF was 0.76 in GTC data and 0.72 in NDMC data. Indirectly, using CF established a relationship between emotional arousal level and depression severity.

Consequently, with the explosive increase in smartphone usage, research on voice emotion recognition and measurement of emotional arousal level using voice have been encouraged. For example, the relationship between arousal level and voice intensity or pitch has been documented [4,5].
The voice of a depressed person has dull, monotonous, and lifeless [6] features, and listeners can perceive patients' distinctive prosody [7,8]. The Hamilton Rating Scale for Depression (HAM-D) [9] and self-administered questionnaires such as the General Health Questionnaire [10] and the Beck Depression Inventory [11] are powerful tools for measuring depression. However, if it becomes possible to measure depression severity by voice, through daily and remote monitoring, using the voice call feature of a smartphone. From this perspective, several studies have been conducted to measure the severity of depression using voice [12]; these have shown that speech characteristics are effective predictors of the signs and severity of depression [13]. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint Similarly, Cannizzaro et al. [14] examined the relationship between the HAM-D score and voice, and found a strong correlation between the HAM-D score and speaking rate or pitch variation. Yang et al. [8] demonstrated that changes in the severity of depression measured by the HAM-D, can be captured by the switching pause, that is, the pause duration between the end of one speaker's utterance and the start of an utterance by the other. The mel-frequency cepstral coefficient (MFCC) is often used for voice recognition. Taguchi et al. [15] showed that MFCC2 (the second dimension of MFCC) is effective in classifying patients with depression and individuals without depression.
In a previous study, we found that the higher the arousal level, the higher were both the Hurst exponent and the zero-crossing rate of the waveform [16]. Specifically, we have shown that arousal level can be approximated by a weighted average of the Hurst exponent and zero-crossing rate. The zero-crossing rate is the rate at which the signal crosses the reference line, and is small in a smooth curve such as a sine curve and large in a rough waveform such as white noise. However, the Hurst exponent is expressed as 2-D, where D is the fractal dimension that represents the complexity of the waveform. In other words, the Hurst exponent is a measure of smoothness, which is the . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint opposite of the fractal dimension, and theoretically it is 0 for white noise and 0.5 for brown noise.
In this paper, we propose a new speech feature called centripetal force (CF) that combines both the roughness and smoothness of the waveform. Next, CF was calculated from the emotional speech recordings stored in the interactive emotional dyadic motion capture (IEMOCAP) database [17], and compared with the arousal level evaluated by the annotators. Next, CF was calculated from the voice of the depressed patient and compared with the HAM-D score.

Data on emotional arousal level
We used the IEMOCAP database [17] to investigate the relationship between the proposed voice index, CF, and emotional arousal. The database contains audio recordings of dyadic mixed-gender pairs of voice over artists. This database contains voices for five sessions in total, that is, five male voices and five female voices. The voices were manually divided into utterances (i.e., the sounds/words made from one breath to the next). There were 10,039 utterances in total. The arousal level of each utterance was evaluated by at least two different annotators on a five-point scale. The is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint arousal level of each utterance was calculated as the average of the evaluation values given by each annotator.
Furthermore, emotion categories were evaluated by at least three annotators.
There were nine emotion categories: "angry," "happy," "sad," "neutral," "frustrated," "excited," "fearful," "surprised," and "disgusted." However, utterances that did not seem to fit into any of these categories were classified as "other." A simple majority voting method was used to assign an emotion category if there was disagreement among annotators regarding classification. The annotators were also allowed to tag more than one emotion category. If no majority category could be assigned, the category was labeled xxx. The total number of utterances assigned to the nine emotion categories by the above procedure was 7,527.

Data on severity of depression
This study collected data from outpatients with major depressive disorder after obtaining written informed consent from all participants at the Ginza Taimei Clinic (GTC) and the National Defense Medical College (NDMC) Hospital. At each health care facility, the recruited patients were instructed to pronounce 17 Japanese phrases.
However, the 17 phrases collected at the two hospitals were not exactly the same. Of the . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint 17 phrases, 10 were common. For the purpose of this study, these 10 phrases were used for analysis. Table 1 shows the contents of these 10 phrases.
[Insert Table 1 here.] Voice was recorded using a pin microphone (ME52W, Olympus, Tokyo, Japan) attached to the patient's chest, approximately 15 cm from the mouth. The recording equipment was a portable recorder R-26 (Roland, Shizuoka, Japan). The record format involved a linear pulse-code modulation (PCM). The sampling frequency and number of quantization bits were 11,025 Hz and 16, respectively.
We also had a doctor interview each patient and provide a score on the HAM-D, in the same session the voice recordings were used, to evaluate the severity of depression in patients. In this way, pairs of recorded voices of 10 phrases and HAM-D scores were collected from 178 patients. Table 2  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint organic brain disease. They were diagnosed by a psychiatrist using the Mini- [Insert Table 2 here.] The protocol of this study was designed in accordance with the Declaration of Helsinki and relevant domestic guidelines issued by the concerned authorities, in Japan.

Proposed Method
Consider a signal ( ), 0 x t t T ≤ ≤ . The velocity ( ) v t of the signal at time "t" is defined as follows: Next, the force ( ) F t that acts virtually is defined as follows: where ( ) , mean x y represents the mean of x and y and is used for normalization. In this study, the harmonic mean is used as the mean. For simplicity, we set 1 t ∆ =.
At this time, equation (2) can be rewritten as follows: . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. .
Next, we focus on the directionality of ( ) F t . The center line is defined as the average time x of the signal as follows: In the case of ( ) 0 F t < , ( ) F t represents the downward force. Therefore, when ( ) x t is above the center line, i.e., when ( )  (Figure 1a). It is defined as follows: The CF of a given voice signal is defined as the average time of ( ) CF t as follows: . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint As is clear from the definition, CF increases as the proportion of force toward the center increases. In addition, as can be seen by comparing Figures 1a and 1b, the higher the ratio of the waveform crossing the center line, that is, the higher the zero crossing ratio, the larger the ratio of the force toward the center. Thus, CF can be said to be a measure that reflects the roughness of the waveform, like the zero-crossing rate.
Next, we focus on the first term on the right side of the equation (3). This term is the reciprocal of the harmonic mean of ( ) v t and ( ) The harmonic mean of x and y is expressed as ( ) 2xy x y + , and the more skewed the values of x and y, the smaller the harmonic mean. Conversely, the more biased the values of ( ) v t and ( ) 1 v t + , the greater the CF, and this increases as the bias between them increases. As can be seen by comparing Figures 1b and 1c, the larger the deviation between the values of ( ) v t and ( ) 1 v t + , the smoother the entire waveform tends to be. In other words, the first term on the right side of equation (3) can be considered as a measure of smoothness. Thus, CF can be regarded as a measure that reflects roughness and smoothness.
[Insert Figure 1a here.] [Insert Figure 1b here.] [Insert Figure 1c here.] . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint

CF as an arousal level index
CFs were calculated from 10,039 utterances stored in the IEMOCAP database.
The correlation coefficient between CF and the arousal level of each utterance given by the annotators was 0.52 (n = 10039, p <2.2×10 -16 ). We extracted low arousal voices with a level of 2 or less (n = 1112, mean ± SD=1.92 ± 0.19) and high arousal voices with a level of 4 or more (n = 1692, mean ± SD=4.19 ± 0.28), from the database. As a result of distinguishing these two groups by CF, the area under the curve (AUC) of the receiver operating characteristic curve was 0.93 and when the cutoff value was 0.90, both the sensitivity and specificity were 0.86.
The average values of arousal level and CF for each emotion category were compared. There were nine emotion categories: "angry," "happy," "sad," "neutral," "frustrated," "excited," "fearful," "surprised," and "disgusted." Figure 2 shows the number of utterances in each category. In addition, Figure 3 shows the average values of arousal level and CF for each category. Here, the categories on the horizontal axis are arranged in descending order of average arousal level. Except for the three categories of "fearful," "surprised," and "disgusted," the order relationship between arousal level and CF was the same. As can be seen from the figure, both arousal levels and CF tended to . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint be high in the categories of "angry" and "excited," and low in the categories of "neutral" and "sad," which is consistent with our daily feelings. The number of utterances included in the three categories, in which the order of arousal level and CF did not match, was extremely small, compared to other categories. The utterances (percentage) of the categories "fearful," "surprised," and "disgusted" were 40 (0.53%), 103 (1.42%), and 2 (0.03%), respectively. The above analysis was performed using the statistical software R [20]. All analyses were performed using the statistical software R, unless otherwise specified.

HAM-D score
There are various discussions on the classification of depression severity using the HAM-D score [21]. In this study, the dataset was divided into two groups, a "no depression" group with a HAM-D score of less than 8 and a "depression" group with a HAM-D score of 8 or more, using Hashim's method [22]. Table 3 shows the mean HAM-D score of each group according to the facility. According to the Wilcoxon rank . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint sum test, a significant difference was found between the HAM-D scores of each group, in both the GTC and the NDMC Hospital (p = 1.22×10 -8 , p = 1.56×10 -13 , respectively).

CF as an index of depression severity
Data of patients with major depression were collected from two health care facilities. As shown in Table 2, the age groups of patients at both facilities were quite different. In addition, the sound field environments may be different in both facilities.
Therefore, a separate analysis was performed for each hospital. Figure 4 shows the mean CF values of each group for each facility. However, the CF value of each participant was obtained by calculating the mean value of the CFs for the 10 phrases. At the GTC, the mean CF values of the "no depression" group and "depression" group were 0.67 ± 0.041 (n = 10) and 0.54 ± 0.015 (n = 78), respectively. At the NDMC Hospital, the mean CF values of the "no depression" and "depression" groups were 0.81±0.020(n = 65) and 0.69 ±0.031 (n = 25), respectively.
[Insert Figure 4 here.] As a result of the Wilcoxon rank sum test, a significant difference was found between the mean CF values of each group in both the GTC and the NDMC Hospital (p . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. At the NDMC Hospital, there were significant differences in both group and phrase factors (F(1,880) = 74.24, p<2.00×10 -16 , F(9,860) = 16.11, p<2.00×10 -16 ).
However, there was no interaction between group and phrase (F(9,860) = 0.15, p = 1.00). Figures 5a and 5b. show the mean CF for each phrase in the "no depression" and "depression" groups. In all phrases, CF values of the "no depression" group were higher than those of the "depression group" at both facilities.
[Insert Figures 5a and 5b  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint Table 4 shows a summary of classification performance between the "no depression" and "depression" groups by CF. The table shows the p-values obtained by the Wilcoxon rank sum test, the AUC, and the correlation coefficient between HAM-D and CF for each phrase of both facilities. In the table, the minimum p-value, the maximum AUC, and the maximum correlation for each hospital are shown in bold fonts. The AUC tended to be higher for the GTC. On the other hand, the correlation coefficient with HAM-D tended to be higher for the NDMC Hospital.

Discussion
The CF, a voice index proposed by this paper, showed a significant correlation with the arousal level evaluated by annotators. In the comparison of CF and arousal level by emotion category, the order relation between "fearful," "surprised," "frustrated," and "happy" was reversed. However, as can be seen from Figure 3, the values of the arousal level are almost the same for these four emotions, and it can be said that it is difficult even for human beings to distinguish among them.
Regarding the emotion category "disgusted," the values of the arousal level and CF were significantly different, but this may be because the sample size was too small . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint (n = 2). With regard to depression severity, there was also a significant correlation between the HAM-D score and the CF value. The AUC was over 0.7 in both facilities. Although no simple comparison is possible, the AUC shown here was about the same value as that of the CF. However, it may be advantageous because the CF is unlikely to be overfitted as it consists of only one feature.
As shown above, the CF was shown to be associated with arousal and depression severity. In other words, though indirectly, the relationship between arousal . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint and depression severity was associated by the CF. However, it should be noted that the arousal level was an evaluation value given by the annotator and did not reflect the participant's own evaluation. In the future, it is necessary to investigate the relationship between CF and physiological indicators.
In addition, we need to clarify the qualitative meaning of CF. The CF correlated with the arousal level, and was high for the voices expressing anger and excitement and low for those expressing sadness and neutral qualities. It is necessary to consider why a voice with a high arousal level increases the force toward the center.
This study has some limitations. First, all participants had to use the same phrases for an accurate evaluation. Our future task would be to analyze the CF of spontaneous speech. The two-way ANOVA revealed that CF was affected by the utterance content. Future studies should explore the reasons for such differences among phrases. Second, the sample size of the group was small. We collected voices from two health care facilities, the GTC and the NDMC Hospital, but the age group and the distribution of HAM-D scores were quite different at both facilities. The mean age of the participants was higher in the NDMC Hospital. Conversely, the HAM-D score was higher in the GTC. The reason for this may be that the GTC is located in central Tokyo and many outpatients are young office workers who commute to the city center; while . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint

Data Availability
According to Japanese law, the sensitivity of audio files is similar to that of any other personal information and cannot be published without consent. In this research protocol, we did not obtain consent from the participants to publish the raw audio files as a corpus. The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint    is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint Figure 4. The mean CF for the "depression" and "no depression" groups, by facility.
Note: Error bars represent standard error. ** (p <0.01) . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 22, 2020. . https://doi.org/10.1101/2020.08.19.20177048 doi: medRxiv preprint