1. Introduction

Workshop on AI Evaluation Beyond Metrics, July

Item Response Theory to Evaluate Speech Synthesis: Beyond Synthetic Speech Dificulty

Chaina Oliveira

Ricardo Prudêncio

0 0 Universidade Federal de Pernambuco , 1235 Prof. Moraes Rego, Recife , Brazil

2022

25 2022 0000 0002

Artificial Intelligence (AI) systems have been increasingly developed and improved. In this sense, one of the main challenges is to evaluate and compare them. However, traditional assessment methods do consider some hidden factors that may influence the quality of these systems that can be helpful in their discrimination (e.g., between poor and good techniques). Previously, we developed a work that uses Item Response Theory (IRT) to simultaneously evaluate speech synthesis and recognition. IRT is a paradigm from psychometrics to estimate the cognitive ability of human respondents based on their responses to items with diferent levels of dificulty. One of the measures we estimated in that previous work was the synthesized speeches' dificulties, in turn, the factors that influence that measure were not deeply explored. So, in this paper, we navigate far on this topic and investigate what explains a synthesized speech dificulty. We found out that some of the factors that may influence are: the sentence, the locale and the service used to generate the speech. Also, we performed a preliminary study to investigate the viability of predicting the synthesized dificulty using machine learning models. So, we trained some regression models using the speech synthesis parameters as features and the dificulty as the label. The best result was achieved using a Random Forest, in which we got 0.31 as normalized R2 score.

eol>Item Response Theory Speech Synthesis Evaluation Synthesized Speech Dificulty Speech Quality Measurement

1. Introduction

does not hit an instance class that a poor one does). For instance, they clarified that it is unfair to evaluate classiProgress in speech synthesis and recognition research ifers using just the number of instances they hit, it is also changed the way we communicate and interact with ma- important to analyze the dificulty of instances classified chines. These techniques can be used as a communication by the models under test. Furthermore, IRT was also way in diverse applications. It is common to see mobile adopted to evaluate regression models abilities in [ 6 ]. users who opt for using command voices instead of the A more recent way of estimating IRT dificulties was device’s keyboard to execute some task (e.g., call some- proposed by [ 7 ]. The authors suggested that we could one, do a google search, write an e-mail). Those kinds predict the dificulty of new items using a regression of systems have been developed and improved more and model trained with the problem features, using the difimore, but we have not seen many advances in how to culty as target. They trained a regression model for a set evaluate them. In a previous paper, we proposed Item of domains (i.e., Supervised Learning, Audio Processing, Response Theory (IRT) from psychometrics to evaluate Computer Vision and so on) and the results showed that speech synthesis [ 1 ] and in other, we assessed speech using this methodology in that context is promising. synthesis and recognition [ 2 ]. Recently, we developed a work that adopted IRT evalu

IRT is commonly used in educational testing to esti- ate speech synthesis and speech recognition [ 2 ], which its mate the latent ability of respondents and the dificulty of main goal was to estimate the latent ability of Automatic items. Recently, this methodology of evaluation has been Speech Recognition systems, the quality of speakers and adopted in other contexts, including in the evaluation of the dificulty of synthesized speeches and sentences. So, AI systems. In supervised learning, IRT was explored by ifrstly, we extracted 100 benchmark sentences from Vox[ 3 ], [ 4 ] and [ 5 ] to evaluate the ability of classifiers based Forge [ 8 ] and synthesized them using English voices from in their answers to a set of instances (what class each four services using diferent variation of pitch and rate. instance belongs to). [ 4 ] and [ 5 ] investigated the impor- It resulted in a set of synthesized audios that were given tance of analyzing the particular problems in which good as input to four ASR systems to be transcribed. After this, techniques fail (e.g., a classifier with good performance we calculated the accuracy of all transcriptions using the word accuracy rate ( ). The become the input to our IRT model (i.e., the responses). To estimate the IRT parameters (e.g., synthesized speech dificulties), we adopted the 3-IRT model proposed by [ 3 ].

In this paper, we present a deep analysis of the predicted synthesized speeches’ dificulties estimated in [ 2 ] in order to understand if they can be explained by the sentences or the synthesis parameters used to generate the speeches. So, we deeply analysed the data produced by these previous work and found that the synthesized speech dificulty can be afected by the sentence and some speech synthesis parameters (e.g., speaker, locale, pitch, rate and gender). We also aimed to know if we could use any regression model to predict the IRT dificulty in this context. So, we trained MLP, Linear Regression and Random Forest models using the synthesis parameters as features and the dificulty as the label. The Random Forest outperformed the others, getting 0.50 as normalized MAE and 0.31 as normalized R2.

The proposal of this paper fits with the AI Evaluation Beyond Metrics workshop’s goal once both aim to investigate and give visibility to new robust approaches to AI systems assessment. As the workshop’s goal, we desire to explore new assessment methods to try to cover some limitations of the traditional ones. The approach of evaluation used in this work (i.e., IRT) has been already adopted to evaluate other kinds of AI systems such as classifiers, NLP systems, and so on. Here, we explore the analysis of using IRT in a new context - to evaluate speech synthesis and recognition.

2. Item Response Theory

2.2. 3-IRT Model IRT is a methodology from psychometrics that aims to estimate the latent abilities of respondents in tests [ 9 ]. It models the responses to testing items based on their dififculties and the skills of the respondents who answered them. This section presents a classical IRT model (i.e., the binary) and a more recent model (i.e., 3-IRT). This last one was the one we adopted in this work.

The binary IRT model is applied when the response can

be correct or incorrect. In turn we have this more recent model that deals with continuous responses, the 3-IRT [ 3 ]. The authors of 3-IRT applied it in two contexts.

The first one was to estimate the responses given by 2.1. Binary Model students to items, a typical application of IRT. The second application was in supervised machine learning, in which The binary model, also known as dichotomous, is usually classifiers and instances were respondents and items, used when a response to an item is positive or negative. respectively. In turn, the responses were the probability In this category, we have the the 3-parameter (3PL) IRT of the classifiers assigning the correct class to an instance. model and the 2-parameter (2PL) IRT model. In 3PL, the The expectation of the correct responses can be calculated probability of a correct response is defined by a logistic by: function of the respondent ability and the item’s dificulties, discrimination and guessing. This model returns the Item Characteristic Curve (ICC), which is modeled [ | , , ] = 1 (2) according to the function below:

• is the item discrimination (the slope of the

ICC); • is the guessing parameter (the asymptotic min

imum of the ICC).

• is the ability of respondent .

It is important to emphasize that when using IRT, different from traditional evaluation methods, the respondent’s ability is not necessarily estimated only by the number of questions he answers correctly. It depends on the number of dificult items he hits. Similarly, the dificulty of an item is measured by the number of respondents who answer it correctly. In other words, to estimate these parameters, we consider the sets of items and respondents under analysis.

• ∈ [ 0, 1 ] is the response of respondent to

item ; • is the dificulty of the item ; • is the ability of the respondent ; • is the discrimination of the item . ( = 1| ) = + 1 + 1− − ( − ) (1) in which: in which: • is the response of respondent j to item i; • is the item dificulty (the location parameter of

the ICC);

Some ICCs that can be modeled by the Eq. 1 is shown in Figure 1. Each plot shows the curve with diferent values of dificulty and discrimination. When = 2, the curve assumes a sigmoidal shape. If the discrimination is 1, the curve is parabolic, but if that parameter is between 0 and 1, the ICC assumes an anti-sigmoidal behavior.

3. IRT to Evaluate Speech Synthesis In a previous work ([2]), we developed a two-level IRT

model to evaluate speech synthesis and recognition. This model is illustrated on Figure 2. In the first level, an item is a synthesized speech produced from a given sentence and a speaker. In turn, the respondent is an ASR system. Each response is the transcription accuracy observed when a synthesized speech is adopted as an input the ASR system (i.e., ). An IRT model identifies latent patterns of responses to estimate the dificulty of each synthesized speech and the ability of each ASR system. In the second first level, the synthesized speech’s dificulty is decomposed into two latent factors: the sentence’s dificulty and the speaker’s quality. In this current work, we focus on the first level. Our main goal is to find characteristics that may influence the estimated synthesized speech’s dificulty. So, in this paper, we focus on analyzing and using data generated and estimated on Level 1 presented in [ 2 ].

Figure 3 shows two ICCs of synthesized speeches with low and high dificulty, respectively. In the first one (i.e., 6613), all Automatic Speech Recognition (ASRs) systems got a high response value to that item. However, almost all (3 of 4) ASRs got a low response value for the most dificult item (i.e., 2829).

A variety of sentences, speakers and automatic speech recognizers were used by [ 2 ] as presented below: • Sentences: The sentences were extracted from VoxForge [ 8 ], an open speech dataset. A total of 100 English sentences of diferent sizes were adopted. Figure 4 shows the distribution of those sentences size (number of characters) with median of 51.5. The shortest sentence has 12 characters, and the biggest has 134. • Speakers: The speakers are from four diferent services: Amazon Polly [ 10 ], Google Text to Speech API [ 11 ], IBM Watson Text to Speech [ 12 ] and Microsoft Azure Text to Speech [ 13 ]. Each service has speakers with diferent English accents, genders, pitches and rates. • Automatic Speaker Recognizers: The recognizers adopted in this work were: Google Speech to Text [ 14 ], Microsoft Azure Speech to Text [ 15 ], IBM Watson Speech to Text [ 16 ] and Wit [ 17 ]. They were responsible for receiving a synthesized speech and generating a transcription (the sentence the recognizer understood) of the referred audio.

In [2], a total of 15,000 synthesized speeches were pro

duced. Each one was generated from a single sentence and a speaker setting. The IRT model estimated the dififculty of each speech, with distribution presented in Figure 5. The dificulty lies between 0 and 0.9. The majority part of the speeches has dificulty between 0.2 and 0.6. Also, we do not see a representative peak. It means that there is not a specific dificulty value shared by a big part of synthesized speeches. model. Table 1 shows examples of transcriptions of two 4. Experiments and Results of the longest sentences of our dataset. See that just a part of them was transcribed. It is impacting on the mean The IRT model provided in [ 2 ] estimated the dificulty dificulty of those sentences. value of each speech, but the aspects that impacted the dif- Two of the speech parameters we explored were pitch ifculty across speeches were not deeply investigated. In and rate. We generated speeches with three diferent this paper, we deeply explored the synthesized speeches’ pitch values (e.g., low, medium and high). Figure 7 shows dificulty inferred, aiming to observing its relation to the distribution of the synthesized speech dificulty for speech synthesis parameters and sentence features. For each pitch group. Each box represents 50% of the difiinstance, may the length of a sentence influence the dif- culty values of the respective group. In turn, the lower ifculty? Are bigger sentences easier or more dificult to and upper whiskers represents the dificulties outside synthesize than short ones? Is gender somehow related to the box. It also indicates the variability of the data outdificulty? So, in Section 4.1, we explore the relationship side the lower and upper quantiles (i.e., the lower and between specific synthesis parameters and the dificulty. upper box lines). The line that divides each box into two We show the dificulty distribution among the groups of parts is the median. It means that a half the dificulty each feature and also performed statistical tests to see values are greater than or equal to that value, and half the significant diferences between them. For instance, are less. For instance, Figure 7 shows that speeches with we present the dificulty distribution of each gender and low or medium pitch tend to be easier than the ones with performed the statistical test among the dificulty values high pitch, once the dificulty of 50% of the synthesized of male and female speeches. We also aimed to know if speeches with high, medium and low pitch is 0.42, 0.38 we could predict the synthesized speech dificulty. Thus, and 0.37 (the median of each group), respectively. It is in Section 4.2, we present insights and results of a pre- also possible to see that speeches with high pitch are the liminary predictive model we developed to predict the ones that tend to be more dificult whilst the ones with dificulty, using the synthesis parameters as predictor a low pitch are the easiest. Regarding the rate (Figure attributes and the dificulty as the target attribute. 8), we noticed that speeches with a fast rate tend to be more dificult. In turn, the ones with medium pitch are

4.1. How Synthesis Parameters Influence the easiest.

the Dificulty of a Synthetic Speech? As we used four services to synthesize the speeches, we aimed to investigate if speeches from a specific synInitially, we aimed to understand if the size of the sen- thesizer are more dificult than the ones generated by tences has any relation to the dificulty. We noticed that the others, and we confirmed that as shown in Figure the bigger the sentence’s size or the number of words, 9. The speeches from Azure are the most dificult. In it tends to be more dificult, as seen in Figure 6. We in- turn, the ones from Watson are the easiest. In the middle, spected some cases and found out that, depending on we have Google and Polly, with this last one tending to the parameters used to synthesize the speeches, they are generate easier synthesized speeches than the service not fully transcribed by some recognizers. It directly af- from Google. fects the , the response used as input to the IRT

The relation between gender and locale (i.e., type of

English) with dificulty was also analyzed. Figures 11 and 10 show the synthesized speech dificulty distribution by gender and by locale, respectively. Following, Figure 12 shows the mean dificulty of each gender by locale. We see that female voices are more dificult than male ones. Regarding the English type, synthesized speeches with English from the United States are the easiest ones. In turn, speeches from Australian English are the most dificult, followed by British English and Indian English, respectively. Furthermore, we can see that female voices are more dificult than male voices in all locales (except for Australian English that there is not male voices in our database to compare).

We performed ANOVA statistical test among the groups of each feature shown in this Section’s plots (Figures 7 - 11) o see the significant diferences between them. The p-value obtained from the analysis in all cases was significant ( < 0.01). So we conclude that there are significant diferences among them.

4.2. Predicting the Dificulty of a Synthetic Speech This section presents the experiments we performed to

evaluate the predictability of the synthesized speech dificulty. As we have the sentences and speaker parameters used to generate the speeches (i.e., pitch, rate, speaker, The Random Forest trained with all features outperlocale), we investigated if dificulty can be predicted us- formed all models. It had normalized MAE and R2 of ing these parameters as predictor attributes (Table 2). 0.50 and 0.31, respectively (Table 3). Figure 13 shows the Thus, we trained some regression models by assuming feature’s importances. It represents the score of the feadificulty as the target attribute. tures we used to train the Random Forest model with all

The regression models we trained were: MLP, Lin- features (i.e., combination 1). The feature that has more ear Regression and Random Forest from scikit-learn1, a efect is the sentence, followed by the size of the sentence machine learning python library. We encoded the cate- (i.e., len_sentence), service, speaker, number of words, gorical features (e.g., sentence, speaker, and so on) using pitch, rate, locale and gender, since higher values mean the label encoding method, also from scikit-learn. We that a feature has more efect on the prediction process. also run each model using four diferent combinations of For instance, the feature service is more useful for prefeatures (Table 3): dicting the synthesized speech dificulty than the rate. In fact, in Section 4.1 we could see that the tendency some services have to generate more dificult speeches is more • Combination 1: all features (Table 2). • Combination 2: all features except the sentence. • Combination 2: all features except the speaker. • Combination 2: all features except the sentence and the speaker. explicit than some rates do. In other words, the dificulty distribution between the services is more diferent than the dificulty distribution among the rates.

It was a preliminary study to analyze the viability of using three diferent types of models to predict the synthesized speech dificulty. The experiments showed that by having a sentence and the synthesis parameters of a new speech we want to synthesize, we can predict its dificulty without having to run an IRT model again. We can use the dataset we already constructed to train a model that would be able to perform that prediction. In the near future, we aim to delve deeper into this and do more experiments and further analysis. We can explore adding more features related to phonemes, of instance.

Also we can test our models with a newly .

5. Conclusion and Future Work

In this paper, we investigated what explains the synthesized speech dificulty. We deeply analyzed the data regarding an experiment we performed in [ 2 ] focusing on that topic: nfiding out if the dificulty of a synthesized speech can be explained by the sentence or any other parameter used in the synthesis process (e.g., pitch, rate, speaker).

The results of our descriptive analysis showed that bigger sentences tend to be more dificult. Also, some services or languages generate easier speeches than others. Female voices are more dificult than male ones. We also trained regression models in order to see if we can predict the synthesized speech dificulty. Our preliminary experiment showed that it may be useful to use this approach in this context. So, in the feature, we aim to better investigate this topic, training more robust models and adding more features to see if we have more insights about that and even better results.

Acknowledgments This work was supported by CAPES, CNPq and FACEPE (Brazilian funding agencies) and Motorola Mobility.

[1]

C. S.

Oliveira ,

C. C.

Tenório ,

R. B.

Prudêncio , Item response theory to estimate the latent ability of speech synthesizers ., in: ECAI , 2020 , pp. 1874 - 1880 .

[2]

C. S.

Oliveira ,

J. V.

Moraes ,

Silva Filho ,

R. B.

Prudêncio , A two-level item response theory model to evaluate speech synthesis and recognition , Speech Communication 137 ( 2022 ) 19 - 34 .

[3]

Chen ,

T. S.

Filho ,

R. B. C.

Prudêncio ,

Diethe , P. Flach, 3 -irt: A new item response model and its applications , in: Proceedings of Machine Learning Research , volume 89 , 2019 , pp. 1013 - 1021 .

[4]

Martínez-Plumed ,

R. B. C.

Prudêncio , A. MartínezUsó, J. Hernández-Orallo , Making sense of item response theory in machine learning , in: Proceedings of the Twenty-second European Conference on Artificial Intelligence , IOS Press, 2016 , pp. 1140 - 1148 .

[5]

Martínez-Plumed ,

R. B.

Prudêncio , A. MartínezUsó, J. Hernández-Orallo , Item response theory in ai: Analysing machine learning classifiers at the instance level , Artificial Intelligence 271 ( 2019 ) 18 - 42 .

[6]

J. V.

Moraes ,

J. T.

Reinaldo ,

Ferreira-Junior ,

Silva Filho ,

R. B.

Prudêncio , Evaluating regression algorithms at the instance level using item response theory, Knowledge-Based Systems ( 2022 ) 108076 .

[7]

Martınez-Plumed ,

Castellano-Falcón ,

Monserrat ,

Hernández-Orallo , When ai dificulty is easy: The explanatory power of predicting irt dificulty , in: Proceedings of the AAAI Conference on Artificial Intelligence , 2022 .

[8] VoxForge, Voxforge, 2022 . URL: http://www. voxforge.org/, access in: 08 /05/ 2022 .

[9] R. J. De Ayala , The theory and practice of item response theory , Guilford Publications , 2013 .

[10] A. W. S. AWS , Amazon polly: Turn text into lifelike speech using deep learning, 2022 . URL: https://aws. amazon.com/polly/, access in: 08 /05/ 2022 .

[11]

Cloud , Cloud text-to-speech: Text-to-speech conversion powered by machine learning , 2022 . URL: https://cloud.google.com/text-to-speech/, access in: 08 /05/ 2022 .

[12] I. Watson , Text to speech, 2022 . URL: https: //text-to -speech-demo.ng .bluemix.net/, access in: 08 /05/ 2022 .

[13]

Azure , Text to speech: Convert text to lifelike speech for more natural interfaces , 2022 . URL: https://azure.microsoft.com/en-us/services/ cognitive-services/text-to-speech/, access in: 08 /05/ 2022 .

[14]

Cloud , Speech-to-text: Speech-to-text conversion powered by machine learning , 2022 . URL: https://cloud.google.com/speech-to-text, access in: 08 /05/ 2022 .

[15]

Azure , Speech to text: Convert spoken audio to text for more natural interactions , 2022 . URL: https://azure.microsoft.com/en-us/services/ cognitive-services/speech-to-text/, access in: 08 /05/ 2022 .

[16] I. Watson , Speech to text, 2022 . URL: https: //speech-to -text-demo.ng .bluemix.net/, access in: 08 /05/ 2022 .

[17] W. AI , Natural language for developers , 2022 . URL: https://wit.ai/, access in: 08 /05/ 2022 .