<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on AI Evaluation Beyond Metrics, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Item Response Theory to Evaluate Speech Synthesis: Beyond Synthetic Speech Dificulty</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chaina Oliveira</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Prudêncio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidade Federal de Pernambuco</institution>
          ,
          <addr-line>1235 Prof. Moraes Rego, Recife</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>25</volume>
      <issue>2022</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Artificial Intelligence (AI) systems have been increasingly developed and improved. In this sense, one of the main challenges is to evaluate and compare them. However, traditional assessment methods do consider some hidden factors that may influence the quality of these systems that can be helpful in their discrimination (e.g., between poor and good techniques). Previously, we developed a work that uses Item Response Theory (IRT) to simultaneously evaluate speech synthesis and recognition. IRT is a paradigm from psychometrics to estimate the cognitive ability of human respondents based on their responses to items with diferent levels of dificulty. One of the measures we estimated in that previous work was the synthesized speeches' dificulties, in turn, the factors that influence that measure were not deeply explored. So, in this paper, we navigate far on this topic and investigate what explains a synthesized speech dificulty. We found out that some of the factors that may influence are: the sentence, the locale and the service used to generate the speech. Also, we performed a preliminary study to investigate the viability of predicting the synthesized dificulty using machine learning models. So, we trained some regression models using the speech synthesis parameters as features and the dificulty as the label. The best result was achieved using a Random Forest, in which we got 0.31 as normalized R2 score.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Item Response Theory</kwd>
        <kwd>Speech Synthesis Evaluation</kwd>
        <kwd>Synthesized Speech Dificulty</kwd>
        <kwd>Speech Quality Measurement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        does not hit an instance class that a poor one does). For
instance, they clarified that it is unfair to evaluate
classiProgress in speech synthesis and recognition research ifers using just the number of instances they hit, it is also
changed the way we communicate and interact with ma- important to analyze the dificulty of instances classified
chines. These techniques can be used as a communication by the models under test. Furthermore, IRT was also
way in diverse applications. It is common to see mobile adopted to evaluate regression models abilities in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
users who opt for using command voices instead of the A more recent way of estimating IRT dificulties was
device’s keyboard to execute some task (e.g., call some- proposed by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The authors suggested that we could
one, do a google search, write an e-mail). Those kinds predict the dificulty of new items using a regression
of systems have been developed and improved more and model trained with the problem features, using the
difimore, but we have not seen many advances in how to culty as target. They trained a regression model for a set
evaluate them. In a previous paper, we proposed Item of domains (i.e., Supervised Learning, Audio Processing,
Response Theory (IRT) from psychometrics to evaluate Computer Vision and so on) and the results showed that
speech synthesis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and in other, we assessed speech using this methodology in that context is promising.
synthesis and recognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Recently, we developed a work that adopted IRT
evalu
      </p>
      <p>
        IRT is commonly used in educational testing to esti- ate speech synthesis and speech recognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which its
mate the latent ability of respondents and the dificulty of main goal was to estimate the latent ability of Automatic
items. Recently, this methodology of evaluation has been Speech Recognition systems, the quality of speakers and
adopted in other contexts, including in the evaluation of the dificulty of synthesized speeches and sentences. So,
AI systems. In supervised learning, IRT was explored by ifrstly, we extracted 100 benchmark sentences from
Vox[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to evaluate the ability of classifiers based Forge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and synthesized them using English voices from
in their answers to a set of instances (what class each four services using diferent variation of pitch and rate.
instance belongs to). [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] investigated the impor- It resulted in a set of synthesized audios that were given
tance of analyzing the particular problems in which good as input to four ASR systems to be transcribed. After this,
techniques fail (e.g., a classifier with good performance we calculated the accuracy of all transcriptions using
the word accuracy rate ( ). The   become the
input to our IRT model (i.e., the responses). To estimate
the IRT parameters (e.g., synthesized speech dificulties),
we adopted the  3-IRT model proposed by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In this paper, we present a deep analysis of the
predicted synthesized speeches’ dificulties estimated in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
in order to understand if they can be explained by the
sentences or the synthesis parameters used to generate
the speeches. So, we deeply analysed the data produced
by these previous work and found that the synthesized
speech dificulty can be afected by the sentence and some
speech synthesis parameters (e.g., speaker, locale, pitch,
rate and gender). We also aimed to know if we could
use any regression model to predict the IRT dificulty in
this context. So, we trained MLP, Linear Regression and
Random Forest models using the synthesis parameters as
features and the dificulty as the label. The Random
Forest outperformed the others, getting 0.50 as normalized
MAE and 0.31 as normalized R2.
      </p>
      <p>The proposal of this paper fits with the AI Evaluation
Beyond Metrics workshop’s goal once both aim to
investigate and give visibility to new robust approaches
to AI systems assessment. As the workshop’s goal, we
desire to explore new assessment methods to try to cover
some limitations of the traditional ones. The approach of
evaluation used in this work (i.e., IRT) has been already
adopted to evaluate other kinds of AI systems such as
classifiers, NLP systems, and so on. Here, we explore
the analysis of using IRT in a new context - to evaluate
speech synthesis and recognition.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Item Response Theory</title>
      <p>
        2.2.  3-IRT Model
IRT is a methodology from psychometrics that aims to
estimate the latent abilities of respondents in tests [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It
models the responses to testing items based on their
dififculties and the skills of the respondents who answered
them. This section presents a classical IRT model (i.e.,
the binary) and a more recent model (i.e.,  3-IRT). This
last one was the one we adopted in this work.
      </p>
      <sec id="sec-2-1">
        <title>The binary IRT model is applied when the response can</title>
        <p>
          be correct or incorrect. In turn we have this more recent
model that deals with continuous responses, the  3-IRT
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The authors of  3-IRT applied it in two contexts.
        </p>
        <p>The first one was to estimate the responses given by
2.1. Binary Model students to items, a typical application of IRT. The second
application was in supervised machine learning, in which
The binary model, also known as dichotomous, is usually classifiers and instances were respondents and items,
used when a response to an item is positive or negative. respectively. In turn, the responses were the probability
In this category, we have the the 3-parameter (3PL) IRT of the classifiers assigning the correct class to an instance.
model and the 2-parameter (2PL) IRT model. In 3PL, the The expectation of the correct responses can be calculated
probability of a correct response is defined by a logistic by:
function of the respondent ability and the item’s
dificulties, discrimination and guessing. This model returns
the Item Characteristic Curve (ICC), which is modeled [ |  ,  , ] = 1 (2)
according to the function below:</p>
        <p>•  is the item discrimination (the slope of the</p>
        <p>ICC);
•  is the guessing parameter (the asymptotic
min</p>
        <p>imum of the ICC).</p>
        <p>•   is the ability of respondent .</p>
        <p>It is important to emphasize that when using IRT,
different from traditional evaluation methods, the
respondent’s ability is not necessarily estimated only by the
number of questions he answers correctly. It depends
on the number of dificult items he hits. Similarly, the
dificulty of an item is measured by the number of
respondents who answer it correctly. In other words, to
estimate these parameters, we consider the sets of items
and respondents under analysis.</p>
        <p>
          •  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] is the response of respondent  to
        </p>
        <p>item ;
•   is the dificulty of the item ;
•   is the ability of the respondent ;
•  is the discrimination of the item .
 ( = 1|  ) =  + 1 + 1− − ( −  )
(1)
in which:
in which:
•  is the response of respondent j to item i;
•   is the item dificulty (the location parameter of</p>
        <p>the ICC);</p>
        <p>Some ICCs that can be modeled by the Eq. 1 is shown
in Figure 1. Each plot shows the curve with diferent
values of dificulty and discrimination. When  = 2, the
curve assumes a sigmoidal shape. If the discrimination is
1, the curve is parabolic, but if that parameter is between
0 and 1, the ICC assumes an anti-sigmoidal behavior.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. IRT to Evaluate Speech</title>
    </sec>
    <sec id="sec-4">
      <title>Synthesis</title>
      <sec id="sec-4-1">
        <title>In a previous work ([2]), we developed a two-level IRT</title>
        <p>
          model to evaluate speech synthesis and recognition. This
model is illustrated on Figure 2. In the first level, an item
is a synthesized speech produced from a given sentence
and a speaker. In turn, the respondent is an ASR system.
Each response is the transcription accuracy observed
when a synthesized speech is adopted as an input the
ASR system (i.e.,  ). An IRT model identifies latent
patterns of responses to estimate the dificulty of each
synthesized speech and the ability of each ASR system. In
the second first level, the synthesized speech’s dificulty
is decomposed into two latent factors: the sentence’s
dificulty and the speaker’s quality. In this current work,
we focus on the first level. Our main goal is to find
characteristics that may influence the estimated synthesized
speech’s dificulty. So, in this paper, we focus on
analyzing and using data generated and estimated on Level 1
presented in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>Figure 3 shows two ICCs of synthesized speeches with
low and high dificulty, respectively. In the first one (i.e.,
6613), all Automatic Speech Recognition (ASRs) systems
got a high response value to that item. However, almost
all (3 of 4) ASRs got a low response value for the most
dificult item (i.e., 2829).</p>
        <p>
          A variety of sentences, speakers and automatic speech
recognizers were used by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as presented below:
• Sentences: The sentences were extracted from
VoxForge [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], an open speech dataset. A total
of 100 English sentences of diferent sizes were
adopted. Figure 4 shows the distribution of those
sentences size (number of characters) with
median of 51.5. The shortest sentence has 12
characters, and the biggest has 134.
• Speakers: The speakers are from four diferent
services: Amazon Polly [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], Google Text to
Speech API [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], IBM Watson Text to Speech [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
and Microsoft Azure Text to Speech [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Each
service has speakers with diferent English
accents, genders, pitches and rates.
• Automatic Speaker Recognizers: The
recognizers adopted in this work were: Google Speech to
Text [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], Microsoft Azure Speech to Text [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ],
IBM Watson Speech to Text [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and Wit [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
They were responsible for receiving a synthesized
speech and generating a transcription (the
sentence the recognizer understood) of the referred
audio.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>In [2], a total of 15,000 synthesized speeches were pro</title>
        <p>
          duced. Each one was generated from a single sentence
and a speaker setting. The IRT model estimated the
dififculty of each speech, with distribution presented in
Figure 5. The dificulty lies between 0 and 0.9. The
majority part of the speeches has dificulty between 0.2 and
0.6. Also, we do not see a representative peak. It means
that there is not a specific dificulty value shared by a big
part of synthesized speeches.
model. Table 1 shows examples of transcriptions of two
4. Experiments and Results of the longest sentences of our dataset. See that just a
part of them was transcribed. It is impacting on the mean
The IRT model provided in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] estimated the dificulty dificulty of those sentences.
value of each speech, but the aspects that impacted the dif- Two of the speech parameters we explored were pitch
ifculty across speeches were not deeply investigated. In and rate. We generated speeches with three diferent
this paper, we deeply explored the synthesized speeches’ pitch values (e.g., low, medium and high). Figure 7 shows
dificulty inferred, aiming to observing its relation to the distribution of the synthesized speech dificulty for
speech synthesis parameters and sentence features. For each pitch group. Each box represents 50% of the
difiinstance, may the length of a sentence influence the dif- culty values of the respective group. In turn, the lower
ifculty? Are bigger sentences easier or more dificult to and upper whiskers represents the dificulties outside
synthesize than short ones? Is gender somehow related to the box. It also indicates the variability of the data
outdificulty? So, in Section 4.1, we explore the relationship side the lower and upper quantiles (i.e., the lower and
between specific synthesis parameters and the dificulty. upper box lines). The line that divides each box into two
We show the dificulty distribution among the groups of parts is the median. It means that a half the dificulty
each feature and also performed statistical tests to see values are greater than or equal to that value, and half
the significant diferences between them. For instance, are less. For instance, Figure 7 shows that speeches with
we present the dificulty distribution of each gender and low or medium pitch tend to be easier than the ones with
performed the statistical test among the dificulty values high pitch, once the dificulty of 50% of the synthesized
of male and female speeches. We also aimed to know if speeches with high, medium and low pitch is 0.42, 0.38
we could predict the synthesized speech dificulty. Thus, and 0.37 (the median of each group), respectively. It is
in Section 4.2, we present insights and results of a pre- also possible to see that speeches with high pitch are the
liminary predictive model we developed to predict the ones that tend to be more dificult whilst the ones with
dificulty, using the synthesis parameters as predictor a low pitch are the easiest. Regarding the rate (Figure
attributes and the dificulty as the target attribute. 8), we noticed that speeches with a fast rate tend to be
more dificult. In turn, the ones with medium pitch are
        </p>
        <sec id="sec-4-2-1">
          <title>4.1. How Synthesis Parameters Influence the easiest.</title>
          <p>the Dificulty of a Synthetic Speech? As we used four services to synthesize the speeches,
we aimed to investigate if speeches from a specific
synInitially, we aimed to understand if the size of the sen- thesizer are more dificult than the ones generated by
tences has any relation to the dificulty. We noticed that the others, and we confirmed that as shown in Figure
the bigger the sentence’s size or the number of words, 9. The speeches from Azure are the most dificult. In
it tends to be more dificult, as seen in Figure 6. We in- turn, the ones from Watson are the easiest. In the middle,
spected some cases and found out that, depending on we have Google and Polly, with this last one tending to
the parameters used to synthesize the speeches, they are generate easier synthesized speeches than the service
not fully transcribed by some recognizers. It directly af- from Google.
fects the  , the response used as input to the IRT</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>The relation between gender and locale (i.e., type of</title>
        <p>English) with dificulty was also analyzed. Figures 11 and
10 show the synthesized speech dificulty distribution
by gender and by locale, respectively. Following, Figure
12 shows the mean dificulty of each gender by locale.
We see that female voices are more dificult than male
ones. Regarding the English type, synthesized speeches
with English from the United States are the easiest ones.
In turn, speeches from Australian English are the most
dificult, followed by British English and Indian English,
respectively. Furthermore, we can see that female voices
are more dificult than male voices in all locales (except
for Australian English that there is not male voices in
our database to compare).</p>
        <p>We performed ANOVA statistical test among the
groups of each feature shown in this Section’s plots
(Figures 7 - 11) o see the significant diferences between them.
The p-value obtained from the analysis in all cases was
significant (  &lt; 0.01). So we conclude that there are
significant diferences among them.</p>
        <sec id="sec-4-3-1">
          <title>4.2. Predicting the Dificulty of a</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>Synthetic Speech</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>This section presents the experiments we performed to</title>
        <p>evaluate the predictability of the synthesized speech
dificulty. As we have the sentences and speaker parameters
used to generate the speeches (i.e., pitch, rate, speaker, The Random Forest trained with all features
outperlocale), we investigated if dificulty can be predicted us- formed all models. It had normalized MAE and R2 of
ing these parameters as predictor attributes (Table 2). 0.50 and 0.31, respectively (Table 3). Figure 13 shows the
Thus, we trained some regression models by assuming feature’s importances. It represents the score of the
feadificulty as the target attribute. tures we used to train the Random Forest model with all</p>
        <p>The regression models we trained were: MLP, Lin- features (i.e., combination 1). The feature that has more
ear Regression and Random Forest from scikit-learn1, a efect is the sentence, followed by the size of the sentence
machine learning python library. We encoded the cate- (i.e., len_sentence), service, speaker, number of words,
gorical features (e.g., sentence, speaker, and so on) using pitch, rate, locale and gender, since higher values mean
the label encoding method, also from scikit-learn. We that a feature has more efect on the prediction process.
also run each model using four diferent combinations of For instance, the feature service is more useful for
prefeatures (Table 3): dicting the synthesized speech dificulty than the rate. In
fact, in Section 4.1 we could see that the tendency some
services have to generate more dificult speeches is more
• Combination 1: all features (Table 2).
• Combination 2: all features except the sentence.
• Combination 2: all features except the speaker.
• Combination 2: all features except the sentence
and the speaker.
explicit than some rates do. In other words, the dificulty
distribution between the services is more diferent than
the dificulty distribution among the rates.</p>
        <p>It was a preliminary study to analyze the viability of
using three diferent types of models to predict the
synthesized speech dificulty. The experiments showed that
by having a sentence and the synthesis parameters of a
new speech we want to synthesize, we can predict its
dificulty without having to run an IRT model again. We
can use the dataset we already constructed to train a
model that would be able to perform that prediction. In
the near future, we aim to delve deeper into this and do
more experiments and further analysis. We can explore
adding more features related to phonemes, of instance.</p>
        <p>Also we can test our models with a newly .</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>
        In this paper, we investigated what explains the
synthesized speech dificulty. We deeply analyzed the data
regarding an experiment we performed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focusing
on that topic: nfiding out if the dificulty of a synthesized
speech can be explained by the sentence or any other
parameter used in the synthesis process (e.g., pitch, rate,
speaker).
      </p>
      <p>The results of our descriptive analysis showed that
bigger sentences tend to be more dificult. Also, some
services or languages generate easier speeches than
others. Female voices are more dificult than male ones. We
also trained regression models in order to see if we can
predict the synthesized speech dificulty. Our
preliminary experiment showed that it may be useful to use this
approach in this context. So, in the feature, we aim to
better investigate this topic, training more robust models
and adding more features to see if we have more insights
about that and even better results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work was supported by CAPES, CNPq and FACEPE (Brazilian funding agencies) and Motorola Mobility.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Tenório</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Prudêncio</surname>
          </string-name>
          ,
          <article-title>Item response theory to estimate the latent ability of speech synthesizers</article-title>
          .,
          <source>in: ECAI</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1874</fpage>
          -
          <lpage>1880</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Moraes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Silva</surname>
          </string-name>
          <string-name>
            <surname>Filho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Prudêncio</surname>
          </string-name>
          ,
          <article-title>A two-level item response theory model to evaluate speech synthesis and recognition</article-title>
          ,
          <source>Speech Communication</source>
          <volume>137</volume>
          (
          <year>2022</year>
          )
          <fpage>19</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Filho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B. C.</given-names>
            <surname>Prudêncio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Diethe</surname>
          </string-name>
          , P. Flach,  3
          <article-title>-irt: A new item response model and its applications</article-title>
          ,
          <source>in: Proceedings of Machine Learning Research</source>
          , volume
          <volume>89</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>1013</fpage>
          -
          <lpage>1021</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Martínez-Plumed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B. C.</given-names>
            <surname>Prudêncio</surname>
          </string-name>
          , A. MartínezUsó, J.
          <string-name>
            <surname>Hernández-Orallo</surname>
          </string-name>
          ,
          <article-title>Making sense of item response theory in machine learning</article-title>
          ,
          <source>in: Proceedings of the Twenty-second European Conference on Artificial Intelligence</source>
          , IOS Press,
          <year>2016</year>
          , pp.
          <fpage>1140</fpage>
          -
          <lpage>1148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Martínez-Plumed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Prudêncio</surname>
          </string-name>
          , A. MartínezUsó, J.
          <string-name>
            <surname>Hernández-Orallo</surname>
          </string-name>
          ,
          <article-title>Item response theory in ai: Analysing machine learning classifiers at the instance level</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>271</volume>
          (
          <year>2019</year>
          )
          <fpage>18</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Moraes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Reinaldo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferreira-Junior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Silva</surname>
          </string-name>
          <string-name>
            <surname>Filho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Prudêncio</surname>
          </string-name>
          ,
          <article-title>Evaluating regression algorithms at the instance level using item response theory, Knowledge-Based Systems (</article-title>
          <year>2022</year>
          )
          <fpage>108076</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Martınez-Plumed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Castellano-Falcón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Monserrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernández-Orallo</surname>
          </string-name>
          ,
          <article-title>When ai dificulty is easy: The explanatory power of predicting irt dificulty</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] VoxForge, Voxforge,
          <year>2022</year>
          . URL: http://www. voxforge.org/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>R. J. De Ayala</surname>
          </string-name>
          ,
          <article-title>The theory and practice of item response theory</article-title>
          ,
          <source>Guilford Publications</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>A. W. S. AWS</surname>
          </string-name>
          ,
          <article-title>Amazon polly: Turn text into lifelike speech using deep learning, 2022</article-title>
          . URL: https://aws. amazon.com/polly/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cloud</surname>
          </string-name>
          ,
          <article-title>Cloud text-to-speech: Text-to-speech conversion powered by machine learning</article-title>
          ,
          <year>2022</year>
          . URL: https://cloud.google.com/text-to-speech/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>I. Watson</surname>
          </string-name>
          , Text to speech,
          <year>2022</year>
          . URL: https: //text-to
          <article-title>-speech-demo.ng</article-title>
          .bluemix.net/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Azure</surname>
          </string-name>
          ,
          <article-title>Text to speech: Convert text to lifelike speech for more natural interfaces</article-title>
          ,
          <year>2022</year>
          . URL: https://azure.microsoft.com/en-us/services/ cognitive-services/text-to-speech/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cloud</surname>
          </string-name>
          ,
          <article-title>Speech-to-text: Speech-to-text conversion powered by machine learning</article-title>
          ,
          <year>2022</year>
          . URL: https://cloud.google.com/speech-to-text, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Azure</surname>
          </string-name>
          ,
          <article-title>Speech to text: Convert spoken audio to text for more natural interactions</article-title>
          ,
          <year>2022</year>
          . URL: https://azure.microsoft.com/en-us/services/ cognitive-services/speech-to-text/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>I. Watson</surname>
          </string-name>
          , Speech to text,
          <year>2022</year>
          . URL: https: //speech-to
          <article-title>-text-demo.ng</article-title>
          .bluemix.net/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>W. AI</surname>
          </string-name>
          ,
          <source>Natural language for developers</source>
          ,
          <year>2022</year>
          . URL: https://wit.ai/, access in:
          <volume>08</volume>
          /05/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>