=Paper= {{Paper |id=Vol-2253/paper53 |storemode=property |title=DARC-IT: a DAtaset for Reading Comprehension in ITalian |pdfUrl=https://ceur-ws.org/Vol-2253/paper53.pdf |volume=Vol-2253 |authors=Dominique Brunato,Martina Valeriani,Felice Dell'Orletta |dblpUrl=https://dblp.org/rec/conf/clic-it/BrunatoVD18 }} ==DARC-IT: a DAtaset for Reading Comprehension in ITalian== https://ceur-ws.org/Vol-2253/paper53.pdf
          DARC-IT: a DAtaset for Reading Comprehension in ITalian
                 Dominique Brunato , Martina Valeriani• , Felice Dell’Orletta
                                          •
                                            University of Pisa
                                marti.valeriani@gmail.com
            
              Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                                   ItaliaNLP Lab - www.italianlp.it
           {dominique.brunato, felice.dellorletta}@ilc.cnr.it


                     Abstract                         and reasoning across multiple sentences. In ed-
                                                      ucational scenarios, student’s comprehension and
    English. In this paper, we present DARC-          reasoning skills are typically assessed through a
    IT, a new reading comprehension dataset           variety of tasks, going from prediction tasks (e.g.
    for the Italian language aimed at identify-       cloze test) to retellings generation and question an-
    ing ‘question-worthy’ sentences, i.e. sen-        swering, which are costly to produce and require
    tences in a text which contain information        domain expert knowledge. Given also the chal-
    that is worth asking a question about1 . The      lenges posed by the broad diffusion of distance
    purpose of the corpus is twofold: to in-          learning programs, such as MOOC (Massive Open
    vestigate the linguistic profile of question-     Online Courses), the automatic assessment of RC
    worthy sentences and to support the devel-        is becoming a rapidly growing research field of
    opment of automatic question generation           Natural Language Processing (NLP). While much
    systems.                                          more work has been done on developing Auto-
                                                      mated Essay Scoring (AES) systems (Passonneau
    Italiano. In questo contributo, viene             et al., 2017), recent studies have focused on the
    presentato DARC-IT, un nuovo corpus di            automatic generation of questions to be used for
    comprensione scritta per la lingua ital-          evaluating humans’ reading and comprehension
    iana per l’identificazione delle frasi che        (Du and Cardie, 2017; Afzal and Mitkov, 2014).
    si prestano ad essere oggetto di una do-          This is not a trivial task, since it assumes the abil-
    manda2 . Lo scopo di questo corpus è du-         ity to understand which concepts in a text are
    plice: studiare il profilo linguistico delle      most relevant, where relevance can be here de-
    frasi informative e fornire un corpus di          fined as the likelihood of a passage to be worth
    addestramento a supporto di un sistema            asking a question about. The availability of large
    automatico di generazione di domande di           and high-quality RC datasets containing questions
    comprensione.                                     posed by humans on a given text thus becomes a
                                                      fundamental requirement to train data-driven sys-
                                                      tems able to automatically learn what makes a pas-
1   Introduction
                                                      sage ‘question-worthy’. In this regard, datasets
Reading comprehension (RC) can be defined as          collected for other NLP tasks, Question Answer-
“the process of simultaneously extracting and con-    ing above all, provide a valuable resource. One
structing meaning through interaction and involve-    of the most widely used is the Stanford Question
ment with written language” (Snow, 2002). Such a      Answering Dataset (SQuAD), (Rajpurkar et al.,
definition emphasizes that RC is a complex human      2016). It contains more than 100,000 questions
ability that can be decomposed into multiple oper-    posed by crowdworkers on a set of Wikipedia ar-
ations, such as coreference resolution, understand-   ticles, in which the answer to each question is a
ing discourse relations, commonsense reasoning        segment of text from the corresponding reading
    1
                                                      passage. More recently, other large RC datasets
      The corpus will be made publicly avail-
able for research purposes at the following link:
                                                      have been released: it is the case of the ‘Triv-
http://www.italianlp.it/resources/                    iaQA’ dataset (Joshi et al., 2017), which is in-
    2
      Il corpus sarà messo a disposizione libera-    tended to be more challenging than SQuaD since
mente per scopi di ricerca al seguente indirizzo:     it contains a higher proportion of complex ques-
http://www.italianlp.it/resources/
tions, i.e. questions requiring inference over mul-      ing. As documented by the last available techni-
tiple sentences. The same holds for RACE (Lai            cal report provided by the Institute3 , the INVALSI
et al., 2017), which is also the only one specifi-       Italian test has been designed to cover seven main
cally designed for educational purposes. Indeed          aspects underlying text comprehension, namely:
it covers multiple domains and written styles and        understanding the meaning of words; identifying
contains questions generated by domain experts,          explicit information; inferring implicit informa-
i.e. English teachers, to assess reading and com-        tion; detecting elements conveying cohesion and
prehension skills of L2 learners. While all these        coherence in text; comprehending the meaning of
datasets are available for the English language, to      a passage by integrating both implicit and explicit
our knowledge, no similar RC datasets exist for          information; comprehending the meaning of the
the Italian language. In this paper we introduce         whole text; generating a meaningful interpretation
a new corpus for Italian specifically conceived to       (e.g. understanding the message, the purpose etc.).
support research on the automatic identification of      With respect to their form, questions can be ei-
question-worthy passages. In what follows, we            ther multiple-choice (typically with 3 or 4 options,
first describe the typology of texts it contains and     see example (1)) or, more rarely, open-ended ques-
the annotation process we performed on them. We          tions (example 2).
then carry out a qualitative analysis based on lin-         Example (1): Dove abita il ragno del rac-
guistic features automatically extracted from texts         conto? (Where does the spider of the story
with the aim of studying, on the one hand, which            live?)
features mostly discriminate question-worthy sen-           A. In un albero del bosco. (On a forest tree)
tences from other sentences and, on the other               B. Sopra un fiore del bosco. (Upon a forest
hand, whether the two classes of sentences have a           flower)
different profile in terms of linguistic complexity.        C. In una siepe del bosco. (In a forest hedge)

2   Dataset Collection                                      Example (2): Dopo aver letto il testo, qual
                                                            è secondo te il messaggio che vuole dare
The first step in the process of corpus construc-           l’autore? (After reading the text, what do you
tion was the selection of appropriate materials.            think is the message the author wants to give?)
As noted by Lai et al. (2017), a major drawback
of many existing RC datasets is that they were              For the purpose of our study, we selected only
either crowd-sourced or automatically-generated          the first type of questions, thus obtaining a total
thus paying very little attention to the intended tar-   of 354 questions. Table 1 reports some statistics
get user; this makes them less suitable to be used       about the final corpus collected from the INVALSI
in real educational scenarios. To prevent these lim-     tests.
itations, we relied on a corpus of reading com-
prehension tests designed by the National Institute          SchoolGrade       Texts    Sentences     Questions
for the Evaluation of the Education System (IN-              2nd Primary       10       195           75
                                                             4th Primary       9        205           36
VALSI), which is the Italian institution in charge           5th Primary       12       427           50
of developing standardized tests for the assess-             1st Middle        19       513           72
ment of numeracy and literacy skills of primary,             3rd Middle        10       342           48
middle and high school students.                             1st High          10       303           32
   To create the corpus, we focused only on tests            2nd High          7        211           18
                                                             3rd High          9        261           23
designed to assess students’ competences in the
                                                             TOT               86       2457          354
Italian language. We thus collected a total of 86
Italian tests administered between 2003 and 2013,
                                                         Table 1: Total number of texts, total number of
of which 31 targeting primary school’s pupils of
                                                         sentences and corresponding questions for each
the second, third and fifth grade, 29 targeting stu-
                                                         school grade in DARC-IT.
dents of the first and third year of middle school
and 26 targeting students of first, second and third
grade of high school. To each text a number of
questions is associated, which aim to deeply as-            3
                                                              http://www.invalsi.it/invalsi/doc eventi/2017/
sess student’s ability of reading and understand-        Rapporto tecnico SNV 2017.pdf
2.1   Annotation Scheme                                    salı̀ sul vagone delle marmellate. (I’m going to be
For each question of the corpus, the annotation            a postman! One said. I’m going to be a teacher!
process was meant to identify the sentence (or a           Another said. And I’m going to be a chef! Shouted
sentence span) containing the corresponding an-            a third one and went up on the wagon of the jams).
swer. This information was marked on text by en-              Corresponding question: A che cosa pensano
closing the relevant text span in opening and clos-        i bambini quando vedono gli oggetti sul treno?
ing xml tags with a letter R in upper case.                (What do children think when they see the items
   The outcome of the annotation process was               on the train?)
a tabular file with the following information re-             Multiple choice answers: A. Ai giochi che po-
ported in separate columns: i) the text segmented          tranno fare. (To the plays they can do); B. A cose
into sentences; ii) a binary value 1 vs 0 (1 if the        utili che si possono vendere. (To useful things
sentence contains the answer to the question and 0         that can be sold); C. Ai regali che vorrebbero rice-
if not); iii) the corresponding question; iv) the an-      vere. (To the presents they would like to receive);
swer provided by the annotator. Table 2 gives an           D. Ai lavori che faranno da grandi. (To the trades
example of the dataset structure.                          they will do as adults.)
   A qualitative inspection of the corpus allowed
                                                           3       Linguistic Analysis
identifying different typologies of ‘question-
worthy’ sentences: sentences that were the target          As a result of the annotation process, we obtained
of one question only (this is the case of the second       398 ‘question-worthy’ sentences and 2059 ‘non-
sentence reported in Table 2); sentences that were         question’ worthy sentences. Starting from this
the target of multiple questions, such as (4), and         classification we carried out an in-depth linguis-
sentences that only partially answered the question        tic analysis based on a wide set of features cap-
(i.e. the whole information required to give the an-       turing properties of a sentence at lexical, morpho–
swer is spread across multiple sentences), such as         syntactic and syntactic level. The aim of this anal-
(5).                                                       ysis was to understand whether there are some lin-
   (4) Question-worthy sentence: Leo decide di             guistic features that mostly allow predicting the
aiutare gli animali della giungla (Leo decided to          ‘likelihood’ of a sentence to be the target of a ques-
help the jungle animals)                                   tion. To allow the extraction of linguistic features,
   Corresponding questions:                                all sentences were automatically tagged by the
                                                           part-of-speech tagger described in (Dell’Orletta,
  • Qual è la cosa più importante per Leo? (What
                                                           2009) and dependency parsed by the DeSR parser
    is the most important think to Leo?)
                                                           described in (Attardi et al., 2009).
      Multiple choice answers: A. Essere un bravo             Table 3 shows an excerpt of the first 20 fea-
      cacciatore. (To be a good hunter); B. Di-            tures (of 177 extracted ones) for which the average
      ventare il piú coraggioso di tutti. (To become      difference between their value in the ‘question-
      the bravest of all); C. Rendersi utile agli altri.   worthy’ and ‘non question-worthy’ class was
      (To make himself useful to others); D. Fare          highly statistically significant using the Wilcoxon
      nuove esperienze. (To make new experi-               rank sum test4 . As it can be seen, sentences on
      ences).                                              which a comprehension question was asked are
                                                           on average much more longer. This could be ex-
  • Cosa sceglie di fare Leo nella giungla? (What
                                                           pected since the longer the sentence the higher the
    does Leo choose to do in the jungle?)
                                                           probability that it is more informative and thus
      Multiple choice answers: A. Giocare con              containing concepts that are worth asking a ques-
      tutti. (To play with everybody); B. Dormire          tion about. This is also suggested by the higher
      e mangiare.        (To sleep and eat); C.            distribution of proper nouns [10], most likely re-
      Aiutare chi è in difficoltà. (To help people in    ferring to relevant semantic types (e.g. person,
      need); D. Nuotare nell’acqua del fiume (To           location) which typically occur in Narrative, i.e.
      swim in the river water)                             the main textual genre of the Invalsi tests. The
                                                           higher sentence length of ‘question-worthy’ sen-
  (5) Question-worthy sentences: “Io farò il
                                                           tences has effects also at morpho-syntactic and
postino!” Disse uno. “Io farò il maestro!” Disse
                                                               4
un altro. “E io farò lo chef!”. Urlò un terzo e                  All significant features are shown in Appendix (A).
 Sentence                                 Class   Tag                         Question                       Answer
 La lucciola si preparò e, quando        0
 calò la sera, andò all’appuntamento.
 Entrò nel bosco scuro e raggiunse la    1       Entrò nel bosco scuro   Dove abita il ragno del rac-   In una siepe del
 siepe dove viveva il ragno.                      e raggiunse la siepe dove   conto?                         bosco.
                                                  viveva il ragno.<\R>

                                    Table 2: Sample output of the dataset structure.


syntactic level, as shown e.g. by the higher pro-               tion to operationalize multiple phenomena of text
portion of conjunctions introducing subordinate                 complexity. READ–IT assigns different readabil-
clauses ([7] Subord. conj: 1.63 vs 1.50) and by                 ity scores using the following four models: 1)
the presence of longer syntactic relations in which             Base Model, relying on raw text features only
the linear distance between the ‘head’ and the ‘de-             (e.g. average sentence and word length); 2) Lex-
pendent’ is higher than 10 tokens ([20] Max link:               ical Model, relying on a combination of raw text
11.30 vs 8.30).                                                 and lexical features; 3) Syntax Model, relying on
                                                                morpho-syntactic and syntactic features; 4) Global
                             Question       NoQuestion          Model, combining all feature types (raw text, lex-
 Features                 Avg (StDev) Avg (StDev)
                      Raw Text features                         ical, morpho-syntactic and syntactic features).
 [1] Sentence length* 29.00 (16.11) 20.00 (13.75)                  Results are reported in Table 4. As it can be
                 Morpho–syntactic features                      noted, question-worthy sentences have a higher
 [2] Punctuation*         4.74 (2.82)      7.70 (6.23)
                                                                complexity with respect to all models. Especially
 [3] Negative adv*        1.23 (2.82)      1.19 (3.13)
 [4] Coord. conj*         3.50 (3.40)      3.20 (3.81)          at syntactic level, this could be expected given the
 [5] Poss. adj*           0.96 (2.10)      0.89 (2.33)          higher values obtained by features related to syn-
 [6] Relative pron*       1.14 (2.00)      1.12 (2.32)          tactic complexity which turned out to be signifi-
 [7] Subord. conj*        1.63 (2.80)      1.50 (2.90)
 [8] Prepositions*        7.90 (5.01)      7.60 (6.20)          cantly involved in discriminating these sentences.
 [9] Determiners*         9.13 (5.00)      9.00 (6.20)
 [10] Proper nouns*       2.05 (3.90)      2.00 (4.30)                                     Question    NoQuestion
 [11] Numbers             0.66 (1.87)      0.64 (2.25)               READ-IT Base          59,9%       21,1%
 [12] Verbs               15.98 (6.32)     16.97 (8.18)              READ-IT Lexical       98,9 %      66,4%
 [13] Indicat. mood*      57.00 (30.70) 60.00 (33.82)                READ-IT Syntactic     69,3%       37,5%
 [14] Particip. mood      7.13 (14.22) 6.34 (14.88)                  READ-IT Global        100%        95%
 [15] 3rd pers. verb*     55.15 (39.50) 45.20 (42.62)
 [16] Conjunctions        5.1     (4.35)   4.34 (4.66)          Table 4: Readability score obtained by different
                      Syntactic features
 [17] Clause length*      8.63 (4.34)      7.90 (4.24)
                                                                READ-IT models.
 [18] Verbal heads*       4.00 (2.30)      3.00 (2.03)
 [19] Postverb Subj*      13.60 (27.00) 15.70 (32.00)
 [20] Max link*           11.30 (7.06)     8.30 (6.80)          4   Conclusion
Table 3: Linguistic features whose average dif-                 We presented DARC-IT, a new reading compre-
ference between the two classes was statistically               hension dataset for Italian collected from a sam-
significant. For each feature it is reported the                ple of standardized evaluation tests used to as-
average value (avg) and the standard deviation                  sess students’ reading and comprehension at dif-
(StDev). All differences are statistically signif-              ferent grade levels. For each text, we anno-
icant at p<.005; those with * also at p<.001.                   tated ‘question-worthy’ sentences, i.e. sentences
(Note: Question=question-worthy sent.; NoQues-                  which contained the answer to a given question.
tion=Non question-worthy sent.)                                 A qualitative analysis of these sentences showed
                                                                that the likelihood of a sentence to be ‘question-
   A further analysis was meant to investigate the              worthy’ can be modeled using a set of linguis-
profile of question-worthy sentences with respect               tic features, which are especially linked to syn-
to linguistic complexity. To this end, we exploit               tactic complexity. We believe that this corpus
READ-IT (Dell’Orletta et al., 2011), a general-                 can support research on the development of auto-
purpose readability assessment tool for Italian,                matic question generation systems as well as ques-
which combines traditional raw text features with               tion answering systems. Current developments go
lexical, morpho-syntactic and syntactic informa-                into several directions: we are carrying out a first
classification experiment to automatically predict           Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
‘question-worthy’ sentences and evaluate the im-               Eduard H. Hovy. 2017. RACE: Large-scale ReAd-
                                                               ing Comprehension Dataset From Examinations. In
pact of linguistic features on the classifier perfor-
                                                               Proceedings of the 2017 Conference on Empirical
mance. We are also planning to enlarge the cor-                Methods in Natural Language Processing, Associ-
pus and to investigate more in-depth the typology              ation for Computational Linguistics, Copenhagen,
of questions and answers it contains, in order to              Denmark.
study what characterizes sentences answering, for            Rebecca J. Passonneau, Ananya Poddar, Gaurav Gite,
instance, to factual vs non-factual questions.                 Alisa Krivokapic, Qian Yang and Dolores Perin.
                                                               2016. Wise Crowd Content Assessment and Edu-
5     Acknowledgments                                          cational Rubrics. International Journal of Artificial
                                                               Intelligence in Education, 28, 29–55.
The work presented in this paper was par-                    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
tially supported by the 2–year project (2018-                  Percy Liang. 2016. Squad: 100,000+ questions for
2020) SchoolChain – Soluzioni innovative per                   machine comprehension of text. In Proceedings of
la creazione, la certificazione, il riuso e la con-            the 2016 Conference on Empirical Methods in Natu-
                                                               ral Language Processing, Association for Computa-
divisione di unità didattiche digitali all’interno            tional Linguistics, Austin, Texas, pages 2383–2392.
del sistema Scuola, funded by Regione Toscana
(BANDO POR FESR 2014-2020).                                  Catherine Snow. 2002. Reading for understanding:
                                                               Toward an RD program in reading comprehension.
                                                               Rand Corporation.
References
    Naveed Afzal and Ruslan Mitkov. 2014. Auto-
     matic generation of multiple choice questions using
     dependency-based semantic relations Soft Comput-
     ing, 18 (7), 1269–1281.

Giuseppe Attardi, Felice Dell’Orletta, Maria Simi,
  Joseph Turian. 2009. Accurate dependency pars-
  ing with a stacked multilayer perceptron. In Pro-
  ceedings of EVALITA 2009 - Evaluation of NLP and
  Speech Tools for Italian 2009, Reggio Emilia, Italy,
  December 2009.

Felice Dell’Orletta. 2009. Ensemble system for part-
  of-speech tagging. In Proceedings of EVALITA 2009
  - Evaluation of NLP and Speech Tools for Italian
  2009, Reggio Emilia, Italy, December 2009.

Felice Dell’Orletta, Simonetta Montemagni and Giu-
  lia Venturi. 2011. READ-IT: assessing readabil-
  ity of Italian texts with a view to text simplification.
  Proceedings of the Second Workshop on Speech
  and Language Processing for Assistive Technologies
  (SLPAT 2011), Edimburgo, UK: 73–83.

Xinya Du and Claire Cardie. 2017. Identifying
  Where to Focus in Reading Comprehension for Neu-
  ral Question Generation. In Proceedings of the 2017
  Conference on Empirical Methods in Natural Lan-
  guage Processing, Association for Computational
  Linguistics, Copenhagen, Denmark.

Mandar Joshi, Eunsol Choi, Daniel Weld and Luke
 Zettlemoyer. 2017. TriviaQA: A Large Scale Dis-
 tantly Supervised Challenge Dataset for Reading
 Comprehension. In Proceedings of the 55th Annual
 Meeting of the Association for Computational Lin-
 guistics, Vancouver, Canada, 1601–1611.
Appendix (A).

                                            Question-worthy sentences   Non Question-worthy Sentences
        Features                            Average (StDev)             Average (StDev)
                                                Raw Text features
        Sentence length***                  29.00      (16.11)          20.00     (13.75)
                                                  Lexical features
        % Basic Italian Vocabulary (BIV)*   88.54      (8.53)           88.99     (10.66)
        % Fundamental BIV**                 78.26      (10.83)          79.59     (13.23)
        % ‘High Usage’ BIV*                 12.31      (8.12)           12.50     (10.28)
        Lexical density*                    0.56       (0.08)           0.58      (0.11)
                                            Morpho–syntactic features
        % Adjectives*                       5.20       (4.71)           4.35      (5.55)
        % Articles***                       9.13       (5.00)           9.00      (6.20)
        % Conjunctions**                    5.1        (4.35)           4.34      (4.66)
        % Coordinat. conj***                3.50       (3.40)           3.20      (3.81)
        % Demonstrative determiners***      0.61       (1.61)           0.55      (1.90)
        % Indefinite pronouns               0.87       (2.26)           0.66      (2.24)
        % Interrogative determiners*        00.5       (0.52)           0.06      (0.67)
        % Interjections*                    0.03       (0.31)           0.09      (0.72)
        % Numbers**                         0.66       (1.87)           0.64      (2.25)
        % Negative adverbs***               1.23       (2.82)           1.19      (3.13)
        % Ordinal numbers*                  0.27       (1.04)           0.14      (0.83)
        % Possessive adjectives***          0.96       (2.10)           0.89      (2.33)
        % Prepositions***                   7.90       (5.01)           7.60      (6.20)
        % Proper nouns**                    2.05       (3.90)           2.00      (4.30)
        % Punctuation***                    4.74       (2.82)           7.70      (6.23)
        % Relative pronouns***              1.14       (2.00)           1.12      (2.32)
        % Subordin. conj***                 1.63       (2.80)           1.50      (2.90)
        % Verbs**                           15.98      (6.32)           16.97     (8.18)
        % Verb Participial mood**           7.13       (14.22)          6.34      (14.88)
        % Verb Indicative mood***           57.00      (30.70)          60.00     (33.82)
        % Verb Conditional mood**           1.37       (6.13)           2.35      (9.58)
        % Verb Past tense**                 22.19      (34.80)          23.88     (37.73)
        % Verb Imperfect tense**            29.08      (39.35)          29.04     (41.13)
        % Verb Present tense*               45.04      (43.50)          38.40     (44.91)
        % 3rd pers. verb***                 55.15      (39.50)          45.20     (42.62)
        % 2nd pers. verb*                   1.37       (7.34)           1.84      (10.25)
        TTR ratio (first 100 lemmas)**      0.84       (0.10)           0.89      (0.10)
                                                Syntactic features
        Clause length (in tokens)***        8.63       (4.34)           7.90      (4.24)
        Avg verbal heads/sentence***        4.00       (2.30)           3.00      (2.03)
        Avg prep. links length*             1.11       (0.45)           0.93      (0.58)
        Max link length***                  11.30      (7.06)           8.30      (6.80)
        Verb arity                          34.93      (29.74)          33.37     (32.70)
        % Postverbal subject***             13.60      (27.00)          15.70     (32.00)
        % Preverbal objects*                10.17      (25.17)          9.22      (25.55)
        % DEP Root**                        5.52       (3.31)           8.20      (6.30)
        % DEP Mod rel***                    1.50       (2.21)           1.30      (2.50)
        % DEP Copulative Conj**             5.34       (4.92)           4.65      (5.26)
        % DEP Determiner***                 9.10       (5.00)           8.80      (6.20)
        % DEP Disjuntive Conj               0.14       (0.76)           0.20      (0.99)
        % DEP Locative Compl*               0.73       (2.03)           0.53      (1.81)
        % DEP neg***                        1.20       (2.80)           1.13      (2.84)
        % DEP conj**                        4.58       (4.12)           3.91      (4.62)
        % DEP concatenation*                0.06       (0.52)           0.08      (0.8)

Table 5: Linguistic features whose average difference between the two classes was statistically signifi-
cant. For each feature it is reported the average value and the standard deviation (StDev). *** indicates a
highly significant difference (p<.001); ** a very significant difference (p<.01); * a significant difference
(p<.05).