=Paper=
{{Paper
|id=Vol-2253/paper53
|storemode=property
|title=DARC-IT: a DAtaset for Reading Comprehension in ITalian
|pdfUrl=https://ceur-ws.org/Vol-2253/paper53.pdf
|volume=Vol-2253
|authors=Dominique Brunato,Martina Valeriani,Felice Dell'Orletta
|dblpUrl=https://dblp.org/rec/conf/clic-it/BrunatoVD18
}}
==DARC-IT: a DAtaset for Reading Comprehension in ITalian==
DARC-IT: a DAtaset for Reading Comprehension in ITalian Dominique Brunato , Martina Valeriani• , Felice Dell’Orletta • University of Pisa marti.valeriani@gmail.com Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it {dominique.brunato, felice.dellorletta}@ilc.cnr.it Abstract and reasoning across multiple sentences. In ed- ucational scenarios, student’s comprehension and English. In this paper, we present DARC- reasoning skills are typically assessed through a IT, a new reading comprehension dataset variety of tasks, going from prediction tasks (e.g. for the Italian language aimed at identify- cloze test) to retellings generation and question an- ing ‘question-worthy’ sentences, i.e. sen- swering, which are costly to produce and require tences in a text which contain information domain expert knowledge. Given also the chal- that is worth asking a question about1 . The lenges posed by the broad diffusion of distance purpose of the corpus is twofold: to in- learning programs, such as MOOC (Massive Open vestigate the linguistic profile of question- Online Courses), the automatic assessment of RC worthy sentences and to support the devel- is becoming a rapidly growing research field of opment of automatic question generation Natural Language Processing (NLP). While much systems. more work has been done on developing Auto- mated Essay Scoring (AES) systems (Passonneau Italiano. In questo contributo, viene et al., 2017), recent studies have focused on the presentato DARC-IT, un nuovo corpus di automatic generation of questions to be used for comprensione scritta per la lingua ital- evaluating humans’ reading and comprehension iana per l’identificazione delle frasi che (Du and Cardie, 2017; Afzal and Mitkov, 2014). si prestano ad essere oggetto di una do- This is not a trivial task, since it assumes the abil- manda2 . Lo scopo di questo corpus è du- ity to understand which concepts in a text are plice: studiare il profilo linguistico delle most relevant, where relevance can be here de- frasi informative e fornire un corpus di fined as the likelihood of a passage to be worth addestramento a supporto di un sistema asking a question about. The availability of large automatico di generazione di domande di and high-quality RC datasets containing questions comprensione. posed by humans on a given text thus becomes a fundamental requirement to train data-driven sys- tems able to automatically learn what makes a pas- 1 Introduction sage ‘question-worthy’. In this regard, datasets Reading comprehension (RC) can be defined as collected for other NLP tasks, Question Answer- “the process of simultaneously extracting and con- ing above all, provide a valuable resource. One structing meaning through interaction and involve- of the most widely used is the Stanford Question ment with written language” (Snow, 2002). Such a Answering Dataset (SQuAD), (Rajpurkar et al., definition emphasizes that RC is a complex human 2016). It contains more than 100,000 questions ability that can be decomposed into multiple oper- posed by crowdworkers on a set of Wikipedia ar- ations, such as coreference resolution, understand- ticles, in which the answer to each question is a ing discourse relations, commonsense reasoning segment of text from the corresponding reading 1 passage. More recently, other large RC datasets The corpus will be made publicly avail- able for research purposes at the following link: have been released: it is the case of the ‘Triv- http://www.italianlp.it/resources/ iaQA’ dataset (Joshi et al., 2017), which is in- 2 Il corpus sarà messo a disposizione libera- tended to be more challenging than SQuaD since mente per scopi di ricerca al seguente indirizzo: it contains a higher proportion of complex ques- http://www.italianlp.it/resources/ tions, i.e. questions requiring inference over mul- ing. As documented by the last available techni- tiple sentences. The same holds for RACE (Lai cal report provided by the Institute3 , the INVALSI et al., 2017), which is also the only one specifi- Italian test has been designed to cover seven main cally designed for educational purposes. Indeed aspects underlying text comprehension, namely: it covers multiple domains and written styles and understanding the meaning of words; identifying contains questions generated by domain experts, explicit information; inferring implicit informa- i.e. English teachers, to assess reading and com- tion; detecting elements conveying cohesion and prehension skills of L2 learners. While all these coherence in text; comprehending the meaning of datasets are available for the English language, to a passage by integrating both implicit and explicit our knowledge, no similar RC datasets exist for information; comprehending the meaning of the the Italian language. In this paper we introduce whole text; generating a meaningful interpretation a new corpus for Italian specifically conceived to (e.g. understanding the message, the purpose etc.). support research on the automatic identification of With respect to their form, questions can be ei- question-worthy passages. In what follows, we ther multiple-choice (typically with 3 or 4 options, first describe the typology of texts it contains and see example (1)) or, more rarely, open-ended ques- the annotation process we performed on them. We tions (example 2). then carry out a qualitative analysis based on lin- Example (1): Dove abita il ragno del rac- guistic features automatically extracted from texts conto? (Where does the spider of the story with the aim of studying, on the one hand, which live?) features mostly discriminate question-worthy sen- A. In un albero del bosco. (On a forest tree) tences from other sentences and, on the other B. Sopra un fiore del bosco. (Upon a forest hand, whether the two classes of sentences have a flower) different profile in terms of linguistic complexity. C. In una siepe del bosco. (In a forest hedge) 2 Dataset Collection Example (2): Dopo aver letto il testo, qual è secondo te il messaggio che vuole dare The first step in the process of corpus construc- l’autore? (After reading the text, what do you tion was the selection of appropriate materials. think is the message the author wants to give?) As noted by Lai et al. (2017), a major drawback of many existing RC datasets is that they were For the purpose of our study, we selected only either crowd-sourced or automatically-generated the first type of questions, thus obtaining a total thus paying very little attention to the intended tar- of 354 questions. Table 1 reports some statistics get user; this makes them less suitable to be used about the final corpus collected from the INVALSI in real educational scenarios. To prevent these lim- tests. itations, we relied on a corpus of reading com- prehension tests designed by the National Institute SchoolGrade Texts Sentences Questions for the Evaluation of the Education System (IN- 2nd Primary 10 195 75 4th Primary 9 205 36 VALSI), which is the Italian institution in charge 5th Primary 12 427 50 of developing standardized tests for the assess- 1st Middle 19 513 72 ment of numeracy and literacy skills of primary, 3rd Middle 10 342 48 middle and high school students. 1st High 10 303 32 To create the corpus, we focused only on tests 2nd High 7 211 18 3rd High 9 261 23 designed to assess students’ competences in the TOT 86 2457 354 Italian language. We thus collected a total of 86 Italian tests administered between 2003 and 2013, Table 1: Total number of texts, total number of of which 31 targeting primary school’s pupils of sentences and corresponding questions for each the second, third and fifth grade, 29 targeting stu- school grade in DARC-IT. dents of the first and third year of middle school and 26 targeting students of first, second and third grade of high school. To each text a number of questions is associated, which aim to deeply as- 3 http://www.invalsi.it/invalsi/doc eventi/2017/ sess student’s ability of reading and understand- Rapporto tecnico SNV 2017.pdf 2.1 Annotation Scheme salı̀ sul vagone delle marmellate. (I’m going to be For each question of the corpus, the annotation a postman! One said. I’m going to be a teacher! process was meant to identify the sentence (or a Another said. And I’m going to be a chef! Shouted sentence span) containing the corresponding an- a third one and went up on the wagon of the jams). swer. This information was marked on text by en- Corresponding question: A che cosa pensano closing the relevant text span in opening and clos- i bambini quando vedono gli oggetti sul treno? ing xml tags with a letter R in upper case. (What do children think when they see the items The outcome of the annotation process was on the train?) a tabular file with the following information re- Multiple choice answers: A. Ai giochi che po- ported in separate columns: i) the text segmented tranno fare. (To the plays they can do); B. A cose into sentences; ii) a binary value 1 vs 0 (1 if the utili che si possono vendere. (To useful things sentence contains the answer to the question and 0 that can be sold); C. Ai regali che vorrebbero rice- if not); iii) the corresponding question; iv) the an- vere. (To the presents they would like to receive); swer provided by the annotator. Table 2 gives an D. Ai lavori che faranno da grandi. (To the trades example of the dataset structure. they will do as adults.) A qualitative inspection of the corpus allowed 3 Linguistic Analysis identifying different typologies of ‘question- worthy’ sentences: sentences that were the target As a result of the annotation process, we obtained of one question only (this is the case of the second 398 ‘question-worthy’ sentences and 2059 ‘non- sentence reported in Table 2); sentences that were question’ worthy sentences. Starting from this the target of multiple questions, such as (4), and classification we carried out an in-depth linguis- sentences that only partially answered the question tic analysis based on a wide set of features cap- (i.e. the whole information required to give the an- turing properties of a sentence at lexical, morpho– swer is spread across multiple sentences), such as syntactic and syntactic level. The aim of this anal- (5). ysis was to understand whether there are some lin- (4) Question-worthy sentence: Leo decide di guistic features that mostly allow predicting the aiutare gli animali della giungla (Leo decided to ‘likelihood’ of a sentence to be the target of a ques- help the jungle animals) tion. To allow the extraction of linguistic features, Corresponding questions: all sentences were automatically tagged by the part-of-speech tagger described in (Dell’Orletta, • Qual è la cosa più importante per Leo? (What 2009) and dependency parsed by the DeSR parser is the most important think to Leo?) described in (Attardi et al., 2009). Multiple choice answers: A. Essere un bravo Table 3 shows an excerpt of the first 20 fea- cacciatore. (To be a good hunter); B. Di- tures (of 177 extracted ones) for which the average ventare il piú coraggioso di tutti. (To become difference between their value in the ‘question- the bravest of all); C. Rendersi utile agli altri. worthy’ and ‘non question-worthy’ class was (To make himself useful to others); D. Fare highly statistically significant using the Wilcoxon nuove esperienze. (To make new experi- rank sum test4 . As it can be seen, sentences on ences). which a comprehension question was asked are on average much more longer. This could be ex- • Cosa sceglie di fare Leo nella giungla? (What pected since the longer the sentence the higher the does Leo choose to do in the jungle?) probability that it is more informative and thus Multiple choice answers: A. Giocare con containing concepts that are worth asking a ques- tutti. (To play with everybody); B. Dormire tion about. This is also suggested by the higher e mangiare. (To sleep and eat); C. distribution of proper nouns [10], most likely re- Aiutare chi è in difficoltà. (To help people in ferring to relevant semantic types (e.g. person, need); D. Nuotare nell’acqua del fiume (To location) which typically occur in Narrative, i.e. swim in the river water) the main textual genre of the Invalsi tests. The higher sentence length of ‘question-worthy’ sen- (5) Question-worthy sentences: “Io farò il tences has effects also at morpho-syntactic and postino!” Disse uno. “Io farò il maestro!” Disse 4 un altro. “E io farò lo chef!”. Urlò un terzo e All significant features are shown in Appendix (A). Sentence Class Tag Question Answer La lucciola si preparò e, quando 0 calò la sera, andò all’appuntamento. Entrò nel bosco scuro e raggiunse la 1 Entrònel bosco scuro Dove abita il ragno del rac- In una siepe del siepe dove viveva il ragno. e raggiunse la siepe dove conto? bosco. viveva il ragno.<\R> Table 2: Sample output of the dataset structure. syntactic level, as shown e.g. by the higher pro- tion to operationalize multiple phenomena of text portion of conjunctions introducing subordinate complexity. READ–IT assigns different readabil- clauses ([7] Subord. conj: 1.63 vs 1.50) and by ity scores using the following four models: 1) the presence of longer syntactic relations in which Base Model, relying on raw text features only the linear distance between the ‘head’ and the ‘de- (e.g. average sentence and word length); 2) Lex- pendent’ is higher than 10 tokens ([20] Max link: ical Model, relying on a combination of raw text 11.30 vs 8.30). and lexical features; 3) Syntax Model, relying on morpho-syntactic and syntactic features; 4) Global Question NoQuestion Model, combining all feature types (raw text, lex- Features Avg (StDev) Avg (StDev) Raw Text features ical, morpho-syntactic and syntactic features). [1] Sentence length* 29.00 (16.11) 20.00 (13.75) Results are reported in Table 4. As it can be Morpho–syntactic features noted, question-worthy sentences have a higher [2] Punctuation* 4.74 (2.82) 7.70 (6.23) complexity with respect to all models. Especially [3] Negative adv* 1.23 (2.82) 1.19 (3.13) [4] Coord. conj* 3.50 (3.40) 3.20 (3.81) at syntactic level, this could be expected given the [5] Poss. adj* 0.96 (2.10) 0.89 (2.33) higher values obtained by features related to syn- [6] Relative pron* 1.14 (2.00) 1.12 (2.32) tactic complexity which turned out to be signifi- [7] Subord. conj* 1.63 (2.80) 1.50 (2.90) [8] Prepositions* 7.90 (5.01) 7.60 (6.20) cantly involved in discriminating these sentences. [9] Determiners* 9.13 (5.00) 9.00 (6.20) [10] Proper nouns* 2.05 (3.90) 2.00 (4.30) Question NoQuestion [11] Numbers 0.66 (1.87) 0.64 (2.25) READ-IT Base 59,9% 21,1% [12] Verbs 15.98 (6.32) 16.97 (8.18) READ-IT Lexical 98,9 % 66,4% [13] Indicat. mood* 57.00 (30.70) 60.00 (33.82) READ-IT Syntactic 69,3% 37,5% [14] Particip. mood 7.13 (14.22) 6.34 (14.88) READ-IT Global 100% 95% [15] 3rd pers. verb* 55.15 (39.50) 45.20 (42.62) [16] Conjunctions 5.1 (4.35) 4.34 (4.66) Table 4: Readability score obtained by different Syntactic features [17] Clause length* 8.63 (4.34) 7.90 (4.24) READ-IT models. [18] Verbal heads* 4.00 (2.30) 3.00 (2.03) [19] Postverb Subj* 13.60 (27.00) 15.70 (32.00) [20] Max link* 11.30 (7.06) 8.30 (6.80) 4 Conclusion Table 3: Linguistic features whose average dif- We presented DARC-IT, a new reading compre- ference between the two classes was statistically hension dataset for Italian collected from a sam- significant. For each feature it is reported the ple of standardized evaluation tests used to as- average value (avg) and the standard deviation sess students’ reading and comprehension at dif- (StDev). All differences are statistically signif- ferent grade levels. For each text, we anno- icant at p<.005; those with * also at p<.001. tated ‘question-worthy’ sentences, i.e. sentences (Note: Question=question-worthy sent.; NoQues- which contained the answer to a given question. tion=Non question-worthy sent.) A qualitative analysis of these sentences showed that the likelihood of a sentence to be ‘question- A further analysis was meant to investigate the worthy’ can be modeled using a set of linguis- profile of question-worthy sentences with respect tic features, which are especially linked to syn- to linguistic complexity. To this end, we exploit tactic complexity. We believe that this corpus READ-IT (Dell’Orletta et al., 2011), a general- can support research on the development of auto- purpose readability assessment tool for Italian, matic question generation systems as well as ques- which combines traditional raw text features with tion answering systems. Current developments go lexical, morpho-syntactic and syntactic informa- into several directions: we are carrying out a first classification experiment to automatically predict Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, ‘question-worthy’ sentences and evaluate the im- Eduard H. Hovy. 2017. RACE: Large-scale ReAd- ing Comprehension Dataset From Examinations. In pact of linguistic features on the classifier perfor- Proceedings of the 2017 Conference on Empirical mance. We are also planning to enlarge the cor- Methods in Natural Language Processing, Associ- pus and to investigate more in-depth the typology ation for Computational Linguistics, Copenhagen, of questions and answers it contains, in order to Denmark. study what characterizes sentences answering, for Rebecca J. Passonneau, Ananya Poddar, Gaurav Gite, instance, to factual vs non-factual questions. Alisa Krivokapic, Qian Yang and Dolores Perin. 2016. Wise Crowd Content Assessment and Edu- 5 Acknowledgments cational Rubrics. International Journal of Artificial Intelligence in Education, 28, 29–55. The work presented in this paper was par- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and tially supported by the 2–year project (2018- Percy Liang. 2016. Squad: 100,000+ questions for 2020) SchoolChain – Soluzioni innovative per machine comprehension of text. In Proceedings of la creazione, la certificazione, il riuso e la con- the 2016 Conference on Empirical Methods in Natu- ral Language Processing, Association for Computa- divisione di unità didattiche digitali all’interno tional Linguistics, Austin, Texas, pages 2383–2392. del sistema Scuola, funded by Regione Toscana (BANDO POR FESR 2014-2020). Catherine Snow. 2002. Reading for understanding: Toward an RD program in reading comprehension. Rand Corporation. References Naveed Afzal and Ruslan Mitkov. 2014. Auto- matic generation of multiple choice questions using dependency-based semantic relations Soft Comput- ing, 18 (7), 1269–1281. Giuseppe Attardi, Felice Dell’Orletta, Maria Simi, Joseph Turian. 2009. Accurate dependency pars- ing with a stacked multilayer perceptron. In Pro- ceedings of EVALITA 2009 - Evaluation of NLP and Speech Tools for Italian 2009, Reggio Emilia, Italy, December 2009. Felice Dell’Orletta. 2009. Ensemble system for part- of-speech tagging. In Proceedings of EVALITA 2009 - Evaluation of NLP and Speech Tools for Italian 2009, Reggio Emilia, Italy, December 2009. Felice Dell’Orletta, Simonetta Montemagni and Giu- lia Venturi. 2011. READ-IT: assessing readabil- ity of Italian texts with a view to text simplification. Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies (SLPAT 2011), Edimburgo, UK: 73–83. Xinya Du and Claire Cardie. 2017. Identifying Where to Focus in Reading Comprehension for Neu- ral Question Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, Association for Computational Linguistics, Copenhagen, Denmark. Mandar Joshi, Eunsol Choi, Daniel Weld and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Dis- tantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Lin- guistics, Vancouver, Canada, 1601–1611. Appendix (A). Question-worthy sentences Non Question-worthy Sentences Features Average (StDev) Average (StDev) Raw Text features Sentence length*** 29.00 (16.11) 20.00 (13.75) Lexical features % Basic Italian Vocabulary (BIV)* 88.54 (8.53) 88.99 (10.66) % Fundamental BIV** 78.26 (10.83) 79.59 (13.23) % ‘High Usage’ BIV* 12.31 (8.12) 12.50 (10.28) Lexical density* 0.56 (0.08) 0.58 (0.11) Morpho–syntactic features % Adjectives* 5.20 (4.71) 4.35 (5.55) % Articles*** 9.13 (5.00) 9.00 (6.20) % Conjunctions** 5.1 (4.35) 4.34 (4.66) % Coordinat. conj*** 3.50 (3.40) 3.20 (3.81) % Demonstrative determiners*** 0.61 (1.61) 0.55 (1.90) % Indefinite pronouns 0.87 (2.26) 0.66 (2.24) % Interrogative determiners* 00.5 (0.52) 0.06 (0.67) % Interjections* 0.03 (0.31) 0.09 (0.72) % Numbers** 0.66 (1.87) 0.64 (2.25) % Negative adverbs*** 1.23 (2.82) 1.19 (3.13) % Ordinal numbers* 0.27 (1.04) 0.14 (0.83) % Possessive adjectives*** 0.96 (2.10) 0.89 (2.33) % Prepositions*** 7.90 (5.01) 7.60 (6.20) % Proper nouns** 2.05 (3.90) 2.00 (4.30) % Punctuation*** 4.74 (2.82) 7.70 (6.23) % Relative pronouns*** 1.14 (2.00) 1.12 (2.32) % Subordin. conj*** 1.63 (2.80) 1.50 (2.90) % Verbs** 15.98 (6.32) 16.97 (8.18) % Verb Participial mood** 7.13 (14.22) 6.34 (14.88) % Verb Indicative mood*** 57.00 (30.70) 60.00 (33.82) % Verb Conditional mood** 1.37 (6.13) 2.35 (9.58) % Verb Past tense** 22.19 (34.80) 23.88 (37.73) % Verb Imperfect tense** 29.08 (39.35) 29.04 (41.13) % Verb Present tense* 45.04 (43.50) 38.40 (44.91) % 3rd pers. verb*** 55.15 (39.50) 45.20 (42.62) % 2nd pers. verb* 1.37 (7.34) 1.84 (10.25) TTR ratio (first 100 lemmas)** 0.84 (0.10) 0.89 (0.10) Syntactic features Clause length (in tokens)*** 8.63 (4.34) 7.90 (4.24) Avg verbal heads/sentence*** 4.00 (2.30) 3.00 (2.03) Avg prep. links length* 1.11 (0.45) 0.93 (0.58) Max link length*** 11.30 (7.06) 8.30 (6.80) Verb arity 34.93 (29.74) 33.37 (32.70) % Postverbal subject*** 13.60 (27.00) 15.70 (32.00) % Preverbal objects* 10.17 (25.17) 9.22 (25.55) % DEP Root** 5.52 (3.31) 8.20 (6.30) % DEP Mod rel*** 1.50 (2.21) 1.30 (2.50) % DEP Copulative Conj** 5.34 (4.92) 4.65 (5.26) % DEP Determiner*** 9.10 (5.00) 8.80 (6.20) % DEP Disjuntive Conj 0.14 (0.76) 0.20 (0.99) % DEP Locative Compl* 0.73 (2.03) 0.53 (1.81) % DEP neg*** 1.20 (2.80) 1.13 (2.84) % DEP conj** 4.58 (4.12) 3.91 (4.62) % DEP concatenation* 0.06 (0.52) 0.08 (0.8) Table 5: Linguistic features whose average difference between the two classes was statistically signifi- cant. For each feature it is reported the average value and the standard deviation (StDev). *** indicates a highly significant difference (p<.001); ** a very significant difference (p<.01); * a significant difference (p<.05).