EnetCollect in Italy Lionel Nicolas1 , Verena Lyding1 , Luisa Bentivogli2 , Federico Sangati3 , Johanna Monti3 , Irene Russo4 , Roberto Gretter5 , Daniele Falavigna5 1 Institute for Applied Linguistics, Eurac Research, Bolzano 2 HLT-MT Unit, Fondazione Bruno Kessler, Trento 3 Department of Literary, Linguistic and Comparative Studies, University of Naples “L’Orientale”, Naples 4 Institute of Computational Linguistics “Antonio Zampolli”, CNR, Pisa 5 SpeechTek Unit, Fondazione Bruno Kessler, Trento Abstract We also present enetCollect’s Italian members alongside their NLP-related interests. Indeed, English. In this paper, we present the NLP heavily relies on language resources and their enetCollect1 COST Action, a large net- availability is crucial for the delivery of reliable work project, which aims at initiating a NLP solutions. Due to high costs of production, new Research and Innovation (R&I) trend resources are often missing, especially for lesser on combining the well-established domain used languages. As enetCollect researches new of language learning with recent and suc- approaches to tackle such issues, it is a project of cessful crowdsourcing approaches. We in- particular interest for the Italian NLP community. troduce its objectives, and describe its or- EnetCollect connects to ongoing crowdsourc- ganization. We then present the Italian ing research, including Games With A Purpose ap- network members and detail their research proaches (Chamberlain et al., 2013; Lafourcade interests within enetCollect. Finally, we et al., 2015) for collecting data through gamified report on its progression so far. tasks (cf. e.g. JeuxDeMots (Lafourcade, 2007), or ZombiLingo (Guillaume et al., 2016)), collabora- Italiano. In questo articolo presenti- tive approaches such as Wisdom-of-the-Crowd ini- amo la COST Action enetCollect, un am- tiatives (e.g. dict.cc2 , Wiktionary3 , and Duolingo pio network il cui scopo è avviare un (von Ahn, 2013)), or general Human-based Com- nuovo filone di Ricerca e Innovazione putation activities (implemented through plat- (R&I) combinando l’ambito consolidato forms like Zooniverse4 , Crowd4u5 , etc.). dell’apprendimento delle lingue con i più This paper aims at fostering the participation of recenti e riusciti approcci di crowdsourc- the Italian NLP community while further allow- ing. Introduciamo i suoi obiettivi e de- ing it to benefit from the research and collabora- scriviamo la sua organizzazione. Inoltre, tion opportunities enetCollect offers (e.g. research presentiamo i membri italiani ed i loro in- stay grants) for its remaining 2.5 years of funding. teressi di ricerca all’interno di enetCol- Sections 2 and 3 present enetCollect’s ambition, lect. Infine, descriviamo lo stato di avan- and its organization while Section 4 introduces the zamento finora raggiunto. Italian members and their research interests. Sec- tions 5 and 6 report on achievements up to now and the current state of affairs. 1 Introduction In this paper, we present the COST network enet- 2 Challenge, Motivation and Objectives Collect that aims at kick-starting an R&I trend for Started in March 2017, enetCollect will pursue, combining language learning with crowdsourcing until April 2021, the long-term challenge of fos- techniques in order to unlock a crowdsourcing po- tering language learning in Europe and beyond tential for all languages, consisting in learning and by taking advantage of the ground-breaking na- teaching activities. This potential will be used ture of crowdsourcing and the immense and ever- to mass-produce language learning material and 2 language-related datasets, such as NLP resources. https://www.dict.cc 3 https://www.wiktionary.org/ 1 4 European Network for Combining Language Learning https://www.zooniverse.org/ 5 with Crowdsourcing Techniques, Web: (EnetCollect, 2018) http://crowd4u.org/en/ growing crowd of language learners and teachers6 tent from language-related resources and collect- to mass-produce language learning content and, ing the answers to the exercises to correct and at the same time, language-related data such as extend the resources used). WG3 focuses on NLP resources. The prospect of mass-producing user-oriented design strategies to attract and retain language-related data can vastly impact domains a crowd (e.g. studying the relevance and attrac- such as NLP, which in turn will impact back on tiveness of learner profiling for vocabulary train- language learning by fostering support from var- ing). WG4 focuses on studying the functional de- ious language-related stakeholders (e.g. see Sec- mands and the existing solutions related to lan- tion 4 for NLP-related crowdsourcing scenarios). guage learning and crowdsourcing (e.g. technical As intensifying migration flows (due to eco- solutions addressing the scalability need of some nomical and geopolitical reasons) increase the di- methods). Finally, WG5 focuses on application- versification of language learner profiles and the oriented questions such as ethical issues, legal reg- demand for learning material, the launch of such ulations, and commercialization opportunities. an R&I trend is very timely. Indeed, the ef- The five WGs are different content-wise and can fectiveness of the existing material runs the risk be pursued in a parallel fashion. Nonetheless, they of gradually falling behind and the varied com- remain interdependent in the overarching objec- binations of languages taught and target groups tive. For example, the boundary between explicit can hardly be addressed by small-scale initia- and implicit crowdsourcing (WG1 and WG2) is tives. EnetCollect timely kick-starts an overarch- sometimes difficult to draw when the crowd is ex- ing R&I trend to continuously foster various ini- plicitly involved while their actions are being im- tiatives. Funding-wise, the timing is also favorable plicitly crowdsourced8 . Also, any crowdsourcing as both the increasing need for learning solutions approach will fail if there is no crowd to rely on and the problem-solving nature of crowdsourcing (WG3), no technical solution to support its func- are widely acknowledged. tional needs (WG4), and no appropriate ethical or The creation of a new R&I community is ad- legal contexts to implement it (WG5). Alongside dressed through formal Research Coordination the WGs, three coordination groups on Dissemi- Objectives aiming at creating a shared knowledge nation, Exploitation and Outreach are providing of the subject, at carrying out prototypical ex- standardized support for WG-transversal tasks. periments and at disseminating promising results while formal Capacity-Building Objectives aim at 4 Research Interests of Italian Members creating the core R&I community, communication The Italian members are currently among the most means and new initiatives. In Section 5, we report numerous and active participants to the Action on progress regarding these objectives. and its events. In addition, the Action coordina- tion (Chair and Grant Holder) is carried out by 3 Working Groups and Coordinations two Italian members from Eurac Research (see be- EnetCollect makes a working distinction between low). Being all related to NLP, enetCollect’s Ital- explicit and implicit crowdsourcing approaches: ian partners have a common interest in combin- while for explicit crowdsourcing the crowd inten- ing language learning with implicit crowdsourcing tionally participates (e.g. Wikipedia), for implicit (WG2) so as to extend and correct NLP datasets. crowdsourcing the crowd is not necessarily aware All crowdsourcing scenarios described hereafter of its participation (e.g. reCaptcha7 ). EnetCollect share the same overarching approach: the NLP is organized along five working groups (WG) and partner uses an NLP dataset to generate exercise three support groups called coordinations. content and both crowdsources and cross-matches Whereas WG1 focuses on explicit crowdsourc- the learners’ answers in order to validate/discard ing approaches to create data or learning content the data used to generate the exercise content, (e.g. collaboratively creating lessons), WG2 fo- just like GWAP players validate/discard data while cuses on implicit crowdsourcing approaches for playing. Deriving expert knowledge from cross- the same purpose (e.g. generating exercise con- matched learners’ answers is a challenge enetCol- lect aims at addressing. Relying on a crowd of 6 21% of the Europeans aged over 14 years (9̃0 millions people, Eurobarometer report, (European Commission, 2012) 8 E.g. crowdsourcing learner essays and their corrections 7 https://www.google.com/recaptcha by teachers to create annotated corpora. learners is however promising in two ways. First, language learning usage of MT (Somers, 2001; learners should be mostly confronted with exer- Niño, 2008; Case, 2015; Dongyun, 2017), HLT- cise content generated from reliable NLP data so MT focuses on “post-editing” exercises fostering as to not undermine their efforts. Their constantly- correction and writing skills where students are evaluated proficiency levels thus provide a relia- presented with a sentence and several possible bility score for their answers. Second, as a crowd translations and are asked to choose the most ap- of learners renews itself over time, the set of propriate one and, if necessary, revise it. Exist- crowdsourced answers for each question is poten- ing parallel corpora and state-of-the-art MT sys- tially infinite and their “inferior” reliability is thus tems trained on them will allow to test the learn- compensated by their “superior” quantity. ers’ skills and generate new translations. While The Institute for Applied Linguistics (IAL) of learning, students will thus be trained, evaluated Eurac Research is particularly concerned with re- and will sometimes be allowed to correct MT search on the three official languages of South Ty- outputs and extend training corpora. For such rol (Italian, South Tyrolean German and the mi- a crowdsourcing scenario, advanced L2 learners nority language Ladin). As regards NLP, Italian is will be targeted, especially those studying Trans- the best covered while South Tyrolean is approxi- lation Studies for Italian, English and German at mated by adapting solutions for standard German partners of the Universities of Trento and Bologna. and Ladin has barely any coverage. To improve The PARSEME-IT research group10 of this situation, the IAL aims at crowdsourcing var- the Department of Literary, Linguistic and ied NLP resources for South Tyrolean German and Comparative Studies, University of Naples Ladin, starting with wide-coverage Part-of-Speech “L’Orientale” aims at improving linguistic rep- (POS) lexica. The foreseen crowdsourcing sce- resentativeness, precision, robustness and compu- nario is to use POS lexica to generate exercise con- tational efficiency of NLP applications (Monti et tent for widely adopted exercises such as the one al., 2017). It researches MultiWord Expressions for grouping words according to their properties (MWEs11 ), as a major NLP bottleneck, and inves- (e.g. “select all verbs among these five words”) tigates their representation in language resources or for identifying words within a grid of random and their integration in syntactic parsing, transla- letters (e.g. “select five adjectives in the grid”. tion technology, and language learning. The pos- By crowdsourcing the learners’ answers, the IAL sibility to enhance mono- and multilingual lan- aims at gradually improving the lexica while con- guage resources focusing on MWEs is of partic- tinuously adding new entries. As for the targeted ular interest, especially with regards to MWE lex- crowd of learners, the IAL will build on its long- ica and corpora annotated with MWEs. Accord- standing collaborations with schools (Vettori and ingly, a set of different exercises engaging students Abel, 2017; Abel et al., 2014) and is considering from different degrees (junior high, high school, to target the local language certification9 , an oblig- and undergraduates) are envisioned. For example, atory exam for public positions for which no ded- exercises to improve lists of Italian MWEs and icated learning tool is currently available online. their correspondences in different languages that The Human Language Technology - Machine ask learners to identify/validate MWEs in mono- Translation (HLT-MT) research unit of Fon- lingual texts and suggest possible translations or dazione Bruno Kessler (FBK) is concerned with ask learners to identify/validate MWEs and their MT technologies supporting both human transla- translations in parallel corpora. The targeted stu- tors and multilingual applications. The creation of dents are BA and MA students of the university dedicated language resources is thus a core activ- L’Orientale, especially those attending the transla- ity. Within enetCollect, HLT-MT aims at enrich- tion classes with a solid curriculum in linguistics ing existing parallel corpora and at enhancing MT and Translation Studies. evaluation by crowdsourcing multiple translations The Institute of Computational Linguistics of the same sentence (Bentivogli et al., 2018). As ‘Antonio Zampolli’ (CNR-ILC) carries out re- such translations paraphrase one another, they are search at the international, European, national and also of interest for monolingual NLP purposes. 10 https://sites.google.com/view/ Following the growing number of studies on the parseme-it/home 11 Groups of words composing one lexical unit, such as 9 Exam for bilingualism, Web: (BZ Alto Adige, 2018) ’tirare le cuoia’ (En. kick the bucket) regional level since 1967. It participated in sev- are reported in relation to the formal Research Co- eral EU initiatives on language resource docu- ordination and Capacity-Building Objectives out- mentation and recently took the lead of the na- lined earlier in Section 2.15 tional CLARIN-IT12 consortium. Its main ar- Creating a core community of stakeholders. eas of competence also include Text Processing, The already large initial number of 68 individ- NLP, Knowledge Extraction, and Computational ual members for 34 participating countries has in- Models of Language Usage. Among ILC’s re- creased by 67% to 114 members and by 10% to 38 sources, ImagAct13 , a multimodal resource about countries. The people subscribed to enetCollect’s action verbs, represents a starting point for crowd- mailing list have increased by 149% from 79 to sourcing experiments, where words denoting ac- 197. Also, 15 financed research stays, lasting 152 tions could be explained through videos sharing days overall, led to intense cooperations. a semantic core. Crowdsourcing could be used Building the theoretical framework. The 30 to build these datasets by asking learners to la- presentations and 39 posters at network meetings bel actions shown in short videos. As shown with and 15 research stays have contributed to the first middle school pupils (Coppola et al., 2017), ana- building blocks of the foreseen theoretical frame- lyzing a video illustrating verbs and associating it work, especially with regards to the state-of-the- with words in multiple languages reinforce meta- art review. So far, 3 meetings and 1 training school linguistic reasoning (CARAP, 2012). Such com- were organized (168 participations in total). binations of semantic traits and action verbs can Communication and outreach. EnetCollect’s in- also be used for textual entailment. tranet and website are online for 9 and 7 months The SpeechTEK research unit of Fondazione and host already a substantial amount of informa- Bruno Kessler (FBK) is working on Automatic tion. 11 mailing lists targeting subsets of mem- Speech Recognition (ASR) and addresses com- bers were created and used. 4 calls for research puter assisted language learning as an applica- stays and 5 calls for meeting participation were tion field. In a first project, it aims to automat- distributed and drew attention (and members) to ically assess children’s reading capability at pri- enetCollect. Aside from one invited talk, several mary school. ASR is used to align a given text early activities for publications at conferences of with the speech read out by a pupil, to highlight related research communities are ongoing. its errors and score it. A second project concerns Funding new initiatives. Funding applications the use of ASR and classification tools to auto- were supported early on, e.g. through the ad- matically check the proficiency of Italian students vertisement of specific opportunities or dedicated aged between 9 and 16 years, in learning both En- internal campaigns (e.g. for Marie Sklodowska- glish and German. Both written texts and spoken Curie Individual Fellowships). Three applications utterances have to be evaluated, using reference for mid-sized projects were already submitted in scores related to some proficiency indicators (e.g. the first year, of which two got positively evalu- pronunciation, fluency, lexical richness) given by ated, and one got funded by a Swiss agency. human experts. In the first project, corrections of ASR errors can be crowdsourced and used to build 6 Conclusion more reliable models for assessing reading capa- We presented enetCollect, outlined its key aspects bilities of children. Similarly, in the second project and introduced both its Italian members and their crowdsourcing could help both to transcribe and to research interests. By harnessing even a frag- score the answers uttered by the students. In both ment of the crowdsourcing potential existing for cases, crowdsourcing could allow to adapt ASR all languages taught worldwide, enetCollect could models and produce more reliable gold standards. trigger changes of noticeable impact for language learning and language-related R&I fields, such as 5 Progression of the Network NLP. The fast uptake and overall progression of In this section, the most relevant achievements14 enetCollect within its first year indicate its rele- related to the overall progression of the network vance and the potential magnitude of its ambition. 15 12 We do not report on content-related results as these are www.clarin-it.it too numerous and varied and, more importantly, they are (or 13 www.imagact.it will be) the focus of different publications authored by the 14 See more information on http://enetcollect.eurac.edu. members having achieved them. References M. Lafourcade, N. Le Brun, and A. Joubert. 2015. Games with a Purpose (GWAPS). Wiley-ISTWiley- Andrea Abel, Aivars Glaznieks, Lionel Nicolas, and ISTE, July. Egon Stemle. 2014. Koko: an l1 learner cor- pus for german. In Proceedings of the 9th Interna- Mathieu Lafourcade. 2007. Making people play for tional Conference on Language Resources and Eval- lexical acquisition. In 7th Symposium on Natural uation (LREC 2014), pages 2414–2421, Reykjavik, Language Processing (SNLP 2007), Pattaya, Thai- Iceland. European Language Resources Association land. (ELRA). Johanna Monti, Maria Pia di Buono, and Federico San- Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and gati. 2017. Parseme-it corpus an annotated corpus Marcello Federico. 2018. Neural versus phrase- of verbal multiword expressions in italian. In Fourth based mt quality: An in-depth analysis on en- Italian Conference on Computational Linguistics- glishgerman and englishfrench. Computer Speech CLiC-it 2017, pages 228–233. Accademia Univer- and Language, 49:52 – 70. sity Press. Provincia autonoma di BZ Alto Adige. 2018. Lésame Ana Niño. 2008. Evaluating the use of machine trans- di bilinguismo. Last accessed: 2018-07-20. lation post-editing in the foreign language class. Computer Assisted Language Learning, 21(1):29 – Consiglio d’Europa CARAP. 2012. Le CARAP: Un 49. Cadre de Rfrence pour les Approches Plurielles des Harold Somers. 2001. Three perspectives on mt in the Langues et des Cultures, Comptences et Ressources. classroom. In Proceedings of the eighth Machine Centre Europen pour les Langues Vivantes, Stras- Translation Summit (MT Summit VIII), Santiago de bourg Cedex. Compostela, Galicia, Spain. Megan Case. 2015. Machine translation and the Chiara Vettori and Andrea Abel, editors. 2017. disruption of foreign language learning activities. KOLIPSI II. Gli studenti altoatesini e la sec- eLearning Papers, 45:4 – 16. onda lingua: indagine linguistica e psicosociale. / Die Sdtiroler SchlerInnen und die Zweitsprache: Jon Chamberlain, Karën Fort, Udo Kruschwitz, Math- eine linguistische und sozialpsychologische Unter- ieu Lafourcade, and Massimo Poesio. 2013. Using suchung. Eurac Research, Bolzano/Bozen. games to create language resources: Successes and limitations of the approach. In Iryna Gurevych and Luis von Ahn. 2013. Duolingo: learn a language for Jungi Kim, editors, The People’s Web Meets NLP, free while helping to translate the web. In Proceed- Theory and Applications of Natural Language Pro- ings of the 2013 international conference on Intelli- cessing, pages 3–44. Springer Berlin Heidelberg. gent user interfaces, pages 1–2. ACM. Daria Coppola, Raffaella Moretti, Irene Russo, and Fabiana Tranchida. 2017. In quante lingue mangi? tecniche glottodidattiche e language testing in classi plurilingui e ad abilit differenziata. In Francesca Strik Lievers Giovanna Marotta, editor, Strutture lin- guistiche e dati empirici in diacronia e sincronia, Studi Linguistici Pisani, pages 199–231. Pisa Uni- versity Press. Sun Dongyun. 2017. Application of post-editing in foreign language teaching: Problems and chal- lenges. Canadian Social Science, 13(7):1 – 5. COST Action EnetCollect. 2018. Enetcollect cost website. Last accessed: 2018-07-20. Directorate-General for Communication Euro- pean Commission. 2012. Europeans and their languages. Special eurobarometer 386 report, Survey conducted by TNS Opino & Social, and co-ordinated by the European Commission. Bruno Guillaume, Karën Fort, and Nicolas Lefebvre. 2016. Crowdsourcing complex language resources: Playing to annotate dependency syntax. In Proceed- ings of the International Conference on Computa- tional Linguistics (COLING), Osaka, Japan.