                                               EnetCollect in Italy
                 Lionel Nicolas1 , Verena Lyding1 , Luisa Bentivogli2 , Federico Sangati3 ,
                 Johanna Monti3 , Irene Russo4 , Roberto Gretter5 , Daniele Falavigna5
                        Institute for Applied Linguistics, Eurac Research, Bolzano
                               HLT-MT Unit, Fondazione Bruno Kessler, Trento
  Department of Literary, Linguistic and Comparative Studies, University of Naples “L’Orientale”, Naples
                  Institute of Computational Linguistics “Antonio Zampolli”, CNR, Pisa
                             SpeechTek Unit, Fondazione Bruno Kessler, Trento

                          Abstract                                We also present enetCollect’s Italian members
                                                               alongside their NLP-related interests. Indeed,
         English. In this paper, we present the                NLP heavily relies on language resources and their
         enetCollect1 COST Action, a large net-                availability is crucial for the delivery of reliable
         work project, which aims at initiating a              NLP solutions. Due to high costs of production,
         new Research and Innovation (R&I) trend               resources are often missing, especially for lesser
         on combining the well-established domain              used languages. As enetCollect researches new
         of language learning with recent and suc-             approaches to tackle such issues, it is a project of
         cessful crowdsourcing approaches. We in-              particular interest for the Italian NLP community.
         troduce its objectives, and describe its or-             EnetCollect connects to ongoing crowdsourc-
         ganization. We then present the Italian               ing research, including Games With A Purpose ap-
         network members and detail their research             proaches (Chamberlain et al., 2013; Lafourcade
         interests within enetCollect. Finally, we             et al., 2015) for collecting data through gamified
         report on its progression so far.                     tasks (cf. e.g. JeuxDeMots (Lafourcade, 2007), or
                                                               ZombiLingo (Guillaume et al., 2016)), collabora-
         Italiano. In questo articolo presenti-                tive approaches such as Wisdom-of-the-Crowd ini-
         amo la COST Action enetCollect, un am-                tiatives (e.g. dict.cc2 , Wiktionary3 , and Duolingo
         pio network il cui scopo è avviare un                (von Ahn, 2013)), or general Human-based Com-
         nuovo filone di Ricerca e Innovazione                 putation activities (implemented through plat-
         (R&I) combinando l’ambito consolidato                 forms like Zooniverse4 , Crowd4u5 , etc.).
         dell’apprendimento delle lingue con i più               This paper aims at fostering the participation of
         recenti e riusciti approcci di crowdsourc-            the Italian NLP community while further allow-
         ing. Introduciamo i suoi obiettivi e de-              ing it to benefit from the research and collabora-
         scriviamo la sua organizzazione. Inoltre,             tion opportunities enetCollect offers (e.g. research
         presentiamo i membri italiani ed i loro in-           stay grants) for its remaining 2.5 years of funding.
         teressi di ricerca all’interno di enetCol-            Sections 2 and 3 present enetCollect’s ambition,
         lect. Infine, descriviamo lo stato di avan-           and its organization while Section 4 introduces the
         zamento finora raggiunto.                             Italian members and their research interests. Sec-
                                                               tions 5 and 6 report on achievements up to now
                                                               and the current state of affairs.
     1   Introduction
     In this paper, we present the COST network enet-          2       Challenge, Motivation and Objectives
     Collect that aims at kick-starting an R&I trend for       Started in March 2017, enetCollect will pursue,
     combining language learning with crowdsourcing            until April 2021, the long-term challenge of fos-
     techniques in order to unlock a crowdsourcing po-         tering language learning in Europe and beyond
     tential for all languages, consisting in learning and     by taking advantage of the ground-breaking na-
     teaching activities. This potential will be used          ture of crowdsourcing and the immense and ever-
     to mass-produce language learning material and
     language-related datasets, such as NLP resources.               https://www.dict.cc
        1                                                          4
          European Network for Combining Language Learning           https://www.zooniverse.org/
     with Crowdsourcing Techniques, Web: (EnetCollect, 2018)         http://crowd4u.org/en/
growing crowd of language learners and teachers6             tent from language-related resources and collect-
to mass-produce language learning content and,               ing the answers to the exercises to correct and
at the same time, language-related data such as              extend the resources used). WG3 focuses on
NLP resources. The prospect of mass-producing                user-oriented design strategies to attract and retain
language-related data can vastly impact domains              a crowd (e.g. studying the relevance and attrac-
such as NLP, which in turn will impact back on               tiveness of learner profiling for vocabulary train-
language learning by fostering support from var-             ing). WG4 focuses on studying the functional de-
ious language-related stakeholders (e.g. see Sec-            mands and the existing solutions related to lan-
tion 4 for NLP-related crowdsourcing scenarios).             guage learning and crowdsourcing (e.g. technical
   As intensifying migration flows (due to eco-              solutions addressing the scalability need of some
nomical and geopolitical reasons) increase the di-           methods). Finally, WG5 focuses on application-
versification of language learner profiles and the           oriented questions such as ethical issues, legal reg-
demand for learning material, the launch of such             ulations, and commercialization opportunities.
an R&I trend is very timely. Indeed, the ef-                    The five WGs are different content-wise and can
fectiveness of the existing material runs the risk           be pursued in a parallel fashion. Nonetheless, they
of gradually falling behind and the varied com-              remain interdependent in the overarching objec-
binations of languages taught and target groups              tive. For example, the boundary between explicit
can hardly be addressed by small-scale initia-               and implicit crowdsourcing (WG1 and WG2) is
tives. EnetCollect timely kick-starts an overarch-           sometimes difficult to draw when the crowd is ex-
ing R&I trend to continuously foster various ini-            plicitly involved while their actions are being im-
tiatives. Funding-wise, the timing is also favorable         plicitly crowdsourced8 . Also, any crowdsourcing
as both the increasing need for learning solutions           approach will fail if there is no crowd to rely on
and the problem-solving nature of crowdsourcing              (WG3), no technical solution to support its func-
are widely acknowledged.                                     tional needs (WG4), and no appropriate ethical or
   The creation of a new R&I community is ad-                legal contexts to implement it (WG5). Alongside
dressed through formal Research Coordination                 the WGs, three coordination groups on Dissemi-
Objectives aiming at creating a shared knowledge             nation, Exploitation and Outreach are providing
of the subject, at carrying out prototypical ex-             standardized support for WG-transversal tasks.
periments and at disseminating promising results
while formal Capacity-Building Objectives aim at             4    Research Interests of Italian Members
creating the core R&I community, communication
                                                             The Italian members are currently among the most
means and new initiatives. In Section 5, we report
                                                             numerous and active participants to the Action
on progress regarding these objectives.
                                                             and its events. In addition, the Action coordina-
                                                             tion (Chair and Grant Holder) is carried out by
3       Working Groups and Coordinations
                                                             two Italian members from Eurac Research (see be-
EnetCollect makes a working distinction between              low). Being all related to NLP, enetCollect’s Ital-
explicit and implicit crowdsourcing approaches:              ian partners have a common interest in combin-
while for explicit crowdsourcing the crowd inten-            ing language learning with implicit crowdsourcing
tionally participates (e.g. Wikipedia), for implicit         (WG2) so as to extend and correct NLP datasets.
crowdsourcing the crowd is not necessarily aware             All crowdsourcing scenarios described hereafter
of its participation (e.g. reCaptcha7 ). EnetCollect         share the same overarching approach: the NLP
is organized along five working groups (WG) and              partner uses an NLP dataset to generate exercise
three support groups called coordinations.                   content and both crowdsources and cross-matches
   Whereas WG1 focuses on explicit crowdsourc-               the learners’ answers in order to validate/discard
ing approaches to create data or learning content            the data used to generate the exercise content,
(e.g. collaboratively creating lessons), WG2 fo-             just like GWAP players validate/discard data while
cuses on implicit crowdsourcing approaches for               playing. Deriving expert knowledge from cross-
the same purpose (e.g. generating exercise con-              matched learners’ answers is a challenge enetCol-
                                                             lect aims at addressing. Relying on a crowd of
     21% of the Europeans aged over 14 years (9̃0 millions
people, Eurobarometer report, (European Commission, 2012)        8
                                                                   E.g. crowdsourcing learner essays and their corrections
     https://www.google.com/recaptcha                        by teachers to create annotated corpora.
learners is however promising in two ways. First,          language learning usage of MT (Somers, 2001;
learners should be mostly confronted with exer-            Niño, 2008; Case, 2015; Dongyun, 2017), HLT-
cise content generated from reliable NLP data so           MT focuses on “post-editing” exercises fostering
as to not undermine their efforts. Their constantly-       correction and writing skills where students are
evaluated proficiency levels thus provide a relia-         presented with a sentence and several possible
bility score for their answers. Second, as a crowd         translations and are asked to choose the most ap-
of learners renews itself over time, the set of            propriate one and, if necessary, revise it. Exist-
crowdsourced answers for each question is poten-           ing parallel corpora and state-of-the-art MT sys-
tially infinite and their “inferior” reliability is thus   tems trained on them will allow to test the learn-
compensated by their “superior” quantity.                  ers’ skills and generate new translations. While
   The Institute for Applied Linguistics (IAL) of          learning, students will thus be trained, evaluated
Eurac Research is particularly concerned with re-          and will sometimes be allowed to correct MT
search on the three official languages of South Ty-        outputs and extend training corpora. For such
rol (Italian, South Tyrolean German and the mi-            a crowdsourcing scenario, advanced L2 learners
nority language Ladin). As regards NLP, Italian is         will be targeted, especially those studying Trans-
the best covered while South Tyrolean is approxi-          lation Studies for Italian, English and German at
mated by adapting solutions for standard German            partners of the Universities of Trento and Bologna.
and Ladin has barely any coverage. To improve                 The PARSEME-IT research group10 of
this situation, the IAL aims at crowdsourcing var-         the Department of Literary, Linguistic and
ied NLP resources for South Tyrolean German and            Comparative Studies, University of Naples
Ladin, starting with wide-coverage Part-of-Speech          “L’Orientale” aims at improving linguistic rep-
(POS) lexica. The foreseen crowdsourcing sce-              resentativeness, precision, robustness and compu-
nario is to use POS lexica to generate exercise con-       tational efficiency of NLP applications (Monti et
tent for widely adopted exercises such as the one          al., 2017). It researches MultiWord Expressions
for grouping words according to their properties           (MWEs11 ), as a major NLP bottleneck, and inves-
(e.g. “select all verbs among these five words”)           tigates their representation in language resources
or for identifying words within a grid of random           and their integration in syntactic parsing, transla-
letters (e.g. “select five adjectives in the grid”.        tion technology, and language learning. The pos-
By crowdsourcing the learners’ answers, the IAL            sibility to enhance mono- and multilingual lan-
aims at gradually improving the lexica while con-          guage resources focusing on MWEs is of partic-
tinuously adding new entries. As for the targeted          ular interest, especially with regards to MWE lex-
crowd of learners, the IAL will build on its long-         ica and corpora annotated with MWEs. Accord-
standing collaborations with schools (Vettori and          ingly, a set of different exercises engaging students
Abel, 2017; Abel et al., 2014) and is considering          from different degrees (junior high, high school,
to target the local language certification9 , an oblig-    and undergraduates) are envisioned. For example,
atory exam for public positions for which no ded-          exercises to improve lists of Italian MWEs and
icated learning tool is currently available online.        their correspondences in different languages that
   The Human Language Technology - Machine                 ask learners to identify/validate MWEs in mono-
Translation (HLT-MT) research unit of Fon-                 lingual texts and suggest possible translations or
dazione Bruno Kessler (FBK) is concerned with              ask learners to identify/validate MWEs and their
MT technologies supporting both human transla-             translations in parallel corpora. The targeted stu-
tors and multilingual applications. The creation of        dents are BA and MA students of the university
dedicated language resources is thus a core activ-         L’Orientale, especially those attending the transla-
ity. Within enetCollect, HLT-MT aims at enrich-            tion classes with a solid curriculum in linguistics
ing existing parallel corpora and at enhancing MT          and Translation Studies.
evaluation by crowdsourcing multiple translations             The Institute of Computational Linguistics
of the same sentence (Bentivogli et al., 2018). As         ‘Antonio Zampolli’ (CNR-ILC) carries out re-
such translations paraphrase one another, they are         search at the international, European, national and
also of interest for monolingual NLP purposes.                 10
Following the growing number of studies on the             parseme-it/home
                                                                  Groups of words composing one lexical unit, such as
       Exam for bilingualism, Web: (BZ Alto Adige, 2018)   ’tirare le cuoia’ (En. kick the bucket)
regional level since 1967. It participated in sev-             are reported in relation to the formal Research Co-
eral EU initiatives on language resource docu-                 ordination and Capacity-Building Objectives out-
mentation and recently took the lead of the na-                lined earlier in Section 2.15
tional CLARIN-IT12 consortium. Its main ar-                    Creating a core community of stakeholders.
eas of competence also include Text Processing,                The already large initial number of 68 individ-
NLP, Knowledge Extraction, and Computational                   ual members for 34 participating countries has in-
Models of Language Usage. Among ILC’s re-                      creased by 67% to 114 members and by 10% to 38
sources, ImagAct13 , a multimodal resource about               countries. The people subscribed to enetCollect’s
action verbs, represents a starting point for crowd-           mailing list have increased by 149% from 79 to
sourcing experiments, where words denoting ac-                 197. Also, 15 financed research stays, lasting 152
tions could be explained through videos sharing                days overall, led to intense cooperations.
a semantic core. Crowdsourcing could be used                   Building the theoretical framework. The 30
to build these datasets by asking learners to la-              presentations and 39 posters at network meetings
bel actions shown in short videos. As shown with               and 15 research stays have contributed to the first
middle school pupils (Coppola et al., 2017), ana-              building blocks of the foreseen theoretical frame-
lyzing a video illustrating verbs and associating it           work, especially with regards to the state-of-the-
with words in multiple languages reinforce meta-               art review. So far, 3 meetings and 1 training school
linguistic reasoning (CARAP, 2012). Such com-                  were organized (168 participations in total).
binations of semantic traits and action verbs can              Communication and outreach. EnetCollect’s in-
also be used for textual entailment.                           tranet and website are online for 9 and 7 months
   The SpeechTEK research unit of Fondazione                   and host already a substantial amount of informa-
Bruno Kessler (FBK) is working on Automatic                    tion. 11 mailing lists targeting subsets of mem-
Speech Recognition (ASR) and addresses com-                    bers were created and used. 4 calls for research
puter assisted language learning as an applica-                stays and 5 calls for meeting participation were
tion field. In a first project, it aims to automat-            distributed and drew attention (and members) to
ically assess children’s reading capability at pri-            enetCollect. Aside from one invited talk, several
mary school. ASR is used to align a given text                 early activities for publications at conferences of
with the speech read out by a pupil, to highlight              related research communities are ongoing.
its errors and score it. A second project concerns             Funding new initiatives. Funding applications
the use of ASR and classification tools to auto-               were supported early on, e.g. through the ad-
matically check the proficiency of Italian students            vertisement of specific opportunities or dedicated
aged between 9 and 16 years, in learning both En-              internal campaigns (e.g. for Marie Sklodowska-
glish and German. Both written texts and spoken                Curie Individual Fellowships). Three applications
utterances have to be evaluated, using reference               for mid-sized projects were already submitted in
scores related to some proficiency indicators (e.g.            the first year, of which two got positively evalu-
pronunciation, fluency, lexical richness) given by             ated, and one got funded by a Swiss agency.
human experts. In the first project, corrections of
ASR errors can be crowdsourced and used to build               6    Conclusion
more reliable models for assessing reading capa-               We presented enetCollect, outlined its key aspects
bilities of children. Similarly, in the second project         and introduced both its Italian members and their
crowdsourcing could help both to transcribe and to             research interests. By harnessing even a frag-
score the answers uttered by the students. In both             ment of the crowdsourcing potential existing for
cases, crowdsourcing could allow to adapt ASR                  all languages taught worldwide, enetCollect could
models and produce more reliable gold standards.               trigger changes of noticeable impact for language
                                                               learning and language-related R&I fields, such as
5        Progression of the Network                            NLP. The fast uptake and overall progression of
In this section, the most relevant achievements14              enetCollect within its first year indicate its rele-
related to the overall progression of the network              vance and the potential magnitude of its ambition.
                                                                     We do not report on content-related results as these are
       www.clarin-it.it                                        too numerous and varied and, more importantly, they are (or
       www.imagact.it                                          will be) the focus of different publications authored by the
       See more information on http://enetcollect.eurac.edu.   members having achieved them.
