CLARIN-IT: An Overview on the Italian Clarin Consortium After Six Years of Activity Dario Del Fante1 , Francesca Frontini2 , Monica Monachini2 and Valeria Quochi2 1 Dipartimento di Studi Linguistici e Letterari, Università degli Studi di Padova 2 Istituto di Linguistica Computazionale «A. Zampolli», CNR, Pisa Abstract This paper offers an overview of the Italian CLARIN consortium after six years since its establishment. The members, the centres and the repositories and the most important collections are described. Lastly, in order to showcase the visibility and the accessiblity of Language Resources provided by CLARIN-IT from a user-perspective, we show how Italian resources are findable within CLARIN ERIC. Keywords Language Resources, Data Repositories and Archives, Research Infrastructures, CLARIN 1. Introduction CLARIN ERIC 1 is one of the 20 European Research Infrastructure Consortia (ERICs). Its aim is to make digital language resources (hencefoth LRs) [1] available to scholars and researchers from all disciplines. CLARIN-IT, the Italian CLARIN consortium was establised in October 2015 [2], and it has recently been recognized as "project of international significance" by the Italian government. It benefits from the support of the Ministry of Research and has been listed among the high priority research infrastructures, according to Decreto Ministeriale No.1082 del 10/09/2021 - Piano Nazionale per le Infrastrutture di Ricerca (PNIR) 2021-2027 2 . CLARIN ERIC has two fundamental objectives. The first one is to maintain digital repositories, where Language Resources (LRs), that is to say data, corpora, lexicons, tools, are catalogued, stored and retrieved in a simple way. The second one concerns the development of technological solutions that can be intuitively accessed by users. Thefore, CLARIN represents a structure where producers of language technologies and users of these technologies are connected and integrated. CLARIN’s technical infrastructure is designed in accordance with the FAIR principles and aims at supporting the best practices of Open Science, by facilitating the deposit and long term preservation of data, their persistent identification and citation, by providing standardised and machine readable metadata and clear licenses, and easy access via single sign on3 . An important added value provided by CLARIN is that of an increased findability of resources, IRCDL 2022: 18th Italian Research Conference on Digital Libraries, February 24–25, 2022, Padova, Italy $ dario.delfante@unipd.it (D. Del Fante); francesca.frontini@ilc.cnr.it (F. Frontini); monica.monachini@ilc.cnr.it (M. Monachini); valeria.quochi@ilc.cnr.it (V. Quochi) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 http://www.clarin.eu 2 https://www.mur.gov.it/it/atti-e-normativa/decreto-ministeriale-n1082-del-10-09-2021 3 https://www.clarin.eu/fair thanks to the harvesting of metadata from various repositories into the central infrastructure. The aim of this paper is twofold: • To provide a clear overview of the Italian national CLARIN consortium as it currently stands, six years after its creation in terms of members, centres and collections stored • To illustrate the visibility and the accessiblity of Italian LRs within the CLARIN infras- tructure As concerns the second point, for reasons of space, we will not present here the overall archi- tecture of CLARIN ERIC, for which we refer to publications such as [3, 2]. In this contriubtion we shall concentrate in particular on the Virtual Language Observatory, the CLARIN’s meta catalogue where the metadata from all data centres are made visible and searchable from a single point of access. In section 2, we discuss the current state of affairs of the Italian consortium in terms of members, and centres within the CLARIN federation, with a special focus on what they offer to CLARIN in terms of resources, services and expertise. In section 3, we simulate two queries on the VLO to highlight how this tool can contribute to the visibility of Italian resources from a user’s perspective and we discuss about the results. 2. CLARIN-IT in 2021 2.1. Members The CLARIN-IT4 consortium includes a founding member and seven full members. The current full members are the following: 1. the Istituto di Linguistica Computazionale "A. Zampolli” (ILC) of the Consiglio Nazionale delle Ricerche in Pisa is the founding member and host of the ILC4CLARIN repository5 ; 2. The EURAC Research Association (Bolzano) signed the CLARIN-IT Membership Appli- cation in 2017. The membership establishes that the organization creates a repository compliant with the CLARIN guidelines in which to deposit the metadata relating to the resources and tools available at its headquarters. 3. The Department of Education, Human Sciences and Intercultural Communication of the University of Siena signed the CLARIN-IT Membership Application in 2016. 4. The Department of Philology and Literary Criticism of the University of Siena signed the CLARIN-IT Membership Application in 2017. 5. The Bruno Kessler Foundation (Trento) signed the CLARIN-IT Membership Application in 2018. 6. The Archival and Bibliographical Superintendence of Tuscany (Firenze) signed the CLARIN-IT Membership Application in 2019 7. The Department of Electrical Engineering and Information Technology and the Interde- partmental Research Center "URBAN/ECO" of the University of Naples Federico II signed the CLARIN-IT Membership Applications in 2020. 4 For a survey on Language Resources in CLARIN-IT, refer to [4] 5 https://ilc4clarin.ilc.cnr.it/ 8. The Catholic University of the Sacred Heart (Milano) joined CLARIN-IT by signing a Scientific Collaboration Agreement with ILC-CNR (Pisa) CLARIN in 2021. Moreover, thanks to a continuous and focused User Involvement strategy, the Consortium is constantly expanding. Many other institutions have expressed their interest in joining CLARIN-IT or in depositing their data in the CLARIN national repository. The Italian consortium embraces different research directions. One of those is the field of Digital Classics, which still suffers from shortage or restricted availability of language resources for historical languages such as Ancient Greek, Latin or Sanskrit. To this end, the consortium aims to make some of the existing digitized resources for Ancient Greek and Latin available through its repositories, as well as to create new ones by enriching existing corpora and lexical datasets with Linked Open Data. Another important direction is that of speech and oral archives, which are at the crossroads between speech sciences, digital humanities and digital heritage. CLARIN-IT collaborates with the University of Siena and the Superintendence of Tuscany to coordinate a project aiming at building a model and an architecture for the preservation, enhancement and accessibility of such archives. Finally, the EURAC partners are carrying out research around non-standard forms of language as found in learner corpora and computer- mediated communication. The founding member and the consortium members are also involved in many international infrastructural projects, which aim to strengthen the cohesion of research across a number of related fields associated with the humanities. Among these we cite ELEXIS6 (on e-lexicography), the SSHOC cluster project7 (European open cloud ecosystem of data and tools for SSH), the TRIPLE project8 (a discovery platform for SSH). CLARIN-IT researchers are also active in standardization initiatives, such as ISO, TEI, W3C, and international academic organizations and networks, such as Learner Corpus Association9 , Special Interest Groups on Computer Mediated Communication, the COST Action European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect10 ), and the COST Action Nexus Linguarum11 , for building an ecosystem of multilingual and semantically interoperable linguistic data at Web scale. 2.2. Centres The CLARIN network is composed of distributed centres, which can be of three main types12 . Firstly, there are technical centres or B-Centres: these are generally hosted by a university or a public research institution and offer access to resources, services 13 and/or knowledge. Secondly, there are Metadata Providing Centres or C-Centres: they offer deposit and metadata curation. Lastly, there are Knowledge Centres or K-Centres: centres sharing their knowledge 6 https://elex.is/ 7 https://sshopencloud.eu/ 8 https://project.gotriple.eu/ 9 https://www.learnercorpusassociation.org/ 10 https://enetcollect.eurac.edu/ 11 https://nexuslinguarum.eu/ 12 A complete list can be found here: https://www.clarin.eu/content/overview-clarin-centres 13 https://www.clarin.eu/content/services and expertise on one or more aspects of a domain covered by CLARIN. Their mission is to ensure that the available knowledge and expertise does not exist as a fragmented collection of informations, but it is made accessible in an organised way to both the CLARIN community and the social sciences and humanities research community at large. Each K-centre has its own specific areas of expertise. CLARIN-IT comprises two data centres: • The ILC4CLARIN B-centre, which is hosted and managed by the Institute for Computa- tional Linguistics "A.Zampolli" in Pisa, the founding member of CLARIN-IT . • The EURAC Research CLARIN Centre (ERCC) C-centre, which is hosted by the Institute for Applied Linguistics (IAL) at Eurac Research in Bolzano, a full member of CLARIN-IT. Both centres offer the possibility to: • deposit data, by ensuring that they are stored safely; • search for data and tools and to download them easily; • make the citation format as easy and consistent as possible; Through the two repositories, CLARIN-IT offers a variety of resources (Cf. section 2.3 for an extensive excursus on CLARIN-IT offer). ILC4CLARIN, as the national B centre, has the mission to offer deposit facility to the whole of the national community, and also hosts a number of Natural Language Processing services14 , many of which are offered as web services and integrated into the CLARIN pipeline management system WebLicht15 . In addition to these, Italy is also currently hosting two K-centers. • CLARIN Knowledge Centre for Digital and Public Textual Scholarship (DiPText-KC)16 , jointly maintained by University Ca’ Foscari in Venice and ILC-CNR • CLARIN Knowledge Centre for Computer-Mediated Communication and Social Media Corpora (CKCMC)17 , a distributed K-centre jointly hosted by the Institute for Applied Linguistics, Eurac Research (IAL) in Bolzano, the Formal Linguistics Laboratory (LLF), in Paris, the Jožef Stefan Institute (IJS) in Ljubljana and the Leibniz-Institute for the German Language (IDS) in Mannheim While these centres constitute the backbone of the national in-kind contribution to CLARIN ERIC, other activities are worth mentioning, such as the participation of members of the Italian consortium in various important CLARIN ERIC committees, such as the Legal and Ethical Issues Committee (CLIC) and the Standards Committee, as well as the work carried out to facilitate the deposit of important collections, such as for instance the Archivio della Latinità Italiana del Medioevo (ALIM). 14 https://ilc4clarin.ilc.cnr.it/services/ 15 https://weblicht.sfs.uni-tuebingen.de/ 16 https://diptext-kc.ilc4clarin.ilc.cnr.it/ 17 https://cmc-corpora.org/ckcmc 2.3. Collections offered by CLARIN-IT CLARIN-IT offers seven different digital collections, which are deposited in one of the two centres previoulsy mentioned (section 2.2). As Table 1 shows, each collection includes a number of individual language resources. Collections ALIM Literary Sources 344 ILC4CLARIN : OPEN Data and Tools 9 ILC4CLARIN 58 CIRCSE 8 Alim Documentary Sources 11 ERCC Learner Corpora 8 Eurac: Learner Language 10 ERCC Web Corpora 4 Table 1 Collections in CLARIN-IT. Situation at December 2021. For example, ALIM Literary Sources [5] collection gives access to a vast archive Latin texts produced in Italy during the Middle Ages; its publication is of great importance for providing CLARIN-IT and the CLARIN community, at large, with critically reliable texts for the use of philologists, historians of literature, historians of institutions, culture and science of the Middle Age. However, the Italian centres do not only host collections by Italian institutions. For example, the Ghent University has deposited three corpora in the ERCC centre: • ACTER (Annotated Corpora for Term Extraction Research) v1.4 • Beldeko Summary Corpus v1.0.0 • ACTER (Annotated Corpora for Term Extraction Research) v1.3 Indeed, both CLARIN-IT centers offer LRs in a variety of languages, not only Italian, as shown in Table 2. Languages Latin 369 Croatian 1 English 43 Modern Greek 2 Italian 38 Croatian 1 Arabic 32 Ladino 1 German 12 Mòcheno 1 Ancient Greek 10 Sardinian 1 French 4 Saurano 1 Dutch 4 Slovenian 1 Czech 2 Spanish; Castilian 1 Basque 1 Trentino 1 Breton 1 Tyrolean 1 Cimbrian 1 Veneto 1 Table 2 Languages in CLARIN-IT Since ILC4CLARIN is specialised in ancient texts, Latin and Ancient Greek are particularly represented. However, as discussed in [4], Latin LRs are overrepresented, even with respect to Italian, because of the metadata of the ALIM corpus: each text of that collection is deposited as a separate resource, while this is not true for other corpora [5]. 3. Italian Resources in CLARIN The most important poin of access for CLARIN is the Virtual Language Observatory (VLO)18 [6] which harvests metadata from all the official CLARIN data providing centres and makes them searchable via a unified interface offering faceted search. Other interesting and useful central discovery services are the Federated Content Search (FCS)19 , the Language Resources Switchboard (SB)20 , and the CLARIN Resource Families21 . In order to investigate the visibility and the accessiblity of Italian LRs in CLARIN-IT from a user-perspective, we showcase two simple queries which can be performed using the faceted search functionalities of the VLO. Each query was raised in different formats. Our aim here is to assess the visibility of resources from Italian centres, but also resources that may be relevant to Italian researchers offered by centres outside of Italy, by trying to replicate as much as possible the behaviour of a non-expert user, accessing the VLO for the first time. Query 1 - Search for Italian Corpora in the VLO by filtering for: 1. Language = Italian . 2. Resource type = Corpora. This query returns 159 results. If we filter them by organisation, we can easily assess that the resources from the Italian centres are correctly displayed in the meta catalogue. For instance by adding the filter • organisation = Institute for Applied Linguistics, Eurac Research all 7 corpora from ERCC are selected. At the same time, it also becomes evident that important resources for the Italian language are also offered by other centres outside Italy, such as, among others, the Universal Dependencies Consortium (20 corpora) and the Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) (12 corpora) which host multilingual collections of Treebanks, namely syntactically annotated corpora. From this query it is thus evident that the participation of Italian centres in CLARIN not only allows resources produced in Italy to gain visibility outside of the national context, but also to make important resurces hosted abroad more accessible to Italian researchers. A similar experiment can be attempted for Latin, a language that is strongly represented among CLARIN- IT resources. 18 https://vlo.clarin.eu 19 https://contentsearch.clarin.eu/ 20 https://switchboard.clarin.eu/ 21 https://www.clarin.eu/resource-families Query 2 - Search for Latin lexical resources in the VLO by filtering for: 1. Language = Latin 2. Resource type = Lexical Resource This query returns 22 results. Among these we find the important resources made available on ILC4CLARIN by the CIRCSE collection, which contains the results of the Linking Latin (LiLa) ERC project22 , produced by Marco Passarotti and his colleagues. However, an important resource by the CIRCSE lab, namely the Word Formation Latin (WFL), is hosted instead by the LINDAT repository, together with the other resources of the multilingual Universal Derivations collection. Thus the VLO view allows users to easily find links between resources that are hosted on different repositories, and so making them more accessible and reusable in research. 4. Conclusive Remarks and Future Work The infrastructures such as CLARIN play an important role to overcome future challenges of Open Science, and to deliver on the promise of a digital ecosystem of freely accessible, interconnected resources, where data from different providers, made available according to the FAIR principles, can be reused and combined to produce novel research. Within this context, the role of standards in the archivial management is fundamental. The CLARIN-IT consortium is constantly increasing the number of resources deposited in its centres, and also conducting a regular monitoring of the different collections provided by various partners across the two repositories, so as to verify their visibility in CLARIN. An important future challenge is that of increasing CLARIN’s users base at the national level. CLARIN has developed different methodologies and approaches to mesure and evaluate user engagement. Surveys can be used to test the interest in the use of digital resources and related tools [7]. A new survey, which is still on going, shows that while Italian researchers in the domain of Language Resources and Technologies are mostly aware of CLARIN-IT’s services, a large number of them is still relying on local repositories or on GitHub to store their data. It is thus crucial that CLARIN-IT provides adequate training to resources such as the VLO, in particular targeting the needs of specific communities. In this sense the newly released tutorial CLARIN Tools and Resources for Lexicographic Work [8] is a step in this direciton. References [1] J. Godfrey, A. Zampolli, Language resources, in: Survey of the State of the Art in Human Language Technology. Linguistica Computazionale, XII-XIII., Cambridge University Press, 1997, pp. 381–384. [2] M. Monachini, F. Frontini, CLARIN, l’infrastruttura europea delle risorse linguistiche per le scienze umane e sociali e il suo network italiano CLARIN-IT, Italian Journal of Computational Linguistics 2 (2016) 11–30. URL: http://journals.openedition.org/ijcol/387. doi:10.4000/ijcol.387. 22 https://lila-erc.eu/#page-top [3] F. de Jong, B. Maegaard, D. Fišer, D. van Uytvanck, A. Witt, Interoperability in an infrastruc- ture enabling multidisciplinary research: The case of CLARIN, in: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 3406–3413. URL: https://aclanthology.org/2020.lrec-1.417. [4] D. Del Fante, F. Frontini, M. Monachini, V. Quochi, CLARIN-IT resources in CLARIN ERIC - a bird’s-eye view, in: CLARIN Annual Conference 2021, Proceedings CLARIN Annual Conference 2021, 2021, pp. 129–133. [5] F. Boschetti, R. Del Gratta, M. Monachini, M. Buzzoni, P. Monella, R. Rosselli Del Turco, “Tea for Two”: The Archive of the Italian Latinity of the Middle Ages meets the CLARIN Infrastructure, in: C. Navarretta, M. Eskevich (Eds.), Proceedings of CLARIN Annual Confer- ence 2020. Virtual Edition, 2020. URL: https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ ConferenceProceedings.pdf. [6] D. Broeder, M. Kemps-Snijders, D. V. Uytvanck, M. Windhouwer, P. Withers, P. Wittenburg, C. Zinn, A data category registry- and component-based metadata framework, in: N. C. C. Chair), K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, D. Tapias (Eds.), Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), European Language Resources Association (ELRA), Valletta, Malta, 2010, pp. 19–21. [7] M. Monachini, A. Nicolosi, A. Stefanini, Digital classics and CLARIN-IT: What italian scholars of ancient greek expect from digital resources and technology, in: Selected papers from the CLARIN Annual Conference 2017, Budapest, 18-20 September 2017, 2018, pp. 61–74. [8] F. Frontini, A. Bellandi, V. Quochi, M. Monachini, K. Mörth, S. Zhanial, M. Ďurčo, A. Woldrich, CLARIN Tools and Resources for Lexicographic Work, 2022. URL: https://elexis. humanistika.org/en/resource/posts/clarin-tools-and-resources-for-lexicographic-work, publisher: DARIAH-Campus.