=Paper=
{{Paper
|id=Vol-3224/paper07
|storemode=property
|title=Proxecto Nós: Artificial intelligence at the service of the Galician language
|pdfUrl=https://ceur-ws.org/Vol-3224/paper07.pdf
|volume=Vol-3224
|authors=Adina Ioana Vladu,Iria de-Dios-Flores,Carmen Magariños,John E. Ortega,José Ramom Pichel Campos,Marcos García,Pablo Gamallo,Elisa Fernández Rei,Alberto Bugarín,Manuel González González,Senén Barro,Xose Luis Regueira
|dblpUrl=https://dblp.org/rec/conf/sepln/VladudMOCG0RBGB22
}}
==Proxecto Nós: Artificial intelligence at the service of the Galician language==
Proxecto Nós: Artificial intelligence at the service of the Galician
language
Proxecto Nós: Inteligencia artificial al servicio de la lengua gallega
Adina Ioana Vladu 1, Iria de-Dios-Flores 2, Carmen Magariños 1, John E. Ortega 2, José Ramom
Pichel 2, Marcos Garcia 2, Pablo Gamallo 2, Elisa Fernández Rei 1, Alberto Bugarín 2, Manuel
González González 1, Senén Barro 2 and Xosé Luis Regueira 1
1
Instituto da Lingua Galega (ILG) - Universidade de Santiago de Compostela, Spain
2
Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS) - Universidade de Santiago de
Compostela, Spain
Abstract
Proxecto Nós is an initiative aimed at providing the Galician language with openly licensed
resources, tools, and demonstrators in the area of intelligent technologies. The Project has two
main scientific and technological objectives: (i) to integrate the Galician language into cutting-
edge AI and language technologies, thus enabling the natural use of Galician in human-machine
interactions; and (ii) to improve the state of the art of language technologies for Galician.
Keywords 1
Language technologies, linguistic rights, Galician, low-resource languages.
1. Introduction of Santiago de Compostela (USC) and is currently
being carried out by a research team comprising
members of the Instituto da Lingua Galega (ILG)
Proxecto Nós (The Nós Project) is an initiative
and the Centro Singular de Investigación en
promoted by the Galician Government (Xunta de
Tecnoloxías Intelixentes (CiTIUS). The first
Galicia), aimed at providing the Galician
stage, spanning from the final trimester of 2021 to
language with openly licensed resources, tools,
2025, will lay the foundations and provide the
demonstrators, and use cases in the area of
resources that will help place Galician among the
intelligent technologies. The execution of
languages that are fully active in the digital
Proxecto Nós has been entrusted to the University
society and economy.
SEPLN-PD 2022. Annual Conference of the Spanish Association for
Natural Language Processing 2022: Projects and Demonstrations,
September 21-23, 2022, A Coruña, Spain
EMAIL: adina.vladu@usc.gal (A.I. Vladu); iria.dedios@usc.gal
(I. de-Dios-Flores); mariadelcarmen.magarinos@usc.gal (C.
Magariños); john.ortega@usc.gal (J. Ortega);
jramom.pichel@usc.gal (J.R. Pichel);
marcos.garcia.gonzalez@usc.gal (M. Garcia);
pablo.gamallo@usc.gal (P. Gamallo); elisa.fernandez@usc.gal (E.
Férnandez Rei); alberto.bugarin.diz@usc.gal (A. Bugarín);
manuel.gonzalez.gonzalez@usc.gal (M. González González)
senen.barro@usc.gal (S. Barro); xoseluis.regueira@usc.gal (X.L.
Regueira)
ORCID: 0000-0002-3910-7820 (A.I. Vladu); 0000-0002-5941-
1707 (I. de-Dios-Flores); 0000-0003-3525-1304 (C. Magariños);
0000-0002-2328-3205 (J. Ortega); 0000-0001-5172-6803 (J.R.
Pichel); 0000-0002-6557-0210 (M. Garcia); 0000-0002-5819-
2469 (P. Gamallo); 0000-0002-4109-0087 (E. Férnandez Rei);
0000-0003-3574-3843 (A. Bugarín); 0000-0001-7948-4607 (M.
González González) 0000-0001-6035-540X (S. Barro); 0000-
0001-7264-3740 (X.L. Regueira)
©️ 2020 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
26
2. Context and motivation process enormous quantities of documents. In the
case of Galician, incorporating the language into
state-of-the-art AI applications can not only
The development of language technologies is
significantly favor its prestige (a decisive factor in
a strategic innovation area geared towards the
language normalization), but also guarantee
digital society and economy, and it has been a
citizens’ language rights and reduce social
priority in both Spanish (Plan Estatal de
inequality.
Investigación Científica y Técnica y de
In economic terms, the global Natural
Innovación, Estrategia Española de Ciencia y
Language Processing (NLP) market size was
Tecnología y de Innovación) and European
valued at more than USD 10 billion in 2020 and
(Horizon 2020) scientific planning. Technologies
is expected to reach USD 41 billion by 2025
such as machine translation (MT), information
(Aldabe et al., 2021). NLP technologies are used
extraction (IE), text analytics, and dialogue
in different areas such as information retrieval,
systems are essential in the digital society, culture,
MT, IE (with notable growth in its application in
and economy.
the medical domain during the Covid-19
Languages in high demand worldwide
pandemic), dialogue systems, and automatic text
(especially English) benefit from a large variety
generation, among many others. The capacity to
of computational resources that can contribute to
model language, an essential ability for human
developing new automatic language processing
beings, ensures a promising future for such
technologies and tools. Such is the case due to the
technologies from both an economic and research
long-standing research tradition in these areas
and innovation perspective.
(e.g., the variety of projects financed by USA’s
DARPA) and the need to incorporate such
languages into the AI applications associated with 3. State of the art: Galician resources
the latest electronic devices (such as the and technologies
conversational AI or automatic dictation software
developed by Google, Amazon or Apple). Other
In 2012, the White Paper The Galician
languages that have joined AI research later, such
Language in the Digital Age (García-Mateo et al.,
as Chinese, are currently following in the
2012) described Galician as a language with a
footsteps of English, through projects such as
level of technological support that “gives rise to
Baidu’s Qian Yan, which improve significantly
cautious optimism”, while highlighting the need
the computational resources available in their
for new resources and tools. Previous research
respective language varieties.
projects on Galician resulted in speech processing
Notwithstanding, language technologies are
resources (COTOVÍA), an annotated reference
also necessary for languages in lower
corpus (CORGA), morphosyntactic lemmatizers
international demand. Consequently, different
and taggers (XIADA, FreeLing, IXA-Pipes),
languages have developed similar initiatives to
other specialized corpora, both text (CLUVI,
Nós. Among others, we can highlight Projecte
CTG. TreeGal) and speech (CORILGA, AGO),
AINA, which will develop computational
MT systems (GAIO, OpenTrad), spellcheckers
resources for Catalan until 2024, or the work
(OrtoGal), grammar checkers (Avalingua),
carried out at the HiTZ Research Center, focusing
language analysis and IE tools (Linguakit),
on languages technologies for Basque. Other
language models (SemantiGal, Bertinho), and
projects, such as CorCenCC (in Great Britain, for
other resources.
Welsh) or UQAILAUT (in Canada, for Inuktitut)
Furthermore, Galician is currently part of
were considered success cases in the promotion of
multilingual crowdsourced data collection
the digital use of socially threatened languages.
initiatives carried out by important companies on
The democratization of language technologies
the global IT market, which have resulted in
has a great social and cultural impact on the
speech databases such as Google’s SLR77
communities that use them. For instance, MT
(Kjartansson et al., 2020) and Mozilla’s
increases access to contents in different
CommonVoice 7.0 and 8.0 (Ardila et al., 2020).
languages, thus facilitating intercultural relations;
This situation is reflected in a recent report on the
dialogue systems allow us to communicate with
current state of the LT (Language Technology)
machines in our own language; and semantic
field for Galician (Ramírez Sánchez & García
technologies enable advances in the automatic
Mateo, 2022), which informed on the
comprehension of texts, thus making it possible to
considerable growth in the production of high-
27
quality Galician resources and services, especially models and demonstrators from different areas
text resources. will be made publicly available.
Despite the quality of these resources, it should More information on the organization of
be noted that not all are freely and publicly Proxecto Nós can be found in de-Dios-Flores et
available for the development of LT. The LT field al, 2022.
has undergone profound changes over the last few
years since the introduction of neural network 4.2. Scientific and technological
systems. Generally, training models using these
state-of-the-art technologies requires large objectives
quantities of data and has high energetic and
computational costs, which continues to be a Proxecto Nós has two main scientific and
challenge for low-resource languages. However, technological objectives: (i) to integrate the
as many recent studies show, end-to-end Galician language into cutting-edge AI and
technologies and open-source multilingual pre- language technologies, thus enabling the natural
trained models created using large quantities of use of Galician in human-machine interactions;
data from high-resource languages (Shen et al., and (ii) to improve the state of the art of language
2018; Baevski et al., 2020; Wolf et al., 2020) can technologies for Galician.
be used, through transfer learning and fine-tuning, For this purpose, resources, tools, and
to train models in low- or medium-resource applications will be developed and distributed
languages such as Catalan (Külebi & Öktem, under open licenses, which will allow them to be
2018; Külebi et al., 2020) or, in our case, Galician. integrated into existing devices and services (such
To this end, the existence of resources and tools as smart speakers or conversational agents) and
that are freely available to the scientific and future technologies. To this end, specific
business community is essential, and that objectives directly related to some of the major
constitutes one of the main objectives of Proxecto tasks of NLP have been established.
Nós. Each of these technological objectives will be
executed in a different subproject, which will
allow the parallel development of different tasks
4. Project description
and, overall, a more effective organization of the
4.1. Organization work. However, a set of general objectives are
shared by all the tasks. These objectives are: (i)
The tasks that are to be carried out as part of the compilation of high-quality linguistic
the Project can be included in the following areas, resources (annotated reference corpora, web-scale
corresponding to some of the major NLP fields: corpora, specialized corpora by tasks and
An example of numbered list is as following. domains, parallel corpora, knowledge bases,
1. Speech synthesis (TTS) dictionaries, etc.); (ii) the elaboration of language
2. Speech recognition (ASR) and acoustic models (both general-purpose and
3. Automatic text generation task-specific models); and (iii) the development
4. Dialogue systems of applications based on these models. The project
5. MT will also have a general coordination mechanism
6. IE through which resources will be distributed and
7. Opinion mining and fact checking shared among its subprojects.
8. Language correction and assessment The resources and language models developed
These broad, mutually interdependent areas for each task will be made available to the public,
fall within the three strategic lines jointly thus allowing their use in all kinds of applications,
identified by the Project’s research team and the services, and products, by the scientific
Xunta de Galicia (in particular, with the Axencia community, companies, institutions, and society
para a Modernización Tecnolóxica de Galicia): (i) in general. The results will be disseminated
spoken or written conversation with people, (ii) through a repository available at the project’s web
language quality, and (iii) information portal (which can be hosted on internal servers),
management. as well as other established and internationally
In accordance with the funding agreement recognized repositories, such as HuggingFace,
signed by the Xunta de Galicia and the USC, the GitHub, Zenodo, etc.
organization of the tasks included in Nós follows Finally, the project contemplates the complete
a yearly schedule. Each year, resources, language development of applications based on these
28
resources, which will act as visible and accessible References
demonstrators of the developed technology and
will produce a tractor effect that will lead to the
[1] I. Aldabe, G. Rehm, G. Rigau, A. Way,
development of new products.
Report on existing strategic documents and
projects in LT/AI, European Language
5. Conclusion and future work Equality (ELE), 2021.
[2] R. Ardila, M. Branson, K. Davis, M.
Among the initial results of Nós, we can Henretty, M. Kohler, J. Meyer, R. Morais, L.
highlight the first crawl of a web-based Galician Saunders, F. M. Tyers, G. Weber, Common
corpus and a language model based on the CCNet Voice: A Massively-Multilingual Speech
tools and data (Ortega et al., 2022a), and the Corpus, in: Proceedings of LREC 2020.
development and testing of a Spanish-Galician [3] A. Baevski, H. Zhou, A. Mohamed, M. Auli,
neural machine translation (NMT) system wav2vec 2.0: A Framework for Self-
prototype (Ortega et al., 2022b). Supervised Learning of Speech
For the current year, Proxecto Nós aims to Representations. arXiv, 2020, pp. 1–
keep generating linguistic and computational 19. doi: 10.48550/arXiv.2006.11477
resources to explore different subprojects. [4] I. de-Dios-Flores, C. Magariños, A. I. Vladu,
Specifically, in the first half of 2022 work will be J. E. Ortega, J. R. Pichel, M. García, P.
carried out on the design of a high-quality speech Gamallo, E. Fernández Rei, A. Bugarín-Diz,
corpus of sufficient size so as to allow training M. González González, S. Barro, X. L.
TTS state-of-the-art models, to be released in the Regueira, The Nós Project: Opening routes
last trimester. The second half of the year will also for the Galician language in the field of
see the publication of a speech corpus for ASR. In language technologies, in: Proceedings of the
the same timeframe, the project will publish TDLE Workshop @LREC2022, pp. 52–61
several text corpora: parallel Galician-Spanish, Marseille, 20 June 2022.
Galician-English, and Galician-Portuguese [5] C. García Mateo, M. Arza Rodríguez (auth.),
corpora; a web-scale Galician text corpus, larger G. Rehm, H. Uszkoreit (eds.), The Galician
than the one already compiled, to be used in all the Language in the Digital Age, Springer-
subprojects working with written text included in Verlag, Berlin Heidelberg, 2012.
Nós; and a domain-specific corpus for automatic [6] B. Külebi, A. Öktem, Building an Open
text generation. Based on these resources, new Source Automatic Speech Recognition
language models will be developed using System for Catalan, in: IberSPEECH,
different state-of-the-art techniques, as well as Barcelona, Spain, 2018, pp. 25–29.
demonstrators or prototypes of a TTS system, [7] B. Külebi, A. Öktem, A. Peiró-Lilja, S.
NMT system, and automatic text generator for Pascual, M. Farrús, CATOTRON - A Neural
Galician. At the same time, throughout 2022 Text-To-Speech System in Catalan. In:
efforts will focus on extending and improving the Proceedings of Interspeech 2020.
first systems developed, and on validating the [8] O. Kjartansson, A. Gutkin, A. Butryna, I.
results obtained via the creation of high-quality Demirsahin, C. Rivera, Open-Source High
gold standards. Quality Speech Datasets for Basque, Catalan
and Galician, in: Proceedings of the 1st Joint
Workshop on SLTU and CCURL, Marseille,
Acknowledgements France, 2020, pp. 21–27.
[9] J. E. Ortega, I. de Dios Flores, P. Gamallo, J.
This research was funded by the project “Nós: R. Pichel, A Neural Machine Translation
Galician in the society and economy of artificial System for Spanish to Galician through
intelligence” (Proxecto Nós: O galego na Portuguese Transliteration, in: PROPOR
sociedade e economía da intelixencia artificial 2022, Fortaleza, Brazil.
2021-CP080), agreement between Xunta de [10] J. E. Ortega, I. de Dios Flores, J. R. Pichel, P.
Galicia and University of Santiago de Gamallo, Revisiting CCNet for Quality
Compostela, and grant ED431G2019/04 by the Measurements in Galician, in: PROPOR
Galician Ministry of Education, University and 2022, Fortaleza, Brazil.
Professional Training, and the European Regional [11] J. M. Ramírez Sánchez, C. García Mateo
Development Fund (ERDF/FEDER program). (auth.), M. Giagkou, S. Piperidis, G. Rehm,
29
J. Dunne (eds.), Report on the Galician
Language (Deliverable D1.15), ELE, 2022.
[12] J. Shen, R. Pang, R. J. Weiss, M. Schuster,
N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, , Y.
Wang, R. J. Skerry-Ryan, R. A. Saurous, Y.
Agiomyrgiannakis, Y. Wu, Natural TTS
Synthesis By Conditioning Wavenet On Mel
Spectrogram Predictions, in: Proceedings of
ICASSP, 2018.
[13] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
Delangue, A. Moi, P. Cistac, T. Rault, R.
Louf, M. Funtowicz, J. Davison, S. Shleifer,
P. von Platen, C. Ma, Y. Jernite, J. Plu, C.
Xu, T. Le Scao, S. Gugger, et al.,
Transformers: State-of-the-Art Natural
Language Processing. In: Proceedings of the
2020 Conference on Empirical Methods in
NLP: System Demonstrations, 2020, pp. 38–
45.
30