1. Introduction

From Tokens to Trees: Mapping Syntactic Structures in the Deserts of Data-Scarce Languages

David Vilares

Alberto Muñoz-Ortiz

0 0 Universidade da Coruña, CITIC, Departamento de Ciencias de la Computación y Tecnologías de la Información , Campus de Elviña s/n, 15071, A Coruña , Spain

Low-resource learning in natural language processing focuses on developing efective resources, tools, and technologies for languages that are less popular within the industry and academia. This efort is crucial for several reasons, including ensuring that as many languages as possible are represented digitally, and enhancing access to language technologies for native speakers of minority languages. In this context, this paper outlines the motivation, research lines, and results from a Leonardo Grant - by FBBVA - on low-resource languages and parsing as sequence labeling. The project's primary aim was to devise fast and accurate methods for low-resource syntactic parsing and to examine evaluation strategies as well as strengths and weaknesses in comparison to alternative parsing strategies.

eol>low-resource learning natural language processing parsing cross-lingual learning multilinguality

1. Introduction

SEPLN-CEDI-PD 2024: Seminar of the Spanish Society for Natural Language Processing: Projects and Systems Demonstrations, June 19-20, 2024, A Coruña, Spain. $ david.vilares@udc.es (D. Vilares); alberto.munoz.ortiz@udc.es (A. Muñoz-Ortiz) https://www.grupolys.org/~david.vilares/ (D. Vilares); https:// amunozo.github.io/ (A. Muñoz-Ortiz)

0000-0002-1295-3840 (D. Vilares); 0000-0001-9608-2730 (A. Muñoz-Ortiz)

© 2024 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWicenoserkAtstrhiboutpionP4.r0 oIncteerneadtioinnagl(sCC(CBYE4.U0).R-WS.org)

1https://www.redleonardo.es/ monitoring social networks but considering only Indo- The evaluation. Throughout the project, we emphaEuropean languages), among others. sized the importance of evaluating a wide variety of languages, encompassing diverse linguistic families, typoloThe problem. While syntactic parsers excel with high- gies, and alphabets. This strategy was adopted to ensure resource languages, they encounter significant challenges our results were more robust and generalizable. To do with low-resource ones. The ability to analyze sentence so, we mostly relied on the Universal Dependencies [ 3 ], structure is crucial for NLP tools, including the develop- a collection of treebanks2, which contains syntactic anment of applications like automatic translation, question notations for more than 100 languages from diferent answering, and text summarization. In some other cases, language families, and alphabets. the desired output is the structure itself, as is often the case for computational linguists (for instance, because The novelty. From a technical standpoint, this project they want to study languages) or when the final output was both original and innovative as it combined artifiis a tree or graph that aids in understanding the meaning cial intelligence and natural language processing with of the utterance (e.g., relationships between symptoms, recent cognitive theories on how humans comprehend diseases, and cures in clinical reports). language structure. The approach aimed to develop new NLP models capable of swiftly and accurately obtaining The approach. From a linguistic point of view, the the syntactic structure of sentences written in languages 7,000 languages spoken in the world are organized into with a scarcity of resources. In this regard, research on about 140 families. For example, Spanish, French, Gali- languages with limited resources is recognized by the cian, or Catalan are all Indoeuropean languages; while international NLP community as one of the major unreTurkish, Uzbek, Kazakh, or Uyghur are Turkic languages. solved challenges. Several authors have made significant Moreover, many of these resource-scarce languages are contributions in recent years in areas such as machine closely related to another language with a multitude of translation [ 4 ] morphological analysis [ 5 ], and syntactic speakers and resources available (e.g., Galician-Spanish analysis [6]. Thematically, the project addressed various or Uyghur-Turkish), sharing not only linguistic typol- concerns of contemporary society, including the develogy (e.g., word order or vocabulary formulation) but also opment of technologies that contribute to the preservasyntactic structures. In the same way that it is easier tion of knowledge expressed in diferent languages and for a person to create grammatical sentences in a new ensuring democratic access to artificial intelligence techlanguage if they already know another language with nologies. similar characteristics (e.g., for a Spanish speaker, Galician would be easier than Uyghur, and the opposite would 3. Methodology be true for a Turkish speaker). In NLP, it is also a common approach to exploit related languages, specially in The project explored three lines of work. The first focused the context of using rich-resourced languages to help on data collection for experiments, including training modeling less-resourced ones. This is also an angle that initial sequence labeling baselines, and it examined the we considered through the project to model the syntactic impact of annotated data volume on model quality. Furstructure of low-resource languages. In addition, recent thermore, it set up baseline models based on traditional studies in cognitive science suggest that humans might dependency parsing paradigms, using both graph-based use the same brain regions for lexical, syntactic, and se- and transition-based strategies. This aimed to better unmantic processing of sentences, and that this processing derstand the models and compare our results with these is carried out according to a sequence-labeling-like pro- typically slower, but more accurate, strategies. The seccess [ 2 ]. The underlying idea is that the brain processes ond line of work concentrated on leveraging distant and sentences as a flat sequence, whose representation is dy- auxiliary data to enhance the performance of the basenamically updated without the need for creating complex line models and to comprehend how neural networks hierarchical abstractions of the sentence to represent its perceive the structure of languages. The third of work syntactic structure. Recent studies have shown that it explored data augmentation methods for low-resource is possible to emulate this behavior in NLP using deep languages and dependency parsers. The second and third learning techniques and sequence labeling models, with lines of work were partially dependent on the first one, the great added advantage of their speed, making their but could be developed independently from each other use in real environments possible, unlike other syntactic later. We now briefly summarize them before moving on analysis paradigms. However, there was little research to the project results. of sequence labeling models for low-resource languages, and the challenges it poses to build them. This was the gap that this project aimed to fill.

2This is usually the name given to a dataset with syntactic

annotations.

Research line 1 - Compilation, analysis of syntactic typology, creation of baseline models, and impact of annotated data. This line focused on: (1) collecting representative data, (2) training the initial models, and (3) exploring the impact of the amount of annotated data on sequence labeling models, depending on the chosen parsing linearization. Specifically: 1. The first goal was to identify treebanks for numerous languages in collections such as the Universal Dependencies [ 3 ], pinpointing both lowresource and rich-resource languages of interest for the project. The focus was on identifying languages that share substantial syntactic proximity, evaluated according to various linguistic criteria including alphabet, word order, language family, or typology, among others. To achieve this, the approach involved using automated techniques to estimate such proximity, leveraging publicly available resources like the World Atlas of Language Structures [7] and URIEL [8]. Among the treebanks studied during the project, we included several rich-resource languages - such as English, German, Portuguese, Russian, Classical Chinese, Korean, and Japanese - and lowresource languages - such as Galician, Basque, Telugu, Marathi, Lithuanian, Faroese, Afrikaans, and Wolof. 2. The second goal was to develop, train, and assess base syntactic models across the chosen languages. The first step involved training sequence labeling models for both low-resource and richresource languages separately. This step was crucial for garnering preliminary experimental results and to have a baseline framework against which to evaluate models in the next phases. Additionally, this step was useful for preparing the high-resource models aimed at transferring syntactic knowledge in later stages of the project, for instance through zero-shot and few-shot setups. 3. The third goal of this line was to examine the performance of diferent linearizations for sequence labeling parsing on low-resource languages. At the project’s outset, various linearizations of dependency trees were available for training sequence labeling models, i.e. diferent strategies to create a sequence of labels that could be decoded into a dependency tree, and some others were created during the project.3 However, it was unclear if some linearizations could be more effectively used with the same data volume. To study so, we trained sequence labeling parsers on various languages to determine whether such

3For the details about the tested linearizations, we recommend

reading [ 1, 9, 10 ]. linearizations were equally data-hungry or not, and whether rich-resource and low-resource languages showed similar patterns.

Research line 2 - Auxiliary data use of pre-trained models. This line focused on the use of distance learning, such as reliance on parsers first trained for richresource languages, encoders pre-trained on masked language modeling, and auxiliary data, such as part-ofspeech tags, and examined their impact on the performance of sequence labeling parsers for low-resource languages and domains:

1. The first goal involved using sequence labeling

models first trained on rich-resource languages.

These models were then fine-tuned in a second phase on low-resource languages. We applied this strategy in both zero-shot and few-shot setups. The zero-shot setup operates under the assumption that there is no available data for the low-resource language. However, we expect that a related rich-resource language can still help obtain meaningful outputs for the low-resource languages. The few-shot setup, on the other hand, assumes that some data is available. This data is used to continue fine-tuning the model initially pre-trained on the rich-resource language. Alternatively, under the few-shot setup, this phase also involved training the model in a single phase by merging low-resource training data with data from a related rich-resource language. 2. The second goal aimed to use related or distant tasks that provide useful information about the syntactic structure of the languages, to assess their impact on sequence labeling models for lowresource languages. On one hand, the first task involved leveraging morphological information for sequence labeling parsers in both low- and rich-resource languages. On the other hand, we explored the use of language models as encoders for sequence labeling tasks. This involved directly outputting vector representations into a sequence of labels to reconstruct the tree, and analyzing its performance on data-scarce tongues. The hypothesis was that during the pre-training phase, the language model would learn to encode useful information about the syntactic structure of seen languages in its latent representational space.

Research line 3 - Data augmentation techniques for low-resource dependency parsing. This research line explored methods for generating synthetic data to train dependency parsers for languages that sufer from a scarcity of resources. Initially, we considered various strategies, including techniques such as cropping and rotating, as well as semi-automatically annotating sen- even low-accuracy PoS taggers can enhance parsing pertences. Finally, we focused our eforts on adapting syn- formance, especially when more PoS tag than depentactic resources annotated in a rich-resource language to dency tree annotations are available. This study is siga low-resource language, treating the task as a word-level nificant in computational linguistics, ofering insights translation problem that takes into account morphologi- into the nuanced relationship between encoding stratecal information to maintain annotations across languages. gies and resource availability. It underscored the varyWe found this strategy adequate for the purpose of the ing utility of PoS tags for sequence labeling models (as project as it ofers explicit properties that should facili- well as for other parsing paradigms) and emphasized tate the transfer of language structure from resource-rich the encoding-dependent impact of PoS tagging accuracy. languages to related, less-resourced ones. The research also explored how controlling PoS tag accuracy can influence parsing outcomes, providing valuable guidance for future work on parsing models for under4. Results represented languages. The code was made available at: https://www.grupolys.org/software/aacl2022/.

Linearizations for parsing as sequence labeling. In [9] we proposed a new family of sequence labeling encodings based on brackets. In short, these encodings use a Cross-lingual Inflection as a Data Augmentation special kind of shorthand - a series of symbols like brack- Method for Parsing [13]. This paper introduced ets and slashes - to describe which words are connected a technique for creating ‘synthetic creole’ treebanks, and how. This type of linearizations is particularly well- termed x-inflected treebanks, through cross-lingual morsuited for certain low-resource languages such as Ancient phological inflection. This process required a source Greek, and also languages with high non-projectivity, language dependency treebank from a closely related which represents language with relatively free word or- language, equipped with lemmas and morphological feader. In [10] we propose a set of novel linearizations from tures, alongside a morphological inflection system taiexisting transition-based algorithms. The code is avail- lored for the target language. To create the morphological able at https://github.com/mstrise/dep2label-bert, and inflectors, we relied on UniMorph [ 14]. Our aim with this it supports large language models such as BERT as en- approach was to produce x-inflected treebanks that mimcoders to exploit learned structure of languages during icked the target language to a certain degree. For a greater its pre-training phase. clarity, Figure 1 depicts an example from our paper summarizing the high-level process of our method. The objective was to enhance parser performance for languages that had scarce or no annotated data, by leveraging an accurately trained morphological inflection system. This system was then applied to a related rich-resource treebank to approximate the linguistic characteristics of the target low-resourced language. The code was made available at: https://github.com/amunozo/x-inflection.

Not All Linearizations Are Equally Data-Hungry in Sequence Labeling Parsing [11]. The paper summarized the main outcomes from our research line 1. It focused on the efectiveness of various sequence labeling encodings for dependency parsing, particularly in the context of low-resource languages. It compared the performance of diferent encodings—head selection, relative position, bracketing, and mapping from transitionbased subsequences — under the constraints of limited training data. The findings suggest that while headselection encodings may perform better in data-rich environments, bracketing encodings show greater promise in low-resource settings. This insight is crucial for developing more efective parsing strategies in languages with scarce computational resources. The study highlighted the complex connection between how information is encoded and the availability of resources.

The Fragility of Multi-Treebank Parsing Evaluation [15]. This paper examined the impact of treebank selection on parser performance evaluations, drawing on insights and evaluation issues that we observed during the development of the project. It specifically demonstrated how parser rankings, in terms of performance, could vary significantly across diferent treebank subsets, challenging the reliability of evaluations based on a single subset. The results from several experiments emphasized the need for meticulous treebank selection to ensure robust, comprehensive, and unbiased evaluations.

Parsing linearizations appreciate PoS tags - but The study also highlighted the challenges in formulating some are fussy about errors [12]. This paper sum- selection guidelines and cautioned against strategies that marized some of the findings that resulted from our sec- might lead to weak conclusions. Interestingly, it revealed ond research line of work. Particularly, it investigated that the disparity in efectiveness between sequence the role of Part-of-Speech (PoS) tags in sequence label- labeling parsers and traditional parsers was considerably ing parsing in low-resource settings. It highlighted that smaller for languages with fewer resources compared to Another Dead End for Morphological Tags? Perturbed Inputs and Parsing [16]. This paper focused on a low-resource domain: how to perform efective parsing when the input text is highly corrupted with many lexical errors, which could be due to natural causes or adversarial attacks. These attacks could involve removing a character, adding a character, replacing a character, or switching two symbols. In our study - linguistically diverse, but for now restricted to languages using the Latin alphabet - we looked at 14 diferent sets of language data and found some interesting results. When we tested under such types of corrupted inputs, adding morphological information (such as universal, specific part-of-speech tags, and very detailed morphological features) actually (and counterintuitively) made the performance of traditional parsing models decline faster. However, for sequence labeling parsers, adding this kind of information was beneficial, like the ones proposed in our project was beneficial. The code to replicate the experiments and create adversarial attacks was made available at: https://github.com/amunozo/parsing_perturbations.

This method serves as a proxy to estimate the extent of syntactic structure encoded by these models for various languages, which is of interest for both rich-resource and low-resource languages. To achieve this, we first carefully selected a diverse array of language models, difering in their scale, language pretraining objectives, and token representation formats. Then, to extract dependency and constituent structures directly from them, we used existing sequence labeling encodings for tree parsing. By adding just a linear layer on top of this type of encoders, we transformed continuous vector representations into discrete labels. The results showed that, for languages included in the pretraining data, sequence labeling models can be trained much more efectively, with the amount of available fine-tuning data not being a primary factor. The code for this research was made available at https://github.com/amunozo/multilingual-assessment.

Acknowledgments This project was supported by a 2020 Leonardo Grant for

Researchers and Cultural Creators from the FBBVA.4 Assessment of Pre-Trained Models Across Languages and Grammars [17]. In this paper, we built upon our initial ideas from our second line of research, to introduce the first comprehensive framework that spans multiple paradigms and languages, aimed at recovering syntactic structures, including both dependency 4FBBVA accepts no responsibility for the opinions, statements and constituent types, as learned by language models. and contents included in the project and/or the results thereof, which are entirely the responsibility of the authors.

Empirical Methods in Natural Language Processing, //aclanthology.org/2022.aacl-short.16. 2018, pp. 614–620. [13] A. Muñoz-Ortiz, C. Gómez-Rodríguez, D. Vilares, [6] L. Duong, T. Cohn, S. Bird, P. Cook, Low resource Cross-lingual inflection as a data augmentation dependency parsing: Cross-lingual parameter shar- method for parsing, in: S. Tafreshi, J. Sedoc, ing in a neural network parser, in: Proceedings A. Rogers, A. Drozd, A. Rumshisky, A. Akula (Eds.), of the 53rd Annual Meeting of the Association Proceedings of the Third Workshop on Insights for Computational Linguistics and the 7th Interna- from Negative Results in NLP, Association for Comtional Joint Conference on Natural Language Pro- putational Linguistics, Dublin, Ireland, 2022, pp. 54– cessing (Volume 2: Short Papers), 2015, pp. 845–850. 61. URL: https://aclanthology.org/2022.insights-1.7. [7] M. Haspelmath, The typological database of the doi:10.18653/v1/2022.insights-1.7. world atlas of language structures, The Use of [14] A. D. McCarthy, C. Kirov, M. Grella, A. Nidhi, P. Xia, Databases in Cross-Linguistic Studies 41 (2009) 283. K. Gorman, E. Vylomova, S. J. Mielke, G. Nicolai, [8] P. Littell, D. R. Mortensen, K. Lin, K. Kairis, M. Silfverberg, T. Arkhangelskiy, N. Krizhanovsky, C. Turner, L. Levin, Uriel and lang2vec: Repre- A. Krizhanovsky, E. Klyachko, A. Sorokin, J. Manssenting languages as typological, geographical, and ifeld, V. Ernštreits, Y. Pinter, C. L. Jacobs, R. Cotphylogenetic vectors, in: Proceedings of the 15th terell, M. Hulden, D. Yarowsky, UniMorph 3.0: Conference of the European Chapter of the Asso- Universal Morphology, in: N. Calzolari, F. Béchet, ciation for Computational Linguistics: Volume 2, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, Short Papers, 2017, pp. 8–14. H. Isahara, B. Maegaard, J. Mariani, H. Mazo, [9] M. Strzyz, D. Vilares, C. Gómez-Rodríguez, Brack- A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings eting encodings for 2-planar dependency pars- of the Twelfth Language Resources and Evaluation ing, in: D. Scott, N. Bel, C. Zong (Eds.), Pro- Conference, European Language Resources Associceedings of the 28th International Conference ation, Marseille, France, 2020, pp. 3922–3931. URL: on Computational Linguistics, International Com- https://aclanthology.org/2020.lrec-1.483. mittee on Computational Linguistics, Barcelona, [15] I. Alonso-Alonso, D. Vilares, C. Gómez-Rodríguez, Spain (Online), 2020, pp. 2472–2484. URL: https: The fragility of multi-treebank parsing evaluation, //aclanthology.org/2020.coling-main.223. doi:10. in: N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, 18653/v1/2020.coling-main.223. L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Do[10] C. Gómez-Rodríguez, M. Strzyz, D. Vilares, A uni- natelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, fying theory of transition-based and sequence la- Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, S.-H. beling parsing, in: D. Scott, N. Bel, C. Zong (Eds.), Na (Eds.), Proceedings of the 29th International Proceedings of the 28th International Conference Conference on Computational Linguistics, Interon Computational Linguistics, International Com- national Committee on Computational Linguistics, mittee on Computational Linguistics, Barcelona, Gyeongju, Republic of Korea, 2022, pp. 5345–5359. Spain (Online), 2020, pp. 3776–3793. URL: https: URL: https://aclanthology.org/2022.coling-1.475. //aclanthology.org/2020.coling-main.336. doi:10. [16] A. Muñoz-Ortiz, D. Vilares, Another dead 18653/v1/2020.coling-main.336. end for morphological tags? perturbed in[11] A. Muñoz-Ortiz, M. Strzyz, D. Vilares, Not all lin- puts and parsing, in: A. Rogers, J. Boydearizations are equally data-hungry in sequence Graber, N. Okazaki (Eds.), Findings of the Aslabeling parsing, in: R. Mitkov, G. Angelova (Eds.), sociation for Computational Linguistics: ACL Proceedings of the International Conference on 2023, Association for Computational Linguistics, Recent Advances in Natural Language Processing Toronto, Canada, 2023, pp. 7301–7310. URL: https: (RANLP 2021), INCOMA Ltd., Held Online, 2021, //aclanthology.org/2023.findings-acl.459. doi: 10. pp. 978–988. URL: https://aclanthology.org/2021. 18653/v1/2023.findings-acl.459. ranlp-1.111. [17] A. Muñoz-Ortiz, D. Vilares, C. Gómez-Rodríguez, [12] A. Muñoz-Ortiz, M. Anderson, D. Vilares, C. Gómez- Assessment of pre-trained models across languages Rodríguez, Parsing linearizations appreciate PoS and grammars, in: J. C. Park, Y. Arase, B. Hu, W. Lu, tags - but some are fussy about errors, in: Y. He, D. Wijaya, A. Purwarianti, A. A. Krisnadhi (Eds.), H. Ji, S. Li, Y. Liu, C.-H. Chang (Eds.), Proceedings Proceedings of the 13th International Joint Conferof the 2nd Conference of the Asia-Pacific Chap- ence on Natural Language Processing and the 3rd ter of the Association for Computational Linguis- Conference of the Asia-Pacific Chapter of the Astics and the 12th International Joint Conference sociation for Computational Linguistics (Volume 1: on Natural Language Processing (Volume 2: Short Long Papers), Association for Computational LinPapers), Association for Computational Linguis- guistics, Nusa Dua, Bali, 2023, pp. 359–373. URL: tics, Online only, 2022, pp. 117–127. URL: https: https://aclanthology.org/2023.ijcnlp-main.23.

[1]

Strzyz ,

Vilares ,

Gómez-Rodríguez , Viable dependency parsing as sequence labeling , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 717 - 723 . URL: https://aclanthology.org/ N19-1077. doi: 10 .18653/v1/ N19 -1077.

[2]

M. H.

Christiansen ,

Chater , The now-or-never bottleneck: A fundamental constraint on language , Behavioral and brain sciences 39 ( 2016 ) e62 .

[3] M.-C. de Marnefe , C. D.

Manning , J.

Nivre , D.

Zeman , Universal Dependencies, Computational Linguistics 47 ( 2021 ) 255 - 308 . URL: https://aclanthology.org/ 2021 .cl- 2 .11. doi: 10 .1162/coli_a_ 00402 .

[4]

Zoph ,

Yuret , J. May,

Knight , Transfer learning for low-resource neural machine translation , in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , 2016 , pp. 1568 - 1575 .

[5]

Plank , Ž. Agić, Distant supervision from disparate sources for low-resource part-of-speech tagging , in: Proceedings of the 2018 Conference on