Enhancing Lexical Complexity Prediction in Italian through Automatic Morphological Segmentation Laura Occhipinti1 1 University of Bologna, Italy Abstract Morphological analysis is essential for various Natural Language Processing (NLP) tasks, as it reveals the internal structure of words and deepens our understanding of their morphological and syntactic relationships. This study focuses on surface morphological segmentation for the Italian language, addressing the limited representation of detailed morphological information in existing corpora. Using an automatic segmentation tool, we extract quantitative morphological parameters to investigate their impact on the perception of word complexity by native Italian speakers. Through correlation analysis, we demonstrate that morphological features, such as the number of morphemes and lexical morpheme frequency, significantly influence how complex words are perceived. These insights contribute to improving automatic lexical complexity prediction models and offer a deeper understanding of the role of morphology in word comprehension. Keywords Morphological segmentation, Lexical complexity prediction, Italian language 1. Introduction (they were eating). The resulting surface segmentation would be mangi- + -avano, where mangi- is a morph de- Morphological analysis is crucial for various NLP tasks, rived from the root of the verb mangiare, and -avano is as it provides insights into the internal structures of the suffix indicating the third person plural of the imper- words and helps us better understand the morpholog- fect tense. In contrast, the canonical segmentation would ical and syntactic relationships between words [1]. yield mangiare + -avano, with mangiare as the canonical The Italian language, with its rich morphology and ex- morpheme and -avano as the suffix1 . tensive use of inflection and derivation, presents unique In this study, we focus on surface morphological seg- challenges and opportunities for morphological segmen- mentation for the Italian language. Morphological fea- tation. tures are often not adequately represented in available Automatic segmentation, a key component of morphol- corpora for this language, or they refer exclusively to ogy learning, involves dividing word forms into mean- morphosyntactic information, such as the grammatical ingful units such as roots, prefixes, and suffixes [2]. This category of words and a macro-level descriptive analysis task falls under the broader category of subword segmen- mainly related to inflection. Information about the inter- tation [3] but is distinct due to its linguistic motivation. nal structure of words, such as derivation or composition, Computational approaches typically identify subwords is often lacking. based on purely statistical considerations, which often The primary objective of this work is to use an auto- results in subunits that do not correspond to recogniz- matic segmenter to extract a series of quantitative mor- able linguistic units [4, 5, 6, 7]. Making this task more phological parameters. We believe that our approach morphologically oriented could enable models to gen- does not require the detailed analysis provided by canon- eralize better to new words or forms, as basic roots or ical segmentation, which could entail longer processing morphemes are often shared among words, and it could times. also facilitate the interpretation of model results. When discussing morphological segmentation, we can 1 It’s important to note that the segmentation process is not always refer to two types: (1) Surface segmentation, which in- straightforward, as it involves various linguistic criteria that may volves dividing words into morphs, the surface forms of not be immediately clear. For example, one of the challenges lies in morphemes; (2) Canonical segmentation, which involves deciding whether to detach or retain the thematic vowel—a vowel dividing words into morphemes and reducing them to that appears between the root and the inflectional suffix, especially their standard forms [8]. in Romance languages. In the case of mangiavano, the thematic vowel -a- could either be considered part of the root or treated For instance, consider the Italian word mangiavano as a separate morph. Similarly, other segmentation criteria might involve distinctions between compound forms, derivational affixes, CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, or fused morphemes that do not have clear boundaries. As a result, Dec 04 — 06, 2024, Pisa, Italy the segmentation criteria can vary based on linguistic theory, the $ laura.occhipinti3@unibo.it (L. Occhipinti) specific task (e.g., computational vs. linguistic analysis), or even the  0009-0007-8799-4333 (L. Occhipinti) intended application of the segmentation (e.g., for syntactic parsing © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). or machine learning). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In addition to examining classic parameters reported in suffer from oversegmentation and incorrect segmenta- the literature that influence complexity [9], such as word tion of affixes [19, 28]. These challenges arise due to the frequency, length, and number of syllables, we aim to complex interplay of phonological, morphological, and explore how morphological features integrate with these semantic factors in natural languages. factors to affect word complexity perception. Specifi- Semi-supervised methods leverage both annotated and cally, we seek to understand how the internal structure unannotated data, enhancing model performance with of words contributes to the cognitive load that speakers minimal manual annotation [29]. These methods are experience when processing more complex lexical items. effective in scenarios with limited labeled data[30, 31], Our premise is that words with more morphemes are using initial labeled datasets to hypothesize and validate more complex because they contain more information patterns across larger unlabeled corpora [32]. While ben- to decode [10]. For example, consider the word infelicità eficial, semi-supervised methods depend on the quality of (unhappiness). To decode it, one must know the word initial labeled datasets and may struggle with languages felice (happy), from which it is derived, as well as the pre- exhibiting extensive morphological diversity [2]. fix in-, which negates the quality expressed by the base Supervised methods, relying on annotated datasets, term, and the suffix -ità, which transforms the adjective typically achieve higher accuracy due to learning from into an abstract noun. Therefore, to fully understand the explicitly labeled examples. Techniques include neural meaning of infelicità, the reader or listener must be able networks, Hidden Markov Models (HMM), and Convolu- to correctly recognize and interpret each of these mor- tional Neural Networks (CNNs) [33, 34, 35, 23]. Despite phemes and their contribution to the overall meaning of their high performance, supervised methods are limited the word. by the need for extensive annotated corpora, which can The main contributions of this work are: (1) Providing be costly and time-consuming to create. a tool capable of automatically segmenting words into Given access to a large annotated dataset for the Italian linguistically motivated base forms; (2) presenting the language, on which we made semi-manual corrections, dataset constructed for training our model; (3) evaluating our study primarily adopts a supervised approach. the impact of different linguistic features on speakers’ perception of word complexity, with a particular focus 2.1. Resources available for the Italian on morphological features. language Several computational resources and tools have been de- 2. Related Works veloped to manage Italian morphological information The study of morphological segmentation has evolved [36, 37, 38, 39, 40, 41]. These resources are essential for from classical linguistics to advanced machine learn- improving the accuracy of text processing and support- ing techniques [11, 12]. The main approaches include ing advanced linguistic research. However, many of them lexicon-based and boundary-detection-based meth- focus primarily on morphological analysis, without pro- ods [2]. Lexicon-based methods rely on a comprehen- viding detailed support for morphological segmentation, sive database of known morphemes [13, 14, 15], while which limits their usefulness in tasks that require fine- boundary-detection methods identify transition points grained word structure analysis. Even those tools that between morphemes using statistical or machine learn- offer segmentation often approach it with different meth- ing techniques [16, 17, 18]. ods and objectives than ours. Another significant distinction is between generative Morph-it! [37] is an open-source lexicon that con- models and discriminative models. Generative models, tains 504,906 entries and 34,968 unique lemmas, each suited for unsupervised learning, generate word forms annotated with morphological characteristics that link and segmentations from raw data [19, 20, 21]. In contrast, inflected word forms to their lemmas. While valuable discriminative models, which require annotated data, pre- for lemmatization and morphological analysis, it is not dict segmentations based on learned relationships from suited for morphological segmentation, as it primarily fo- labeled examples [22, 23]. cuses on inflected forms rather than decomposing words Unsupervised methods do not require labeled data, into their individual morphemes. making them attractive for leveraging vast amounts of MorphoPro [39] is part of the TextPro suite and is de- raw data. They trace back to Harris (1955), who used signed for morphological analysis of both English and statistical methods to identify morphological segments. Italian. It uses a declarative knowledge base converted Notable systems include Linguistica [24, 25] and Mor- into a Finite State Automaton (FSA) for detailed morpho- fessor [26, 27], which employ the Minimum Description logical analysis. However, MorphoPro’s output is geared Length (MDL) principle to identify regularities within towards global morphological analysis and lacks support data. Despite their utility, unsupervised methods often for internal word segmentation into morphemes, limiting its applicability for more granular tasks. MAGIC [36] provides a lexicon of approximately evaluated by a group of native speakers with a perceived 100,000 lemmas and performs detailed morphological complexity score ranging from 1 to 5. In the dataset, the and morphosyntactic analysis. However, similar to other aggregated and normalized complexity value is between resources, MAGIC does not focus on morphological seg- 0 and 1, where 0 indicates very simple words and 1 in- mentation. Instead, it provides morphological and syn- dicates very complex words2 . The morphological traits tactic information about word forms, making it more extracted by the selected model were then integrated useful for general morphological analysis rather than with other linguistic features typically considered influ- segmenting words into individual morphemes. ential in the perception of word complexity [9]. These Getarun [38] offers a lexicon of around 80,000 roots and combined features were analyzed in a correlation study provides sophisticated morphosyntactic analysis. How- with the perceived complexity values of MultiLs-IT to as- ever, like MAGIC, it is designed primarily for syntactic sess their impact on predicting linguistic complexity. By parsing and lacks functionality for detailed morphologi- examining the relationships between these variables, we cal segmentation, focusing instead on morphological and aim to determine whether morphological measures can syntactic relationships. be effectively used in systems designed to automatically DerIvaTario [41] is another resource that provides sig- identify word complexity. nificant support for morphological segmentation, partic- ularly in the context of derivational morphology. It offers 3.1. Dataset detailed information on derivational patterns in Italian, mapping out how words are formed through derivational The primary reference for this work is the AnIta dataset, processes, which is especially useful for studying word which includes data annotated with morphological seg- formation in a structured manner. However, DerIvaTario mentations based on specific rules. One rule excludes focuses primarily on canonical segmentations and does bases derived from Latin, Greek, and other languages. not always recognize smaller morphemes, such as final Since Italian, especially in technical and specialized fields, morphemes. This limitation means it may miss finer- contains many such words, we modified the dataset to grained morphological elements, making it more suitable include these forms to ensure accurate representation. for analyzing larger, derivational units rather than cap- The initial dataset consisted of numerous entries au- turing all inflectional components. tomatically generated by AnIta, often including over- AnIta is an advanced morphological analyzer for Ital- generated word-forms (possible words [44]), especially ian, implemented within the FSA framework [40]. It sup- in evaluative morphology. This resulted in a comprehen- ports a comprehensive lexicon with over 120,000 lemmas sive dataset with approximately two million entries.To and handles inflectional, derivational, and compositional adapt the AnIta dataset for our research needs, we un- phenomena. AnIta’s segmentation occurs on two levels: dertook several steps. superficial segmentation of word forms and derivation 1) Due to the extensive size, we reduced the sample, graphs. Although derivation graphs are incomplete, the retaining one-third of entries for each letter, resulting in tool’s focus on superficial segmentation aligns with our approximately 728,814 word-forms (35% of the original research needs. For the segmentation of lemmas related dataset). This sample maintains a fair representation of to derivational phenomena, AnIta adopts two main rules: all linguistic categories3 . 2) We systematically identified and addressed prefixes and suffixes, prioritizing longer (1) affixes are kept unchanged; (2) lexicon entries are seg- mented only if their base is a recognizable independent affixes to preserve more informative morphological struc- Italian word. tures. This semi-automatic approach facilitated manual verification while enhancing segmentation quality. 3) We manually reviewed the segmented words, ensuring 3. Methods accuracy and consistency, preserving prefixes in their original forms as per AnIta’s rule number one. 4) The fi- In this study, we trained three models, originally devel- nal dataset was divided into training (80%) and test (20%) oped for other languages, using an Italian dataset that sets, comprising 583,051 and 145,763 words respectively. was manually created and verified with morphological This split allowed effective training and validation of segmentations. After evaluating the performance of the our models without needing a separate validation set, as models, we selected the most effective one and used it no parameter tuning was performed. This streamlined to extract morphological parameters from the words in the MultiLS-IT dataset, a resource designed for lexical 2 The resource is available at https://github.com/MLSP2024/MLSP_ simplification in the Italian language [42, 43]. Data. The dataset comprises 600 contextualized words, an- 3 Initially, we aimed to manually review the entire dataset to address notated for complexity and accompanied by substitutes any inconsistencies and overlooked segments. However, due to time perceived as simpler than the target word. Each word was constraints, we opted to reduce the dataset by randomly selecting 30% of the entries for each letter. Automatic segmentation systems Precision Recall F1 Accuracy Neural Morpheme Segmentation 0.9879 0.9806 0.9892 0.9793 MorphemeBERT 0.9868 0.9199 0.9522 0.9581 Morfessor FlatCat 0.7974 0.3676 0.5033 0.7399 Table 1 Results of models on morphological segmentation. methodology ensured a robust dataset for implementing places the boundaries at the correct points. Its F1 score and evaluating our automatic segmentation system. (0.9892), which balances precision and recall, underscores the model’s ability not only to accurately segment mor- 3.2. Segmentation Models phemes but also to capture the majority of them with minimal oversight. The high recall (0.9806) confirms that Given the extensive dataset at our disposal, we se- the model rarely misses morphemes, making it particu- lected models within the domain of supervised or semi- larly well-suited for handling complex or less frequent supervised learning. The models considered include: morphological patterns. This balance between high pre- Morfessor FlatCat [31]: a semi-supervised model cision and recall showcases the robustness of the CNN- that utilizes a HMM approach for morphological segmen- based architecture, which can effectively model both local tation. It is efficient in handling languages with complex dependencies between segments and the global morpho- morphological structures. The model’s flat lexicon and logical structure of words4 . the use of semi-supervised learning make it particularly MorphemeBERT demonstrates a high level of preci- suited for scenarios where annotated data is scarce. sion, indicating that when it identifies a morpheme, it Neural Morpheme Segmentation [33]: a su- is likely correct. However, its recall is noticeably lower pervised model based on CNNs, designed to segment than that of Neural Morpheme Segmentation, which morphemes by treating the task as a sequential labeling suggests that while it makes fewer errors, it also fails to problem using the BMES scheme (Begin, Middle, End, detect a significant number of morphemes. This trade-off Single). This model is noted for its ability to capture between precision and recall points to a more conser- local dependencies within textual data. Its architecture vative approach in morpheme segmentation, where the includes multiple convolutional and pooling layers, en- model prioritizes accuracy over coverage. The F1 score hancing its capability to identify and segment complex of 0.9522, though still strong, highlights this imbalance morphological patterns. between precision and recall, meaning the model per- MorphemeBERT [45]: an advanced model that in- forms well but lacks the comprehensive identification tegrates BERT’s characters embeddings with CNNs to that would elevate its overall performance. The accu- enhance morphological segmentation. BERT provides racy of 0.9581 reflects that the model is quite reliable in deep, context-rich linguistic representations, which can general, but its inability to capture as many correct mor- significantly improve the model’s accuracy in identifying phemes as Neural Morpheme Segmentation affects its morphemic boundaries. overall segmentation capability. This limitation might be due to how MorphemeBERT integrates BERT embed- 3.3. Evaluation dings, which are optimized for context-rich predictions but may struggle with identifying morphemic boundaries After constructing the dataset and selecting the previ- in less straightforward or ambiguous cases, leading to ously described models, we proceeded with the training. more missed segments. Table 1 presents a comparative evaluation of the three Morfessor FlatCat shows a considerably weaker models using precision, recall, F1 score, and accuracy. performance compared to the other two models. While These metrics are standard for assessing the performance its precision score of 0.79744 is decent, meaning that the of boundary detection models, providing a comprehen- morphemes it identifies are mostly accurate, its recall sive overview of each model’s effectiveness in identifying is notably low. This indicates that the model misses a and segmenting morphemes accurately. substantial number of morphemes, failing to capture the Neural Morpheme Segmentation demonstrates the full complexity of word segmentation. The low recall highest performance among the three systems across suggests that Morfessor FlatCat struggles to identify almost all metrics, particularly excelling in precision many valid morphemic boundaries, which results in in- and F1 score. The high precision (0.9879) indicates that complete or inaccurate segmentations. Consequently, the model is very accurate in identifying correct mor- its F1 score (0.5033) and accuracy (0.7399) are signifi- pheme boundaries, minimizing false positives. In other 4 words, when the model segments a word, it reliably This model is available upon request. Please contact the author directly to access to the model and relevant references. cantly lower, suggesting that this system is less reliable By integrating these morphological features with other for applications requiring high fidelity in morpheme seg- linguistic traits typically considered influential in speak- mentation. ers’ perception of complexity, we aim to assess their impact on predicting linguistic complexity5 . 4. Selection of Linguistic Features 5. Analysis and discussion Based on a thorough review of the literature on lexical complexity prediction [9, 46], we selected several lin- Through studying the correlations between these vari- guistic features to analyze their impact on complexity. ables, we seek to determine whether morphological mea- In addition to common surface characteristics, such as sures can be effectively used to develop systems capable the number of letters, syllables, and vowels in words, of automatically identifying word complexity. To achieve commonly used in complexity studies and readability cal- this, we conducted a correlation and significance analysis culations, we identified other relevant parameters. One between the features discussed earlier and the perceived key factor is the frequency of a word, as more frequent complexity values for the 600 words included in MultiLs- words tend to be perceived as more familiar and thus less IT. complex. We calculated it using the ItWac corpus [47]. Feature Correlation p-value Another important parameter is the number of senses a Length 0.082 0.045* word has, measured using the lexical resources ItalWord- Number of vowels 0.097 0.018* Number of syllables 0.091 0.026* net [48]. Lastly, the presence of stop words, calculated Number of Morphemes 0.112 0.006* with Spacy model, which are common words that often Senses_ID -0.277 0.000* Stopword -0.124 0.003* carry little inherent meaning, can influence the perceived Lemma Frequency -0.467 0.000* Morphological Density 0.036 0.381 complexity of a sentence or text. Given the focus of this Lexical morpheme frequency -0.333 0.000* study on morphological features’ impact on lexical com- plexity, we concentrated on several key aspects related Table 2 to the internal structure of words. These features could Spearman correlation coefficients and p-values for features and complexity. Note: * indicates statistical significance. show how morphological traits contribute to word intri- cacy: Number of morphemes: Morphemes are the small- Table 2 presents the Spearman correlation coefficients est units of meaning in words, including affixes (prefixes and their statistical significance for the features calcu- and suffixes) and roots. The number of morphemes gives lated6 . The correlation analysis reveals several important an indication of the information load of a word. Lexical insights. items with more morphemes typically require more de- Word length, number of vowels, and number of syl- coding effort from readers. We used our Convolutional lables all have small but statistically significant positive Neural Model for automatic morphological segmentation correlations with complexity. This suggests that, as ex- and morpheme counting. pected, longer words with more vowels and syllables Morphological density: This quantitative metric is tend to be perceived as more complex. These factors are defined as the ratio of the number of morphemes to word typical in readability studies, where more phonologically length, offering a measure of how densely packed mean- complex words are generally harder to process. ingful units are within a word. Higher morphological The number of morphemes also shows a positive cor- density can indicate more cognitive load, as each unit relation with complexity, reinforcing the idea that words contributes distinct information, potentially raising the with more morphemes are perceived as more complex. complexity of the word. This feature is statistically significant as well. Frequency of the lexical morpheme: Lexical mor- Negative correlations for senses_ID, stopword pres- phemes carry the core meaning of the word. Employing ence, and lemma frequency suggest that words with more our morphological segmentator on the ItWac corpus [47], senses, those that are stopwords, or those that are more enabled us to dissect the word into segments and ag- gregate the frequencies of individual morphemes. This 5 For a detailed analysis of how these parameters were processed, frequency, transformed using a logarithmic scale, helps refer to Occhipinti 2024. 6 predict complexity by leveraging the familiarity of fre- Spearman’s rank correlation was chosen because it does not assume a linear relationship between variables, making it more suitable for quently occurring morphemes. The use of lexical mor- our dataset, where the relationships between features like word pheme frequency as a complexity indicator is based on length, number of morphemes, and word complexity may not follow the idea that even if a word is unfamiliar as a whole, its a strictly linear pattern. Spearman’s correlation measures whether component morphemes may be common in the language an increase in one variable tends to be consistently associated with and more recognizable [49]. an increase (or decrease) in another, which is more appropriate given the nature of our linguistic features. Figure 1: Correlation of complexity values. frequently used are perceived as less complex. These morphological segmentation in the Italian language. features are also statistically significant. It is notewor- The correlation analysis reveals that while traditional thy that the number of senses (senses_ID) is inversely metrics like word length and frequency are valuable pre- proportional to complexity. This could be attributed to dictors of complexity, incorporating morphological fea- the incompleteness of ItalWordNet, potentially leading tures provides additional insights that enrich our un- to unreliable predicted values. derstanding of lexical complexity. Notably, the positive Morphological density, however, does not show a sta- correlation between the number of morphemes and per- tistically significant correlation with complexity, suggest- ceived complexity suggests that words with more mor- ing that the ratio of morphemes to word length may not phemes are inherently more complex. Conversely, fre- be a strong predictor of perceived complexity. quent lexical morphemes tend to reduce perceived com- The lexical morpheme frequency shows a significant plexity, highlighting the importance of familiarity in com- negative correlation with complexity, indicating that plexity perception. Our study also emphasizes the need more frequently occurring morphemes contribute to for diverse linguistic features, including both surface lower perceived complexity. This supports the notion characteristics and morphological traits, to create more that familiar morphemes, even within otherwise complex robust and accurate models for predicting word complex- words, aid in comprehension. ity. The statistically significant correlations for most fea- These findings underscore the importance of consid- tures validate their relevance in complexity prediction. ering a range of linguistic features, including morpho- However, it is important to note that our findings are logical traits, when assessing lexical complexity. By in- based on a relatively small dataset of annotated complex- tegrating these features into computational models, we ity perceptions. To obtain more robust and generalizable can enhance their ability to accurately predict word com- results, it would be highly beneficial to have access to plexity and, subsequently, improve lexical simplification. a larger and more diverse dataset of complexity annota- tions. Expanding the dataset to include a wider variety of texts and contexts would enhance the reliability of 6. Conclusion the correlations observed and improve the training and evaluation of automatic complexity prediction models. This study highlights the significance of integrating mor- Future research should focus on gathering more exten- phological features into automatic models to enhance the sive annotated datasets and exploring additional linguis- comprehension and prediction of lexical complexity. The tic features that may influence complexity perception. By high performance of the Neural Morpheme Segmenta- doing so, we can further refine our models and develop tion model demonstrates the efficacy of convolutional more effective tools for lexical simplification and other neural networks in capturing the detailed patterns of applications aimed at improving text accessibility. References 2010, pp. 364–393. [13] J. G. Wolff, The discovery of segments in natural [1] J. T. Devlin, H. L. Jamison, P. M. Matthews, L. M. language, British Journal of Psychology 68 (1977) Gonnerman, Morphology and the internal structure 97–106. of words, Proceedings of the National Academy of [14] C. G. Nevill-Manning, I. H. Witten, Identifying hi- Sciences 101 (2004) 14984–14988. erarchical structure in sequences: A linear-time al- [2] T. Ruokolainen, O. Kohonen, K. Sirts, S.-A. Grön- gorithm, Journal of Artificial Intelligence Research roos, M. Kurimo, S. Virpioja, A comparative study 7 (1997) 67–82. of minimally supervised morphological segmenta- [15] M. Johnson, Unsupervised word segmentation for tion, Computational Linguistics 42 (2016) 91–120. sesotho using adaptor grammars, in: Proceedings of [3] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, the Tenth Meeting of ACL Special Interest Group on M. Gallé, A. Raja, C. Si, W. Y. Lee, B. Sagot, et al., Computational Morphology and Phonology, 2008, Between words and characters: A brief history of pp. 20–27. open-vocabulary modeling and tokenization in nlp, [16] Z. S. Harris, From phoneme to morpheme, Lan- arXiv preprint arXiv:2112.10508 (2021). guage 31 (1955) 190–222. URL: http://www.jstor. [4] R. Sennrich, B. Haddow, A. Birch, Neural machine org/stable/411036. translation of rare words with subword units, in: [17] P. Cohen, B. Heeringa, N. M. Adams, An unsuper- Proceedings of the 54th Annual Meeting of the As- vised algorithm for segmenting categorical time- sociation for Computational Linguistics (Volume 1: series into episodes, in: Proceedings of Pattern Long Papers), 2016, pp. 1715–1725. doi:10.18653/ Detection and Discovery: ESF Exploratory Work- v1/P16-1162. shop London, 2002, pp. 49–62. [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, [18] A. Sorokin, A. Kravtsova, Deep convolutional net- BERT: Pre-training of deep bidirectional trans- works for supervised morpheme segmentation of formers for language understanding, in: Pro- russian language, in: Proceedings of 7th Interna- ceedings of the 2019 Conference of the North tional Conference in Artificial Intelligence and Nat- American Chapter of the Association for Computa- ural Language (AINL 2018), 2018, pp. 3–10. tional Linguistics: Human Language Technologies, [19] M. Creutz, K. Lagus, Unsupervised models for 2019, pp. 4171–4186. URL: https://aclanthology.org/ morpheme segmentation and morphology learn- N19-1423. doi:10.18653/v1/N19-1423. ing, ACM Transactions on Speech and Language [6] K. Bostrom, G. Durrett, Byte pair encoding is subop- Processing (TSLP) 4 (2007) 1–34. timal for language model pretraining, in: Findings [20] H. Poon, C. Cherry, K. Toutanova, Unsupervised of the Association for Computational Linguistics: morphological segmentation with log-linear mod- EMNLP 2020, 2020, pp. 4617–4624. els, in: Proceedings of Human Language Technolo- [7] X. Song, A. Salcianu, Y. Song, D. Dopson, D. Zhou, gies: The 2009 Annual Conference of the North Fast wordpiece tokenization, in: Proceedings of the American Chapter of the Association for Computa- 2021 Conference on Empirical Methods in Natural tional Linguistics, 2009, pp. 209–217. Language Processing, 2020, pp. 2089–2103. [21] K. Sirts, S. Goldwater, Minimally-supervised mor- [8] R. Cotterell, C. Kirov, J. Sylak-Glassman, phological segmentation using adaptor grammars, D. Yarowsky, J. Eisner, M. Hulden, The sig- Transactions of the Association for Computational morphon 2016 shared task—morphological Linguistics 1 (2013) 255–266. reinflection, in: Proceedings of the 14th SIGMOR- [22] Z. S. Harris, Morpheme Boundaries within Words: PHON workshop on computational research in Report on a Computer Test, Springer Netherlands, phonetics, phonology, and morphology, 2016, pp. 1970, pp. 68–77. 10–22. [23] T. Ruokolainen, O. Kohonen, S. Virpioja, M. Ku- [9] K. Collins-Thompson, Computational assessment rimo, Supervised morphological segmentation in of text readability: A survey of current and future a low-resource learning setting using conditional research, ITL-International Journal of Applied Lin- random fields, in: Proceedings of the Seventeenth guistics 165 (2014) 97–135. Conference on Computational Natural Language [10] W. U. Dressler, Ricchezza e complessità morfologica, Learning, 2013, pp. 29–37. Ricchezza e complessità morfologica (1999) 1000– [24] J. Goldsmith, Unsupervised learning of the mor- 1011. phology of a natural language, Computational lin- [11] S. Scalise, Morfologia, il Mulino, 1994. guistics 27 (2001) 153–198. [12] J. A. Goldsmith, Segmentation and morphology, [25] J. Goldsmith, An algorithm for the unsupervised in: The handbook of computational linguistics and learning of morphology, Natural language engi- natural language processing, Wiley Online Library, neering 12 (2006) 353–371. [26] M. Creutz, K. Lagus, Unsupervised discovery of ence series 2005 (ISSN 1747-9398), volume 1, 2005, morphemes, in: Proceedings of the ACL-02 Work- pp. 1–12. shop on Morphological and Phonological Learning, [38] R. Delmonte, et al., Computational Linguistic 2002, pp. 21–30. Text Processing–Lexicon, Grammar, Parsing and [27] M. J. P. Creutz, K. H. Lagus, Morfessor in the mor- Anaphora Resolution, Nova Science Publishers, pho challenge, in: Proceedings of the PASCAL 2008. Challenge Workshop on Unsupervised Segmenta- [39] E. Pianta, C. Girardi, R. Zanoli, The textpro tool tion of Words into Morphemes, 2006, pp. 12–17. suite., in: Proceedings of the Sixth International [28] Ö. Kılıç, C. Bozsahin, Semi-supervised morpheme Conference on Language Resources and Evaluation segmentation without morphological analysis, in: (LREC’08), 2008, p. 2603–2607. Proceedings of the workshop on language resources [40] F. Tamburini, M. Melandri, Anita: a powerful mor- and technologies for Turkic languages, LREC, 2012, phological analyser for italian., in: Proceedings of pp. 52–56. the Eleventh International Conference on Language [29] T. Ruokolainen, O. Kohonen, S. Virpioja, M. Kurimo, Resources and Evaluation (LREC 2018), 2012, pp. Painless semi-supervised morphological segmen- 941–947. tation using conditional random fields, in: Pro- [41] L. Talamo, C. Celata, P. M. Bertinetto, Derivatario: ceedings of the 14th Conference of the European An annotated lexicon of italian derivatives, Word Chapter of the Association for Computational Lin- Structure 9 (2016) 72–102. guistics, volume 2: Short Papers, 2014, pp. 84–89. [42] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro, [30] J. Lafferty, A. McCallum, F. Pereira, et al., Condi- S. Bott, S. Calderon Ramirez, R. Cardon, T. François, tional random fields: Probabilistic models for seg- A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide, menting and labeling sequence data, in: Interna- J. M. Imperial, A. Nohejl, K. North, L. Occhip- tional Conference on Machine Learning, 2001, pp. inti, N. Peréz Rojas, N. Raihan, T. Ranasinghe, 282—-289. M. Solis Salazar, M. Zampieri, H. Saggion, An [31] S.-A. Grönroos, S. Virpioja, P. Smit, M. Kurimo, Mor- extensible massively multilingual lexical simplifi- fessor flatcat: An hmm-based method for unsuper- cation pipeline dataset using the MultiLS frame- vised and semi-supervised learning of morphology, work, in: R. Wilkens, R. Cardon, A. Todirascu, in: Proceedings of COLING 2014, the 25th Inter- N. Gala (Eds.), Proceedings of the 3rd Workshop national Conference on Computational Linguistics, on Tools and Resources for People with REAd- 2014, pp. 1177–1185. ing DIfficulties (READI) @ LREC-COLING 2024, [32] X. Zhu, A. B. Goldberg, Introduction to semi- ELRA and ICCL, Torino, Italia, 2024, pp. 38–46. URL: supervised learning, Springer Nature, 2022. https://aclanthology.org/2024.readi-1.4. [33] A. Sorokin, Convolutional neural networks for [43] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro, low-resource morpheme segmentation: baseline S. Bott, S. Calderon Ramirez, R. Cardon, T. François, or state-of-the-art?, in: Proceedings of the 16th A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide, J. M. Workshop on Computational Research in Phonet- Imperial, A. Nohejl, K. North, L. Occhipinti, N. P. ics, Phonology, and Morphology, 2019, pp. 154–159. Rojas, N. Raihan, T. Ranasinghe, M. S. Salazar, URL: https://aclanthology.org/W19-4218. doi:10. S. Štajner, M. Zampieri, H. Saggion, The BEA 18653/v1/W19-4218. 2024 shared task on the multilingual lexical sim- [34] L. Wang, Z. Cao, Y. Xia, G. De Melo, Morphological plification pipeline, in: E. Kochmar, M. Bexte, segmentation with window lstm neural networks, J. Burstein, A. Horbach, R. Laarmann-Quante, in: Proceedings of the AAAI Conference on Artifi- A. Tack, V. Yaneva, Z. Yuan (Eds.), Proceedings cial Intelligence, 2016, pp. 2842–2848. of the 19th Workshop on Innovative Use of NLP [35] R. Cotterell, T. Mueller, A. Fraser, H. Schütze, for Building Educational Applications (BEA 2024), Labeled morphological segmentation with semi- Association for Computational Linguistics, Mex- markov models, in: Proceedings of the Nineteenth ico City, Mexico, 2024, pp. 571–589. URL: https: Conference on Computational Natural Language //aclanthology.org/2024.bea-1.51. Learning, 2015, pp. 164–174. [44] M. Aronoff, A decade of morphology and word [36] M. Battista, V. Pirrelli, Una piattaforma di morfolo- formation, Annual review of anthropology (1983) gia computazionale per l’analisi e la generazione 355–375. delle parole italiane, Technical Report, ILC-CNR, [45] A. Sorokin, Improving morpheme segmentation us- 1999. ing bert embeddings, in: International Conference [37] E. Zanchetta, M. Baroni, Morph-it! a free corpus- on Analysis of Images, Social Networks and Texts, based morphological resource for the italian lan- Springer, 2021, pp. 148–161. guage, in: Proceedings of corpus linguistics confer- [46] K. North, M. Zampieri, M. Shardlow, Lexical com- plexity prediction: An overview, ACM Computing Surveys 55 (2023) 1–42. [47] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta, The wacky wide web: a collection of very large linguistically processed web-crawled corpora, Lan- guage resources and evaluation 43 (2009) 209–226. [48] A. Roventini, A. Alonge, N. Calzolari, B. Magnini, F. Bertagna, Italwordnet: a large semantic database for italian., in: In Proceedings of the Second Inter- national Conference on Language Resources and Evaluation (LREC-2000), 2000, pp. 783–790. [49] P. Colé, J. Segui, M. Taft, Words and morphemes as units for lexical access, Journal of Memory and Language 37 (1997) 312–330. [50] L. Occhipinti, Complex word identification for ital- ian language: a dictionary-based approach, in: Pro- ceedings of Clib24, Sixth International Conference on Computational Linguistics in Bulgaria, 2024, pp. 119–129.