1. Introduction

Enhancing Lexical Complexity Prediction in Italian through Automatic Morphological Segmentation

Laura Occhipinti

0 0 University of Bologna , Italy

Morphological analysis is essential for various Natural Language Processing (NLP) tasks, as it reveals the internal structure of words and deepens our understanding of their morphological and syntactic relationships. This study focuses on surface morphological segmentation for the Italian language, addressing the limited representation of detailed morphological information in existing corpora. Using an automatic segmentation tool, we extract quantitative morphological parameters to investigate their impact on the perception of word complexity by native Italian speakers. Through correlation analysis, we demonstrate that morphological features, such as the number of morphemes and lexical morpheme frequency, significantly influence how complex words are perceived. These insights contribute to improving automatic lexical complexity prediction models and ofer a deeper understanding of the role of morphology in word comprehension.

eol>Morphological segmentation Lexical complexity prediction Italian language

1. Introduction

(they were eating). The resulting surface segmentation would be mangi- + -avano, where mangi- is a morph deMorphological analysis is crucial for various NLP tasks, rived from the root of the verb mangiare, and -avano is as it provides insights into the internal structures of the sufix indicating the third person plural of the imperwords and helps us better understand the morpholog- fect tense. In contrast, the canonical segmentation would ical and syntactic relationships between words [ 1 ]. yield mangiare + -avano, with mangiare as the canonical

The Italian language, with its rich morphology and ex- morpheme and -avano as the sufix 1. tensive use of inflection and derivation, presents unique In this study, we focus on surface morphological segchallenges and opportunities for morphological segmen- mentation for the Italian language. Morphological featation. tures are often not adequately represented in available

Automatic segmentation, a key component of morphol- corpora for this language, or they refer exclusively to ogy learning, involves dividing word forms into mean- morphosyntactic information, such as the grammatical ingful units such as roots, prefixes, and sufixes [ 2]. This category of words and a macro-level descriptive analysis task falls under the broader category of subword segmen- mainly related to inflection. Information about the intertation [3] but is distinct due to its linguistic motivation. nal structure of words, such as derivation or composition, Computational approaches typically identify subwords is often lacking. based on purely statistical considerations, which often The primary objective of this work is to use an autoresults in subunits that do not correspond to recogniz- matic segmenter to extract a series of quantitative morable linguistic units [4, 5, 6, 7]. Making this task more phological parameters. We believe that our approach morphologically oriented could enable models to gen- does not require the detailed analysis provided by canoneralize better to new words or forms, as basic roots or ical segmentation, which could entail longer processing morphemes are often shared among words, and it could times. also facilitate the interpretation of model results.

When discussing morphological segmentation, we can refer to two types: (1) Surface segmentation, which involves dividing words into morphs, the surface forms of morphemes; (2) Canonical segmentation, which involves dividing words into morphemes and reducing them to their standard forms [8].

For instance, consider the Italian word mangiavano 1It’s important to note that the segmentation process is not always straightforward, as it involves various linguistic criteria that may not be immediately clear. For example, one of the challenges lies in deciding whether to detach or retain the thematic vowel—a vowel that appears between the root and the inflectional sufix, especially in Romance languages. In the case of mangiavano, the thematic vowel -a- could either be considered part of the root or treated as a separate morph. Similarly, other segmentation criteria might involve distinctions between compound forms, derivational afixes, or fused morphemes that do not have clear boundaries. As a result, the segmentation criteria can vary based on linguistic theory, the specific task (e.g., computational vs. linguistic analysis), or even the intended application of the segmentation (e.g., for syntactic parsing or machine learning).

In addition to examining classic parameters reported in sufer from oversegmentation and incorrect segmentathe literature that influence complexity [ 9], such as word tion of afixes [ 19, 28 ]. These challenges arise due to the frequency, length, and number of syllables, we aim to complex interplay of phonological, morphological, and explore how morphological features integrate with these semantic factors in natural languages. factors to afect word complexity perception. Specifi- Semi-supervised methods leverage both annotated and cally, we seek to understand how the internal structure unannotated data, enhancing model performance with of words contributes to the cognitive load that speakers minimal manual annotation [29]. These methods are experience when processing more complex lexical items. efective in scenarios with limited labeled data[ 30, 31],

Our premise is that words with more morphemes are using initial labeled datasets to hypothesize and validate more complex because they contain more information patterns across larger unlabeled corpora [32]. While bento decode [ 10 ]. For example, consider the word infelicità eficial, semi-supervised methods depend on the quality of (unhappiness). To decode it, one must know the word initial labeled datasets and may struggle with languages felice (happy), from which it is derived, as well as the pre- exhibiting extensive morphological diversity [2]. ifx in-, which negates the quality expressed by the base Supervised methods, relying on annotated datasets, term, and the sufix -ità, which transforms the adjective typically achieve higher accuracy due to learning from into an abstract noun. Therefore, to fully understand the explicitly labeled examples. Techniques include neural meaning of infelicità, the reader or listener must be able networks, Hidden Markov Models (HMM), and Convoluto correctly recognize and interpret each of these mor- tional Neural Networks (CNNs) [33, 34, 35, 23]. Despite phemes and their contribution to the overall meaning of their high performance, supervised methods are limited the word. by the need for extensive annotated corpora, which can

The main contributions of this work are: (1) Providing be costly and time-consuming to create. a tool capable of automatically segmenting words into Given access to a large annotated dataset for the Italian linguistically motivated base forms; (2) presenting the language, on which we made semi-manual corrections, dataset constructed for training our model; (3) evaluating our study primarily adopts a supervised approach. the impact of diferent linguistic features on speakers’ perception of word complexity, with a particular focus on morphological features. 2.1. Resources available for the Italian language 2. Related Works Several computational resources and tools have been developed to manage Italian morphological information The study of morphological segmentation has evolved [ 36, 37, 38, 39, 40, 41 ]. These resources are essential for from classical linguistics to advanced machine learn- improving the accuracy of text processing and supporting techniques [11, 12]. The main approaches include ing advanced linguistic research. However, many of them lexicon-based and boundary-detection-based meth- focus primarily on morphological analysis, without proods [2]. Lexicon-based methods rely on a comprehen- viding detailed support for morphological segmentation, sive database of known morphemes [ 13, 14, 15 ], while which limits their usefulness in tasks that require fineboundary-detection methods identify transition points grained word structure analysis. Even those tools that between morphemes using statistical or machine learn- ofer segmentation often approach it with diferent mething techniques [16, 17, 18]. ods and objectives than ours.

Another significant distinction is between generative Morph-it! [ 37 ] is an open-source lexicon that conmodels and discriminative models. Generative models, tains 504,906 entries and 34,968 unique lemmas, each suited for unsupervised learning, generate word forms annotated with morphological characteristics that link and segmentations from raw data [ 19, 20, 21 ]. In contrast, inflected word forms to their lemmas. While valuable discriminative models, which require annotated data, pre- for lemmatization and morphological analysis, it is not dict segmentations based on learned relationships from suited for morphological segmentation, as it primarily folabeled examples [22, 23]. cuses on inflected forms rather than decomposing words

Unsupervised methods do not require labeled data, into their individual morphemes. making them attractive for leveraging vast amounts of MorphoPro [39] is part of the TextPro suite and is deraw data. They trace back to Harris (1955), who used signed for morphological analysis of both English and statistical methods to identify morphological segments. Italian. It uses a declarative knowledge base converted Notable systems include Linguistica [24, 25] and Mor- into a Finite State Automaton (FSA) for detailed morphofessor [26, 27], which employ the Minimum Description logical analysis. However, MorphoPro’s output is geared Length (MDL) principle to identify regularities within towards global morphological analysis and lacks support data. Despite their utility, unsupervised methods often for internal word segmentation into morphemes, limiting its applicability for more granular tasks.

MAGIC [36] provides a lexicon of approximately evaluated by a group of native speakers with a perceived 100,000 lemmas and performs detailed morphological complexity score ranging from 1 to 5. In the dataset, the and morphosyntactic analysis. However, similar to other aggregated and normalized complexity value is between resources, MAGIC does not focus on morphological seg- 0 and 1, where 0 indicates very simple words and 1 inmentation. Instead, it provides morphological and syn- dicates very complex words2. The morphological traits tactic information about word forms, making it more extracted by the selected model were then integrated useful for general morphological analysis rather than with other linguistic features typically considered influsegmenting words into individual morphemes. ential in the perception of word complexity [9]. These

Getarun [38] ofers a lexicon of around 80,000 roots and combined features were analyzed in a correlation study provides sophisticated morphosyntactic analysis. How- with the perceived complexity values of MultiLs-IT to asever, like MAGIC, it is designed primarily for syntactic sess their impact on predicting linguistic complexity. By parsing and lacks functionality for detailed morphologi- examining the relationships between these variables, we cal segmentation, focusing instead on morphological and aim to determine whether morphological measures can syntactic relationships. be efectively used in systems designed to automatically

DerIvaTario [41] is another resource that provides sig- identify word complexity. nificant support for morphological segmentation, particularly in the context of derivational morphology. It ofers 3.1. Dataset detailed information on derivational patterns in Italian, mapping out how words are formed through derivational The primary reference for this work is the AnIta dataset, processes, which is especially useful for studying word which includes data annotated with morphological segformation in a structured manner. However, DerIvaTario mentations based on specific rules. One rule excludes focuses primarily on canonical segmentations and does bases derived from Latin, Greek, and other languages. not always recognize smaller morphemes, such as final Since Italian, especially in technical and specialized fields, morphemes. This limitation means it may miss finer- contains many such words, we modified the dataset to grained morphological elements, making it more suitable include these forms to ensure accurate representation. for analyzing larger, derivational units rather than cap- The initial dataset consisted of numerous entries auturing all inflectional components. tomatically generated by AnIta, often including over

AnIta is an advanced morphological analyzer for Ital- generated word-forms (possible words [44]), especially ian, implemented within the FSA framework [40]. It sup- in evaluative morphology. This resulted in a comprehenports a comprehensive lexicon with over 120,000 lemmas sive dataset with approximately two million entries.To and handles inflectional, derivational, and compositional adapt the AnIta dataset for our research needs, we unphenomena. AnIta’s segmentation occurs on two levels: dertook several steps. superficial segmentation of word forms and derivation 1) Due to the extensive size, we reduced the sample, graphs. Although derivation graphs are incomplete, the retaining one-third of entries for each letter, resulting in tool’s focus on superficial segmentation aligns with our approximately 728,814 word-forms (35% of the original research needs. For the segmentation of lemmas related dataset). This sample maintains a fair representation of to derivational phenomena, AnIta adopts two main rules: all linguistic categories3. 2) We systematically identified (1) afixes are kept unchanged; (2) lexicon entries are seg- and addressed prefixes and sufixes, prioritizing longer mented only if their base is a recognizable independent afixes to preserve more informative morphological strucItalian word. tures. This semi-automatic approach facilitated manual verification while enhancing segmentation quality. 3) We manually reviewed the segmented words, ensuring 3. Methods accuracy and consistency, preserving prefixes in their original forms as per AnIta’s rule number one. 4) The final dataset was divided into training (80%) and test (20%) sets, comprising 583,051 and 145,763 words respectively.

This split allowed efective training and validation of our models without needing a separate validation set, as no parameter tuning was performed. This streamlined In this study, we trained three models, originally developed for other languages, using an Italian dataset that was manually created and verified with morphological segmentations. After evaluating the performance of the models, we selected the most efective one and used it to extract morphological parameters from the words in the MultiLS-IT dataset, a resource designed for lexical simplification in the Italian language [42, 43].

2The resource is available at https://github.com/MLSP2024/MLSP_

Data.

The dataset comprises 600 contextualized words, an- 3Initially, we aimed to manually review the entire dataset to address notated for complexity and accompanied by substitutes any inconsistencies and overlooked segments. However, due to time perceived as simpler than the target word. Each word was constraints, we opted to reduce the dataset by randomly selecting 30% of the entries for each letter.

Automatic segmentation systems Neural Morpheme Segmentation MorphemeBERT Morfessor FlatCat methodology ensured a robust dataset for implementing places the boundaries at the correct points. Its F1 score and evaluating our automatic segmentation system. (0.9892), which balances precision and recall, underscores the model’s ability not only to accurately segment mor3.2. Segmentation Models phemes but also to capture the majority of them with minimal oversight. The high recall (0.9806) confirms that Given the extensive dataset at our disposal, we se- the model rarely misses morphemes, making it particulected models within the domain of supervised or semi- larly well-suited for handling complex or less frequent supervised learning. The models considered include: morphological patterns. This balance between high preMorfessor FlatCat [31]: a semi-supervised model cision and recall showcases the robustness of the CNNthat utilizes a HMM approach for morphological segmen- based architecture, which can efectively model both local tation. It is eficient in handling languages with complex dependencies between segments and the global morphomorphological structures. The model’s flat lexicon and logical structure of words4. the use of semi-supervised learning make it particularly MorphemeBERT demonstrates a high level of precisuited for scenarios where annotated data is scarce. sion, indicating that when it identifies a morpheme, it

Neural Morpheme Segmentation [33]: a su- is likely correct. However, its recall is noticeably lower pervised model based on CNNs, designed to segment than that of Neural Morpheme Segmentation, which morphemes by treating the task as a sequential labeling suggests that while it makes fewer errors, it also fails to problem using the BMES scheme (Begin, Middle, End, detect a significant number of morphemes. This trade-of Single). This model is noted for its ability to capture between precision and recall points to a more conserlocal dependencies within textual data. Its architecture vative approach in morpheme segmentation, where the includes multiple convolutional and pooling layers, en- model prioritizes accuracy over coverage. The F1 score hancing its capability to identify and segment complex of 0.9522, though still strong, highlights this imbalance morphological patterns. between precision and recall, meaning the model per

MorphemeBERT [45]: an advanced model that in- forms well but lacks the comprehensive identification tegrates BERT’s characters embeddings with CNNs to that would elevate its overall performance. The accuenhance morphological segmentation. BERT provides racy of 0.9581 reflects that the model is quite reliable in deep, context-rich linguistic representations, which can general, but its inability to capture as many correct morsignificantly improve the model’s accuracy in identifying phemes as Neural Morpheme Segmentation afects its morphemic boundaries. overall segmentation capability. This limitation might be due to how MorphemeBERT integrates BERT embed3.3. Evaluation dings, which are optimized for context-rich predictions but may struggle with identifying morphemic boundaries After constructing the dataset and selecting the previ- in less straightforward or ambiguous cases, leading to ously described models, we proceeded with the training. more missed segments.

Table 1 presents a comparative evaluation of the three Morfessor FlatCat shows a considerably weaker models using precision, recall, F1 score, and accuracy. performance compared to the other two models. While These metrics are standard for assessing the performance its precision score of 0.79744 is decent, meaning that the of boundary detection models, providing a comprehen- morphemes it identifies are mostly accurate, its recall sive overview of each model’s efectiveness in identifying is notably low. This indicates that the model misses a and segmenting morphemes accurately. substantial number of morphemes, failing to capture the

Neural Morpheme Segmentation demonstrates the full complexity of word segmentation. The low recall highest performance among the three systems across suggests that Morfessor FlatCat struggles to identify almost all metrics, particularly excelling in precision many valid morphemic boundaries, which results in inand F1 score. The high precision (0.9879) indicates that complete or inaccurate segmentations. Consequently, the model is very accurate in identifying correct mor- its F1 score (0.5033) and accuracy (0.7399) are signifipheme boundaries, minimizing false positives. In other words, when the model segments a word, it reliably

4This model is available upon request. Please contact the author

directly to access to the model and relevant references.

By integrating these morphological features with other linguistic traits typically considered influential in speakers’ perception of complexity, we aim to assess their impact on predicting linguistic complexity5.

5. Analysis and discussion

cantly lower, suggesting that this system is less reliable for applications requiring high fidelity in morpheme segmentation.

4. Selection of Linguistic Features Based on a thorough review of the literature on lexical

complexity prediction [9, 46], we selected several lin- Through studying the correlations between these variguistic features to analyze their impact on complexity. ables, we seek to determine whether morphological meaIn addition to common surface characteristics, such as sures can be efectively used to develop systems capable the number of letters, syllables, and vowels in words, of automatically identifying word complexity. To achieve commonly used in complexity studies and readability cal- this, we conducted a correlation and significance analysis culations, we identified other relevant parameters. One between the features discussed earlier and the perceived key factor is the frequency of a word, as more frequent complexity values for the 600 words included in MultiLswords tend to be perceived as more familiar and thus less IT. complex. We calculated it using the ItWac corpus [47].

Another important parameter is the number of senses a LFeenagtuthre 0C.0o8r2relation 0p.-0v4a5l*ue word has, measured using the lexical resources ItalWord- Number of vowels 0.097 0.018* net [48]. Lastly, the presence of stop words, calculated NNuummbbeerr ooff sMyollrapbhleesmes 00..019112 00..000266** with Spacy model, which are common words that often Senses_ID -0.277 0.000* carry little inherent meaning, can influence the perceived SLteompmwoarFdrequency --00..416274 00..000030** complexity of a sentence or text. Given the focus of this LMeoxricpahlomloogripcahleDmeenfsrietqyuency -00.0.33363 00..308010* study on morphological features’ impact on lexical complexity, we concentrated on several key aspects related Table 2 to the internal structure of words. These features could Spearman correlation coeficients and p-values for features show how morphological traits contribute to word intri- and complexity. Note: * indicates statistical significance. cacy:

Number of morphemes: Morphemes are the small- Table 2 presents the Spearman correlation coeficients est units of meaning in words, including afixes (prefixes and their statistical significance for the features calcuand sufixes) and roots. The number of morphemes gives lated6. The correlation analysis reveals several important an indication of the information load of a word. Lexical insights. items with more morphemes typically require more de- Word length, number of vowels, and number of sylcoding efort from readers. We used our Convolutional lables all have small but statistically significant positive Neural Model for automatic morphological segmentation correlations with complexity. This suggests that, as exand morpheme counting. pected, longer words with more vowels and syllables

Morphological density: This quantitative metric is tend to be perceived as more complex. These factors are defined as the ratio of the number of morphemes to word typical in readability studies, where more phonologically length, ofering a measure of how densely packed mean- complex words are generally harder to process. ingful units are within a word. Higher morphological The number of morphemes also shows a positive cordensity can indicate more cognitive load, as each unit relation with complexity, reinforcing the idea that words contributes distinct information, potentially raising the with more morphemes are perceived as more complex. complexity of the word. This feature is statistically significant as well.

Frequency of the lexical morpheme: Lexical mor- Negative correlations for senses_ID, stopword presphemes carry the core meaning of the word. Employing ence, and lemma frequency suggest that words with more our morphological segmentator on the ItWac corpus [47], senses, those that are stopwords, or those that are more enabled us to dissect the word into segments and aggregate the frequencies of individual morphemes. This frequency, transformed using a logarithmic scale, helps predict complexity by leveraging the familiarity of frequently occurring morphemes. The use of lexical morpheme frequency as a complexity indicator is based on the idea that even if a word is unfamiliar as a whole, its component morphemes may be common in the language and more recognizable [49]. 5For a detailed analysis of how these parameters were processed, refer to Occhipinti 2024. 6Spearman’s rank correlation was chosen because it does not assume a linear relationship between variables, making it more suitable for our dataset, where the relationships between features like word length, number of morphemes, and word complexity may not follow a strictly linear pattern. Spearman’s correlation measures whether an increase in one variable tends to be consistently associated with an increase (or decrease) in another, which is more appropriate given the nature of our linguistic features. frequently used are perceived as less complex. These morphological segmentation in the Italian language. features are also statistically significant. It is notewor- The correlation analysis reveals that while traditional thy that the number of senses (senses_ID) is inversely metrics like word length and frequency are valuable preproportional to complexity. This could be attributed to dictors of complexity, incorporating morphological feathe incompleteness of ItalWordNet, potentially leading tures provides additional insights that enrich our unto unreliable predicted values. derstanding of lexical complexity. Notably, the positive

Morphological density, however, does not show a sta- correlation between the number of morphemes and pertistically significant correlation with complexity, suggest- ceived complexity suggests that words with more moring that the ratio of morphemes to word length may not phemes are inherently more complex. Conversely, frebe a strong predictor of perceived complexity. quent lexical morphemes tend to reduce perceived com

The lexical morpheme frequency shows a significant plexity, highlighting the importance of familiarity in comnegative correlation with complexity, indicating that plexity perception. Our study also emphasizes the need more frequently occurring morphemes contribute to for diverse linguistic features, including both surface lower perceived complexity. This supports the notion characteristics and morphological traits, to create more that familiar morphemes, even within otherwise complex robust and accurate models for predicting word complexwords, aid in comprehension. ity. The statistically significant correlations for most fea

These findings underscore the importance of consid- tures validate their relevance in complexity prediction. ering a range of linguistic features, including morpho- However, it is important to note that our findings are logical traits, when assessing lexical complexity. By in- based on a relatively small dataset of annotated complextegrating these features into computational models, we ity perceptions. To obtain more robust and generalizable can enhance their ability to accurately predict word com- results, it would be highly beneficial to have access to plexity and, subsequently, improve lexical simplification. a larger and more diverse dataset of complexity annotations. Expanding the dataset to include a wider variety of texts and contexts would enhance the reliability of 6. Conclusion the correlations observed and improve the training and evaluation of automatic complexity prediction models.

This study highlights the significance of integrating mor- Future research should focus on gathering more extenphological features into automatic models to enhance the sive annotated datasets and exploring additional linguiscomprehension and prediction of lexical complexity. The tic features that may influence complexity perception. By high performance of the Neural Morpheme Segmenta- doing so, we can further refine our models and develop tion model demonstrates the eficacy of convolutional more efective tools for lexical simplification and other neural networks in capturing the detailed patterns of applications aimed at improving text accessibility. [26] M. Creutz, K. Lagus, Unsupervised discovery of ence series 2005 (ISSN 1747-9398), volume 1, 2005, morphemes, in: Proceedings of the ACL-02 Work- pp. 1–12. shop on Morphological and Phonological Learning, [38] R. Delmonte, et al., Computational Linguistic 2002, pp. 21–30. Text Processing–Lexicon, Grammar, Parsing and [27] M. J. P. Creutz, K. H. Lagus, Morfessor in the mor- Anaphora Resolution, Nova Science Publishers, pho challenge, in: Proceedings of the PASCAL 2008.

Challenge Workshop on Unsupervised Segmenta- [39] E. Pianta, C. Girardi, R. Zanoli, The textpro tool tion of Words into Morphemes, 2006, pp. 12–17. suite., in: Proceedings of the Sixth International [28] Ö. Kılıç, C. Bozsahin, Semi-supervised morpheme Conference on Language Resources and Evaluation segmentation without morphological analysis, in: (LREC’08), 2008, p. 2603–2607.

Proceedings of the workshop on language resources [40] F. Tamburini, M. Melandri, Anita: a powerful morand technologies for Turkic languages, LREC, 2012, phological analyser for italian., in: Proceedings of pp. 52–56. the Eleventh International Conference on Language [29] T. Ruokolainen, O. Kohonen, S. Virpioja, M. Kurimo, Resources and Evaluation (LREC 2018), 2012, pp.

Painless semi-supervised morphological segmen- 941–947. tation using conditional random fields, in: Pro- [41] L. Talamo, C. Celata, P. M. Bertinetto, Derivatario: ceedings of the 14th Conference of the European An annotated lexicon of italian derivatives, Word Chapter of the Association for Computational Lin- Structure 9 (2016) 72–102.

guistics, volume 2: Short Papers, 2014, pp. 84–89. [42] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro, [30] J. Laferty, A. McCallum, F. Pereira, et al., Condi- S. Bott, S. Calderon Ramirez, R. Cardon, T. François, tional random fields: Probabilistic models for seg- A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide, menting and labeling sequence data, in: Interna- J. M. Imperial, A. Nohejl, K. North, L. Occhiptional Conference on Machine Learning, 2001, pp. inti, N. Peréz Rojas, N. Raihan, T. Ranasinghe, 282—-289. M. Solis Salazar, M. Zampieri, H. Saggion, An [31] S.-A. Grönroos, S. Virpioja, P. Smit, M. Kurimo, Mor- extensible massively multilingual lexical simplififessor flatcat: An hmm-based method for unsuper- cation pipeline dataset using the MultiLS framevised and semi-supervised learning of morphology, work, in: R. Wilkens, R. Cardon, A. Todirascu, in: Proceedings of COLING 2014, the 25th Inter- N. Gala (Eds.), Proceedings of the 3rd Workshop national Conference on Computational Linguistics, on Tools and Resources for People with REAd2014, pp. 1177–1185. ing DIficulties (READI) @ LREC-COLING 2024, [32] X. Zhu, A. B. Goldberg, Introduction to semi- ELRA and ICCL, Torino, Italia, 2024, pp. 38–46. URL: supervised learning, Springer Nature, 2022. https://aclanthology.org/2024.readi-1.4. [33] A. Sorokin, Convolutional neural networks for [43] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro, low-resource morpheme segmentation: baseline S. Bott, S. Calderon Ramirez, R. Cardon, T. François, or state-of-the-art?, in: Proceedings of the 16th A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide, J. M. Workshop on Computational Research in Phonet- Imperial, A. Nohejl, K. North, L. Occhipinti, N. P. ics, Phonology, and Morphology, 2019, pp. 154–159. Rojas, N. Raihan, T. Ranasinghe, M. S. Salazar, URL: https://aclanthology.org/W19-4218. doi:10. S. Štajner, M. Zampieri, H. Saggion, The BEA 18653/v1/W19-4218. 2024 shared task on the multilingual lexical sim[34] L. Wang, Z. Cao, Y. Xia, G. De Melo, Morphological plification pipeline, in: E. Kochmar, M. Bexte, segmentation with window lstm neural networks, J. Burstein, A. Horbach, R. Laarmann-Quante, in: Proceedings of the AAAI Conference on Artifi- A. Tack, V. Yaneva, Z. Yuan (Eds.), Proceedings cial Intelligence, 2016, pp. 2842–2848. of the 19th Workshop on Innovative Use of NLP [35] R. Cotterell, T. Mueller, A. Fraser, H. Schütze, for Building Educational Applications (BEA 2024), Labeled morphological segmentation with semi- Association for Computational Linguistics, Mexmarkov models, in: Proceedings of the Nineteenth ico City, Mexico, 2024, pp. 571–589. URL: https: Conference on Computational Natural Language //aclanthology.org/2024.bea-1.51.

Learning, 2015, pp. 164–174. [44] M. Aronof, A decade of morphology and word [36] M. Battista, V. Pirrelli, Una piattaforma di morfolo- formation, Annual review of anthropology (1983) gia computazionale per l’analisi e la generazione 355–375. delle parole italiane, Technical Report, ILC-CNR, [45] A. Sorokin, Improving morpheme segmentation us1999. ing bert embeddings, in: International Conference [ 37 ] E. Zanchetta, M. Baroni, Morph-it! a free corpus- on Analysis of Images, Social Networks and Texts, based morphological resource for the italian lan- Springer, 2021, pp. 148–161. guage, in: Proceedings of corpus linguistics confer- [46] K. North, M. Zampieri, M. Shardlow, Lexical com

2010 , pp. 364 - 393 .

[13] J. G. Wolf, The discovery of segments in natural [1 ]

J. T.

Devlin ,

H. L.

Jamison ,

P. M.

Matthews , L. M. language , British Journal of Psychology 68 ( 1977 )

Gonnerman , Morphology and the internal structure 97-106 .

of words , Proceedings of the National Academy of [14]

C. G.

Nevill-Manning ,

I. H.

Witten , Identifying hi-

Sciences 101 ( 2004 ) 14984 - 14988 . erarchical structure in sequences: A linear- time al[2]

Ruokolainen ,

Kohonen ,

Sirts , S.-A. Grön- gorithm, Journal of Artificial Intelligence Research

roos , M. Kurimo, S.

Virpioja , A comparative study 7 ( 1997 ) 67 - 82 .

of minimally supervised morphological segmenta- [15]

Johnson , Unsupervised word segmentation for

tion , Computational Linguistics 42 ( 2016 ) 91 - 120 . sesotho using adaptor grammars , in: Proceedings of [3]

S. J.

Mielke ,

Alyafeai ,

Salesky ,

Rafel , M. Dey, the Tenth Meeting of ACL Special Interest Group on

Gallé ,

Raja ,

Si ,

W. Y.

Lee ,

Sagot , et al., Computational Morphology and Phonology , 2008 ,

Between words and characters: A brief history of pp. 20 - 27 .

open-vocabulary modeling and tokenization in nlp , [16]

Z. S.

Harris , From phoneme to morpheme, Lan-

arXiv preprint arXiv:2112.10508 ( 2021 ). guage 31 ( 1955 ) 190 - 222 . URL: http://www.jstor. [4]

Sennrich ,

Haddow ,

Birch , Neural machine org/stable/411036.

translation of rare words with subword units , in: [17]

Cohen ,

Heeringa ,

N. M.

Adams , An unsuper-

Proceedings of the 54th Annual Meeting of the As- vised algorithm for segmenting categorical time-

sociation for Computational Linguistics (Volume 1: series into episodes , in: Proceedings of Pattern

Long Papers) , 2016 , pp. 1715 - 1725 . doi: 10 .18653/ Detection and Discovery: ESF Exploratory Work-

v1/ P16 -1162. shop London, 2002 , pp. 49 - 62 . [5]

Devlin , M.-

Chang ,

Lee ,

Toutanova , [18]

Sorokin ,

Kravtsova , Deep convolutional net-

formers for language understanding, in: Pro- russian language , in: Proceedings of 7th Interna-

ceedings of the 2019 Conference of the North tional Conference in Artificial Intelligence and Nat-

American Chapter of the Association for Computa- ural Language (AINL

2018 ), 2018 , pp. 3 - 10 .

tional Linguistics: Human Language Technologies , [19] M.

Creutz , K.

Lagus , Unsupervised models for

2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ morpheme segmentation and morphology learn-

N19 - 1423 . doi: 10 .18653/v1/ N19 -1423. ing, ACM Transactions on Speech and Language [6]

Bostrom , G. Durrett, Byte pair encoding is subop- Processing (TSLP) 4 ( 2007 ) 1 - 34 .

timal for language model pretraining , in: Findings [20]

Poon ,

Cherry ,

Toutanova , Unsupervised

EMNLP 2020 , 2020 , pp. 4617 - 4624 . els, in: Proceedings of Human Language Technolo [7]

Song ,

Salcianu ,

Song ,

Dopson ,

Zhou , gies: The 2009 Annual Conference of the North

2021 Conference on Empirical Methods in Natural tional Linguistics , 2009 , pp. 209 - 217 .

Language

Processing , 2020 , pp. 2089 - 2103 . [21]

Sirts ,

Goldwater , Minimally-supervised mor[8]

Cotterell ,

Kirov , J.

Sylak-Glassman, phological segmentation using adaptor grammars,

morphon 2016 shared task-morphological Linguistics 1 (

2013 ) 255 - 266 .

reinflection, in: Proceedings of the 14th SIGMOR- [22]

Z. S.

Harris , Morpheme Boundaries within Words:

phonetics, phonology, and morphology, 2016 , pp. 1970 , pp. 68 - 77 .

10- 22 . [23]

Ruokolainen ,

Kohonen ,

Virpioja , M. Ku[9]

Collins-Thompson , Computational assessment rimo, Supervised morphological segmentation in

guistics 165 ( 2014 ) 97 - 135 . Conference on Computational Natural Language [10]

W. U.

Dressler , Ricchezza e complessità morfologica, Learning , 2013 , pp. 29 - 37 .

Ricchezza e complessità morfologica (

1999 ) 1000 - [24]

Goldsmith , Unsupervised learning of the mor-

1011. phology of a natural language , Computational lin [11]

Scalise , Morfologia, il Mulino, 1994 . guistics 27 ( 2001 ) 153 - 198 . [12]

J. A.

Goldsmith , Segmentation and morphology, [25]

Goldsmith , An algorithm for the unsupervised

natural language processing , Wiley Online Library, neering 12 ( 2006 ) 353 - 371 .

Surveys 55 ( 2023 ) 1 - 42 . [47]

Baroni ,

Bernardini ,

Ferraresi , E. Zanchetta,

guage resources and evaluation 43 ( 2009 ) 209 - 226 . [48]

Roventini ,

Alonge ,

Calzolari ,

Magnini ,

Evaluation (LREC-2000) , 2000 , pp. 783 - 790 . [49]

Colé ,

Segui ,

Taft , Words and morphemes

Language 37 ( 1997 ) 312 - 330 . [50]

Occhipinti , Complex word identification for ital-

ceedings of Clib24 , Sixth International Conference

on Computational Linguistics in Bulgaria , 2024 , pp.