Enhancing Lexical Complexity Prediction in Italian through
                                Automatic Morphological Segmentation
                                Laura Occhipinti1
                                1
                                    University of Bologna, Italy


                                                   Abstract
                                                   Morphological analysis is essential for various Natural Language Processing (NLP) tasks, as it reveals the internal structure
                                                   of words and deepens our understanding of their morphological and syntactic relationships. This study focuses on surface
                                                   morphological segmentation for the Italian language, addressing the limited representation of detailed morphological
                                                   information in existing corpora. Using an automatic segmentation tool, we extract quantitative morphological parameters to
                                                   investigate their impact on the perception of word complexity by native Italian speakers. Through correlation analysis, we
                                                   demonstrate that morphological features, such as the number of morphemes and lexical morpheme frequency, significantly
                                                   influence how complex words are perceived. These insights contribute to improving automatic lexical complexity prediction
                                                   models and offer a deeper understanding of the role of morphology in word comprehension.

                                                   Keywords
                                                   Morphological segmentation, Lexical complexity prediction, Italian language


                                1. Introduction                                                                                             (they were eating). The resulting surface segmentation
                                                                                                                                            would be mangi- + -avano, where mangi- is a morph de-
                                Morphological analysis is crucial for various NLP tasks,                                                    rived from the root of the verb mangiare, and -avano is
                                as it provides insights into the internal structures of                                                     the suffix indicating the third person plural of the imper-
                                words and helps us better understand the morpholog-                                                         fect tense. In contrast, the canonical segmentation would
                                ical and syntactic relationships between words [1].                                                         yield mangiare + -avano, with mangiare as the canonical
                                   The Italian language, with its rich morphology and ex-                                                   morpheme and -avano as the suffix1 .
                                tensive use of inflection and derivation, presents unique                                                      In this study, we focus on surface morphological seg-
                                challenges and opportunities for morphological segmen-                                                      mentation for the Italian language. Morphological fea-
                                tation.                                                                                                     tures are often not adequately represented in available
                                   Automatic segmentation, a key component of morphol-                                                      corpora for this language, or they refer exclusively to
                                ogy learning, involves dividing word forms into mean-                                                       morphosyntactic information, such as the grammatical
                                ingful units such as roots, prefixes, and suffixes [2]. This                                                category of words and a macro-level descriptive analysis
                                task falls under the broader category of subword segmen-                                                    mainly related to inflection. Information about the inter-
                                tation [3] but is distinct due to its linguistic motivation.                                                nal structure of words, such as derivation or composition,
                                Computational approaches typically identify subwords                                                        is often lacking.
                                based on purely statistical considerations, which often                                                        The primary objective of this work is to use an auto-
                                results in subunits that do not correspond to recogniz-                                                     matic segmenter to extract a series of quantitative mor-
                                able linguistic units [4, 5, 6, 7]. Making this task more                                                   phological parameters. We believe that our approach
                                morphologically oriented could enable models to gen-                                                        does not require the detailed analysis provided by canon-
                                eralize better to new words or forms, as basic roots or                                                     ical segmentation, which could entail longer processing
                                morphemes are often shared among words, and it could                                                        times.
                                also facilitate the interpretation of model results.
                                   When discussing morphological segmentation, we can                                                       1
                                                                                                                                                It’s important to note that the segmentation process is not always
                                refer to two types: (1) Surface segmentation, which in-                                                         straightforward, as it involves various linguistic criteria that may
                                volves dividing words into morphs, the surface forms of                                                         not be immediately clear. For example, one of the challenges lies in
                                morphemes; (2) Canonical segmentation, which involves                                                           deciding whether to detach or retain the thematic vowel—a vowel
                                dividing words into morphemes and reducing them to                                                              that appears between the root and the inflectional suffix, especially
                                their standard forms [8].                                                                                       in Romance languages. In the case of mangiavano, the thematic
                                                                                                                                                vowel -a- could either be considered part of the root or treated
                                   For instance, consider the Italian word mangiavano                                                           as a separate morph. Similarly, other segmentation criteria might
                                                                                                                                                involve distinctions between compound forms, derivational affixes,
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                            or fused morphemes that do not have clear boundaries. As a result,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                  the segmentation criteria can vary based on linguistic theory, the
                                $ laura.occhipinti3@unibo.it (L. Occhipinti)                                                                    specific task (e.g., computational vs. linguistic analysis), or even the
                                 0009-0007-8799-4333 (L. Occhipinti)                                                                           intended application of the segmentation (e.g., for syntactic parsing
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).                                                         or machine learning).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   In addition to examining classic parameters reported in      suffer from oversegmentation and incorrect segmenta-
the literature that influence complexity [9], such as word      tion of affixes [19, 28]. These challenges arise due to the
frequency, length, and number of syllables, we aim to           complex interplay of phonological, morphological, and
explore how morphological features integrate with these         semantic factors in natural languages.
factors to affect word complexity perception. Specifi-             Semi-supervised methods leverage both annotated and
cally, we seek to understand how the internal structure         unannotated data, enhancing model performance with
of words contributes to the cognitive load that speakers        minimal manual annotation [29]. These methods are
experience when processing more complex lexical items.          effective in scenarios with limited labeled data[30, 31],
   Our premise is that words with more morphemes are            using initial labeled datasets to hypothesize and validate
more complex because they contain more information              patterns across larger unlabeled corpora [32]. While ben-
to decode [10]. For example, consider the word infelicità       eficial, semi-supervised methods depend on the quality of
(unhappiness). To decode it, one must know the word             initial labeled datasets and may struggle with languages
felice (happy), from which it is derived, as well as the pre-   exhibiting extensive morphological diversity [2].
fix in-, which negates the quality expressed by the base           Supervised methods, relying on annotated datasets,
term, and the suffix -ità, which transforms the adjective       typically achieve higher accuracy due to learning from
into an abstract noun. Therefore, to fully understand the       explicitly labeled examples. Techniques include neural
meaning of infelicità, the reader or listener must be able      networks, Hidden Markov Models (HMM), and Convolu-
to correctly recognize and interpret each of these mor-         tional Neural Networks (CNNs) [33, 34, 35, 23]. Despite
phemes and their contribution to the overall meaning of         their high performance, supervised methods are limited
the word.                                                       by the need for extensive annotated corpora, which can
   The main contributions of this work are: (1) Providing       be costly and time-consuming to create.
a tool capable of automatically segmenting words into              Given access to a large annotated dataset for the Italian
linguistically motivated base forms; (2) presenting the         language, on which we made semi-manual corrections,
dataset constructed for training our model; (3) evaluating      our study primarily adopts a supervised approach.
the impact of different linguistic features on speakers’
perception of word complexity, with a particular focus          2.1. Resources available for the Italian
on morphological features.
                                                                     language
                                                                Several computational resources and tools have been de-
2. Related Works                                                veloped to manage Italian morphological information
The study of morphological segmentation has evolved             [36, 37, 38, 39, 40, 41]. These resources are essential for
from classical linguistics to advanced machine learn-           improving the accuracy of text processing and support-
ing techniques [11, 12]. The main approaches include            ing advanced linguistic research. However, many of them
lexicon-based and boundary-detection-based meth-                focus primarily on morphological analysis, without pro-
ods [2]. Lexicon-based methods rely on a comprehen-             viding detailed support for morphological segmentation,
sive database of known morphemes [13, 14, 15], while            which limits their usefulness in tasks that require fine-
boundary-detection methods identify transition points           grained word structure analysis. Even those tools that
between morphemes using statistical or machine learn-           offer segmentation often approach it with different meth-
ing techniques [16, 17, 18].                                    ods and objectives than ours.
   Another significant distinction is between generative           Morph-it! [37] is an open-source lexicon that con-
models and discriminative models. Generative models,            tains 504,906 entries and 34,968 unique lemmas, each
suited for unsupervised learning, generate word forms           annotated with morphological characteristics that link
and segmentations from raw data [19, 20, 21]. In contrast,      inflected word forms to their lemmas. While valuable
discriminative models, which require annotated data, pre-       for lemmatization and morphological analysis, it is not
dict segmentations based on learned relationships from          suited for morphological segmentation, as it primarily fo-
labeled examples [22, 23].                                      cuses on inflected forms rather than decomposing words
   Unsupervised methods do not require labeled data,            into their individual morphemes.
making them attractive for leveraging vast amounts of              MorphoPro [39] is part of the TextPro suite and is de-
raw data. They trace back to Harris (1955), who used            signed for morphological analysis of both English and
statistical methods to identify morphological segments.         Italian. It uses a declarative knowledge base converted
Notable systems include Linguistica [24, 25] and Mor-           into a Finite State Automaton (FSA) for detailed morpho-
fessor [26, 27], which employ the Minimum Description           logical analysis. However, MorphoPro’s output is geared
Length (MDL) principle to identify regularities within          towards global morphological analysis and lacks support
data. Despite their utility, unsupervised methods often         for internal word segmentation into morphemes, limiting
                                                                its applicability for more granular tasks.
   MAGIC [36] provides a lexicon of approximately              evaluated by a group of native speakers with a perceived
100,000 lemmas and performs detailed morphological             complexity score ranging from 1 to 5. In the dataset, the
and morphosyntactic analysis. However, similar to other        aggregated and normalized complexity value is between
resources, MAGIC does not focus on morphological seg-          0 and 1, where 0 indicates very simple words and 1 in-
mentation. Instead, it provides morphological and syn-         dicates very complex words2 . The morphological traits
tactic information about word forms, making it more            extracted by the selected model were then integrated
useful for general morphological analysis rather than          with other linguistic features typically considered influ-
segmenting words into individual morphemes.                    ential in the perception of word complexity [9]. These
   Getarun [38] offers a lexicon of around 80,000 roots and    combined features were analyzed in a correlation study
provides sophisticated morphosyntactic analysis. How-          with the perceived complexity values of MultiLs-IT to as-
ever, like MAGIC, it is designed primarily for syntactic       sess their impact on predicting linguistic complexity. By
parsing and lacks functionality for detailed morphologi-       examining the relationships between these variables, we
cal segmentation, focusing instead on morphological and        aim to determine whether morphological measures can
syntactic relationships.                                       be effectively used in systems designed to automatically
   DerIvaTario [41] is another resource that provides sig-     identify word complexity.
nificant support for morphological segmentation, partic-
ularly in the context of derivational morphology. It offers    3.1. Dataset
detailed information on derivational patterns in Italian,
mapping out how words are formed through derivational     The primary reference for this work is the AnIta dataset,
processes, which is especially useful for studying word   which includes data annotated with morphological seg-
formation in a structured manner. However, DerIvaTario    mentations based on specific rules. One rule excludes
focuses primarily on canonical segmentations and does     bases derived from Latin, Greek, and other languages.
not always recognize smaller morphemes, such as final     Since Italian, especially in technical and specialized fields,
morphemes. This limitation means it may miss finer-       contains many such words, we modified the dataset to
grained morphological elements, making it more suitable   include these forms to ensure accurate representation.
for analyzing larger, derivational units rather than cap-    The initial dataset consisted of numerous entries au-
turing all inflectional components.                       tomatically generated by AnIta, often including over-
   AnIta is an advanced morphological analyzer for Ital-  generated word-forms (possible words [44]), especially
ian, implemented within the FSA framework [40]. It sup-   in evaluative morphology. This resulted in a comprehen-
ports a comprehensive lexicon with over 120,000 lemmas    sive dataset with approximately two million entries.To
and handles inflectional, derivational, and compositional adapt the AnIta dataset for our research needs, we un-
phenomena. AnIta’s segmentation occurs on two levels:     dertook several steps.
superficial segmentation of word forms and derivation        1) Due to the extensive size, we reduced the sample,
graphs. Although derivation graphs are incomplete, the    retaining one-third of entries for each letter, resulting in
tool’s focus on superficial segmentation aligns with our  approximately 728,814 word-forms (35% of the original
research needs. For the segmentation of lemmas related    dataset). This sample maintains a fair representation of
to derivational phenomena, AnIta adopts two main rules:   all linguistic categories3 . 2) We systematically identified
                                                          and addressed prefixes and suffixes, prioritizing longer
(1) affixes are kept unchanged; (2) lexicon entries are seg-
mented only if their base is a recognizable independent   affixes to preserve more informative morphological struc-
Italian word.                                             tures. This semi-automatic approach facilitated manual
                                                          verification while enhancing segmentation quality. 3)
                                                          We manually reviewed the segmented words, ensuring
3. Methods                                                accuracy and consistency, preserving prefixes in their
                                                          original forms as per AnIta’s rule number one. 4) The fi-
In this study, we trained three models, originally devel- nal dataset was divided into training (80%) and test (20%)
oped for other languages, using an Italian dataset that sets, comprising 583,051 and 145,763 words respectively.
was manually created and verified with morphological This split allowed effective training and validation of
segmentations. After evaluating the performance of the our models without needing a separate validation set, as
models, we selected the most effective one and used it no parameter tuning was performed. This streamlined
to extract morphological parameters from the words in
the MultiLS-IT dataset, a resource designed for lexical 2
                                                           The resource is available at https://github.com/MLSP2024/MLSP_
simplification in the Italian language [42, 43].           Data.
   The dataset comprises 600 contextualized words, an- 3 Initially, we aimed to manually review the entire dataset to address
notated for complexity and accompanied by substitutes any inconsistencies and overlooked segments. However, due to time
perceived as simpler than the target word. Each word was constraints, we opted to reduce the dataset by randomly selecting
                                                                30% of the entries for each letter.
                      Automatic segmentation systems    Precision        Recall    F1        Accuracy
                      Neural Morpheme Segmentation      0.9879           0.9806    0.9892    0.9793
                      MorphemeBERT                      0.9868           0.9199    0.9522    0.9581
                      Morfessor FlatCat                 0.7974           0.3676    0.5033    0.7399

Table 1
Results of models on morphological segmentation.


methodology ensured a robust dataset for implementing        places the boundaries at the correct points. Its F1 score
and evaluating our automatic segmentation system.            (0.9892), which balances precision and recall, underscores
                                                             the model’s ability not only to accurately segment mor-
3.2. Segmentation Models                                     phemes but also to capture the majority of them with
                                                             minimal oversight. The high recall (0.9806) confirms that
Given the extensive dataset at our disposal, we se-          the model rarely misses morphemes, making it particu-
lected models within the domain of supervised or semi-       larly well-suited for handling complex or less frequent
supervised learning. The models considered include:          morphological patterns. This balance between high pre-
Morfessor FlatCat [31]: a semi-supervised model              cision and recall showcases the robustness of the CNN-
that utilizes a HMM approach for morphological segmen-       based architecture, which can effectively model both local
tation. It is efficient in handling languages with complex   dependencies between segments and the global morpho-
morphological structures. The model’s flat lexicon and       logical structure of words4 .
the use of semi-supervised learning make it particularly        MorphemeBERT demonstrates a high level of preci-
suited for scenarios where annotated data is scarce.         sion, indicating that when it identifies a morpheme, it
   Neural Morpheme Segmentation [33]: a su-                  is likely correct. However, its recall is noticeably lower
pervised model based on CNNs, designed to segment            than that of Neural Morpheme Segmentation, which
morphemes by treating the task as a sequential labeling      suggests that while it makes fewer errors, it also fails to
problem using the BMES scheme (Begin, Middle, End,           detect a significant number of morphemes. This trade-off
Single). This model is noted for its ability to capture      between precision and recall points to a more conser-
local dependencies within textual data. Its architecture     vative approach in morpheme segmentation, where the
includes multiple convolutional and pooling layers, en-      model prioritizes accuracy over coverage. The F1 score
hancing its capability to identify and segment complex       of 0.9522, though still strong, highlights this imbalance
morphological patterns.                                      between precision and recall, meaning the model per-
   MorphemeBERT [45]: an advanced model that in-             forms well but lacks the comprehensive identification
tegrates BERT’s characters embeddings with CNNs to           that would elevate its overall performance. The accu-
enhance morphological segmentation. BERT provides            racy of 0.9581 reflects that the model is quite reliable in
deep, context-rich linguistic representations, which can     general, but its inability to capture as many correct mor-
significantly improve the model’s accuracy in identifying    phemes as Neural Morpheme Segmentation affects its
morphemic boundaries.                                        overall segmentation capability. This limitation might
                                                             be due to how MorphemeBERT integrates BERT embed-
3.3. Evaluation                                              dings, which are optimized for context-rich predictions
                                                             but may struggle with identifying morphemic boundaries
After constructing the dataset and selecting the previ-      in less straightforward or ambiguous cases, leading to
ously described models, we proceeded with the training.      more missed segments.
Table 1 presents a comparative evaluation of the three          Morfessor FlatCat shows a considerably weaker
models using precision, recall, F1 score, and accuracy.      performance compared to the other two models. While
These metrics are standard for assessing the performance     its precision score of 0.79744 is decent, meaning that the
of boundary detection models, providing a comprehen-         morphemes it identifies are mostly accurate, its recall
sive overview of each model’s effectiveness in identifying   is notably low. This indicates that the model misses a
and segmenting morphemes accurately.                         substantial number of morphemes, failing to capture the
   Neural Morpheme Segmentation demonstrates the             full complexity of word segmentation. The low recall
highest performance among the three systems across           suggests that Morfessor FlatCat struggles to identify
almost all metrics, particularly excelling in precision      many valid morphemic boundaries, which results in in-
and F1 score. The high precision (0.9879) indicates that     complete or inaccurate segmentations. Consequently,
the model is very accurate in identifying correct mor-       its F1 score (0.5033) and accuracy (0.7399) are signifi-
pheme boundaries, minimizing false positives. In other
                                                             4
words, when the model segments a word, it reliably               This model is available upon request. Please contact the author
                                                                 directly to access to the model and relevant references.
cantly lower, suggesting that this system is less reliable   By integrating these morphological features with other
for applications requiring high fidelity in morpheme seg- linguistic traits typically considered influential in speak-
mentation.                                                 ers’ perception of complexity, we aim to assess their
                                                           impact on predicting linguistic complexity5 .

4. Selection of Linguistic Features
                                                             5. Analysis and discussion
Based on a thorough review of the literature on lexical
complexity prediction [9, 46], we selected several lin-      Through studying the correlations between these vari-
guistic features to analyze their impact on complexity.      ables, we seek to determine whether morphological mea-
In addition to common surface characteristics, such as       sures can be effectively used to develop systems capable
the number of letters, syllables, and vowels in words,       of automatically identifying word complexity. To achieve
commonly used in complexity studies and readability cal-     this, we conducted a correlation and significance analysis
culations, we identified other relevant parameters. One      between the features discussed earlier and the perceived
key factor is the frequency of a word, as more frequent      complexity values for the 600 words included in MultiLs-
words tend to be perceived as more familiar and thus less    IT.
complex. We calculated it using the ItWac corpus [47].
                                                                        Feature                      Correlation   p-value
Another important parameter is the number of senses a                   Length                       0.082         0.045*
word has, measured using the lexical resources ItalWord-                Number of vowels             0.097         0.018*
                                                                        Number of syllables          0.091         0.026*
net [48]. Lastly, the presence of stop words, calculated                Number of Morphemes          0.112         0.006*
with Spacy model, which are common words that often                     Senses_ID                    -0.277        0.000*
                                                                        Stopword                     -0.124        0.003*
carry little inherent meaning, can influence the perceived              Lemma Frequency              -0.467        0.000*
                                                                        Morphological Density        0.036         0.381
complexity of a sentence or text. Given the focus of this               Lexical morpheme frequency   -0.333        0.000*
study on morphological features’ impact on lexical com-
plexity, we concentrated on several key aspects related      Table 2
to the internal structure of words. These features could     Spearman correlation coefficients and p-values for features
                                                             and complexity. Note: * indicates statistical significance.
show how morphological traits contribute to word intri-
cacy:
   Number of morphemes: Morphemes are the small-                Table 2 presents the Spearman correlation coefficients
est units of meaning in words, including affixes (prefixes   and their statistical significance for the features calcu-
and suffixes) and roots. The number of morphemes gives       lated6 . The correlation analysis reveals several important
an indication of the information load of a word. Lexical     insights.
items with more morphemes typically require more de-            Word length, number of vowels, and number of syl-
coding effort from readers. We used our Convolutional        lables all have small but statistically significant positive
Neural Model for automatic morphological segmentation        correlations with complexity. This suggests that, as ex-
and morpheme counting.                                       pected, longer words with more vowels and syllables
   Morphological density: This quantitative metric is        tend to be perceived as more complex. These factors are
defined as the ratio of the number of morphemes to word      typical in readability studies, where more phonologically
length, offering a measure of how densely packed mean-       complex words are generally harder to process.
ingful units are within a word. Higher morphological            The number of morphemes also shows a positive cor-
density can indicate more cognitive load, as each unit       relation with complexity, reinforcing the idea that words
contributes distinct information, potentially raising the    with more morphemes are perceived as more complex.
complexity of the word.                                      This feature is statistically significant as well.
   Frequency of the lexical morpheme: Lexical mor-              Negative correlations for senses_ID, stopword pres-
phemes carry the core meaning of the word. Employing         ence, and lemma frequency suggest that words with more
our morphological segmentator on the ItWac corpus [47],      senses, those that are stopwords, or those that are more
enabled us to dissect the word into segments and ag-
gregate the frequencies of individual morphemes. This        5
                                                               For a detailed analysis of how these parameters were processed,
frequency, transformed using a logarithmic scale, helps        refer to Occhipinti 2024.
                                                             6
predict complexity by leveraging the familiarity of fre-       Spearman’s rank correlation was chosen because it does not assume
                                                               a linear relationship between variables, making it more suitable for
quently occurring morphemes. The use of lexical mor-
                                                               our dataset, where the relationships between features like word
pheme frequency as a complexity indicator is based on          length, number of morphemes, and word complexity may not follow
the idea that even if a word is unfamiliar as a whole, its     a strictly linear pattern. Spearman’s correlation measures whether
component morphemes may be common in the language              an increase in one variable tends to be consistently associated with
and more recognizable [49].                                    an increase (or decrease) in another, which is more appropriate
                                                               given the nature of our linguistic features.
Figure 1: Correlation of complexity values.


frequently used are perceived as less complex. These       morphological segmentation in the Italian language.
features are also statistically significant. It is notewor-   The correlation analysis reveals that while traditional
thy that the number of senses (senses_ID) is inversely     metrics like word length and frequency are valuable pre-
proportional to complexity. This could be attributed to    dictors of complexity, incorporating morphological fea-
the incompleteness of ItalWordNet, potentially leading     tures provides additional insights that enrich our un-
to unreliable predicted values.                            derstanding of lexical complexity. Notably, the positive
   Morphological density, however, does not show a sta-    correlation between the number of morphemes and per-
tistically significant correlation with complexity, suggest-
                                                           ceived complexity suggests that words with more mor-
ing that the ratio of morphemes to word length may not     phemes are inherently more complex. Conversely, fre-
be a strong predictor of perceived complexity.             quent lexical morphemes tend to reduce perceived com-
   The lexical morpheme frequency shows a significant      plexity, highlighting the importance of familiarity in com-
negative correlation with complexity, indicating that      plexity perception. Our study also emphasizes the need
more frequently occurring morphemes contribute to          for diverse linguistic features, including both surface
lower perceived complexity. This supports the notion       characteristics and morphological traits, to create more
that familiar morphemes, even within otherwise complex     robust and accurate models for predicting word complex-
words, aid in comprehension.                               ity. The statistically significant correlations for most fea-
   These findings underscore the importance of consid-     tures validate their relevance in complexity prediction.
ering a range of linguistic features, including morpho-    However, it is important to note that our findings are
logical traits, when assessing lexical complexity. By in-  based on a relatively small dataset of annotated complex-
tegrating these features into computational models, we     ity perceptions. To obtain more robust and generalizable
can enhance their ability to accurately predict word com-  results, it would be highly beneficial to have access to
plexity and, subsequently, improve lexical simplification. a larger and more diverse dataset of complexity annota-
                                                           tions. Expanding the dataset to include a wider variety
                                                           of texts and contexts would enhance the reliability of
6. Conclusion                                              the correlations observed and improve the training and
                                                           evaluation of automatic complexity prediction models.
This study highlights the significance of integrating mor-
                                                              Future research should focus on gathering more exten-
phological features into automatic models to enhance the
                                                           sive annotated datasets and exploring additional linguis-
comprehension and prediction of lexical complexity. The
                                                           tic features that may influence complexity perception. By
high performance of the Neural Morpheme Segmenta-
                                                           doing so, we can further refine our models and develop
tion model demonstrates the efficacy of convolutional
                                                           more effective tools for lexical simplification and other
neural networks in capturing the detailed patterns of
                                                           applications aimed at improving text accessibility.
References                                                            2010, pp. 364–393.
                                                                 [13] J. G. Wolff, The discovery of segments in natural
 [1] J. T. Devlin, H. L. Jamison, P. M. Matthews, L. M.               language, British Journal of Psychology 68 (1977)
     Gonnerman, Morphology and the internal structure                 97–106.
     of words, Proceedings of the National Academy of            [14] C. G. Nevill-Manning, I. H. Witten, Identifying hi-
     Sciences 101 (2004) 14984–14988.                                 erarchical structure in sequences: A linear-time al-
 [2] T. Ruokolainen, O. Kohonen, K. Sirts, S.-A. Grön-                gorithm, Journal of Artificial Intelligence Research
     roos, M. Kurimo, S. Virpioja, A comparative study                7 (1997) 67–82.
     of minimally supervised morphological segmenta-             [15] M. Johnson, Unsupervised word segmentation for
     tion, Computational Linguistics 42 (2016) 91–120.                sesotho using adaptor grammars, in: Proceedings of
 [3] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey,        the Tenth Meeting of ACL Special Interest Group on
     M. Gallé, A. Raja, C. Si, W. Y. Lee, B. Sagot, et al.,           Computational Morphology and Phonology, 2008,
     Between words and characters: A brief history of                 pp. 20–27.
     open-vocabulary modeling and tokenization in nlp,           [16] Z. S. Harris, From phoneme to morpheme, Lan-
     arXiv preprint arXiv:2112.10508 (2021).                          guage 31 (1955) 190–222. URL: http://www.jstor.
 [4] R. Sennrich, B. Haddow, A. Birch, Neural machine                 org/stable/411036.
     translation of rare words with subword units, in:           [17] P. Cohen, B. Heeringa, N. M. Adams, An unsuper-
     Proceedings of the 54th Annual Meeting of the As-                vised algorithm for segmenting categorical time-
     sociation for Computational Linguistics (Volume 1:               series into episodes, in: Proceedings of Pattern
     Long Papers), 2016, pp. 1715–1725. doi:10.18653/                 Detection and Discovery: ESF Exploratory Work-
     v1/P16-1162.                                                     shop London, 2002, pp. 49–62.
 [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,               [18] A. Sorokin, A. Kravtsova, Deep convolutional net-
     BERT: Pre-training of deep bidirectional trans-                  works for supervised morpheme segmentation of
     formers for language understanding, in: Pro-                     russian language, in: Proceedings of 7th Interna-
     ceedings of the 2019 Conference of the North                     tional Conference in Artificial Intelligence and Nat-
     American Chapter of the Association for Computa-                 ural Language (AINL 2018), 2018, pp. 3–10.
     tional Linguistics: Human Language Technologies,            [19] M. Creutz, K. Lagus, Unsupervised models for
     2019, pp. 4171–4186. URL: https://aclanthology.org/              morpheme segmentation and morphology learn-
     N19-1423. doi:10.18653/v1/N19-1423.                              ing, ACM Transactions on Speech and Language
 [6] K. Bostrom, G. Durrett, Byte pair encoding is subop-             Processing (TSLP) 4 (2007) 1–34.
     timal for language model pretraining, in: Findings          [20] H. Poon, C. Cherry, K. Toutanova, Unsupervised
     of the Association for Computational Linguistics:                morphological segmentation with log-linear mod-
     EMNLP 2020, 2020, pp. 4617–4624.                                 els, in: Proceedings of Human Language Technolo-
 [7] X. Song, A. Salcianu, Y. Song, D. Dopson, D. Zhou,               gies: The 2009 Annual Conference of the North
     Fast wordpiece tokenization, in: Proceedings of the              American Chapter of the Association for Computa-
     2021 Conference on Empirical Methods in Natural                  tional Linguistics, 2009, pp. 209–217.
     Language Processing, 2020, pp. 2089–2103.                   [21] K. Sirts, S. Goldwater, Minimally-supervised mor-
 [8] R. Cotterell, C. Kirov, J. Sylak-Glassman,                       phological segmentation using adaptor grammars,
     D. Yarowsky, J. Eisner, M. Hulden, The sig-                      Transactions of the Association for Computational
     morphon 2016 shared task—morphological                           Linguistics 1 (2013) 255–266.
     reinflection, in: Proceedings of the 14th SIGMOR-           [22] Z. S. Harris, Morpheme Boundaries within Words:
     PHON workshop on computational research in                       Report on a Computer Test, Springer Netherlands,
     phonetics, phonology, and morphology, 2016, pp.                  1970, pp. 68–77.
     10–22.                                                      [23] T. Ruokolainen, O. Kohonen, S. Virpioja, M. Ku-
 [9] K. Collins-Thompson, Computational assessment                    rimo, Supervised morphological segmentation in
     of text readability: A survey of current and future              a low-resource learning setting using conditional
     research, ITL-International Journal of Applied Lin-              random fields, in: Proceedings of the Seventeenth
     guistics 165 (2014) 97–135.                                      Conference on Computational Natural Language
[10] W. U. Dressler, Ricchezza e complessità morfologica,             Learning, 2013, pp. 29–37.
     Ricchezza e complessità morfologica (1999) 1000–            [24] J. Goldsmith, Unsupervised learning of the mor-
     1011.                                                            phology of a natural language, Computational lin-
[11] S. Scalise, Morfologia, il Mulino, 1994.                         guistics 27 (2001) 153–198.
[12] J. A. Goldsmith, Segmentation and morphology,               [25] J. Goldsmith, An algorithm for the unsupervised
     in: The handbook of computational linguistics and                learning of morphology, Natural language engi-
     natural language processing, Wiley Online Library,               neering 12 (2006) 353–371.
[26] M. Creutz, K. Lagus, Unsupervised discovery of               ence series 2005 (ISSN 1747-9398), volume 1, 2005,
     morphemes, in: Proceedings of the ACL-02 Work-               pp. 1–12.
     shop on Morphological and Phonological Learning,        [38] R. Delmonte, et al., Computational Linguistic
     2002, pp. 21–30.                                             Text Processing–Lexicon, Grammar, Parsing and
[27] M. J. P. Creutz, K. H. Lagus, Morfessor in the mor-          Anaphora Resolution, Nova Science Publishers,
     pho challenge, in: Proceedings of the PASCAL                 2008.
     Challenge Workshop on Unsupervised Segmenta-            [39] E. Pianta, C. Girardi, R. Zanoli, The textpro tool
     tion of Words into Morphemes, 2006, pp. 12–17.               suite., in: Proceedings of the Sixth International
[28] Ö. Kılıç, C. Bozsahin, Semi-supervised morpheme              Conference on Language Resources and Evaluation
     segmentation without morphological analysis, in:             (LREC’08), 2008, p. 2603–2607.
     Proceedings of the workshop on language resources       [40] F. Tamburini, M. Melandri, Anita: a powerful mor-
     and technologies for Turkic languages, LREC, 2012,           phological analyser for italian., in: Proceedings of
     pp. 52–56.                                                   the Eleventh International Conference on Language
[29] T. Ruokolainen, O. Kohonen, S. Virpioja, M. Kurimo,          Resources and Evaluation (LREC 2018), 2012, pp.
     Painless semi-supervised morphological segmen-               941–947.
     tation using conditional random fields, in: Pro-        [41] L. Talamo, C. Celata, P. M. Bertinetto, Derivatario:
     ceedings of the 14th Conference of the European              An annotated lexicon of italian derivatives, Word
     Chapter of the Association for Computational Lin-            Structure 9 (2016) 72–102.
     guistics, volume 2: Short Papers, 2014, pp. 84–89.      [42] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro,
[30] J. Lafferty, A. McCallum, F. Pereira, et al., Condi-         S. Bott, S. Calderon Ramirez, R. Cardon, T. François,
     tional random fields: Probabilistic models for seg-          A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide,
     menting and labeling sequence data, in: Interna-             J. M. Imperial, A. Nohejl, K. North, L. Occhip-
     tional Conference on Machine Learning, 2001, pp.             inti, N. Peréz Rojas, N. Raihan, T. Ranasinghe,
     282—-289.                                                    M. Solis Salazar, M. Zampieri, H. Saggion, An
[31] S.-A. Grönroos, S. Virpioja, P. Smit, M. Kurimo, Mor-        extensible massively multilingual lexical simplifi-
     fessor flatcat: An hmm-based method for unsuper-             cation pipeline dataset using the MultiLS frame-
     vised and semi-supervised learning of morphology,            work, in: R. Wilkens, R. Cardon, A. Todirascu,
     in: Proceedings of COLING 2014, the 25th Inter-              N. Gala (Eds.), Proceedings of the 3rd Workshop
     national Conference on Computational Linguistics,            on Tools and Resources for People with REAd-
     2014, pp. 1177–1185.                                         ing DIfficulties (READI) @ LREC-COLING 2024,
[32] X. Zhu, A. B. Goldberg, Introduction to semi-                ELRA and ICCL, Torino, Italia, 2024, pp. 38–46. URL:
     supervised learning, Springer Nature, 2022.                  https://aclanthology.org/2024.readi-1.4.
[33] A. Sorokin, Convolutional neural networks for           [43] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro,
     low-resource morpheme segmentation: baseline                 S. Bott, S. Calderon Ramirez, R. Cardon, T. François,
     or state-of-the-art?, in: Proceedings of the 16th            A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide, J. M.
     Workshop on Computational Research in Phonet-                Imperial, A. Nohejl, K. North, L. Occhipinti, N. P.
     ics, Phonology, and Morphology, 2019, pp. 154–159.           Rojas, N. Raihan, T. Ranasinghe, M. S. Salazar,
     URL: https://aclanthology.org/W19-4218. doi:10.              S. Štajner, M. Zampieri, H. Saggion, The BEA
     18653/v1/W19-4218.                                           2024 shared task on the multilingual lexical sim-
[34] L. Wang, Z. Cao, Y. Xia, G. De Melo, Morphological           plification pipeline, in: E. Kochmar, M. Bexte,
     segmentation with window lstm neural networks,               J. Burstein, A. Horbach, R. Laarmann-Quante,
     in: Proceedings of the AAAI Conference on Artifi-            A. Tack, V. Yaneva, Z. Yuan (Eds.), Proceedings
     cial Intelligence, 2016, pp. 2842–2848.                      of the 19th Workshop on Innovative Use of NLP
[35] R. Cotterell, T. Mueller, A. Fraser, H. Schütze,             for Building Educational Applications (BEA 2024),
     Labeled morphological segmentation with semi-                Association for Computational Linguistics, Mex-
     markov models, in: Proceedings of the Nineteenth             ico City, Mexico, 2024, pp. 571–589. URL: https:
     Conference on Computational Natural Language                 //aclanthology.org/2024.bea-1.51.
     Learning, 2015, pp. 164–174.                            [44] M. Aronoff, A decade of morphology and word
[36] M. Battista, V. Pirrelli, Una piattaforma di morfolo-        formation, Annual review of anthropology (1983)
     gia computazionale per l’analisi e la generazione            355–375.
     delle parole italiane, Technical Report, ILC-CNR,       [45] A. Sorokin, Improving morpheme segmentation us-
     1999.                                                        ing bert embeddings, in: International Conference
[37] E. Zanchetta, M. Baroni, Morph-it! a free corpus-            on Analysis of Images, Social Networks and Texts,
     based morphological resource for the italian lan-            Springer, 2021, pp. 148–161.
     guage, in: Proceedings of corpus linguistics confer-    [46] K. North, M. Zampieri, M. Shardlow, Lexical com-
     plexity prediction: An overview, ACM Computing
     Surveys 55 (2023) 1–42.
[47] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta,
     The wacky wide web: a collection of very large
     linguistically processed web-crawled corpora, Lan-
     guage resources and evaluation 43 (2009) 209–226.
[48] A. Roventini, A. Alonge, N. Calzolari, B. Magnini,
     F. Bertagna, Italwordnet: a large semantic database
     for italian., in: In Proceedings of the Second Inter-
     national Conference on Language Resources and
     Evaluation (LREC-2000), 2000, pp. 783–790.
[49] P. Colé, J. Segui, M. Taft, Words and morphemes
     as units for lexical access, Journal of Memory and
     Language 37 (1997) 312–330.
[50] L. Occhipinti, Complex word identification for ital-
     ian language: a dictionary-based approach, in: Pro-
     ceedings of Clib24, Sixth International Conference
     on Computational Linguistics in Bulgaria, 2024, pp.
     119–129.