<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Lexical Complexity Prediction in Italian through Automatic Morphological Segmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Occhipinti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Morphological analysis is essential for various Natural Language Processing (NLP) tasks, as it reveals the internal structure of words and deepens our understanding of their morphological and syntactic relationships. This study focuses on surface morphological segmentation for the Italian language, addressing the limited representation of detailed morphological information in existing corpora. Using an automatic segmentation tool, we extract quantitative morphological parameters to investigate their impact on the perception of word complexity by native Italian speakers. Through correlation analysis, we demonstrate that morphological features, such as the number of morphemes and lexical morpheme frequency, significantly influence how complex words are perceived. These insights contribute to improving automatic lexical complexity prediction models and ofer a deeper understanding of the role of morphology in word comprehension.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Morphological segmentation</kwd>
        <kwd>Lexical complexity prediction</kwd>
        <kwd>Italian language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        (they were eating). The resulting surface segmentation
would be mangi- + -avano, where mangi- is a morph
deMorphological analysis is crucial for various NLP tasks, rived from the root of the verb mangiare, and -avano is
as it provides insights into the internal structures of the sufix indicating the third person plural of the
imperwords and helps us better understand the morpholog- fect tense. In contrast, the canonical segmentation would
ical and syntactic relationships between words [
        <xref ref-type="bibr" rid="ref17">1</xref>
        ]. yield mangiare + -avano, with mangiare as the canonical
      </p>
      <p>The Italian language, with its rich morphology and ex- morpheme and -avano as the sufix 1.
tensive use of inflection and derivation, presents unique In this study, we focus on surface morphological
segchallenges and opportunities for morphological segmen- mentation for the Italian language. Morphological
featation. tures are often not adequately represented in available</p>
      <p>Automatic segmentation, a key component of morphol- corpora for this language, or they refer exclusively to
ogy learning, involves dividing word forms into mean- morphosyntactic information, such as the grammatical
ingful units such as roots, prefixes, and sufixes [ 2]. This category of words and a macro-level descriptive analysis
task falls under the broader category of subword segmen- mainly related to inflection. Information about the
intertation [3] but is distinct due to its linguistic motivation. nal structure of words, such as derivation or composition,
Computational approaches typically identify subwords is often lacking.
based on purely statistical considerations, which often The primary objective of this work is to use an
autoresults in subunits that do not correspond to recogniz- matic segmenter to extract a series of quantitative
morable linguistic units [4, 5, 6, 7]. Making this task more phological parameters. We believe that our approach
morphologically oriented could enable models to gen- does not require the detailed analysis provided by
canoneralize better to new words or forms, as basic roots or ical segmentation, which could entail longer processing
morphemes are often shared among words, and it could times.
also facilitate the interpretation of model results.</p>
      <p>When discussing morphological segmentation, we can
refer to two types: (1) Surface segmentation, which
involves dividing words into morphs, the surface forms of
morphemes; (2) Canonical segmentation, which involves
dividing words into morphemes and reducing them to
their standard forms [8].</p>
      <p>For instance, consider the Italian word mangiavano
1It’s important to note that the segmentation process is not always
straightforward, as it involves various linguistic criteria that may
not be immediately clear. For example, one of the challenges lies in
deciding whether to detach or retain the thematic vowel—a vowel
that appears between the root and the inflectional sufix, especially
in Romance languages. In the case of mangiavano, the thematic
vowel -a- could either be considered part of the root or treated
as a separate morph. Similarly, other segmentation criteria might
involve distinctions between compound forms, derivational afixes,
or fused morphemes that do not have clear boundaries. As a result,
the segmentation criteria can vary based on linguistic theory, the
specific task (e.g., computational vs. linguistic analysis), or even the
intended application of the segmentation (e.g., for syntactic parsing
or machine learning).</p>
      <p>
        In addition to examining classic parameters reported in sufer from oversegmentation and incorrect
segmentathe literature that influence complexity [ 9], such as word tion of afixes [
        <xref ref-type="bibr" rid="ref23">19, 28</xref>
        ]. These challenges arise due to the
frequency, length, and number of syllables, we aim to complex interplay of phonological, morphological, and
explore how morphological features integrate with these semantic factors in natural languages.
factors to afect word complexity perception. Specifi- Semi-supervised methods leverage both annotated and
cally, we seek to understand how the internal structure unannotated data, enhancing model performance with
of words contributes to the cognitive load that speakers minimal manual annotation [29]. These methods are
experience when processing more complex lexical items. efective in scenarios with limited labeled data[ 30, 31],
      </p>
      <p>
        Our premise is that words with more morphemes are using initial labeled datasets to hypothesize and validate
more complex because they contain more information patterns across larger unlabeled corpora [32]. While
bento decode [
        <xref ref-type="bibr" rid="ref31">10</xref>
        ]. For example, consider the word infelicità eficial, semi-supervised methods depend on the quality of
(unhappiness). To decode it, one must know the word initial labeled datasets and may struggle with languages
felice (happy), from which it is derived, as well as the pre- exhibiting extensive morphological diversity [2].
ifx in-, which negates the quality expressed by the base Supervised methods, relying on annotated datasets,
term, and the sufix -ità, which transforms the adjective typically achieve higher accuracy due to learning from
into an abstract noun. Therefore, to fully understand the explicitly labeled examples. Techniques include neural
meaning of infelicità, the reader or listener must be able networks, Hidden Markov Models (HMM), and
Convoluto correctly recognize and interpret each of these mor- tional Neural Networks (CNNs) [33, 34, 35, 23]. Despite
phemes and their contribution to the overall meaning of their high performance, supervised methods are limited
the word. by the need for extensive annotated corpora, which can
      </p>
      <p>
        The main contributions of this work are: (1) Providing be costly and time-consuming to create.
a tool capable of automatically segmenting words into Given access to a large annotated dataset for the Italian
linguistically motivated base forms; (2) presenting the language, on which we made semi-manual corrections,
dataset constructed for training our model; (3) evaluating our study primarily adopts a supervised approach.
the impact of diferent linguistic features on speakers’
perception of word complexity, with a particular focus
on morphological features.
2.1. Resources available for the Italian
language
2. Related Works Several computational resources and tools have been
developed to manage Italian morphological information
The study of morphological segmentation has evolved [
        <xref ref-type="bibr" rid="ref39">36, 37, 38, 39, 40, 41</xref>
        ]. These resources are essential for
from classical linguistics to advanced machine learn- improving the accuracy of text processing and
supporting techniques [11, 12]. The main approaches include ing advanced linguistic research. However, many of them
lexicon-based and boundary-detection-based meth- focus primarily on morphological analysis, without
proods [2]. Lexicon-based methods rely on a comprehen- viding detailed support for morphological segmentation,
sive database of known morphemes [
        <xref ref-type="bibr" rid="ref2">13, 14, 15</xref>
        ], while which limits their usefulness in tasks that require
fineboundary-detection methods identify transition points grained word structure analysis. Even those tools that
between morphemes using statistical or machine learn- ofer segmentation often approach it with diferent
mething techniques [16, 17, 18]. ods and objectives than ours.
      </p>
      <p>
        Another significant distinction is between generative Morph-it! [
        <xref ref-type="bibr" rid="ref39">37</xref>
        ] is an open-source lexicon that
conmodels and discriminative models. Generative models, tains 504,906 entries and 34,968 unique lemmas, each
suited for unsupervised learning, generate word forms annotated with morphological characteristics that link
and segmentations from raw data [
        <xref ref-type="bibr" rid="ref23">19, 20, 21</xref>
        ]. In contrast, inflected word forms to their lemmas. While valuable
discriminative models, which require annotated data, pre- for lemmatization and morphological analysis, it is not
dict segmentations based on learned relationships from suited for morphological segmentation, as it primarily
folabeled examples [22, 23]. cuses on inflected forms rather than decomposing words
      </p>
      <p>Unsupervised methods do not require labeled data, into their individual morphemes.
making them attractive for leveraging vast amounts of MorphoPro [39] is part of the TextPro suite and is
deraw data. They trace back to Harris (1955), who used signed for morphological analysis of both English and
statistical methods to identify morphological segments. Italian. It uses a declarative knowledge base converted
Notable systems include Linguistica [24, 25] and Mor- into a Finite State Automaton (FSA) for detailed
morphofessor [26, 27], which employ the Minimum Description logical analysis. However, MorphoPro’s output is geared
Length (MDL) principle to identify regularities within towards global morphological analysis and lacks support
data. Despite their utility, unsupervised methods often for internal word segmentation into morphemes, limiting
its applicability for more granular tasks.</p>
      <p>MAGIC [36] provides a lexicon of approximately evaluated by a group of native speakers with a perceived
100,000 lemmas and performs detailed morphological complexity score ranging from 1 to 5. In the dataset, the
and morphosyntactic analysis. However, similar to other aggregated and normalized complexity value is between
resources, MAGIC does not focus on morphological seg- 0 and 1, where 0 indicates very simple words and 1
inmentation. Instead, it provides morphological and syn- dicates very complex words2. The morphological traits
tactic information about word forms, making it more extracted by the selected model were then integrated
useful for general morphological analysis rather than with other linguistic features typically considered
influsegmenting words into individual morphemes. ential in the perception of word complexity [9]. These</p>
      <p>Getarun [38] ofers a lexicon of around 80,000 roots and combined features were analyzed in a correlation study
provides sophisticated morphosyntactic analysis. How- with the perceived complexity values of MultiLs-IT to
asever, like MAGIC, it is designed primarily for syntactic sess their impact on predicting linguistic complexity. By
parsing and lacks functionality for detailed morphologi- examining the relationships between these variables, we
cal segmentation, focusing instead on morphological and aim to determine whether morphological measures can
syntactic relationships. be efectively used in systems designed to automatically</p>
      <p>DerIvaTario [41] is another resource that provides sig- identify word complexity.
nificant support for morphological segmentation,
particularly in the context of derivational morphology. It ofers 3.1. Dataset
detailed information on derivational patterns in Italian,
mapping out how words are formed through derivational The primary reference for this work is the AnIta dataset,
processes, which is especially useful for studying word which includes data annotated with morphological
segformation in a structured manner. However, DerIvaTario mentations based on specific rules. One rule excludes
focuses primarily on canonical segmentations and does bases derived from Latin, Greek, and other languages.
not always recognize smaller morphemes, such as final Since Italian, especially in technical and specialized fields,
morphemes. This limitation means it may miss finer- contains many such words, we modified the dataset to
grained morphological elements, making it more suitable include these forms to ensure accurate representation.
for analyzing larger, derivational units rather than cap- The initial dataset consisted of numerous entries
auturing all inflectional components. tomatically generated by AnIta, often including
over</p>
      <p>AnIta is an advanced morphological analyzer for Ital- generated word-forms (possible words [44]), especially
ian, implemented within the FSA framework [40]. It sup- in evaluative morphology. This resulted in a
comprehenports a comprehensive lexicon with over 120,000 lemmas sive dataset with approximately two million entries.To
and handles inflectional, derivational, and compositional adapt the AnIta dataset for our research needs, we
unphenomena. AnIta’s segmentation occurs on two levels: dertook several steps.
superficial segmentation of word forms and derivation 1) Due to the extensive size, we reduced the sample,
graphs. Although derivation graphs are incomplete, the retaining one-third of entries for each letter, resulting in
tool’s focus on superficial segmentation aligns with our approximately 728,814 word-forms (35% of the original
research needs. For the segmentation of lemmas related dataset). This sample maintains a fair representation of
to derivational phenomena, AnIta adopts two main rules: all linguistic categories3. 2) We systematically identified
(1) afixes are kept unchanged; (2) lexicon entries are seg- and addressed prefixes and sufixes, prioritizing longer
mented only if their base is a recognizable independent afixes to preserve more informative morphological
strucItalian word. tures. This semi-automatic approach facilitated manual
verification while enhancing segmentation quality. 3)
We manually reviewed the segmented words, ensuring
3. Methods accuracy and consistency, preserving prefixes in their
original forms as per AnIta’s rule number one. 4) The
final dataset was divided into training (80%) and test (20%)
sets, comprising 583,051 and 145,763 words respectively.</p>
      <p>This split allowed efective training and validation of
our models without needing a separate validation set, as
no parameter tuning was performed. This streamlined
In this study, we trained three models, originally
developed for other languages, using an Italian dataset that
was manually created and verified with morphological
segmentations. After evaluating the performance of the
models, we selected the most efective one and used it
to extract morphological parameters from the words in
the MultiLS-IT dataset, a resource designed for lexical
simplification in the Italian language [42, 43].</p>
      <sec id="sec-1-1">
        <title>2The resource is available at https://github.com/MLSP2024/MLSP_</title>
        <p>Data.</p>
        <p>The dataset comprises 600 contextualized words, an- 3Initially, we aimed to manually review the entire dataset to address
notated for complexity and accompanied by substitutes any inconsistencies and overlooked segments. However, due to time
perceived as simpler than the target word. Each word was constraints, we opted to reduce the dataset by randomly selecting
30% of the entries for each letter.</p>
        <p>Automatic segmentation systems
Neural Morpheme Segmentation
MorphemeBERT
Morfessor FlatCat
methodology ensured a robust dataset for implementing places the boundaries at the correct points. Its F1 score
and evaluating our automatic segmentation system. (0.9892), which balances precision and recall, underscores
the model’s ability not only to accurately segment
mor3.2. Segmentation Models phemes but also to capture the majority of them with
minimal oversight. The high recall (0.9806) confirms that
Given the extensive dataset at our disposal, we se- the model rarely misses morphemes, making it
particulected models within the domain of supervised or semi- larly well-suited for handling complex or less frequent
supervised learning. The models considered include: morphological patterns. This balance between high
preMorfessor FlatCat [31]: a semi-supervised model cision and recall showcases the robustness of the
CNNthat utilizes a HMM approach for morphological segmen- based architecture, which can efectively model both local
tation. It is eficient in handling languages with complex dependencies between segments and the global
morphomorphological structures. The model’s flat lexicon and logical structure of words4.
the use of semi-supervised learning make it particularly MorphemeBERT demonstrates a high level of
precisuited for scenarios where annotated data is scarce. sion, indicating that when it identifies a morpheme, it</p>
        <p>Neural Morpheme Segmentation [33]: a su- is likely correct. However, its recall is noticeably lower
pervised model based on CNNs, designed to segment than that of Neural Morpheme Segmentation, which
morphemes by treating the task as a sequential labeling suggests that while it makes fewer errors, it also fails to
problem using the BMES scheme (Begin, Middle, End, detect a significant number of morphemes. This trade-of
Single). This model is noted for its ability to capture between precision and recall points to a more
conserlocal dependencies within textual data. Its architecture vative approach in morpheme segmentation, where the
includes multiple convolutional and pooling layers, en- model prioritizes accuracy over coverage. The F1 score
hancing its capability to identify and segment complex of 0.9522, though still strong, highlights this imbalance
morphological patterns. between precision and recall, meaning the model
per</p>
        <p>MorphemeBERT [45]: an advanced model that in- forms well but lacks the comprehensive identification
tegrates BERT’s characters embeddings with CNNs to that would elevate its overall performance. The
accuenhance morphological segmentation. BERT provides racy of 0.9581 reflects that the model is quite reliable in
deep, context-rich linguistic representations, which can general, but its inability to capture as many correct
morsignificantly improve the model’s accuracy in identifying phemes as Neural Morpheme Segmentation afects its
morphemic boundaries. overall segmentation capability. This limitation might
be due to how MorphemeBERT integrates BERT
embed3.3. Evaluation dings, which are optimized for context-rich predictions
but may struggle with identifying morphemic boundaries
After constructing the dataset and selecting the previ- in less straightforward or ambiguous cases, leading to
ously described models, we proceeded with the training. more missed segments.</p>
        <p>Table 1 presents a comparative evaluation of the three Morfessor FlatCat shows a considerably weaker
models using precision, recall, F1 score, and accuracy. performance compared to the other two models. While
These metrics are standard for assessing the performance its precision score of 0.79744 is decent, meaning that the
of boundary detection models, providing a comprehen- morphemes it identifies are mostly accurate, its recall
sive overview of each model’s efectiveness in identifying is notably low. This indicates that the model misses a
and segmenting morphemes accurately. substantial number of morphemes, failing to capture the</p>
        <p>Neural Morpheme Segmentation demonstrates the full complexity of word segmentation. The low recall
highest performance among the three systems across suggests that Morfessor FlatCat struggles to identify
almost all metrics, particularly excelling in precision many valid morphemic boundaries, which results in
inand F1 score. The high precision (0.9879) indicates that complete or inaccurate segmentations. Consequently,
the model is very accurate in identifying correct mor- its F1 score (0.5033) and accuracy (0.7399) are
signifipheme boundaries, minimizing false positives. In other
words, when the model segments a word, it reliably</p>
      </sec>
      <sec id="sec-1-2">
        <title>4This model is available upon request. Please contact the author</title>
        <p>directly to access to the model and relevant references.</p>
        <p>By integrating these morphological features with other
linguistic traits typically considered influential in
speakers’ perception of complexity, we aim to assess their
impact on predicting linguistic complexity5.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Analysis and discussion</title>
      <p>cantly lower, suggesting that this system is less reliable
for applications requiring high fidelity in morpheme
segmentation.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Selection of Linguistic Features</title>
      <sec id="sec-3-1">
        <title>Based on a thorough review of the literature on lexical</title>
        <p>complexity prediction [9, 46], we selected several lin- Through studying the correlations between these
variguistic features to analyze their impact on complexity. ables, we seek to determine whether morphological
meaIn addition to common surface characteristics, such as sures can be efectively used to develop systems capable
the number of letters, syllables, and vowels in words, of automatically identifying word complexity. To achieve
commonly used in complexity studies and readability cal- this, we conducted a correlation and significance analysis
culations, we identified other relevant parameters. One between the features discussed earlier and the perceived
key factor is the frequency of a word, as more frequent complexity values for the 600 words included in
MultiLswords tend to be perceived as more familiar and thus less IT.
complex. We calculated it using the ItWac corpus [47].</p>
        <p>Another important parameter is the number of senses a LFeenagtuthre 0C.0o8r2relation 0p.-0v4a5l*ue
word has, measured using the lexical resources ItalWord- Number of vowels 0.097 0.018*
net [48]. Lastly, the presence of stop words, calculated NNuummbbeerr ooff sMyollrapbhleesmes 00..019112 00..000266**
with Spacy model, which are common words that often Senses_ID -0.277 0.000*
carry little inherent meaning, can influence the perceived SLteompmwoarFdrequency --00..416274 00..000030**
complexity of a sentence or text. Given the focus of this LMeoxricpahlomloogripcahleDmeenfsrietqyuency -00.0.33363 00..308010*
study on morphological features’ impact on lexical
complexity, we concentrated on several key aspects related Table 2
to the internal structure of words. These features could Spearman correlation coeficients and p-values for features
show how morphological traits contribute to word intri- and complexity. Note: * indicates statistical significance.
cacy:</p>
        <p>Number of morphemes: Morphemes are the small- Table 2 presents the Spearman correlation coeficients
est units of meaning in words, including afixes (prefixes and their statistical significance for the features
calcuand sufixes) and roots. The number of morphemes gives lated6. The correlation analysis reveals several important
an indication of the information load of a word. Lexical insights.
items with more morphemes typically require more de- Word length, number of vowels, and number of
sylcoding efort from readers. We used our Convolutional lables all have small but statistically significant positive
Neural Model for automatic morphological segmentation correlations with complexity. This suggests that, as
exand morpheme counting. pected, longer words with more vowels and syllables</p>
        <p>Morphological density: This quantitative metric is tend to be perceived as more complex. These factors are
defined as the ratio of the number of morphemes to word typical in readability studies, where more phonologically
length, ofering a measure of how densely packed mean- complex words are generally harder to process.
ingful units are within a word. Higher morphological The number of morphemes also shows a positive
cordensity can indicate more cognitive load, as each unit relation with complexity, reinforcing the idea that words
contributes distinct information, potentially raising the with more morphemes are perceived as more complex.
complexity of the word. This feature is statistically significant as well.</p>
        <p>Frequency of the lexical morpheme: Lexical mor- Negative correlations for senses_ID, stopword
presphemes carry the core meaning of the word. Employing ence, and lemma frequency suggest that words with more
our morphological segmentator on the ItWac corpus [47], senses, those that are stopwords, or those that are more
enabled us to dissect the word into segments and
aggregate the frequencies of individual morphemes. This
frequency, transformed using a logarithmic scale, helps
predict complexity by leveraging the familiarity of
frequently occurring morphemes. The use of lexical
morpheme frequency as a complexity indicator is based on
the idea that even if a word is unfamiliar as a whole, its
component morphemes may be common in the language
and more recognizable [49].
5For a detailed analysis of how these parameters were processed,
refer to Occhipinti 2024.
6Spearman’s rank correlation was chosen because it does not assume
a linear relationship between variables, making it more suitable for
our dataset, where the relationships between features like word
length, number of morphemes, and word complexity may not follow
a strictly linear pattern. Spearman’s correlation measures whether
an increase in one variable tends to be consistently associated with
an increase (or decrease) in another, which is more appropriate
given the nature of our linguistic features.
frequently used are perceived as less complex. These morphological segmentation in the Italian language.
features are also statistically significant. It is notewor- The correlation analysis reveals that while traditional
thy that the number of senses (senses_ID) is inversely metrics like word length and frequency are valuable
preproportional to complexity. This could be attributed to dictors of complexity, incorporating morphological
feathe incompleteness of ItalWordNet, potentially leading tures provides additional insights that enrich our
unto unreliable predicted values. derstanding of lexical complexity. Notably, the positive</p>
        <p>Morphological density, however, does not show a sta- correlation between the number of morphemes and
pertistically significant correlation with complexity, suggest- ceived complexity suggests that words with more
moring that the ratio of morphemes to word length may not phemes are inherently more complex. Conversely,
frebe a strong predictor of perceived complexity. quent lexical morphemes tend to reduce perceived
com</p>
        <p>The lexical morpheme frequency shows a significant plexity, highlighting the importance of familiarity in
comnegative correlation with complexity, indicating that plexity perception. Our study also emphasizes the need
more frequently occurring morphemes contribute to for diverse linguistic features, including both surface
lower perceived complexity. This supports the notion characteristics and morphological traits, to create more
that familiar morphemes, even within otherwise complex robust and accurate models for predicting word
complexwords, aid in comprehension. ity. The statistically significant correlations for most
fea</p>
        <p>These findings underscore the importance of consid- tures validate their relevance in complexity prediction.
ering a range of linguistic features, including morpho- However, it is important to note that our findings are
logical traits, when assessing lexical complexity. By in- based on a relatively small dataset of annotated
complextegrating these features into computational models, we ity perceptions. To obtain more robust and generalizable
can enhance their ability to accurately predict word com- results, it would be highly beneficial to have access to
plexity and, subsequently, improve lexical simplification. a larger and more diverse dataset of complexity
annotations. Expanding the dataset to include a wider variety
of texts and contexts would enhance the reliability of
6. Conclusion the correlations observed and improve the training and
evaluation of automatic complexity prediction models.</p>
        <p>This study highlights the significance of integrating mor- Future research should focus on gathering more
extenphological features into automatic models to enhance the sive annotated datasets and exploring additional
linguiscomprehension and prediction of lexical complexity. The tic features that may influence complexity perception. By
high performance of the Neural Morpheme Segmenta- doing so, we can further refine our models and develop
tion model demonstrates the eficacy of convolutional more efective tools for lexical simplification and other
neural networks in capturing the detailed patterns of applications aimed at improving text accessibility.
[26] M. Creutz, K. Lagus, Unsupervised discovery of ence series 2005 (ISSN 1747-9398), volume 1, 2005,
morphemes, in: Proceedings of the ACL-02 Work- pp. 1–12.
shop on Morphological and Phonological Learning, [38] R. Delmonte, et al., Computational Linguistic
2002, pp. 21–30. Text Processing–Lexicon, Grammar, Parsing and
[27] M. J. P. Creutz, K. H. Lagus, Morfessor in the mor- Anaphora Resolution, Nova Science Publishers,
pho challenge, in: Proceedings of the PASCAL 2008.</p>
        <p>Challenge Workshop on Unsupervised Segmenta- [39] E. Pianta, C. Girardi, R. Zanoli, The textpro tool
tion of Words into Morphemes, 2006, pp. 12–17. suite., in: Proceedings of the Sixth International
[28] Ö. Kılıç, C. Bozsahin, Semi-supervised morpheme Conference on Language Resources and Evaluation
segmentation without morphological analysis, in: (LREC’08), 2008, p. 2603–2607.</p>
        <p>Proceedings of the workshop on language resources [40] F. Tamburini, M. Melandri, Anita: a powerful
morand technologies for Turkic languages, LREC, 2012, phological analyser for italian., in: Proceedings of
pp. 52–56. the Eleventh International Conference on Language
[29] T. Ruokolainen, O. Kohonen, S. Virpioja, M. Kurimo, Resources and Evaluation (LREC 2018), 2012, pp.</p>
        <p>Painless semi-supervised morphological segmen- 941–947.
tation using conditional random fields, in: Pro- [41] L. Talamo, C. Celata, P. M. Bertinetto, Derivatario:
ceedings of the 14th Conference of the European An annotated lexicon of italian derivatives, Word
Chapter of the Association for Computational Lin- Structure 9 (2016) 72–102.</p>
        <p>guistics, volume 2: Short Papers, 2014, pp. 84–89. [42] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro,
[30] J. Laferty, A. McCallum, F. Pereira, et al., Condi- S. Bott, S. Calderon Ramirez, R. Cardon, T. François,
tional random fields: Probabilistic models for seg- A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide,
menting and labeling sequence data, in: Interna- J. M. Imperial, A. Nohejl, K. North, L.
Occhiptional Conference on Machine Learning, 2001, pp. inti, N. Peréz Rojas, N. Raihan, T. Ranasinghe,
282—-289. M. Solis Salazar, M. Zampieri, H. Saggion, An
[31] S.-A. Grönroos, S. Virpioja, P. Smit, M. Kurimo, Mor- extensible massively multilingual lexical
simplififessor flatcat: An hmm-based method for unsuper- cation pipeline dataset using the MultiLS
framevised and semi-supervised learning of morphology, work, in: R. Wilkens, R. Cardon, A. Todirascu,
in: Proceedings of COLING 2014, the 25th Inter- N. Gala (Eds.), Proceedings of the 3rd Workshop
national Conference on Computational Linguistics, on Tools and Resources for People with
REAd2014, pp. 1177–1185. ing DIficulties (READI) @ LREC-COLING 2024,
[32] X. Zhu, A. B. Goldberg, Introduction to semi- ELRA and ICCL, Torino, Italia, 2024, pp. 38–46. URL:
supervised learning, Springer Nature, 2022. https://aclanthology.org/2024.readi-1.4.
[33] A. Sorokin, Convolutional neural networks for [43] M. Shardlow, F. Alva-Manchego, R. Batista-Navarro,
low-resource morpheme segmentation: baseline S. Bott, S. Calderon Ramirez, R. Cardon, T. François,
or state-of-the-art?, in: Proceedings of the 16th A. Hayakawa, A. Horbach, A. Hülsing, Y. Ide, J. M.
Workshop on Computational Research in Phonet- Imperial, A. Nohejl, K. North, L. Occhipinti, N. P.
ics, Phonology, and Morphology, 2019, pp. 154–159. Rojas, N. Raihan, T. Ranasinghe, M. S. Salazar,
URL: https://aclanthology.org/W19-4218. doi:10. S. Štajner, M. Zampieri, H. Saggion, The BEA
18653/v1/W19-4218. 2024 shared task on the multilingual lexical
sim[34] L. Wang, Z. Cao, Y. Xia, G. De Melo, Morphological plification pipeline, in: E. Kochmar, M. Bexte,
segmentation with window lstm neural networks, J. Burstein, A. Horbach, R. Laarmann-Quante,
in: Proceedings of the AAAI Conference on Artifi- A. Tack, V. Yaneva, Z. Yuan (Eds.), Proceedings
cial Intelligence, 2016, pp. 2842–2848. of the 19th Workshop on Innovative Use of NLP
[35] R. Cotterell, T. Mueller, A. Fraser, H. Schütze, for Building Educational Applications (BEA 2024),
Labeled morphological segmentation with semi- Association for Computational Linguistics,
Mexmarkov models, in: Proceedings of the Nineteenth ico City, Mexico, 2024, pp. 571–589. URL: https:
Conference on Computational Natural Language //aclanthology.org/2024.bea-1.51.</p>
        <p>
          Learning, 2015, pp. 164–174. [44] M. Aronof, A decade of morphology and word
[36] M. Battista, V. Pirrelli, Una piattaforma di morfolo- formation, Annual review of anthropology (1983)
gia computazionale per l’analisi e la generazione 355–375.
delle parole italiane, Technical Report, ILC-CNR, [45] A. Sorokin, Improving morpheme segmentation
us1999. ing bert embeddings, in: International Conference
[
          <xref ref-type="bibr" rid="ref39">37</xref>
          ] E. Zanchetta, M. Baroni, Morph-it! a free corpus- on Analysis of Images, Social Networks and Texts,
based morphological resource for the italian lan- Springer, 2021, pp. 148–161.
guage, in: Proceedings of corpus linguistics confer- [46] K. North, M. Zampieri, M. Shardlow, Lexical
com
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>2010</year>
          , pp.
          <fpage>364</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [13]
          <string-name>
            <surname>J. G. Wolf,</surname>
          </string-name>
          <article-title>The discovery of segments in natural [1</article-title>
          ]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Jamison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Matthews</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>M. language</article-title>
          ,
          <source>British Journal of Psychology</source>
          <volume>68</volume>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gonnerman</surname>
          </string-name>
          ,
          <article-title>Morphology and the internal structure 97-106</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>of words</article-title>
          ,
          <source>Proceedings of the National Academy</source>
          <volume>of</volume>
          [14]
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Nevill-Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          , Identifying hi-
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Sciences</source>
          <volume>101</volume>
          (
          <year>2004</year>
          )
          <fpage>14984</fpage>
          -
          <lpage>14988</lpage>
          .
          <article-title>erarchical structure in sequences: A linear-</article-title>
          time al[2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruokolainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kohonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sirts</surname>
          </string-name>
          , S.-A. Grön- gorithm,
          <source>Journal of Artificial Intelligence Research</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>roos</surname>
            , M. Kurimo,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Virpioja</surname>
          </string-name>
          ,
          <source>A comparative study 7</source>
          (
          <year>1997</year>
          )
          <fpage>67</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>of minimally supervised morphological segmenta-</article-title>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , Unsupervised word segmentation for
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>tion</surname>
          </string-name>
          ,
          <source>Computational Linguistics</source>
          <volume>42</volume>
          (
          <year>2016</year>
          )
          <fpage>91</fpage>
          -
          <lpage>120</lpage>
          .
          <article-title>sesotho using adaptor grammars</article-title>
          , in: Proceedings of [3]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Mielke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alyafeai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Salesky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Dey, the Tenth Meeting of</article-title>
          ACL Special Interest Group on
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Gallé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Si</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          , et al.,
          <source>Computational Morphology and Phonology</source>
          ,
          <year>2008</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>Between words and characters: A brief history</article-title>
          of pp.
          <fpage>20</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>open-vocabulary modeling and tokenization in nlp</article-title>
          , [16]
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Harris</surname>
          </string-name>
          , From phoneme to morpheme, Lan-
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>arXiv preprint arXiv:2112.10508</source>
          (
          <year>2021</year>
          ).
          <source>guage 31</source>
          (
          <year>1955</year>
          )
          <fpage>190</fpage>
          -
          <lpage>222</lpage>
          . URL: http://www.jstor. [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <source>Neural machine org/stable/411036.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>translation of rare words with subword units</article-title>
          , in: [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Heeringa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Adams</surname>
          </string-name>
          , An unsuper-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>Proceedings of the 54th Annual Meeting of the As- vised algorithm for segmenting categorical time-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>sociation for Computational Linguistics (Volume 1: series into episodes</article-title>
          ,
          <source>in: Proceedings of Pattern</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>Long Papers)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1715</fpage>
          -
          <lpage>1725</lpage>
          . doi:
          <volume>10</volume>
          .18653/ Detection and Discovery: ESF Exploratory Work-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          v1/
          <fpage>P16</fpage>
          -1162. shop London,
          <year>2002</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>62</lpage>
          . [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sorokin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kravtsova</surname>
          </string-name>
          , Deep convolutional net-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>formers for language understanding, in: Pro- russian language</article-title>
          ,
          <source>in: Proceedings of 7th Interna-</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>ceedings of the 2019 Conference of the North tional Conference in Artificial Intelligence and Nat-</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>American Chapter of the Association for Computa- ural Language (AINL</article-title>
          <year>2018</year>
          ),
          <year>2018</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>tional Linguistics: Human Language Technologies</surname>
            , [19]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Creutz</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lagus</surname>
          </string-name>
          , Unsupervised models for
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ morpheme segmentation and morphology learn-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <fpage>N19</fpage>
          -
          <lpage>1423</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423. ing,
          <source>ACM Transactions on Speech and Language</source>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bostrom</surname>
          </string-name>
          , G. Durrett, Byte pair encoding is subop-
          <source>Processing (TSLP) 4</source>
          (
          <issue>2007</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>timal for language model pretraining</article-title>
          , in: Findings [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Unsupervised
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>EMNLP</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>4617</fpage>
          -
          <lpage>4624</lpage>
          . els,
          <source>in: Proceedings of Human Language Technolo</source>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salcianu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dopson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , gies:
          <source>The 2009 Annual Conference of the North</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>2021 Conference on Empirical Methods in Natural tional Linguistics</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Language</given-names>
            <surname>Processing</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>2089</fpage>
          -
          <lpage>2103</lpage>
          . [21]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sirts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goldwater</surname>
          </string-name>
          , Minimally-supervised mor[8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kirov</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Sylak-Glassman, phological segmentation using adaptor grammars,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>morphon 2016 shared task-morphological Linguistics 1 (</article-title>
          <year>2013</year>
          )
          <fpage>255</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          reinflection,
          <source>in: Proceedings of the 14th SIGMOR-</source>
          [22]
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Harris</surname>
          </string-name>
          , Morpheme Boundaries within Words:
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          phonetics, phonology, and morphology,
          <year>2016</year>
          , pp.
          <year>1970</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          10-
          <fpage>22</fpage>
          . [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruokolainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kohonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Virpioja</surname>
          </string-name>
          , M. Ku[9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Collins-Thompson</surname>
          </string-name>
          ,
          <article-title>Computational assessment rimo, Supervised morphological segmentation in</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>guistics 165</source>
          (
          <year>2014</year>
          )
          <fpage>97</fpage>
          -
          <lpage>135</lpage>
          . Conference on Computational Natural Language [10]
          <string-name>
            <given-names>W. U.</given-names>
            <surname>Dressler</surname>
          </string-name>
          , Ricchezza e complessità morfologica,
          <source>Learning</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>Ricchezza e complessità morfologica (</article-title>
          <year>1999</year>
          )
          <fpage>1000</fpage>
          - [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldsmith</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning of the mor-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          1011.
          <article-title>phology of a natural language</article-title>
          ,
          <source>Computational</source>
          <volume>lin</volume>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Scalise</surname>
          </string-name>
          , Morfologia, il Mulino,
          <year>1994</year>
          . guistics
          <volume>27</volume>
          (
          <year>2001</year>
          )
          <fpage>153</fpage>
          -
          <lpage>198</lpage>
          . [12]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Goldsmith</surname>
          </string-name>
          , Segmentation and morphology, [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldsmith</surname>
          </string-name>
          ,
          <article-title>An algorithm for the unsupervised</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>natural language processing</source>
          , Wiley Online Library, neering
          <volume>12</volume>
          (
          <year>2006</year>
          )
          <fpage>353</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          . [47]
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferraresi</surname>
          </string-name>
          , E. Zanchetta,
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>guage resources and evaluation 43</source>
          (
          <year>2009</year>
          )
          <fpage>209</fpage>
          -
          <lpage>226</lpage>
          . [48]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roventini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alonge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Calzolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Evaluation (LREC-2000)</surname>
          </string-name>
          ,
          <year>2000</year>
          , pp.
          <fpage>783</fpage>
          -
          <lpage>790</lpage>
          . [49]
          <string-name>
            <given-names>P.</given-names>
            <surname>Colé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Segui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taft</surname>
          </string-name>
          , Words and morphemes
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>Language</source>
          <volume>37</volume>
          (
          <year>1997</year>
          )
          <fpage>312</fpage>
          -
          <lpage>330</lpage>
          . [50]
          <string-name>
            <given-names>L.</given-names>
            <surname>Occhipinti</surname>
          </string-name>
          ,
          <article-title>Complex word identification for ital-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>ceedings of Clib24</source>
          , Sixth International Conference
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <source>on Computational Linguistics in Bulgaria</source>
          ,
          <year>2024</year>
          , pp.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>