<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Enhancing Lexical Complexity Prediction in Italian through Automatic Morphological Segmentation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Laura</forename><surname>Occhipinti</surname></persName>
							<email>laura.occhipinti3@unibo.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Bologna</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Enhancing Lexical Complexity Prediction in Italian through Automatic Morphological Segmentation</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">0B41DB89D05D9E99EA781086CA9776EB</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Morphological segmentation</term>
					<term>Lexical complexity prediction</term>
					<term>Italian language</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Morphological analysis is essential for various Natural Language Processing (NLP) tasks, as it reveals the internal structure of words and deepens our understanding of their morphological and syntactic relationships. This study focuses on surface morphological segmentation for the Italian language, addressing the limited representation of detailed morphological information in existing corpora. Using an automatic segmentation tool, we extract quantitative morphological parameters to investigate their impact on the perception of word complexity by native Italian speakers. Through correlation analysis, we demonstrate that morphological features, such as the number of morphemes and lexical morpheme frequency, significantly influence how complex words are perceived. These insights contribute to improving automatic lexical complexity prediction models and offer a deeper understanding of the role of morphology in word comprehension.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Morphological analysis is crucial for various NLP tasks, as it provides insights into the internal structures of words and helps us better understand the morphological and syntactic relationships between words <ref type="bibr" target="#b0">[1]</ref>.</p><p>The Italian language, with its rich morphology and extensive use of inflection and derivation, presents unique challenges and opportunities for morphological segmentation.</p><p>Automatic segmentation, a key component of morphology learning, involves dividing word forms into meaningful units such as roots, prefixes, and suffixes <ref type="bibr" target="#b1">[2]</ref>. This task falls under the broader category of subword segmentation <ref type="bibr" target="#b2">[3]</ref> but is distinct due to its linguistic motivation. Computational approaches typically identify subwords based on purely statistical considerations, which often results in subunits that do not correspond to recognizable linguistic units <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. Making this task more morphologically oriented could enable models to generalize better to new words or forms, as basic roots or morphemes are often shared among words, and it could also facilitate the interpretation of model results.</p><p>When discussing morphological segmentation, we can refer to two types: (1) Surface segmentation, which involves dividing words into morphs, the surface forms of morphemes; (2) Canonical segmentation, which involves dividing words into morphemes and reducing them to their standard forms <ref type="bibr" target="#b7">[8]</ref>.</p><p>For instance, consider the Italian word mangiavano (they were eating). The resulting surface segmentation would be mangi-+ -avano, where mangiis a morph derived from the root of the verb mangiare, and -avano is the suffix indicating the third person plural of the imperfect tense. In contrast, the canonical segmentation would yield mangiare + -avano, with mangiare as the canonical morpheme and -avano as the suffix 1 .</p><p>In this study, we focus on surface morphological segmentation for the Italian language. Morphological features are often not adequately represented in available corpora for this language, or they refer exclusively to morphosyntactic information, such as the grammatical category of words and a macro-level descriptive analysis mainly related to inflection. Information about the internal structure of words, such as derivation or composition, is often lacking.</p><p>The primary objective of this work is to use an automatic segmenter to extract a series of quantitative morphological parameters. We believe that our approach does not require the detailed analysis provided by canonical segmentation, which could entail longer processing times. 1 It's important to note that the segmentation process is not always straightforward, as it involves various linguistic criteria that may not be immediately clear. For example, one of the challenges lies in deciding whether to detach or retain the thematic vowel-a vowel that appears between the root and the inflectional suffix, especially in Romance languages. In the case of mangiavano, the thematic vowel -acould either be considered part of the root or treated as a separate morph. Similarly, other segmentation criteria might involve distinctions between compound forms, derivational affixes, or fused morphemes that do not have clear boundaries. As a result, the segmentation criteria can vary based on linguistic theory, the specific task (e.g., computational vs. linguistic analysis), or even the intended application of the segmentation (e.g., for syntactic parsing or machine learning).</p><p>In addition to examining classic parameters reported in the literature that influence complexity <ref type="bibr" target="#b8">[9]</ref>, such as word frequency, length, and number of syllables, we aim to explore how morphological features integrate with these factors to affect word complexity perception. Specifically, we seek to understand how the internal structure of words contributes to the cognitive load that speakers experience when processing more complex lexical items.</p><p>Our premise is that words with more morphemes are more complex because they contain more information to decode <ref type="bibr" target="#b9">[10]</ref>. For example, consider the word infelicità (unhappiness). To decode it, one must know the word felice (happy), from which it is derived, as well as the prefix in-, which negates the quality expressed by the base term, and the suffix -ità, which transforms the adjective into an abstract noun. Therefore, to fully understand the meaning of infelicità, the reader or listener must be able to correctly recognize and interpret each of these morphemes and their contribution to the overall meaning of the word.</p><p>The main contributions of this work are: (1) Providing a tool capable of automatically segmenting words into linguistically motivated base forms; (2) presenting the dataset constructed for training our model; (3) evaluating the impact of different linguistic features on speakers' perception of word complexity, with a particular focus on morphological features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>The study of morphological segmentation has evolved from classical linguistics to advanced machine learning techniques <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>. The main approaches include lexicon-based and boundary-detection-based methods <ref type="bibr" target="#b1">[2]</ref>. Lexicon-based methods rely on a comprehensive database of known morphemes <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref>, while boundary-detection methods identify transition points between morphemes using statistical or machine learning techniques <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18]</ref>.</p><p>Another significant distinction is between generative models and discriminative models. Generative models, suited for unsupervised learning, generate word forms and segmentations from raw data <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21]</ref>. In contrast, discriminative models, which require annotated data, predict segmentations based on learned relationships from labeled examples <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref>.</p><p>Unsupervised methods do not require labeled data, making them attractive for leveraging vast amounts of raw data. They trace back to <ref type="bibr" target="#b15">Harris (1955)</ref>, who used statistical methods to identify morphological segments. Notable systems include Linguistica <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b24">25]</ref> and Morfessor <ref type="bibr" target="#b25">[26,</ref><ref type="bibr" target="#b26">27]</ref>, which employ the Minimum Description Length (MDL) principle to identify regularities within data. Despite their utility, unsupervised methods often suffer from oversegmentation and incorrect segmentation of affixes <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b27">28]</ref>. These challenges arise due to the complex interplay of phonological, morphological, and semantic factors in natural languages.</p><p>Semi-supervised methods leverage both annotated and unannotated data, enhancing model performance with minimal manual annotation <ref type="bibr" target="#b28">[29]</ref>. These methods are effective in scenarios with limited labeled data <ref type="bibr" target="#b29">[30,</ref><ref type="bibr" target="#b30">31]</ref>, using initial labeled datasets to hypothesize and validate patterns across larger unlabeled corpora <ref type="bibr" target="#b31">[32]</ref>. While beneficial, semi-supervised methods depend on the quality of initial labeled datasets and may struggle with languages exhibiting extensive morphological diversity <ref type="bibr" target="#b1">[2]</ref>.</p><p>Supervised methods, relying on annotated datasets, typically achieve higher accuracy due to learning from explicitly labeled examples. Techniques include neural networks, Hidden Markov Models (HMM), and Convolutional Neural Networks (CNNs) <ref type="bibr" target="#b32">[33,</ref><ref type="bibr" target="#b33">34,</ref><ref type="bibr" target="#b34">35,</ref><ref type="bibr" target="#b22">23]</ref>. Despite their high performance, supervised methods are limited by the need for extensive annotated corpora, which can be costly and time-consuming to create.</p><p>Given access to a large annotated dataset for the Italian language, on which we made semi-manual corrections, our study primarily adopts a supervised approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Resources available for the Italian language</head><p>Several computational resources and tools have been developed to manage Italian morphological information <ref type="bibr" target="#b35">[36,</ref><ref type="bibr" target="#b36">37,</ref><ref type="bibr" target="#b37">38,</ref><ref type="bibr" target="#b38">39,</ref><ref type="bibr" target="#b39">40,</ref><ref type="bibr" target="#b40">41]</ref>. These resources are essential for improving the accuracy of text processing and supporting advanced linguistic research. However, many of them focus primarily on morphological analysis, without providing detailed support for morphological segmentation, which limits their usefulness in tasks that require finegrained word structure analysis. Even those tools that offer segmentation often approach it with different methods and objectives than ours. Morph-it! <ref type="bibr" target="#b36">[37]</ref> is an open-source lexicon that contains 504,906 entries and 34,968 unique lemmas, each annotated with morphological characteristics that link inflected word forms to their lemmas. While valuable for lemmatization and morphological analysis, it is not suited for morphological segmentation, as it primarily focuses on inflected forms rather than decomposing words into their individual morphemes.</p><p>MorphoPro <ref type="bibr" target="#b38">[39]</ref> is part of the TextPro suite and is designed for morphological analysis of both English and Italian. It uses a declarative knowledge base converted into a Finite State Automaton (FSA) for detailed morphological analysis. However, MorphoPro's output is geared towards global morphological analysis and lacks support for internal word segmentation into morphemes, limiting its applicability for more granular tasks.</p><p>MAGIC <ref type="bibr" target="#b35">[36]</ref> provides a lexicon of approximately 100,000 lemmas and performs detailed morphological and morphosyntactic analysis. However, similar to other resources, MAGIC does not focus on morphological segmentation. Instead, it provides morphological and syntactic information about word forms, making it more useful for general morphological analysis rather than segmenting words into individual morphemes.</p><p>Getarun <ref type="bibr" target="#b37">[38]</ref> offers a lexicon of around 80,000 roots and provides sophisticated morphosyntactic analysis. However, like MAGIC, it is designed primarily for syntactic parsing and lacks functionality for detailed morphological segmentation, focusing instead on morphological and syntactic relationships.</p><p>DerIvaTario <ref type="bibr" target="#b40">[41]</ref> is another resource that provides significant support for morphological segmentation, particularly in the context of derivational morphology. It offers detailed information on derivational patterns in Italian, mapping out how words are formed through derivational processes, which is especially useful for studying word formation in a structured manner. However, DerIvaTario focuses primarily on canonical segmentations and does not always recognize smaller morphemes, such as final morphemes. This limitation means it may miss finergrained morphological elements, making it more suitable for analyzing larger, derivational units rather than capturing all inflectional components.</p><p>AnIta is an advanced morphological analyzer for Italian, implemented within the FSA framework <ref type="bibr" target="#b39">[40]</ref>. It supports a comprehensive lexicon with over 120,000 lemmas and handles inflectional, derivational, and compositional phenomena. AnIta's segmentation occurs on two levels: superficial segmentation of word forms and derivation graphs. Although derivation graphs are incomplete, the tool's focus on superficial segmentation aligns with our research needs. For the segmentation of lemmas related to derivational phenomena, AnIta adopts two main rules:</p><p>(1) affixes are kept unchanged; (2) lexicon entries are segmented only if their base is a recognizable independent Italian word.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head><p>In this study, we trained three models, originally developed for other languages, using an Italian dataset that was manually created and verified with morphological segmentations. After evaluating the performance of the models, we selected the most effective one and used it to extract morphological parameters from the words in the MultiLS-IT dataset, a resource designed for lexical simplification in the Italian language <ref type="bibr" target="#b41">[42,</ref><ref type="bibr" target="#b42">43]</ref>.</p><p>The dataset comprises 600 contextualized words, annotated for complexity and accompanied by substitutes perceived as simpler than the target word. Each word was evaluated by a group of native speakers with a perceived complexity score ranging from 1 to 5. In the dataset, the aggregated and normalized complexity value is between 0 and 1, where 0 indicates very simple words and 1 indicates very complex words <ref type="foot" target="#foot_0">2</ref> . The morphological traits extracted by the selected model were then integrated with other linguistic features typically considered influential in the perception of word complexity <ref type="bibr" target="#b8">[9]</ref>. These combined features were analyzed in a correlation study with the perceived complexity values of MultiLs-IT to assess their impact on predicting linguistic complexity. By examining the relationships between these variables, we aim to determine whether morphological measures can be effectively used in systems designed to automatically identify word complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Dataset</head><p>The primary reference for this work is the AnIta dataset, which includes data annotated with morphological segmentations based on specific rules. One rule excludes bases derived from Latin, Greek, and other languages. Since Italian, especially in technical and specialized fields, contains many such words, we modified the dataset to include these forms to ensure accurate representation.</p><p>The initial dataset consisted of numerous entries automatically generated by AnIta, often including overgenerated word-forms (possible words <ref type="bibr" target="#b43">[44]</ref>), especially in evaluative morphology. This resulted in a comprehensive dataset with approximately two million entries.To adapt the AnIta dataset for our research needs, we undertook several steps.</p><p>1) Due to the extensive size, we reduced the sample, retaining one-third of entries for each letter, resulting in approximately 728,814 word-forms (35% of the original dataset). This sample maintains a fair representation of all linguistic categories<ref type="foot" target="#foot_1">3</ref> . 2) We systematically identified and addressed prefixes and suffixes, prioritizing longer affixes to preserve more informative morphological structures. This semi-automatic approach facilitated manual verification while enhancing segmentation quality. 3) We manually reviewed the segmented words, ensuring accuracy and consistency, preserving prefixes in their original forms as per AnIta's rule number one. 4) The final dataset was divided into training (80%) and test (20%) sets, comprising 583,051 and 145,763 words respectively. This split allowed effective training and validation of our models without needing a separate validation set, as no parameter tuning was performed. This streamlined </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Results of models on morphological segmentation.</p><p>methodology ensured a robust dataset for implementing and evaluating our automatic segmentation system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Segmentation Models</head><p>Given the extensive dataset at our disposal, we selected models within the domain of supervised or semisupervised learning. The models considered include:</p><p>Morfessor FlatCat <ref type="bibr" target="#b30">[31]</ref>: a semi-supervised model that utilizes a HMM approach for morphological segmentation. It is efficient in handling languages with complex morphological structures. The model's flat lexicon and the use of semi-supervised learning make it particularly suited for scenarios where annotated data is scarce. Neural Morpheme Segmentation <ref type="bibr" target="#b32">[33]</ref>: a supervised model based on CNNs, designed to segment morphemes by treating the task as a sequential labeling problem using the BMES scheme (Begin, Middle, End, Single). This model is noted for its ability to capture local dependencies within textual data. Its architecture includes multiple convolutional and pooling layers, enhancing its capability to identify and segment complex morphological patterns.</p><p>MorphemeBERT <ref type="bibr" target="#b44">[45]</ref>: an advanced model that integrates BERT's characters embeddings with CNNs to enhance morphological segmentation. BERT provides deep, context-rich linguistic representations, which can significantly improve the model's accuracy in identifying morphemic boundaries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Evaluation</head><p>After constructing the dataset and selecting the previously described models, we proceeded with the training. Table <ref type="table">1</ref> presents a comparative evaluation of the three models using precision, recall, F1 score, and accuracy. These metrics are standard for assessing the performance of boundary detection models, providing a comprehensive overview of each model's effectiveness in identifying and segmenting morphemes accurately.</p><p>Neural Morpheme Segmentation demonstrates the highest performance among the three systems across almost all metrics, particularly excelling in precision and F1 score. The high precision (0.9879) indicates that the model is very accurate in identifying correct morpheme boundaries, minimizing false positives. In other words, when the model segments a word, it reliably places the boundaries at the correct points. Its F1 score (0.9892), which balances precision and recall, underscores the model's ability not only to accurately segment morphemes but also to capture the majority of them with minimal oversight. The high recall (0.9806) confirms that the model rarely misses morphemes, making it particularly well-suited for handling complex or less frequent morphological patterns. This balance between high precision and recall showcases the robustness of the CNNbased architecture, which can effectively model both local dependencies between segments and the global morphological structure of words <ref type="foot" target="#foot_2">4</ref> .</p><p>MorphemeBERT demonstrates a high level of precision, indicating that when it identifies a morpheme, it is likely correct. However, its recall is noticeably lower than that of Neural Morpheme Segmentation, which suggests that while it makes fewer errors, it also fails to detect a significant number of morphemes. This trade-off between precision and recall points to a more conservative approach in morpheme segmentation, where the model prioritizes accuracy over coverage. The F1 score of 0.9522, though still strong, highlights this imbalance between precision and recall, meaning the model performs well but lacks the comprehensive identification that would elevate its overall performance. The accuracy of 0.9581 reflects that the model is quite reliable in general, but its inability to capture as many correct morphemes as Neural Morpheme Segmentation affects its overall segmentation capability. This limitation might be due to how MorphemeBERT integrates BERT embeddings, which are optimized for context-rich predictions but may struggle with identifying morphemic boundaries in less straightforward or ambiguous cases, leading to more missed segments.</p><p>Morfessor FlatCat shows a considerably weaker performance compared to the other two models. While its precision score of 0.79744 is decent, meaning that the morphemes it identifies are mostly accurate, its recall is notably low. This indicates that the model misses a substantial number of morphemes, failing to capture the full complexity of word segmentation. The low recall suggests that Morfessor FlatCat struggles to identify many valid morphemic boundaries, which results in incomplete or inaccurate segmentations. Consequently, its F1 score (0.5033) and accuracy (0.7399) are signifi-cantly lower, suggesting that this system is less reliable for applications requiring high fidelity in morpheme segmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Selection of Linguistic Features</head><p>Based on a thorough review of the literature on lexical complexity prediction <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b45">46]</ref>, we selected several linguistic features to analyze their impact on complexity. In addition to common surface characteristics, such as the number of letters, syllables, and vowels in words, commonly used in complexity studies and readability calculations, we identified other relevant parameters. One key factor is the frequency of a word, as more frequent words tend to be perceived as more familiar and thus less complex. We calculated it using the ItWac corpus <ref type="bibr" target="#b46">[47]</ref>. Another important parameter is the number of senses a word has, measured using the lexical resources ItalWordnet <ref type="bibr" target="#b47">[48]</ref>. Lastly, the presence of stop words, calculated with Spacy model, which are common words that often carry little inherent meaning, can influence the perceived complexity of a sentence or text. Given the focus of this study on morphological features' impact on lexical complexity, we concentrated on several key aspects related to the internal structure of words. These features could show how morphological traits contribute to word intricacy:</p><p>Number of morphemes: Morphemes are the smallest units of meaning in words, including affixes (prefixes and suffixes) and roots. The number of morphemes gives an indication of the information load of a word. Lexical items with more morphemes typically require more decoding effort from readers. We used our Convolutional Neural Model for automatic morphological segmentation and morpheme counting.</p><p>Morphological density: This quantitative metric is defined as the ratio of the number of morphemes to word length, offering a measure of how densely packed meaningful units are within a word. Higher morphological density can indicate more cognitive load, as each unit contributes distinct information, potentially raising the complexity of the word.</p><p>Frequency of the lexical morpheme: Lexical morphemes carry the core meaning of the word. Employing our morphological segmentator on the ItWac corpus <ref type="bibr" target="#b46">[47]</ref>, enabled us to dissect the word into segments and aggregate the frequencies of individual morphemes. This frequency, transformed using a logarithmic scale, helps predict complexity by leveraging the familiarity of frequently occurring morphemes. The use of lexical morpheme frequency as a complexity indicator is based on the idea that even if a word is unfamiliar as a whole, its component morphemes may be common in the language and more recognizable <ref type="bibr" target="#b48">[49]</ref>.</p><p>By integrating these morphological features with other linguistic traits typically considered influential in speakers' perception of complexity, we aim to assess their impact on predicting linguistic complexity <ref type="foot" target="#foot_3">5</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Analysis and discussion</head><p>Through studying the correlations between these variables, we seek to determine whether morphological measures can be effectively used to develop systems capable of automatically identifying word complexity. To achieve this, we conducted a correlation and significance analysis between the features discussed earlier and the perceived complexity values for the 600 words included in MultiLs-IT. Table <ref type="table" target="#tab_1">2</ref> presents the Spearman correlation coefficients and their statistical significance for the features calculated <ref type="foot" target="#foot_4">6</ref> . The correlation analysis reveals several important insights.</p><p>Word length, number of vowels, and number of syllables all have small but statistically significant positive correlations with complexity. This suggests that, as expected, longer words with more vowels and syllables tend to be perceived as more complex. These factors are typical in readability studies, where more phonologically complex words are generally harder to process.</p><p>The number of morphemes also shows a positive correlation with complexity, reinforcing the idea that words with more morphemes are perceived as more complex. This feature is statistically significant as well.</p><p>Negative correlations for senses_ID, stopword presence, and lemma frequency suggest that words with more senses, those that are stopwords, or those that are more frequently used are perceived as less complex. These features are also statistically significant. It is noteworthy that the number of senses (senses_ID) is inversely proportional to complexity. This could be attributed to the incompleteness of ItalWordNet, potentially leading to unreliable predicted values.</p><p>Morphological density, however, does not show a statistically significant correlation with complexity, suggesting that the ratio of morphemes to word length may not be a strong predictor of perceived complexity.</p><p>The lexical morpheme frequency shows a significant negative correlation with complexity, indicating that more frequently occurring morphemes contribute to lower perceived complexity. This supports the notion that familiar morphemes, even within otherwise complex words, aid in comprehension.</p><p>These findings underscore the importance of considering a range of linguistic features, including morphological traits, when assessing lexical complexity. By integrating these features into computational models, we can enhance their ability to accurately predict word complexity and, subsequently, improve lexical simplification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>This study highlights the significance of integrating morphological features into automatic models to enhance the comprehension and prediction of lexical complexity. The high performance of the Neural Morpheme Segmentation model demonstrates the efficacy of convolutional neural networks in capturing the detailed patterns of morphological segmentation in the Italian language.</p><p>The correlation analysis reveals that while traditional metrics like word length and frequency are valuable predictors of complexity, incorporating morphological features provides additional insights that enrich our understanding of lexical complexity. Notably, the positive correlation between the number of morphemes and perceived complexity suggests that words with more morphemes are inherently more complex. Conversely, frequent lexical morphemes tend to reduce perceived complexity, highlighting the importance of familiarity in complexity perception. Our study also emphasizes the need for diverse linguistic features, including both surface characteristics and morphological traits, to create more robust and accurate models for predicting word complexity. The statistically significant correlations for most features validate their relevance in complexity prediction. However, it is important to note that our findings are based on a relatively small dataset of annotated complexity perceptions. To obtain more robust and generalizable results, it would be highly beneficial to have access to a larger and more diverse dataset of complexity annotations. Expanding the dataset to include a wider variety of texts and contexts would enhance the reliability of the correlations observed and improve the training and evaluation of automatic complexity prediction models.</p><p>Future research should focus on gathering more extensive annotated datasets and exploring additional linguistic features that may influence complexity perception. By doing so, we can further refine our models and develop more effective tools for lexical simplification and other applications aimed at improving text accessibility.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Correlation of complexity values.</figDesc><graphic coords="6,131.17,84.19,332.94,206.79" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Spearman correlation coefficients and p-values for features and complexity. Note: * indicates statistical significance.</figDesc><table><row><cell>Feature</cell><cell>Correlation</cell><cell>p-value</cell></row><row><cell>Length</cell><cell>0.082</cell><cell>0.045*</cell></row><row><cell>Number of vowels</cell><cell>0.097</cell><cell>0.018*</cell></row><row><cell>Number of syllables</cell><cell>0.091</cell><cell>0.026*</cell></row><row><cell>Number of Morphemes</cell><cell>0.112</cell><cell>0.006*</cell></row><row><cell>Senses_ID</cell><cell>-0.277</cell><cell>0.000*</cell></row><row><cell>Stopword</cell><cell>-0.124</cell><cell>0.003*</cell></row><row><cell>Lemma Frequency</cell><cell>-0.467</cell><cell>0.000*</cell></row><row><cell>Morphological Density</cell><cell>0.036</cell><cell>0.381</cell></row><row><cell>Lexical morpheme frequency</cell><cell>-0.333</cell><cell>0.000*</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">The resource is available at https://github.com/MLSP2024/MLSP_ Data.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">Initially, we aimed to manually review the entire dataset to address any inconsistencies and overlooked segments. However, due to time constraints, we opted to reduce the dataset by randomly selecting 30% of the entries for each letter.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">This model is available upon request. Please contact the author directly to access to the model and relevant references.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">For a detailed analysis of how these parameters were processed, refer to Occhipinti 2024.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">Spearman's rank correlation was chosen because it does not assume a linear relationship between variables, making it more suitable for our dataset, where the relationships between features like word length, number of morphemes, and word complexity may not follow a strictly linear pattern. Spearman's correlation measures whether an increase in one variable tends to be consistently associated with an increase (or decrease) in another, which is more appropriate given the nature of our linguistic features.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Morphology and the internal structure of words</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">L</forename><surname>Jamison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">M</forename><surname>Matthews</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Gonnerman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the National Academy of Sciences</title>
		<imprint>
			<biblScope unit="volume">101</biblScope>
			<biblScope unit="page" from="14984" to="14988" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A comparative study of minimally supervised morphological segmentation</title>
		<author>
			<persName><forename type="first">T</forename><surname>Ruokolainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kohonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Sirts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-A</forename><surname>Grönroos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kurimo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Virpioja</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="91" to="120" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Mielke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Alyafeai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Salesky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gallé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Raja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Si</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">Y</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2112.10508</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Neural machine translation of rare words with subword units</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sennrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Haddow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Birch</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P16-1162</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 54th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1715" to="1725" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N19-1423</idno>
		<ptr target="https://aclanthology.org/N19-1423.doi:10.18653/v1/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Byte pair encoding is suboptimal for language model pretraining</title>
		<author>
			<persName><forename type="first">K</forename><surname>Bostrom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Durrett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2020</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="4617" to="4624" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Fast wordpiece tokenization</title>
		<author>
			<persName><forename type="first">X</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Salcianu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dopson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2089" to="2103" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The sigmorphon 2016 shared task-morphological reinflection</title>
		<author>
			<persName><forename type="first">R</forename><surname>Cotterell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kirov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sylak-Glassman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yarowsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Eisner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hulden</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th SIGMOR-PHON workshop on computational research in phonetics, phonology, and morphology</title>
				<meeting>the 14th SIGMOR-PHON workshop on computational research in phonetics, phonology, and morphology</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="10" to="22" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Computational assessment of text readability: A survey of current and future research</title>
		<author>
			<persName><forename type="first">K</forename><surname>Collins-Thompson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ITL-International Journal of Applied Linguistics</title>
		<imprint>
			<biblScope unit="volume">165</biblScope>
			<biblScope unit="page" from="97" to="135" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">U</forename><surname>Dressler</surname></persName>
		</author>
		<title level="m">Ricchezza e complessità morfologica, Ricchezza e complessità morfologica</title>
				<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="1000" to="1011" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">S</forename><surname>Scalise</surname></persName>
		</author>
		<author>
			<persName><surname>Morfologia</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1994">1994</date>
			<pubPlace>il Mulino</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Segmentation and morphology, in: The handbook of computational linguistics and natural language processing</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Goldsmith</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010">2010</date>
			<publisher>Wiley Online Library</publisher>
			<biblScope unit="page" from="364" to="393" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">The discovery of segments in natural language</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">G</forename><surname>Wolff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">British Journal of Psychology</title>
		<imprint>
			<biblScope unit="volume">68</biblScope>
			<biblScope unit="page" from="97" to="106" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Identifying hierarchical structure in sequences: A linear-time algorithm</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">G</forename><surname>Nevill-Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="67" to="82" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Unsupervised word segmentation for sesotho using adaptor grammars</title>
		<author>
			<persName><forename type="first">M</forename><surname>Johnson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology</title>
				<meeting>the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="20" to="27" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">From phoneme to morpheme</title>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">S</forename><surname>Harris</surname></persName>
		</author>
		<ptr target="http://www.jstor.org/stable/411036" />
	</analytic>
	<monogr>
		<title level="j">Language</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="190" to="222" />
			<date type="published" when="1955">1955</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">An unsupervised algorithm for segmenting categorical timeseries into episodes</title>
		<author>
			<persName><forename type="first">P</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Heeringa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">M</forename><surname>Adams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Pattern Detection and Discovery: ESF Exploratory Workshop</title>
				<meeting>Pattern Detection and Discovery: ESF Exploratory Workshop<address><addrLine>London</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="49" to="62" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Deep convolutional networks for supervised morpheme segmentation of russian language</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sorokin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kravtsova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 7th International Conference in Artificial Intelligence and Natural Language (AINL 2018)</title>
				<meeting>7th International Conference in Artificial Intelligence and Natural Language (AINL 2018)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="3" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Unsupervised models for morpheme segmentation and morphology learning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Creutz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lagus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Speech and Language Processing (TSLP)</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="1" to="34" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Unsupervised morphological segmentation with log-linear models</title>
		<author>
			<persName><forename type="first">H</forename><surname>Poon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cherry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics</title>
				<meeting>Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="209" to="217" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Minimally-supervised morphological segmentation using adaptor grammars</title>
		<author>
			<persName><forename type="first">K</forename><surname>Sirts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Goldwater</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="255" to="266" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">S</forename><surname>Harris</surname></persName>
		</author>
		<title level="m">Morpheme Boundaries within Words: Report on a Computer Test</title>
				<meeting><address><addrLine>Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1970">1970</date>
			<biblScope unit="page" from="68" to="77" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Supervised morphological segmentation in a low-resource learning setting using conditional random fields</title>
		<author>
			<persName><forename type="first">T</forename><surname>Ruokolainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kohonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Virpioja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kurimo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventeenth Conference on Computational Natural Language Learning</title>
				<meeting>the Seventeenth Conference on Computational Natural Language Learning</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="29" to="37" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Unsupervised learning of the morphology of a natural language</title>
		<author>
			<persName><forename type="first">J</forename><surname>Goldsmith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational linguistics</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="page" from="153" to="198" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">An algorithm for the unsupervised learning of morphology</title>
		<author>
			<persName><forename type="first">J</forename><surname>Goldsmith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Natural language engineering</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="353" to="371" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Unsupervised discovery of morphemes</title>
		<author>
			<persName><forename type="first">M</forename><surname>Creutz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lagus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning</title>
				<meeting>the ACL-02 Workshop on Morphological and Phonological Learning</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="21" to="30" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Morfessor in the morpho challenge</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J P</forename><surname>Creutz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Lagus</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes</title>
				<meeting>the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="12" to="17" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Semi-supervised morpheme segmentation without morphological analysis</title>
		<author>
			<persName><forename type="first">Ö</forename><surname>Kılıç</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bozsahin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the workshop on language resources and technologies for Turkic languages</title>
				<meeting>the workshop on language resources and technologies for Turkic languages</meeting>
		<imprint>
			<publisher>LREC</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="52" to="56" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Painless semi-supervised morphological segmentation using conditional random fields</title>
		<author>
			<persName><forename type="first">T</forename><surname>Ruokolainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kohonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Virpioja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kurimo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics</title>
		<title level="s">Short Papers</title>
		<meeting>the 14th Conference of the European Chapter of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="84" to="89" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Conditional random fields: Probabilistic models for segmenting and labeling sequence data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lafferty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="282" to="289" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Morfessor flatcat: An hmm-based method for unsupervised and semi-supervised learning of morphology</title>
		<author>
			<persName><forename type="first">S.-A</forename><surname>Grönroos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Virpioja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Smit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kurimo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics</title>
				<meeting>COLING 2014, the 25th International Conference on Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1177" to="1185" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">Introduction to semisupervised learning</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Goldberg</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2022">2022</date>
			<publisher>Springer Nature</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Convolutional neural networks for low-resource morpheme segmentation: baseline or state-of-the-art?</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sorokin</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/W19-4218</idno>
		<ptr target="https://aclanthology.org/W19-4218.doi:10.18653/v1/W19-4218" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology</title>
				<meeting>the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="154" to="159" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Morphological segmentation with window lstm neural networks</title>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">De</forename><surname>Melo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2842" to="2848" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Labeled morphological segmentation with semimarkov models</title>
		<author>
			<persName><forename type="first">R</forename><surname>Cotterell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mueller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fraser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Nineteenth Conference on Computational Natural Language Learning</title>
				<meeting>the Nineteenth Conference on Computational Natural Language Learning</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="164" to="174" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<title level="m" type="main">Una piattaforma di morfologia computazionale per l&apos;analisi e la generazione delle parole italiane</title>
		<author>
			<persName><forename type="first">M</forename><surname>Battista</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Pirrelli</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>ILC-CNR</publisher>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Morph-it! a free corpusbased morphological resource for the italian language</title>
		<author>
			<persName><forename type="first">E</forename><surname>Zanchetta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Baroni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of corpus linguistics conference series 2005</title>
				<meeting>corpus linguistics conference series 2005</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1" to="12" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<title level="m" type="main">Computational Linguistic Text Processing-Lexicon, Grammar, Parsing and Anaphora Resolution</title>
		<author>
			<persName><forename type="first">R</forename><surname>Delmonte</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Nova Science Publishers</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">The textpro tool suite</title>
		<author>
			<persName><forename type="first">E</forename><surname>Pianta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Girardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zanoli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC&apos;08)</title>
				<meeting>the Sixth International Conference on Language Resources and Evaluation (LREC&apos;08)</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="2603" to="2607" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Anita: a powerful morphological analyser for italian</title>
		<author>
			<persName><forename type="first">F</forename><surname>Tamburini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Melandri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)</title>
				<meeting>the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="941" to="947" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Derivatario: An annotated lexicon of italian derivatives</title>
		<author>
			<persName><forename type="first">L</forename><surname>Talamo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Celata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">M</forename><surname>Bertinetto</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Word Structure</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="72" to="102" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">An extensible massively multilingual lexical simplification pipeline dataset using the MultiLS framework</title>
		<author>
			<persName><forename type="first">M</forename><surname>Shardlow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Batista-Navarro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Calderon</forename><surname>Ramirez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cardon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>François</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hayakawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Horbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hülsing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ide</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Imperial</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nohejl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>North</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Occhipinti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rojas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Raihan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ranasinghe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Solis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Salazar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zampieri</surname></persName>
		</author>
		<author>
			<persName><surname>Saggion</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.readi-1.4" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024, ELRA and ICCL</title>
				<editor>
			<persName><forename type="first">R</forename><surname>Wilkens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Cardon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Todirascu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Gala</surname></persName>
		</editor>
		<meeting>the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024, ELRA and ICCL<address><addrLine>Torino, Italia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="38" to="46" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">The BEA 2024 shared task on the multilingual lexical simplification pipeline</title>
		<author>
			<persName><forename type="first">M</forename><surname>Shardlow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Batista-Navarro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Calderon</forename><surname>Ramirez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cardon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>François</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hayakawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Horbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hülsing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ide</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Imperial</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nohejl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>North</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Occhipinti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">P</forename><surname>Rojas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Raihan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ranasinghe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Salazar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Štajner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zampieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Saggion</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.bea-1.51" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Kochmar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Bexte</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Burstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Horbach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Laarmann-Quante</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Tack</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Yaneva</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Yuan</surname></persName>
		</editor>
		<meeting>the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), Association for Computational Linguistics<address><addrLine>Mexico City, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="571" to="589" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">A decade of morphology and word formation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Aronoff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Annual review of anthropology</title>
		<imprint>
			<biblScope unit="page" from="355" to="375" />
			<date type="published" when="1983">1983</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">Improving morpheme segmentation using bert embeddings</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sorokin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Analysis of Images, Social Networks and Texts</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="148" to="161" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Lexical complexity prediction: An overview</title>
		<author>
			<persName><forename type="first">K</forename><surname>North</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zampieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shardlow</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="1" to="42" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<analytic>
		<title level="a" type="main">The wacky wide web: a collection of very large linguistically processed web-crawled corpora</title>
		<author>
			<persName><forename type="first">M</forename><surname>Baroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bernardini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ferraresi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zanchetta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language resources and evaluation</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="page" from="209" to="226" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b47">
	<analytic>
		<title level="a" type="main">Italwordnet: a large semantic database for italian</title>
		<author>
			<persName><forename type="first">A</forename><surname>Roventini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alonge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bertagna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000)</title>
				<meeting>the Second International Conference on Language Resources and Evaluation (LREC-2000)</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="783" to="790" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b48">
	<analytic>
		<title level="a" type="main">Words and morphemes as units for lexical access</title>
		<author>
			<persName><forename type="first">P</forename><surname>Colé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Segui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Taft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Memory and Language</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="312" to="330" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b49">
	<analytic>
		<title level="a" type="main">Complex word identification for italian language: a dictionary-based approach</title>
		<author>
			<persName><forename type="first">L</forename><surname>Occhipinti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Sixth International Conference on Computational Linguistics in Bulgaria</title>
				<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="119" to="129" />
		</imprint>
	</monogr>
	<note>Proceedings of Clib24</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
