=Paper=
{{Paper
|id=Vol-2664/alexs_paper1
|storemode=property
|title=Vicomtech at ALexS 2020: Unsupervised Complex Word Identification Based on Domain Frequency
|pdfUrl=https://ceur-ws.org/Vol-2664/alesx_paper1.pdf
|volume=Vol-2664
|authors=Elena Zotova,Montse Cuadros,Naiara Perez,Aitor García-Pablos
|dblpUrl=https://dblp.org/rec/conf/sepln/ZotovaCPP20
}}
==Vicomtech at ALexS 2020: Unsupervised Complex Word Identification Based on Domain Frequency==
Vicomtech at ALexS 2020: Unsupervised Complex Word Identification Based on Domain Frequency Elena Zotova, Montse Cuadros, Naiara Perez and Aitor García-Pablos SNLT group at Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Mikeletegi Pasealekua 57, Donostia/San-Sebastián, 20009, Spain Abstract This paper introduces Vicomtech’s systems for unsupervised complex word identification submitted to the ALexS “Análisis Léxico en la SEPLN 2020” task. The systems are based on clustering algorithms with domain specific features, such as word frequency and probability in several Wikipedia corpora, word length, and number of synsets in WordNet. Our systems are designed to identify complex words, taking into account occurrence of the word in domain-specific texts in order to be able to adapt to the domain. Our systems reported good results, performing in second position. Keywords Complex Word Identification, Lexical Simplification, Unsupervised Learning 1. Introduction Complex word identification (CWI) is one of the steps in the process of lexical simplification, which is useful for learners and children in text comprehension [1]. CWI and further substitution of the complex words may significantly improve readability and understandability of a given text. CWI is a relatively recent area of interest with at least two shared tasks focused on it in the past years—the CWI Shared Tasks at SemEval 2016 [2] and NAACL-HTL 2018 [3]. Both challenges set out the problem of detecting words difficult to understand for non-native speakers, and introduce annotated corpora in English, Spanish, German and French in order to develop supervised machine learning systems to that end. The majority of the CWI systems presented in those shared tasks explore a large number of features—morphological, lexical, semantic, collocational, syntactical, psycho-linguistic, etc. For instance, the winners of in 2016 [4] and 2018 [5] leverage 69 and 27 features, respectively. ALexS “Análisis Léxico en la SEPLN 2020” [6] is the first shared task on lexical analysis of university educational texts in Spanish. The organizers of the task propose to implement automatic systems to identify difficult words in the texts, with the following key challenges: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: ezotova@vicomtech.org (E. Zotova); mcuadros@vicomtech.org (M. Cuadros); nperez@vicomtech.org (N. Perez); agarciap@vicomtech.org (A. García-Pablos) orcid: 0000-0002-8350-1331 (E. Zotova); 0000-0002-3620-1053 (M. Cuadros); 0000-0001-8648-0428 (N. Perez); 0000-0001-9882-7521 (A. García-Pablos) © 2020 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IberLEF 2020, September 2020, Málaga, Spain. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Table 1 Corpus characteristics Number Transcriptions (documents) 55 Subjects 25 Words 68,414 Complex words 1,084 Words/document, avg 1,244 Words/document, max 2,646 Words/document, min 465 • The difficult terms have to be within the scope of an academic content, which implies that many technical terms should be excluded because they are commonly used in the domain; that is, the systems should be able to adapt to the domains/university subjects in the dataset. • The task has been evaluated according to the manual annotations of the organizers over the given corpus. The corpus was released without the annotations and no annotation scheme has been published; thus, the participants have had to elaborate their own definition of ‘complex word’. • As no training data has been released, all the automatic systems are expected to use unsupervised learning algorithms. The VYTEDU corpus [7] provided as dataset for this task consists of 55 transcriptions of videotaped lectures in the University of Guayaquil (Ecuador). The lectures are delivered in Spanish and belong to different subjects, such as botany, psychology, economy, programming, architecture, etc. The purpose of the corpus is to analyze lexical complexity of written and oral text, and develop an automatic text simplification system. Table 1 provides a quantitative description of the dataset. This paper is organized as follows. Section 2 describes proposed systems for automatic complex word identification and consists of parts about Feature engineering (Subsection 2.1), and clustering methods (2.2). Further, we report the result of the task and performance of proposed systems in Section 3. Finally, we speak about conclusions in Section 4. 2. Systems Description This section provides a description of the automatic CWI systems with which the reported results have been obtained. First, we explain the features used for training the systems, and we explain the process of building a system with unsupervised learning methods used to solve the task—in this case, clustering. Previous studies [8, 9, 10] proved that a word’s difficulty is closely related to its frequency in a corpus, so the frequency can be used to detect complex words automatically. Words that 8 rarely occur may be new to a reader or listener and thus not easy to understand. However, the frequency of a word may vary among domains, and the word’s understandability may vary with the reader’s or listener’s familiarity with the topic. For instance, the word ‘inflation’ may be easy to understand for those who have thorough knowledge of finances, but difficult for a student that has just started economical studies. Our CWI system takes into account word frequency not only in a given text or in the corpus, but also in the domain of the target text. 2.1. Features Several word-level features have been selected to be used in the clustering process. These features are based on length, frequencies and probabilities of the words and their lemmas in large corpora, domain-related corpora and in the VYTEDU documents. Word or lemma frequency is the count of all the occurrences in a given corpus. Word or lemma probability is the proportion of the frequency of a word/lemma to the total number of words/lemmas in the corpus. Frequency and probability in a large corpus is expected to separate commonly used words in all domains, while domain-specific metrics refer to well-known terminology of the domain. We have used the Spanish Wikipedia data dump and category list from March 20, 2020 [11] and WikiExtractor [12] to get a large corpus of text in Spanish. We have also extracted domain- specific corpora related to the each subject in the VYTEDU dataset. We first mapped manually Wikipedia categories to the 25 subjects encountered in the dataset (e.g., Contabilidad, accounting, Investigación, research, etc.). Then, we selected all the Wikipedia articles in each category and immediate sub-categories in the category hierarchy. The resulting domain-specific corpora contain 71 to 1,432 articles, depending on how broadly the topic of the subject is represented in Wikipedia. The whole Wikipedia corpus have been tokenized, and domain-specific corpora and VYTEDU documents have been tokenized and also lemmatized. All the pre-processing has been done with spaCy’s [13] statistical model e s _ c o r e _ n e w s _ s m , pretrained on the AnCora and Spanish WikiNER. The features are explained below. • Lemma length: Each word’s lemma’s length is calculated. Word length is a common measure of text complexity; for instance, average word length is used in Flesch-Kincaid readability tests [14], and the Automated Readability Index [15] takes into account pro- portion of words and characters. • Lemma frequency in the subject documents: We calculate the frequency of all lem- mas in the documents of given subjects in the VYTEDU corpus. • Number of synsets in WordNet: We take the number of synsets in the Spanish WordNet [16] for each lemma using NLTK Toolkit[17]. According to the studies of [18], it has been known that older—therefore better known—words are more polysemous than recent words, and that frequently used words are more polysemous than infrequent ones. Hence, we assume that the more polysemous a word is, the less complex it is likely to be. • Lemma frequency in domain corpora: We calculate the frequency of all lemmas in the domain-specific corpora. 9 Figure 1: The process of feature engineering using large corpora from Wikipedia • Lemma probability in domain corpora: We calculate the probability of all lemmas in the domain-specific corpora. • Word frequency in Wikipedia: We calculate the frequency of all words in the Wikipedia. • Word probability in Wikipedia: We calculate the probability of all words in the Wikipedia. All features were normalized from 0 to 1. The workflow of the process is shown in Figure 1 2.2. Clustering We use clustering as the unsupervised learning method to classify the words in the documents in VYTEDU corpus. The clustering process has been carried out with each domain-related dataset separately, grouping the documents by subject, in order to obtain domain-specific complex words. As a pre-processing step, we have automatically extracted words candidate to be complex from the VYTEDU dataset. First, we have computed a bag of words by tokenizing and lemmatizing the dataset. Next, we have filtered out stopwords and words that do not belong to meaningful parts of speech, namely, substantives, verbs, adjectives and adverbs. We argue that the removed words are never complex, regardless of the context they occur in. Next, we have lemmatized all the target words. Then, the clustering has been carried out as follows. We have assigned a feature vector to each target word from pre-calculated values explained in Section 2.1. We have used the lemma of the word for some of the features: lemma length, lemma frequency in the subject-grouped documents in VYTEDU corpus, number of synsets in WordNet, and frequency and probability in the domain-related corpus. In addition, we have used the word itself to assign word frequency and probability in Wikipedia. The obtained vector has size 7×28, 798, i.e., the number of features times the total number of words candidate to be complex. 10 Figure 2: The workflow of the clustering process In order to select the clustering methods and the adequate parameters, previously we con- ducted various experiments with annotated corpus from SemEval 2018 task in Spanish [2]. As a result, we have applied two best performing clustering methods—Gaussian Mixture Models (GMM) [19] and K-means [20]—to form two clusters, with the intention that one of the clusters comprises complex words and the other simple words. The GMM model uses spherical covari- ance. The K-means model uses Elkan’s algorithm [21]. Both models have been implemented in Python with the scikit-learn framework [22]. Once the clusters have been established, the system must be able to assign a label (i.e., simple or complex) to each of them. On the grounds that less frequent, probable and polysemous words are more likely to be difficult, we automatically assign the label ‘complex’ to the words in the cluster with the lowest average value of features. The entire workflow of the clustering process is depicted in Figure 2. 3. Results We have presented three runs of the system: • Run 1: A GMM clustering model trained on all the features (described in Section 2.1). • Run 2: A K-means clustering model trained on all the features. • Run 3: A K-means clustering model trained on all the features except the number of synsets in WordNet. The systems have been evaluated in terms of Accuracy, Precision, Recall, and F1-score, F1- score being the key metric to compare classification systems. In addition, we provide one more metric—G-score—, which consists in the harmonic mean of Accuracy and Recall. The G-score was used in the previous shared tasks [2] because it is important in the task of CWI to minimise the number of false negatives (i.e., complex words being identified as simple) and false positives (i.e., simple words being identified as complex), and also maximise the number of true positives (i.e., complex words identified as complex). One method to measure whether a system achieves these goals, is to give more emphasis to Accuracy to account for the former two and Recall for 11 Table 2 Results of the proposed systems System Accuracy Precision Recall F1 score G score # CW Run 1 90.55 9.68 59.69 16.66 71.95 6,682 Run 2 88.68 9.18 68.82 16.20 77.50 8,123 Run 3 91.29 10.40 59.32 17.70 71.91 6,158 Organizers 92.17 12.32 65.50 20.74 76.58 5,794 Best 98.25 34.16 22.67 27.25 36.84 726 the latter. This measure considers the minority of the complex class in any text and therefore gives less weight to the false positives. It is possible to tune the system towards higher recall by regulating the proportion between complex words and all words of the corpus. The results of the proposed systems are shown in Table 2. The challenging nature of the task is reflected in the performance of the systems of all participants. High accuracy with low precision is a result of the imbalanced dataset, where the positive class is underrepresented: there were only 1,084 complex words from 68,414, that is 1.6% of all the words of the corpus. Our best performing system in terms of F1 score (Run 3) is built without the WordNet feature; thus, we can conclude that this feature is not as relevant for this task. If we take G-score into consideration, the most proficient system is Run 2, which uses all the available features and the K-means clustering method. 4. Conclusions Automatic complex word detection with unsupervised methods is a highly challenging task for various reasons. First of all, the definition of complex word is very subjective and depends on the annotators, their level of education, whether they are familiar with the domain, and so on. Secondly, no criteria has been provided about what a complex word is in this particular task. That is why our system is based on the intuition that frequency of the word in a given corpus and its polysemy in WordNet may give significant information. At last, the task of CWI always deals with unbalanced data, where the proportion of complex words depends on the level of difficulty of the text. Our systems consider the domain of the document, but not its difficulty. Our systems leverage a semi-automatic process of corpus extraction from Wikipedia, based on manual formatting of the subjects. One of the possible improvements could be creating an automatic topic modeling system. This method may also be used to create silver labels for further training supervised learning models. Acknowledgments This work has been supported by Vicomtech and partially funded by the project DeepReading (RTI2018-096846-B-C21, MCIU/AEI/FEDER,UE) 12 References [1] M. Shardlow, A Comparison of Techniques to Automatically Identify Complex Words., in: 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 103–109. [2] G. Paetzold, L. Specia, SemEval 2016 Task 11: Complex Word Identification, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, San Diego, California, 2016, pp. 560–569. doi:1 0 . 1 8 6 5 3 / v 1 / S16- 1085. [3] S. M. Yimam, C. Biemann, S. Malmasi, G. Paetzold, L. Specia, S. Štajner, A. Tack, M. Zampieri, A Report on the Complex Word Identification Shared Task 2018, in: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 66–78. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 8 - 0 5 0 7 . [4] G. Paetzold, L. Specia, SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, San Diego, California, 2016, pp. 969–974. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 6 - 1 1 4 9 . [5] S. Gooding, E. Kochmar, CAMB at CWI Shared Task 2018: Complex Word Identification with Ensemble-Based Voting, in: Proceedings of the Thirteenth Workshop on Innova- tive Use of NLP for Building Educational Applications, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 184–194. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 8 - 0 5 2 0 . [6] J. Otriz Zambrano, A. Montejo-Ráez, AlexS 2020: Lexical Analysis Task at SEPLN, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR-WS, Malaga, Spain, 2020. [7] J. A. Ortiz Zambrano, A. Montejo-Ráez, VYTEDU: Un Corpus de Vídeos y sus Transcrip- ciones para Investigación en el Ámbito Educativo, Procesamiento del Lenguaje Natural 59 (2017) 167–170. [8] L. Specia, S. K. Jauhar, R. Mihalcea, SemEval-2012 Task 1: English Lexical Simplification, in: *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 2012, pp. 347–355. [9] G. H. Paetzold, L. Specia, A Survey on Lexical Simplification, J. Artif. Int. Res. 60 (2017) 549––593. doi:1 0 . 1 6 1 3 / j a i r . 5 5 2 6 . [10] T. Kajiwara, M. Komachi, Complex Word Identification Based on Frequency in a Learner Corpus, in: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Build- ing Educational Applications, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 195–199. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 8 - 0 5 2 1 . [11] Wikimedia, Wikimedia Downloads, 2020. URL: https://dumps.wikimedia.org/. [12] G. Attardi, Wikiextractor, 2015. URL: https://github.com/attardi/wikiextractor. [13] spaCy.io, spaCy, 2016. URL: https://spacy.io/. [14] R. Flesch, A New Readability Yardstick, The Journal of Applied Psychology 32 (1948) 13 221–233. doi:1 0 . 1 0 3 7 / h 0 0 5 7 5 3 2 . [15] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S. Chissom, Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Naval Technical Training Command Millington TN Research Branch (1975). [16] A. Fernández-Montraveta, G. Vázquez, C. Fellbaum, The Spanish Version of WordNet 3.0, Text Resources and Lexical Knowledge. Mouton de Gruyter (2008) 175–182. [17] Bird, Steven, Edward Loper and Ewan Klein, Natural Language Processing with Python., 2009. URL: https://www.nltk.org. [18] C. J. Lee, Some Hypotheses Concerning the Evolution of Polysemous Words, Journal of Psycholinguistic Research 19 (1990) 211–219. doi:1 0 . 1 0 0 7 / B F 0 1 0 7 7 2 5 7 . [19] G. J. McLachlan, K. E. Basford, Mixture Models : Inference and Applications to Clustering, volume 84, Marcel Dekker, 1988. doi:1 0 . 2 3 0 7 / 2 3 4 8 0 7 2 . [20] J. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, University of California Press, Berkeley, Calif., 1967, pp. 281–297. [21] C. Elkan, Using the Triangle Inequality to Accelerate K-Means, in: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, AAAI Press, 2003, p. 147–153. [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. 14