Benchmark for Automatic Keyword Extraction in Spanish: Datasets and Methods Pablo Calleja1 , Patricia Martín-Chozas1 and Elena Montiel-Ponsoda1 1 Ontology Engineering Group, Universidad Politécnica de Madrid Abstract Tasks such as document indexing or information retrieval still seem to heavily rely on keywords, even in the LLMs era. However, there is still a need for automatic keyword extraction works and training sets in languages other than English. To the best of our knowledge, no datasets for keyword extraction in Spanish are publicly available for training or evaluation purposes. Additionally, those innovative keyword extraction methods that rely on language models are not being adapted to language models in other languages. To palliate this situation, this work proposes a method to translate into Spanish two of the main gold standard datasets used by the community, while preserving semantics and terms. Then, the main state-of-the-art methods are evaluated against the new translated datasets. The methods used for the evaluation have been configured or re-implemented for Spanish. Keywords Spanish Automatic Keyword Extraction, Spanish language, SemEval2017, SemEval2010 1. Introduction tional supervised methods are based on decision trees [7], naive Bayes [8] or Conditional Random Fields [9]. In Keywords, typically defined as words or terms that best the past 10 years, several models have emerged based on characterise the topics discussed in a document, have neural networks and deep learning [10, 11]. The most proven essential for different NLP tasks such as informa- recent approaches rely on language models and attention tion extraction (IE), text mining, or information retrieval mechanisms [12, 13]. (IR) [1]. With the exponential growth of available digital Supervised methods tend to offer the best results in documents, a need emerged for algorithms capable of the literature of machine learning, but they require a automatically identifying single or compound terms (also large dataset of labelled training corpora. To achieve referred to as key segments or key phrases) that best rep- that, human experts have to manually annotate large resented the most relevant information of a document, a amounts of data, which is a costly and tedious task. The task better known as Automatic Keyword or KeyPhrase resulting annotations refer to the specific keywords that Extraction (AKE). should be extracted from each sentence, paragraph or Nowadays, even in the face of AI Generative algo- document in the corpus. On the other hand, unsupervised rithms and Large Language Models (LLMs), AKE algo- methods, such as statistical or graph-based approaches, rithms are not only used to classify, retrieve, or inspect do not require labelled corpora. Statistical-based meth- large corpora [1, 2, 3], but also to fine-tune LLMs and ods [5, 14] use candidate position, frequency, length, and post-process their output. capitalisation to determine the importance of a word. However, automatically extracting keywords is a chal- Graph-based approaches [15, 16] construct a graph with lenging task due to the complexities of natural language, the candidates as nodes. The edges indicate similarity or document heterogeneity and the type of keywords that co-occurrence of candidates. usually are needed. The current state-of-the-art is full Some of the best-known datasets for automatic key- of proposed methods and tools. From the earliest based word extraction such as SemEval2010 [17], SemEval2017 on lexico-syntactic patterns and frequencies [4] to those [18] or Inspec [19], have been created for evaluation purely based on statistics [5, 6] or the most recent ones tasks and are commonly used to evaluate new methods based on language models. (both supervised and unsupervised), and not so much for Keyword extraction methods have generally been clas- training. sified into supervised or unsupervised methods. Tradi- However, all these efforts are not language agnostic. SEPLN-2024: 40th Conference of the Spanish Society for Natural Most of the works so far have been oriented towards Language Processing. Valladolid, Spain. 24-27 September 2024. the English language, giving a small coverage to other $ p.calleja@upm.es (P. Calleja); patricia.martin@upm.es languages such as Spanish. As far as we know, there (P. Martín-Chozas); elena.montiel@upm.es (E. Montiel-Ponsoda) are no publicly available annotated training corpora in  0000-0001-8423-8240 (P. Calleja); 0000-0002-8922-7521 Spanish. Therefore, supervised algorithms cannot be (P. Martín-Chozas); 0000-0003-3263-3403 (E. Montiel-Ponsoda) © 2024 Copyright for this paper by its authors. Use permitted under Creative easily implemented, and evaluations for supervised or Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) unsupervised algorithms are difficult to perform. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In this paper, a method to translate two of the most resources that contain the lexical units that are represen- important corpora for AKE is proposed and applied to tative of a domain. their translation into Spanish. The main aim of this work Although these two tasks have been conceived for is to create a ’silver standard´ to support the training and different purposes, the truth is that, when performed au- evaluation of automatic keyword extraction in Spanish. tomatically, they obtain similar results and performance, The translation process has been performed to preserve as both rely on linguistic and textual features (at sentence, the semantics and terminological representation of the paragraph or document levels). Thus, several state-of- original texts and the annotations. The translation is the-art methods have been used for both tasks. supported by the Google Translate service and by Chat- In this section, we will review the most relevant works GPT3.5. in this area, making a distinction between traditional Additionally, a benchmark has been generated with approaches (linguistic and statistic) and machine learning five of the most relevant methods in the current state-of- and neural approaches. the-art on the two translated corpus. The methods have been configured for Spanish, and two of them have been 2.1. Traditional approaches re-implemented to use Spanish language models. The rest of the paper is structured as follows: In sec- The algorithms considered in this section are usually tion 2 we provide a summary of the state-of-the-art in based on linguistic patterns, relying on parsing and part- Automatic Keyword Extraction. Section 3 is devoted to speech tagging processes to identify terms [22]. These the method for the translation of the corpora. Section 4 patterns were very prolific in the 1990s, with systems describes the different AKE methods with their config- such as LEXTER [23]. This kind of approaches [24] has urations or adaptations for the Spanish language, and persisted until today, as patterns are the main starting section 5 presents the results of the evaluation bench- point to automatically identify keywords or terms in mark. Finally, section 6 highlights the conclusions and documents and corpora. More advanced works based on recommendations for future work. Both experiments patterns went further to identify the concept evoked by and results are reported in an anonymised GitHub repos- term variants in several languages, as the work by [25] itory1 . for English and French. In any case, the majority of these works are language dependent. Later on, researchers started to combine various types 2. State of the art of linguistic techniques, such as pattern-based tech- niques, regular expressions, stop word lists, and post- As stated by [1], ‘keywords’ and ‘keyphrases’ do not re- processing algorithms, to mention but a few. In this fer to any theory. An element is considered as a ‘key’ context, tools such as TermExtractor emerge, a system element within a document, when it is an important de- that combines several of the previously mentioned tech- scriptor of the document content. The use of ‘word’ ver- niques and applies post-processing filters like domain sus ‘phrase’ refers to the number of textual units, which pertinence, lexical cohesion or structural relevance [26]. can be one (1-gram) or several (n-grams). Since such More advanced works in the literature started to use keywords or keyphrases mostly correspond to terms, de- statistical approaches in combination with linguistic func- fined as words that are specific to a domain, the AKE task tionalities, which appeared to improve the results. The is closely related to the so-called Automatic Terminol- process behind statistical approaches generally consists ogy Extraction/Retrieval (ATE/ATR) task, i.e., the task of of weighting the frequency of occurrence of a combina- identifying relevant terms in a corpus [20]. tion of words (n-grams) in a text. Normally, statistical Lossio-Ventura et al. [21] described in their work that algorithms are divided into two types: 1) those based there are some fundamental differences between term on the unithood that measures the strength of unity of extraction and keyword extraction tasks. One major dif- complex units (such as X2 , T-score and z-score), and 2) ference is that extracting terms requires a large collection those based on the termhood that measures the degree of texts, which is not a necessary requirement in keyword of representation of domain-specific concepts, such as extraction, which can take only a single document as in- C-Value or co-occurrence [27, 28]. Some of these purely put. Also, ATE methods aim to extract term-like units statistical term extractors are INDEX for English [29], and remove those that may not be terms, syntactically Lexterm [30] for Spanish, and RAKE [5], for keyword or terminologically. On the other hand, AKE methods extraction in English. extract the ‘key’ elements of a document, which are not In contrast, it is most common to find mixed ap- limited to terms. Thus, while AKE methods can be do- proaches, such as TerMine, a term extractor that main independent, ATE methods apply to specific fields combines C-Value with linguistic information [4], or or professional domains, since their main goal is to build TermSuite, which applies distributional and composi- 1 https://github.com/oeg-upm/spanish-termex tional methods [31]. In [32], authors combine linguistic processes such as segmentation, PoS tagging and mor- cation of keywords on the embedding representation of phological analysis, with semantic knowledge extracted the sentence using masked tokens. Moreover, their work from external resources and statistical techniques. Other proposes a new type of BERT architecture to be trained works, such as TextRank [33], create a graph from the as a language model, but for the purpose of keyword text to extract keywords based on statistical metrics. identification. 2.2. Machine Learning and Neural 3. Dataset generation approaches In the era of machine learning approaches, datasets are These approaches exploit different features (linguistic or an essential requirement to train and, what is more im- not) to identify keywords. For instance, Rose et al. [5] portant, evaluate algorithms for different NLP tasks. For identified keywords based on word frequency, the num- instance, in the field of Automatic Keyword Extraction, ber of co-occurring neighbors, and the ratio between the there are well-known gold standard datasets that are com- co-occurrence and the frequency. Campos et al. [34] pro- monly used to evaluate approaches within the literature posed YAKE which calculated the importance of each can- such as the SemEval2010 Task 5 [17] and SemEval2017 didate using frequency, offsets, and co-occurrence. Sem- Task 10 [18]. However, the availability of these data sets Cluster method [35] first clustered the candidates based is limited to languages other than English [43]. Conse- on the semantic similarity in which the centroids were se- quently, a common approach to overcome this limitation lected as keywords. TopicRank [36] first assigned a score is to translate the available datasets into the target lan- to each topic by candidate keywords clustering. The guage [44, 45], including Spanish [46]. topics were scored using the TextRank ranking model, To the best of our knowledge, there is no consolidated and keywords were extracted using the most represen- dataset in Spanish for Automated Keyword Extraction, tative candidate from the top-ranked topics. Florescu therefore, the first contribution of this work is the devel- et al. [37] proposed PositionRank to use the position of opment of an evaluation corpus for keyword extraction in word occurrences to improve TextRank on a document. Spanish which results from translating two of the most Word embeddings have also been widely used. Wang et common English AKE datasets: SemEval2010 and Se- al. [38] made use of the pre-trained word embedding and mEval2017. The target of this contribution is to generate the frequency of each word to generate weighted edges a ‘silver standard’ labelled dataset, to provide researchers between words in a document. A weighted PageRank in the field with a consolidated framework to test and algorithm was used to compute the final scores of words. evaluate their approaches. Also, Key2Vec [39] used a similar approach using the However, the translation process for labelled datasets phrase embeddings for representing the candidates and is not a straightforward task. As [47] demonstrated in ranking the importance of the phrases by calculating the their work, labelled datasets have their labels linked to semantic similarity and co-occurrences of the phrases. one token or a span of tokens. Since the sentence struc- Currently, new approaches based on pre-trained neu- ture can vary in different languages, it is very challenging ral language models have appeared in the literature. For to retain the same annotation structure after the trans- instance, Text2TCS2 [40], which is able to extract terms lation process. To overcome such difficulties, we have and relations from raw text, creating taxonomies auto- organised the translation process into two phases: Phase matically. [41] proposed SIFRank, the integration of a 1) Source Dataset Analysis and Source Dataset Prepro- statistical model and a pre-trained language model, to cessing, described in Section 3.1, and Phase 2) Source calculate the relevance between candidates and docu- Dataset Translation and Target Dataset Postprocessing, ment topics. Other works are focused on the extraction described in Section 3.2. of multilingual terminology across domains using trans- Figure 1 summarises the method for the translation formers [42]. process in which, given the two original datasets, a set of Two of the most recent works in the field of AKE us- four datasets translated into Spanish is obtained, using ing language models are AttentionRank and MDERank. two different translation systems. AttentionRank [13] integrates self-attention weights ex- tracted from a pre-trained language model with the cal- culated cross-attention relevancy value to identify key- 3.1. Phase 1: Dataset analysis and words that are important to the local sentence context preprocessing and also have strong relevancy to all sentences within In order to generate the proposed silver standard for the whole document. MDERank [12] bases the identifi- Spanish AKE, we have selected the two previously men- tioned datasets, as they are widely used in experiments 2 https://live.european-language-grid.eu/catalogue/tool- of this kind: SemEval2010 Task 5 [17] and SemEval2017 service/8122 Table 1 SemEval2010 Metrics for SemEval2010 and SemEval2017 datasets, including keywords. Phase 1 SemEval2017 SemEval2010 SemEval2017 Documents 243 493 Tokens 2.334.613 95.877 Term Annotation Term Annotation Keywords 3.785 8.529 with quotes with HTML tag Unmatched 555 0 service few-shot Keywords prompt Google ChatGPT 3.5 Translator some of the keywords come from the ones manually provided by the authors of the papers themselves, and Phase 2 they may not have an exact correspondence in the text. Spa_SemEval2010GT Spa_SemEval2010GPT Regarding the preprocessing of the datasets, there are two main aspects involved in the translation process. Spa_SemEval2017GT Spa_SemEval2017GPT The first one refers to the original text. Not many issues were found during the translation of SemEval2017 cor- pus, since it had a manageable size and a clean structure. However, the original texts of SemEval2010 were arbitrar- Manual revision ily segmented, very long, and contained references and formulas, which posed many problems for the automatic translator when processing them. Figure 1: Method for dataset translation The second aspect refers to the keywords. For the translation of the keywords, we did not simply trans- late the list of keywords out of context, but decided to mark them in the texts with annotations marks (quo- Task 10 [18]. Both datasets are published following the tation marks or the HTML tag
, depending on the same structure, a set of documents containing the raw translation system). Then, we translated the texts and text (named docsutf8) and a set of documents containing retrieved the translated terms contained within the an- the extracted keywords (named keys). Both types of doc- notation marks. uments present the same identifiers to match keywords with source documents. Despite their similar structure, they present several 3.2. Phase 2: Dataset translation and differences. As shown in Table 1, the main difference postprocessing lies in their size. With a smaller number of documents, SemEval2010 far exceeds SemEval2017 in the total num- Most of the existing approaches to create silver standards ber of tokens, which means that it contains fewer docu- from existing gold standards by leveraging machine trans- ments, but of a much larger size. SemEval2017 contains lation rely on at least two translation sources: 3 one from shorter documents with an average of 6 to 7 sentences, a common online translator such as DeepL or Google 4 whereas SemEval2010 contains full scientific papers with Translate , and the other using a Neural Machine Trans- hundreds of sentences. It is interesting to note that, al- lation model, as suggested in [44]. As already announced, though SemEval2010 is bigger in number of documents in this work 5 we have used Google Translate and ChatGPT and number of tokens, SemEval2017 has a bigger number 3.5 Turbo APIs. of extracted keywords. This means that the keywords The keywords from the texts that were translated with from SemEval2010 have greater representation and num- Google Translate were annotated with quotation marks. ber of occurrences than the keywords from 2017. These However, on some occasions the system retrieved errors differences in size are important because they require a in which the annotation marks were missing or misplaced different treatment of the documents during the prepro- in the translated sentence, and either it was not possible cessing and the translation stage. to extract the translated term from the annotated sen- In both datasets, over 50% of the keywords are unigram tence or the extracted term was not correct. To avoid or bigram. However, in SemEval2010 we observe that 555 keywords are not present in the documents with a similar 3 https://www.deepl.com/es/translator span text. The reason for this is to be found in the way in 4 https://translate.google.es/ which the original dataset was created. In SemEval2010, 5 https://platform.openai.com/docs/models/gpt-3-5-turbo that, we decided to append the original term to each anno- model of spaCy has to be downloaded before the methods tated sentence, to force the system to take that term into can be run. account and provide a translation. For instance, in the For the RAKE method, the original library cannot be translation of the sentence ‘...has held two "mobile com- used as it is only oriented to the English language. How- puting" design competitions’ focused on the term ‘mobile ever, there is a version named Multi-rake8 which covers computing’ the translation lost the quotation marks: ‘ha different languages. As the method is statistical, to per- celebrado dos concursos de diseño de computación móvil’. form multilingually, the addition of stopword lists from Thus, we add the term repeated to obtain the translation the different target languages is necessary. of the term: ‘...has held two "mobile computing" design competitions. Mobile computing’. 4.2. Attention Rank With ChatGPT, the tag
was used to mark the key- words before and after. The prompt sent to the generative The implementation of the original authors9 had to be model described the purpose of the model (i.e., ’You are reimplemented from scratch. The original repository a Spanish translator specialised in terminology’), and does not have libraries and version specifications. More- then some examples of annotations in English and its over, the original code relies on libraries for language translations in Spanish with the annotated and trans- models that are not maintained as well as the noun lated keywords were provided. This is called few-shot phrases identification component, which relies on the prompting. The full prompt is presented in Annex A. part-of-speech annotation of Stanford CoreNLP and a Regarding the postprocessing stage, several actions third-party library. Reproducibility was not possible in were performed. First, we extracted all the annotated this work. occurrences of each keyword in the sentence, creating a A new repository10 has been created for the implemen- list of translation candidates per keyword. In some cases, tation of the Attention rank method. This repository uses reconciliation between candidates was necessary to pro- HuggingFace’s library transformer to manage language vide a single translation for each keyword. In the case models and spaCy to identify noun phrases. The reposi- that no disparities between the candidates were found, tory details the specific libraries and versions needed and the translated keyword was automatically assigned. In the external modules needed. The new repository allows case of disparities, terms were manually reviewed and the use of BERT (as in the original work) and RoBERTa a translated keyword was manually assigned. In total, architecture models in different languages. we manually reviewed an average number of 2000 key- The adaptation for RoBERTa models had to deal with words per dataset (220 documents in SemEval2010 and two specific issues regarding the tokeniser. The first one 360 documents in SemEval2017). is the use of different special tokens to delimit sentences at the beginning and at the end to focus the attention mechanisms, as BERT uses ’[CLS]’ and ’[SEP]’ tokens, 4. AKE Adaptation to Spanish RoBERTa uses ’’ and ’’. The second issue is the generated tokens, as BERT uses a WordPiece tokeniser In this section, the different AKE methods used for the ex- in which subwords are marked with the ’##’ tag (e.g., periments and their implementation are presented. Some the word thicknesses is divided into tokens thickness and of them have already been implemented and maintained ##es). In contrast, RoBERTa models use Byte-level Pair by well-known Python libraries and contain adapters to Encoding (BPE) and classifies different tokens for char work with other languages. Two of them, those that are sequences that start a word or that are inside. The tokens based on language models, had to be re-implemented and that start a word include the white space before the word, adapted. In addition to different technical aspects, both and they are marked whith the special character ’G’. ̇ For methods use the original BERT model [48] for English, instance, the word extrapolate is divided into two tokens: and the RoBERTa MarIA model [49] for Spanish. ̇ ’Gextrap’ and ’olate’. Beyond the differences studied in previous works on 4.1. Already implemented methods the benefits or differences between both types of tokenis- ers [50], this work had to develop the alignment process The methods used for the evaluation are TopicRank, between the words of keywords and their correspond- YAKE and RAKE. The Python library PKE6 has been used ing tokens. With WordPiece is easier to find tokens and for the execution of the TopicRank and YAKE methods. recompose the original word, but BPE is sensible to ap- PKE uses the Python library spaCy7 , as many other meth- pearance of the white space before the token. If it does ods, to identify candidate chunks or nominal phrases that not appear, the token is different and its attention value can be relevant terms or keywords. Thus, the Spanish 8 https://github.com/vgrabovets/multi_rake 6 9 https://github.com/boudinfl/pke https://github.com/hd10-iupui/AttentionRank 7 10 https://spacy.io/ https://github.com/oeg-upm/AttentionRankLib changes. This issue has been solved by ensuring that the The results of the AKE algorithms on the Spanish input sentences always have a white space before a word. datasets, both multilingual and adapted for Spanish, show a lower performance compared to the original datasets. 4.3. MDERank However, they are in line with the results for English. Unlike many other NLP experiments, where a good result The original implementation11 contains a better descrip- is represented by metrics starting at 0.6 or 0.7 of f1 score, tion of the requirements. However, it is described for the highest metrics achieved by the algorithms tested in Python 3.7 which is no longer supported by the commu- SemEval2010 and 2017 do not exceed 0.3821 (BR17 and nity and most of the versions of the required libraries are K= 15). deprecated. Also, parts of the execution code are wrong We already expected lower values, as the translation such as the command line execution or the arguments, process is not perfect and it is not always possible to and there is no code related to the KPEBERT model, a maintain the correlation of one keyword in English to model which is trained and used for keyword identifi- the same keyword in Spanish. Apart from the errors cation. Only it is possible to execute it with traditional detected (explained in Section 5.2), GPT3 showed better BERT models. performance in maintaining the structure and terminol- To update the code and method, a new repository has ogy of the translated document. been created12 . In which the requirements, code and exe- It is also important to mention the different results cution process have improved. As AttentionRank, MDER- obtained for each dataset. For Spa SemEval2017GT and ank used Stanford CoreNLP for the identification of noun Spa SemEval2017GPT the best results, in terms of preci- fragments and it has been updated to spaCy. Finally, the sion, recall and f1-score, are obtained by the two methods method can now support RoBERTa models by taking into that are based on language models: AttentionRank and account the problems mentioned in AttentionRank. MDERank. Although the original dataset contains com- plex keywords, the language models perform well as in the English dataset. 5. Evaluation Surprisingly, for Spa SemEval2010GT and Spa This section discusses the evaluation results obtained SemEval2010 GPT the best results are obtained by YAKE. from the execution of the five AKE methods on The nature of the documents in SemEval2010, which are the four translated datasets (Spa_SemEval2010GT , full papers without any cleaning, including formulas, Spa_SemEval2010GPT , Spa_SemEval2017GT and references and citations, makes it difficult for a language Spa_SemEval2017GPT ). The metrics used in the evalu- model to perform well. An added issue is the large ation are precision, recall and f1-measure. Following length of the documents, which in the case of RAKE previous works in the literature, the methods are produces results close to zero. evaluated with the three metrics at the top K of the keywords extracted in each method. K equals 5, 10, and 5.2. Error Analysis and Discussion 15. Finally, we perform an error analysis and present a After a thorough analysis of the results, we conclude discussion around it. Table 2 shows the results obtained. that, beyond some translation errors, the main reason behind the low numbers seems to be the poor quality of 5.1. Results some keywords in the original datasets. Although both Table 2 shows the results for each top K (5, 10, datasets are claimed to have been either generated or 15) and method. The results have been grouped reviewed by humans, we have detected a great number by the type of dataset and the translation system of anomalies that may be the main source of errors, as used: Spa SemEval2010 , Spa SemEval2010 , Spa we try to illustrate below: GT GPT SemEval2017GT and Spa SemEval2017GPT , where GT • Duplicated structures: We find similar structures stands for Google Translate and GPT stands for Chat- with small variations which produce noise and GPT 3.5. Additionally, the column named BR, that stands inconsistencies, such as terms with determiners for Best Result, shows the best f1 result reported in the (i.e. metal and the metal), terms with symbols or original datasets in English (BR10 for SemEval2010 and special characters (i.e. logical inference and “logi- BR17 for SemEval2017). These results are taken from cal inference"), and terms with different spellings the AttentionRank work [13], except for the results for (i.e reputation mechanism and Reputation mecha- MDERank, which are taken from their own published nism). work [12]. • Misspelled structures: We found several examples 11 https://github.com/LinhanZ/mderank of misspelled structures, and, specifically, missing 12 https://github.com/oeg-upm/mderanklib Table 2 Evaluation of five AKE methods against the translated datasets measuring Precision (𝑝), Recall (𝑟) and F-measure (𝐹 ). Each evaluation has taken into account the K (n top) value for 5, 10 and 15. Also, the best F1 obtained for the original SemEval2010 and SemEval2017 in English (BR10 and BR17 ) with each method is reported. Spa_SE2010GT Spa_SE2010GPT BR10 Spa_SE2017GT Spa_SE2017GPT BR17 k Method p r F1 p r F1 F1 p r F1 p r F1 F1 RAKE 0.00 0.00 0.00 0.08 0.03 0.04 0.67 12.17 3.97 5.98 14.88 5.15 7.66 13.24 TopicRank 4.77 1.65 2.45 7.08 2.53 3.73 5.26 19.39 5.85 8.99 21.94 6.87 10.47 15.92 5 YAKE 7.49 2.58 3.83 10.95 3.85 5.69 8.46 10.47 3.39 5.13 18.86 6.45 9.61 12.05 AttentionRank 7.52 2.60 3.86 9.30 3.32 4.89 11.39 19.51 5.88 9.03 24.66 7.84 11.89 23.59 MDERank 7.63 2.44 3.70 9.62 3.11 4.70 12.95 19.39 5.60 8.69 27.46 7.94 12.32 22.81 RAKE 0.00 0.00 0.00 0.16 0.11 0.13 1.33 12.70 8.16 9.93 14.86 10.07 12.00 22.61 TopicRank 4.77 3.28 3.89 6.38 4.50 5.28 7.43 15.98 9.45 11.88 17.97 11.07 13.70 20.60 10 YAKE 7.37 5.07 6.01 9.42 6.56 7.74 11.98 11.87 7.62 9.28 18.09 12.19 14.56 18.16 AttentionRank 7.22 4.38 5.45 9.11 5.45 6.81 15.12 16.71 9.96 12.48 20.54 12.91 15.85 34.37 MDERank 7.17 4.59 5.60 8.88 5.74 6.97 17.07 15.92 9.20 11.66 22.45 12.98 16.45 32.51 RAKE 0.05 0.05 0.05 0.11 0.11 0.11 1.78 11.98 11.25 11.60 14.02 13.90 13.96 26.87 TopicRank 4.36 4.39 4.38 5.38 5.65 5.51 8.02 13.61 12.10 12.81 15.09 13.85 14.44 22.37 15 YAKE 6.83 7.02 6.93 8.56 9.04 8.79 12.87 11.33 10.70 11.01 17.20 17.09 17.15 20.72 AttentionRank 6.70 5.83 6.23 7.90 7.97 7.93 16.66 14.20 12.52 13.31 17.09 15.93 16.49 38.21 MDERank 6.27 6.03 6.15 7.79 7.54 7.66 20.09 13.84 12.01 12.86 19.31 16.75 17.93 37.18 letters both at the beginning and at the end of the netic can be found translated into two different keywords structure (i.e. aked instead of baked). throughout the text, as ferromagnética and ferromagnéti- cos. However, with the aim to be faithful to the original • Non-terminological structures: This is the most evaluation datasets, we decided to choose one of the common anomaly in both datasets, and one of the translations and discard the alternatives, although we main causes for the low performance of the algo- believe that the datasets would benefit from including rithms, both in English and in Spanish. Examples such variation. of such non-terminological structures are: full sentences (i.e. dynamics which clearly reveal the origins of the roaming), sentence fragments (i.e. 6. Conclusions loading force and penetration depth were recorded and their respective values were correlated with This work has analysed the current state-of-the-art of au- the observed), concatenated structures (i.e.1. well tomatic keyword extraction and, in particular, the Span- defined phase space dividing surfaces attached to, ish landscape. In this analysis, we have identified the i.e.2. austenitic or austenitic & ferritic stainless lack of an evaluation framework (including datasets and steel), or even text fragments with references (i.e.1. ready-to-test algorithms) for AKE in Spanish. Conse- comparison between the realistic calculations for quently, this paper proposes two contributions. First, the positive parity [12] and negative parity [14], based generation of a silver standard for the Spanish language on the same quark model [15], i.e.2. calculation by community by the translation of two English datasets Martinez-Pinedo et al.). widely used to evaluate AKE approaches: SemEval2010 and SemEval2017. Second, the configuration of a set of Additionally to inaccuracies and anomalies mentioned state-of-the-art algorithms in an easily executable man- before, in the results we observe that in some instances ner to facilitate the evaluation task, including the adapta- the same keyword has been translated differently into tion of two current methods that rely on language models: Spanish in different parts of the text. For example, the Attention Rank and MDERank. term deployment has been translated both as despliegue With the benchmark in place, we have performed an and implementación within the same text; or the com- evaluation of the implemented algorithms and the trans- pound term information aggregation can be found trans- lated datasets. To be consistent with the evaluations in lated as agregación de información and agregación de la English, the translated datasets maintain the original in- información. In itself, this would not be a problem be- ner structure. The results in Spanish suggest the same cause these are correct translations in Spanish. Moreover, tendency as in English, although they are lower. The even in specialised domains, term variants are commonly error analysis shows that low results are due to several used to designate the same concept. factors: 1) the quality of the original datasets, as they A similar issue occurs when Spanish terms vary in contain noisy texts, non-terminological structures, and gender and number. For instance, the keyword ferromag- terms that are not contained in the texts, 2) the qual- in: Proceedings of the international conference re- ity of the translations for the labelled datasets, as both cent advances in natural language processing, 2015, systems present translation inconsistencies and have dif- pp. 473–479. ficulties to keep track of the translated keyword in the [7] P. D. Turney, Learning algorithms for keyphrase text, 3) the fact that a 1 to 1 translation of keywords is extraction, Information retrieval 2 (2000) 303–336. not always possible nor desirable, and that it would be [8] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, recommendable to include term variants. C. G. Nevill-Manning, Kea: Practical automatic In light of the results and taking these remarks into keyphrase extraction, in: Proceedings of the fourth account, we conclude that maintaining the dataset struc- ACM conference on Digital libraries, 1999, pp. 254– ture in English to evaluate AKE tasks in Spanish might 255. not be the most appropriate approach. For this rea- [9] D. Sahrawat, D. Mahata, M. Kulkarni, H. Zhang, son, as part of future work we are considering two ap- R. Gosangi, A. Stent, A. Sharma, Y. Kumar, R. R. proaches for generating evaluation datasets in Spanish: Shah, R. Zimmermann, Keyphrase extraction 1) automatically postprocessing existing datasets, such from scholarly articles as sequence labeling us- as the two dealt with in this work, to eliminate all non- ing contextualized embeddings, arXiv preprint terminological structures and produce a list of candidate arXiv:1910.08840 (2019). terms instead of just one in the translation process, and [10] R. Alzaidy, C. Caragea, C. L. Giles, Bi-lstm-crf 2) semi-automatically generating a dataset with similar sequence labeling for keyphrase extraction from characteristics to the ones mentioned, but based on texts scholarly documents, in: The world wide web con- originally written in Spanish. ference, 2019, pp. 2551–2557. [11] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, Y. Chi, Deep keyphrase generation, arXiv preprint Acknowledgments arXiv:1704.06879 (2017). [12] L. Zhang, Q. Chen, W. Wang, C. Deng, S. Zhang, This work has been partially founded by INESDATA B. Li, W. Wang, X. Cao, Mderank: A masked (https://inesdata-project.eu/) project, funded by the Span- document embedding rank approach for unsu- ish Ministry of Digital Transformation and Public Affairs pervised keyphrase extraction, arXiv preprint and NextGenerationEU, in the framework of the UNICO arXiv:2110.06651 (2021). I+D CLOUD Program - Real Decreto 959/2022. [13] H. Ding, X. Luo, Attentionrank: Unsupervised keyphrase extraction using self and cross attentions, References in: Proceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, 2021, [1] N. Firoozeh, A. Nazarenko, F. Alizon, B. Daille, pp. 1919–1928. Keyword extraction: Issues and methods, Natural [14] R. Campos, V. Mangaravite, A. Pasquali, A. M. Jorge, Language Engineering 26 (2020) 259–291. doi:10. C. Nunes, A. Jatowt, Yake! collection-independent 1017/S1351324919000457. automatic keyword extractor, in: Advances in In- [2] O. Borisov, M. Aliannejadi, F. Crestani, Keyword formation Retrieval: 40th European Conference on extraction for improved document retrieval in con- IR Research, ECIR 2018, Grenoble, France, March versational search, arXiv preprint arXiv:2109.05979 26-29, 2018, Proceedings 40, Springer, 2018, pp. 806– (2021). 810. [3] H. Shah, R. Mariescu-Istodor, P. Fränti, We- [15] X. Wan, J. Xiao, Single document keyphrase extrac- brank: Language-independent extraction of key- tion using neighborhood knowledge., in: AAAI, words from webpages, in: 2021 IEEE International volume 8, 2008, pp. 855–860. Conference on Progress in Informatics and Com- [16] S. D. Gollapalli, C. Caragea, Extracting keyphrases puting (PIC), IEEE, 2021, pp. 184–192. from research papers using citation networks, in: [4] K. Frantzi, S. Ananiadou, H. Mima, Automatic Proceedings of the AAAI conference on artificial recognition of multi-word terms:. the c-value/nc- intelligence, volume 28, 2014. value method, International journal on digital li- [17] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin, braries 3 (2000) 115–130. SemEval-2010 Task 5 : Automatic Keyphrase Ex- [5] S. Rose, D. Engel, N. Cramer, W. Cowley, Auto- traction from Scientific Articles, in: K. Erk, C. Strap- matic keyword extraction from individual docu- parava (Eds.), Proceedings of the 5th International ments, Text mining: applications and theory 1 Workshop on Semantic Evaluation, Association for (2010) 1–20. Computational Linguistics, Uppsala, Sweden, 2010, [6] A. Oliver, M. Vàzquez, Tbxtools: a free, fast and pp. 21–26. URL: https://aclanthology.org/S10-1004. flexible tool for automatic terminology extraction, [18] I. Augenstein, M. Das, S. Riedel, L. Vikraman, A. Mc- Callum, SemEval 2017 Task 10: ScienceIE - Extract- minología gratuita, Translation Journal (2007). ing Keyphrases and Relations from Scientific Publi- [31] J. Rocheteau, B. Daille, Ttc termsuite: A uima ap- cations, in: S. Bethard, M. Carpuat, M. Apidianaki, plication for multilingual terminology extraction S. M. Mohammad, D. Cer, D. Jurgens (Eds.), Proceed- from comparable corpora, in: 5th International ings of the 11th International Workshop on Seman- Joint Conference on Natural Language Processing tic Evaluation (SemEval-2017), Association for Com- (IJCNLP), 2011, pp. 9–12. putational Linguistics, Vancouver, Canada, 2017, pp. [32] J. Vivaldi, H. Rodríguez, Improving term extraction 546–555. URL: https://aclanthology.org/S17-2091. by combining different techniques, Terminology. doi:10.18653/v1/S17-2091. International Journal of Theoretical and Applied [19] A. Hulth, Improved Automatic Keyword Extraction Issues in Specialized Communication 7 (2001) 31– Given More Linguistic Knowledge, in: Proceedings 48. of the 2003 Conference on Empirical Methods in [33] R. Mihalcea, P. Tarau, Textrank: Bringing order Natural Language Processing, 2003, pp. 216–223. into text, in: Proceedings of the 2004 conference on URL: https://aclanthology.org/W03-1028. empirical methods in natural language processing, [20] A. Oliver, M. Vàzquez, A free terminology extrac- 2004, pp. 404–411. tion suite, in: Proceedings of Translating and the [34] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, Computer 29, 2007. C. Nunes, A. Jatowt, Yake! keyword extraction [21] J. A. Lossio-Ventura, C. Jonquet, M. Roche, M. Teis- from single documents using multiple local features, seire, Combining c-value and keyword extraction Information Sciences 509 (2020) 257–289. doi:10. methods for biomedical terms extraction, in: LBM: 1016/j.ins.2019.09.013. languages in biology and medicine, 2013. [35] H. H. Alrehamy, C. Walker, Semcluster: unsuper- [22] J. S. Justeson, S. M. Katz, Technical terminology: vised automatic keyphrase extraction using affinity some linguistic properties and an algorithm for propagation, in: Advances in Computational In- identification in text, Natural language engineering telligence Systems: Contributions Presented at the 1 (1995) 9–27. 17th UK Workshop on Computational Intelligence, [23] D. Bourigault, Surface grammatical analysis for September 6-8, 2017, Cardiff, UK, Springer, 2018, pp. the extraction of terminological noun phrases, in: 222–235. COLING 1992 Volume 3: The 14th International [36] A. Bougouin, F. Boudin, B. Daille, Topicrank: Conference on Computational Linguistics, 1992. Graph-based topic ranking for keyphrase extrac- [24] K. Kageura, E. Marshman, Terminology extraction tion, in: International joint conference on natural and management, in: The Routledge Handbook of language processing (IJCNLP), 2013, pp. 543–551. Translation and Technology, Routledge, 2019, pp. [37] C. Florescu, C. Caragea, A position-biased pagerank 61–77. algorithm for keyphrase extraction, in: Proceedings [25] B. Daille, Conceptual structuring through term vari- of the AAAI conference on artificial intelligence, ations, in: Proceedings of the ACL 2003 workshop volume 31, 2017. on Multiword expressions: analysis, acquisition and [38] B. Wang, S. Yu, W. Lou, Y. T. Hou, Privacy- treatment, 2003, pp. 9–16. preserving multi-keyword fuzzy search over en- [26] F. Sclano, P. Velardi, Termextractor: a web applica- crypted data in the cloud, in: IEEE INFOCOM tion to learn the shared terminology of emergent 2014-IEEE conference on computer communica- web communities, in: Enterprise Interoperability tions, IEEE, 2014, pp. 2112–2120. II, Springer, 2007, pp. 287–290. [39] D. Mahata, J. Kuriakose, R. Shah, R. Zimmermann, [27] K. Kageura, B. Umino, Methods of automatic term Key2vec: Automatic ranked keyphrase extraction recognition: A review, Terminology. International from scientific articles using phrase embeddings, in: Journal of Theoretical and Applied Issues in Spe- Proceedings of the 2018 Conference of the North cialized Communication 3 (1996) 259–289. American Chapter of the Association for Computa- [28] M. T. Pazienza, M. Pennacchiotti, F. M. Zanzotto, tional Linguistics: Human Language Technologies, Terminology extraction: an analysis of linguistic Volume 2 (Short Papers), 2018, pp. 634–639. and statistical approaches, in: Knowledge mining, [40] D. Gromann, L. Wachowiak, C. Lang, B. Heinisch, Springer, 2005, pp. 255–279. Multilingual extraction of terminological concept [29] L. P. Jones, E. W. Gassie, Jr, S. Radhakrishnan, Index: systems, Deep Learning and Neural Approaches The statistical basis for an automatic conceptual for Linguistic Data (2021) 5. phrase-indexing system, Journal of the American [41] Y. Sun, H. Qiu, Y. Zheng, Z. Wang, C. Zhang, Society for Information Science 41 (1990) 87–97. Sifrank: A new baseline for unsupervised [30] A. Oliver, M. Vázquez, J. Moré, Linguoc lexterm: keyphrase extraction based on pre-trained lan- una herramienta de extracción automática de ter- guage model, IEEE Access 8 (2020) 10896–10906. doi:10.1109/ACCESS.2020.2965087. English sentence: "The University of Florida, in part- [42] C. Lang, L. Wachowiak, B. Heinisch, D. Gromann, nership with Motorola, has held two
mobile comput- Transforming term extraction: Transformer-based ing
design competitions". Spanish sentence : "La approaches to multilingual term extraction across Universidad de Florida, en asociación con Motorola, ha domains, in: Findings of the Association for Com- celebrado dos concursos de diseño de computación móvil". putational Linguistics: ACL-IJCNLP 2021, 2021, pp. Output: computación móvil English sentence: "There, 3607–3620. we assume that
coefficients of non-renormalizable [43] A. Ghafoor, A. S. Imran, S. M. Daudpota, Z. Kas- terms
are suppressed enough to be neglected". Span- trati, R. Batra, M. A. Wani, et al., The impact of ish sentence: "Aquí, asumimos que los coeficientes de translating resource-rich datasets to low-resource los términos no renormalizables están suficientemente languages through multi-lingual text processing, suprimidos como para ser ignorados". Output: coefi- IEEE Access 9 (2021) 124478–124490. cientes de los términos no renormalizables [44] L. Bonifacio, V. Jeronymo, H. Q. Abonizio, I. Cam- English sentence: "It often exploits an
optical dif- piotti, M. Fadaee, R. Lotufo, R. Nogueira, mmarco: fusion model-based image reconstruction algorithm
A multilingual version of the ms marco passage to estimate spatial property values from measurements ranking dataset, arXiv preprint arXiv:2108.13897 of the light flux at the surface of the tissue." Spanish (2021). sentence: "A menudo se utiliza un algoritmo de recon- [45] M. Araújo, A. Pereira, F. Benevenuto, A compara- strucción de imágenes basado en un modelo de difusión tive study of machine translation for multilingual óptica para estimar los valores de propiedades espaciales sentence-level sentiment analysis, Information Sci- a partir de medidas de la flujo de luz en la superficie del ences 512 (2020) 1078–1102. tejido." Output: algoritmo de reconstrucción de imágenes [46] C. P. Carrino, M. R. Costa-Jussà, J. A. Fonollosa, basado en un modelo de difusión óptica Automatic spanish translation of the squad dataset English: "A second group of experiments is aimed at for multilingual question answering, arXiv preprint extensions of the baseline methods that exploit charac- arXiv:1912.05200 (2019). teristic features of the UvT Expert Collection; specifically, [47] G. M. Rosa, L. H. Bonifacio, L. R. de Souza, R. Lotufo, we propose and evaluate refined expert finding and pro- R. Nogueira, A cost-benefit analysis of cross-lingual filing methods that incorporate
topicality and orga- transfer methods, arXiv preprint arXiv:2105.06813 nizational structure
." Spanish: "Un segundo grupo (2021). de experimentos está dirigido a extensiones de los méto- [48] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, dos base que aprovechan las características distintivas de Bert: Pre-training of deep bidirectional transform- la Colección de Expertos de UvT; específicamente, pro- ers for language understanding, arXiv preprint ponemos y evaluamos métodos refinados de búsqueda y arXiv:1810.04805 (2018). perfilado de expertos que incorporan la topicalidad y la [49] A. Gutiérrez-Fandiño, J. Armengol-Estapé, estructura organizativa." output: topicalidad y la estruc- M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P. tura organizativa Carrino, A. Gonzalez-Agirre, C. Armentano-Oller, C. Rodriguez-Penagos, M. Villegas, Maria: Spanish language models, arXiv preprint arXiv:2107.07253 (2021). [50] C. Toraman, E. H. Yilmaz, F. Şahinuç, O. Ozcelik, Impact of tokenization on language models: An analysis for turkish, ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22 (2023). URL: https://doi.org/ 10.1145/3578707. doi:10.1145/3578707. A. Term Translation Prompt You are a scientific translator of English to Spanish spe- cialized in terminology. I give you one sentence in En- glish and the same sentence translated to Spanish. The English sentence has a term between the marks
and
. Identify in the Spanish sentence which words cor- respond to the same original term. The output term is in Spanish. Some examples