Benchmark for Automatic Keyword Extraction in Spanish:
                                Datasets and Methods
                                Pablo Calleja1 , Patricia Martín-Chozas1 and Elena Montiel-Ponsoda1
                                1
                                    Ontology Engineering Group, Universidad Politécnica de Madrid


                                                                             Abstract
                                                                             Tasks such as document indexing or information retrieval still seem to heavily rely on keywords, even in the LLMs era.
                                                                             However, there is still a need for automatic keyword extraction works and training sets in languages other than English. To
                                                                             the best of our knowledge, no datasets for keyword extraction in Spanish are publicly available for training or evaluation
                                                                             purposes. Additionally, those innovative keyword extraction methods that rely on language models are not being adapted
                                                                             to language models in other languages. To palliate this situation, this work proposes a method to translate into Spanish
                                                                             two of the main gold standard datasets used by the community, while preserving semantics and terms. Then, the main
                                                                             state-of-the-art methods are evaluated against the new translated datasets. The methods used for the evaluation have been
                                                                             configured or re-implemented for Spanish.

                                                                             Keywords
                                                                             Spanish Automatic Keyword Extraction, Spanish language, SemEval2017, SemEval2010


                                1. Introduction                                                                                                       tional supervised methods are based on decision trees [7],
                                                                                                                                                      naive Bayes [8] or Conditional Random Fields [9]. In
                                Keywords, typically defined as words or terms that best                                                               the past 10 years, several models have emerged based on
                                characterise the topics discussed in a document, have                                                                 neural networks and deep learning [10, 11]. The most
                                proven essential for different NLP tasks such as informa-                                                             recent approaches rely on language models and attention
                                tion extraction (IE), text mining, or information retrieval                                                           mechanisms [12, 13].
                                (IR) [1]. With the exponential growth of available digital                                                               Supervised methods tend to offer the best results in
                                documents, a need emerged for algorithms capable of                                                                   the literature of machine learning, but they require a
                                automatically identifying single or compound terms (also                                                              large dataset of labelled training corpora. To achieve
                                referred to as key segments or key phrases) that best rep-                                                            that, human experts have to manually annotate large
                                resented the most relevant information of a document, a                                                               amounts of data, which is a costly and tedious task. The
                                task better known as Automatic Keyword or KeyPhrase                                                                   resulting annotations refer to the specific keywords that
                                Extraction (AKE).                                                                                                     should be extracted from each sentence, paragraph or
                                   Nowadays, even in the face of AI Generative algo-                                                                  document in the corpus. On the other hand, unsupervised
                                rithms and Large Language Models (LLMs), AKE algo-                                                                    methods, such as statistical or graph-based approaches,
                                rithms are not only used to classify, retrieve, or inspect                                                            do not require labelled corpora. Statistical-based meth-
                                large corpora [1, 2, 3], but also to fine-tune LLMs and                                                               ods [5, 14] use candidate position, frequency, length, and
                                post-process their output.                                                                                            capitalisation to determine the importance of a word.
                                   However, automatically extracting keywords is a chal-                                                              Graph-based approaches [15, 16] construct a graph with
                                lenging task due to the complexities of natural language,                                                             the candidates as nodes. The edges indicate similarity or
                                document heterogeneity and the type of keywords that                                                                  co-occurrence of candidates.
                                usually are needed. The current state-of-the-art is full                                                                 Some of the best-known datasets for automatic key-
                                of proposed methods and tools. From the earliest based                                                                word extraction such as SemEval2010 [17], SemEval2017
                                on lexico-syntactic patterns and frequencies [4] to those                                                             [18] or Inspec [19], have been created for evaluation
                                purely based on statistics [5, 6] or the most recent ones                                                             tasks and are commonly used to evaluate new methods
                                based on language models.                                                                                             (both supervised and unsupervised), and not so much for
                                   Keyword extraction methods have generally been clas-                                                               training.
                                sified into supervised or unsupervised methods. Tradi-                                                                   However, all these efforts are not language agnostic.
                                SEPLN-2024: 40th Conference of the Spanish Society for Natural                                                        Most of the works so far have been oriented towards
                                Language Processing. Valladolid, Spain. 24-27 September 2024.                                                         the English language, giving a small coverage to other
                                $ p.calleja@upm.es (P. Calleja); patricia.martin@upm.es                                                               languages such as Spanish. As far as we know, there
                                (P. Martín-Chozas); elena.montiel@upm.es (E. Montiel-Ponsoda)                                                         are no publicly available annotated training corpora in
                                 0000-0001-8423-8240 (P. Calleja); 0000-0002-8922-7521
                                                                                                                                                      Spanish. Therefore, supervised algorithms cannot be
                                (P. Martín-Chozas); 0000-0003-3263-3403 (E. Montiel-Ponsoda)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative   easily implemented, and evaluations for supervised or
                                                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        unsupervised algorithms are difficult to perform.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   In this paper, a method to translate two of the most        resources that contain the lexical units that are represen-
important corpora for AKE is proposed and applied to           tative of a domain.
their translation into Spanish. The main aim of this work         Although these two tasks have been conceived for
is to create a ’silver standard´ to support the training and   different purposes, the truth is that, when performed au-
evaluation of automatic keyword extraction in Spanish.         tomatically, they obtain similar results and performance,
The translation process has been performed to preserve         as both rely on linguistic and textual features (at sentence,
the semantics and terminological representation of the         paragraph or document levels). Thus, several state-of-
original texts and the annotations. The translation is         the-art methods have been used for both tasks.
supported by the Google Translate service and by Chat-            In this section, we will review the most relevant works
GPT3.5.                                                        in this area, making a distinction between traditional
   Additionally, a benchmark has been generated with           approaches (linguistic and statistic) and machine learning
five of the most relevant methods in the current state-of-     and neural approaches.
the-art on the two translated corpus. The methods have
been configured for Spanish, and two of them have been         2.1. Traditional approaches
re-implemented to use Spanish language models.
   The rest of the paper is structured as follows: In sec-     The algorithms considered in this section are usually
tion 2 we provide a summary of the state-of-the-art in         based on linguistic patterns, relying on parsing and part-
Automatic Keyword Extraction. Section 3 is devoted to          speech tagging processes to identify terms [22]. These
the method for the translation of the corpora. Section 4       patterns were very prolific in the 1990s, with systems
describes the different AKE methods with their config-         such as LEXTER [23]. This kind of approaches [24] has
urations or adaptations for the Spanish language, and          persisted until today, as patterns are the main starting
section 5 presents the results of the evaluation bench-        point to automatically identify keywords or terms in
mark. Finally, section 6 highlights the conclusions and        documents and corpora. More advanced works based on
recommendations for future work. Both experiments              patterns went further to identify the concept evoked by
and results are reported in an anonymised GitHub repos-        term variants in several languages, as the work by [25]
itory1 .                                                       for English and French. In any case, the majority of these
                                                               works are language dependent.
                                                                  Later on, researchers started to combine various types
2. State of the art                                            of linguistic techniques, such as pattern-based tech-
                                                               niques, regular expressions, stop word lists, and post-
As stated by [1], ‘keywords’ and ‘keyphrases’ do not re-       processing algorithms, to mention but a few. In this
fer to any theory. An element is considered as a ‘key’         context, tools such as TermExtractor emerge, a system
element within a document, when it is an important de-         that combines several of the previously mentioned tech-
scriptor of the document content. The use of ‘word’ ver-       niques and applies post-processing filters like domain
sus ‘phrase’ refers to the number of textual units, which      pertinence, lexical cohesion or structural relevance [26].
can be one (1-gram) or several (n-grams). Since such              More advanced works in the literature started to use
keywords or keyphrases mostly correspond to terms, de-         statistical approaches in combination with linguistic func-
fined as words that are specific to a domain, the AKE task     tionalities, which appeared to improve the results. The
is closely related to the so-called Automatic Terminol-        process behind statistical approaches generally consists
ogy Extraction/Retrieval (ATE/ATR) task, i.e., the task of     of weighting the frequency of occurrence of a combina-
identifying relevant terms in a corpus [20].                   tion of words (n-grams) in a text. Normally, statistical
   Lossio-Ventura et al. [21] described in their work that     algorithms are divided into two types: 1) those based
there are some fundamental differences between term            on the unithood that measures the strength of unity of
extraction and keyword extraction tasks. One major dif-        complex units (such as X2 , T-score and z-score), and 2)
ference is that extracting terms requires a large collection   those based on the termhood that measures the degree
of texts, which is not a necessary requirement in keyword      of representation of domain-specific concepts, such as
extraction, which can take only a single document as in-       C-Value or co-occurrence [27, 28]. Some of these purely
put. Also, ATE methods aim to extract term-like units          statistical term extractors are INDEX for English [29],
and remove those that may not be terms, syntactically          Lexterm [30] for Spanish, and RAKE [5], for keyword
or terminologically. On the other hand, AKE methods            extraction in English.
extract the ‘key’ elements of a document, which are not           In contrast, it is most common to find mixed ap-
limited to terms. Thus, while AKE methods can be do-           proaches, such as TerMine, a term extractor that
main independent, ATE methods apply to specific fields         combines C-Value with linguistic information [4], or
or professional domains, since their main goal is to build     TermSuite, which applies distributional and composi-
    1
        https://github.com/oeg-upm/spanish-termex              tional methods [31]. In [32], authors combine linguistic
processes such as segmentation, PoS tagging and mor-           cation of keywords on the embedding representation of
phological analysis, with semantic knowledge extracted         the sentence using masked tokens. Moreover, their work
from external resources and statistical techniques. Other      proposes a new type of BERT architecture to be trained
works, such as TextRank [33], create a graph from the          as a language model, but for the purpose of keyword
text to extract keywords based on statistical metrics.         identification.

2.2. Machine Learning and Neural                               3. Dataset generation
     approaches
                                                               In the era of machine learning approaches, datasets are
These approaches exploit different features (linguistic or     an essential requirement to train and, what is more im-
not) to identify keywords. For instance, Rose et al. [5]       portant, evaluate algorithms for different NLP tasks. For
identified keywords based on word frequency, the num-          instance, in the field of Automatic Keyword Extraction,
ber of co-occurring neighbors, and the ratio between the       there are well-known gold standard datasets that are com-
co-occurrence and the frequency. Campos et al. [34] pro-       monly used to evaluate approaches within the literature
posed YAKE which calculated the importance of each can-        such as the SemEval2010 Task 5 [17] and SemEval2017
didate using frequency, offsets, and co-occurrence. Sem-       Task 10 [18]. However, the availability of these data sets
Cluster method [35] first clustered the candidates based       is limited to languages other than English [43]. Conse-
on the semantic similarity in which the centroids were se-     quently, a common approach to overcome this limitation
lected as keywords. TopicRank [36] first assigned a score      is to translate the available datasets into the target lan-
to each topic by candidate keywords clustering. The            guage [44, 45], including Spanish [46].
topics were scored using the TextRank ranking model,              To the best of our knowledge, there is no consolidated
and keywords were extracted using the most represen-           dataset in Spanish for Automated Keyword Extraction,
tative candidate from the top-ranked topics. Florescu          therefore, the first contribution of this work is the devel-
et al. [37] proposed PositionRank to use the position of       opment of an evaluation corpus for keyword extraction in
word occurrences to improve TextRank on a document.            Spanish which results from translating two of the most
   Word embeddings have also been widely used. Wang et         common English AKE datasets: SemEval2010 and Se-
al. [38] made use of the pre-trained word embedding and        mEval2017. The target of this contribution is to generate
the frequency of each word to generate weighted edges          a ‘silver standard’ labelled dataset, to provide researchers
between words in a document. A weighted PageRank               in the field with a consolidated framework to test and
algorithm was used to compute the final scores of words.       evaluate their approaches.
Also, Key2Vec [39] used a similar approach using the              However, the translation process for labelled datasets
phrase embeddings for representing the candidates and          is not a straightforward task. As [47] demonstrated in
ranking the importance of the phrases by calculating the       their work, labelled datasets have their labels linked to
semantic similarity and co-occurrences of the phrases.         one token or a span of tokens. Since the sentence struc-
   Currently, new approaches based on pre-trained neu-         ture can vary in different languages, it is very challenging
ral language models have appeared in the literature. For       to retain the same annotation structure after the trans-
instance, Text2TCS2 [40], which is able to extract terms       lation process. To overcome such difficulties, we have
and relations from raw text, creating taxonomies auto-         organised the translation process into two phases: Phase
matically. [41] proposed SIFRank, the integration of a         1) Source Dataset Analysis and Source Dataset Prepro-
statistical model and a pre-trained language model, to         cessing, described in Section 3.1, and Phase 2) Source
calculate the relevance between candidates and docu-           Dataset Translation and Target Dataset Postprocessing,
ment topics. Other works are focused on the extraction         described in Section 3.2.
of multilingual terminology across domains using trans-           Figure 1 summarises the method for the translation
formers [42].                                                  process in which, given the two original datasets, a set of
   Two of the most recent works in the field of AKE us-        four datasets translated into Spanish is obtained, using
ing language models are AttentionRank and MDERank.             two different translation systems.
AttentionRank [13] integrates self-attention weights ex-
tracted from a pre-trained language model with the cal-
culated cross-attention relevancy value to identify key-       3.1. Phase 1: Dataset analysis and
words that are important to the local sentence context              preprocessing
and also have strong relevancy to all sentences within
                                                               In order to generate the proposed silver standard for
the whole document. MDERank [12] bases the identifi-
                                                               Spanish AKE, we have selected the two previously men-
                                                               tioned datasets, as they are widely used in experiments
    2
      https://live.european-language-grid.eu/catalogue/tool-   of this kind: SemEval2010 Task 5 [17] and SemEval2017
service/8122
                                                              Table 1
                               SemEval2010                    Metrics for SemEval2010 and SemEval2017 datasets, including
                                                              keywords.
   Phase 1


                               SemEval2017                                            SemEval2010           SemEval2017
                                                                Documents                  243                  493
                                                                  Tokens                2.334.613              95.877
             Term Annotation           Term Annotation
                                                                 Keywords                 3.785                8.529
               with quotes              with HTML tag           Unmatched
                                                                                             555                    0
                     service                    few-shot         Keywords
                                                 prompt
                Google
                                        ChatGPT 3.5
               Translator                                  some of the keywords come from the ones manually
                                                           provided by the authors of the papers themselves, and
   Phase 2


                                                           they may not have an exact correspondence in the text.
               Spa_SemEval2010GT        Spa_SemEval2010GPT    Regarding the preprocessing of the datasets, there are
                                                           two main aspects involved in the translation process.
               Spa_SemEval2017GT        Spa_SemEval2017GPT The first one refers to the original text. Not many issues
                                                           were found during the translation of SemEval2017 cor-
                                                           pus, since it had a manageable size and a clean structure.
                                                           However, the original texts of SemEval2010 were arbitrar-
                       Manual revision
                                                           ily segmented, very long, and contained references and
                                                           formulas, which posed many problems for the automatic
                                                           translator when processing them.
Figure 1: Method for dataset translation                      The second aspect refers to the keywords. For the
                                                           translation of the keywords, we did not simply trans-
                                                           late the list of keywords out of context, but decided to
                                                           mark them in the texts with annotations marks (quo-
Task 10 [18]. Both datasets are published following the
                                                           tation marks or the HTML tag <br>, depending on the
same structure, a set of documents containing the raw
                                                           translation system). Then, we translated the texts and
text (named docsutf8) and a set of documents containing
                                                           retrieved the translated terms contained within the an-
the extracted keywords (named keys). Both types of doc-
                                                           notation marks.
uments present the same identifiers to match keywords
with source documents.
   Despite their similar structure, they present several 3.2. Phase 2: Dataset translation and
differences. As shown in Table 1, the main difference             postprocessing
lies in their size. With a smaller number of documents,
SemEval2010 far exceeds SemEval2017 in the total num- Most of the existing approaches to create silver standards
ber of tokens, which means that it contains fewer docu- from existing gold standards by leveraging machine trans-
ments, but of a much larger size. SemEval2017 contains lation rely on at least two translation sources:   3
                                                                                                             one from
shorter documents with an average of 6 to 7 sentences,     a common     online translator such  as DeepL    or Google
                                                                     4
whereas SemEval2010 contains full scientific papers with Translate , and the other using a Neural Machine Trans-
hundreds of sentences. It is interesting to note that, al- lation model, as suggested in [44]. As already announced,
though SemEval2010 is bigger in number of documents in this work      5
                                                                         we have used Google Translate and ChatGPT
and number of tokens, SemEval2017 has a bigger number 3.5 Turbo APIs.
of extracted keywords. This means that the keywords           The keywords from the texts that were translated with
from SemEval2010 have greater representation and num-      Google  Translate were annotated with quotation marks.
ber of occurrences than the keywords from 2017. These However, on some occasions the system retrieved errors
differences in size are important because they require a in which the annotation marks were missing or misplaced
different treatment of the documents during the prepro- in the translated sentence, and either it was not possible
cessing and the translation stage.                         to extract the translated term from the annotated sen-
   In both datasets, over 50% of the keywords are unigram tence or the extracted term was not correct. To avoid
or bigram. However, in SemEval2010 we observe that 555
keywords are not present in the documents with a similar          3
                                                                      https://www.deepl.com/es/translator
span text. The reason for this is to be found in the way in       4
                                                                      https://translate.google.es/
which the original dataset was created. In SemEval2010,           5
                                                                      https://platform.openai.com/docs/models/gpt-3-5-turbo
that, we decided to append the original term to each anno-    model of spaCy has to be downloaded before the methods
tated sentence, to force the system to take that term into    can be run.
account and provide a translation. For instance, in the          For the RAKE method, the original library cannot be
translation of the sentence ‘...has held two "mobile com-     used as it is only oriented to the English language. How-
puting" design competitions’ focused on the term ‘mobile      ever, there is a version named Multi-rake8 which covers
computing’ the translation lost the quotation marks: ‘ha      different languages. As the method is statistical, to per-
celebrado dos concursos de diseño de computación móvil’.      form multilingually, the addition of stopword lists from
Thus, we add the term repeated to obtain the translation      the different target languages is necessary.
of the term: ‘...has held two "mobile computing" design
competitions. Mobile computing’.                              4.2. Attention Rank
   With ChatGPT, the tag <br>was used to mark the key-
words before and after. The prompt sent to the generative     The implementation of the original authors9 had to be
model described the purpose of the model (i.e., ’You are      reimplemented from scratch. The original repository
a Spanish translator specialised in terminology’), and        does not have libraries and version specifications. More-
then some examples of annotations in English and its          over, the original code relies on libraries for language
translations in Spanish with the annotated and trans-         models that are not maintained as well as the noun
lated keywords were provided. This is called few-shot         phrases identification component, which relies on the
prompting. The full prompt is presented in Annex A.           part-of-speech annotation of Stanford CoreNLP and a
   Regarding the postprocessing stage, several actions        third-party library. Reproducibility was not possible in
were performed. First, we extracted all the annotated         this work.
occurrences of each keyword in the sentence, creating a          A new repository10 has been created for the implemen-
list of translation candidates per keyword. In some cases,    tation of the Attention rank method. This repository uses
reconciliation between candidates was necessary to pro-       HuggingFace’s library transformer to manage language
vide a single translation for each keyword. In the case       models and spaCy to identify noun phrases. The reposi-
that no disparities between the candidates were found,        tory details the specific libraries and versions needed and
the translated keyword was automatically assigned. In         the external modules needed. The new repository allows
case of disparities, terms were manually reviewed and         the use of BERT (as in the original work) and RoBERTa
a translated keyword was manually assigned. In total,         architecture models in different languages.
we manually reviewed an average number of 2000 key-              The adaptation for RoBERTa models had to deal with
words per dataset (220 documents in SemEval2010 and           two specific issues regarding the tokeniser. The first one
360 documents in SemEval2017).                                is the use of different special tokens to delimit sentences
                                                              at the beginning and at the end to focus the attention
                                                              mechanisms, as BERT uses ’[CLS]’ and ’[SEP]’ tokens,
4. AKE Adaptation to Spanish                                  RoBERTa uses ’<s>’ and ’</s>’. The second issue is the
                                                              generated tokens, as BERT uses a WordPiece tokeniser
In this section, the different AKE methods used for the ex-   in which subwords are marked with the ’##’ tag (e.g.,
periments and their implementation are presented. Some        the word thicknesses is divided into tokens thickness and
of them have already been implemented and maintained          ##es). In contrast, RoBERTa models use Byte-level Pair
by well-known Python libraries and contain adapters to        Encoding (BPE) and classifies different tokens for char
work with other languages. Two of them, those that are        sequences that start a word or that are inside. The tokens
based on language models, had to be re-implemented and        that start a word include the white space before the word,
adapted. In addition to different technical aspects, both     and they are marked whith the special character ’G’. ̇ For
methods use the original BERT model [48] for English,         instance, the word extrapolate is divided into two tokens:
and the RoBERTa MarIA model [49] for Spanish.                  ̇
                                                              ’Gextrap’  and ’olate’.
                                                                 Beyond the differences studied in previous works on
4.1. Already implemented methods                              the benefits or differences between both types of tokenis-
                                                              ers [50], this work had to develop the alignment process
The methods used for the evaluation are TopicRank,
                                                              between the words of keywords and their correspond-
YAKE and RAKE. The Python library PKE6 has been used
                                                              ing tokens. With WordPiece is easier to find tokens and
for the execution of the TopicRank and YAKE methods.
                                                              recompose the original word, but BPE is sensible to ap-
PKE uses the Python library spaCy7 , as many other meth-
                                                              pearance of the white space before the token. If it does
ods, to identify candidate chunks or nominal phrases that
                                                              not appear, the token is different and its attention value
can be relevant terms or keywords. Thus, the Spanish
                                                                  8
                                                                      https://github.com/vgrabovets/multi_rake
    6                                                             9
      https://github.com/boudinfl/pke                                 https://github.com/hd10-iupui/AttentionRank
    7                                                            10
      https://spacy.io/                                               https://github.com/oeg-upm/AttentionRankLib
changes. This issue has been solved by ensuring that the       The results of the AKE algorithms on the Spanish
input sentences always have a white space before a word. datasets, both multilingual and adapted for Spanish, show
                                                            a lower performance compared to the original datasets.
4.3. MDERank                                                However, they are in line with the results for English.
                                                            Unlike many other NLP experiments, where a good result
The original implementation11 contains a better descrip- is represented by metrics starting at 0.6 or 0.7 of f1 score,
tion of the requirements. However, it is described for the highest metrics achieved by the algorithms tested in
Python 3.7 which is no longer supported by the commu- SemEval2010 and 2017 do not exceed 0.3821 (BR17 and
nity and most of the versions of the required libraries are K= 15).
deprecated. Also, parts of the execution code are wrong        We already expected lower values, as the translation
such as the command line execution or the arguments, process is not perfect and it is not always possible to
and there is no code related to the KPEBERT model, a maintain the correlation of one keyword in English to
model which is trained and used for keyword identifi- the same keyword in Spanish. Apart from the errors
cation. Only it is possible to execute it with traditional detected (explained in Section 5.2), GPT3 showed better
BERT models.                                                performance in maintaining the structure and terminol-
   To update the code and method, a new repository has ogy of the translated document.
been created12 . In which the requirements, code and exe-      It is also important to mention the different results
cution process have improved. As AttentionRank, MDER- obtained for each dataset. For Spa SemEval2017GT and
ank used Stanford CoreNLP for the identification of noun Spa SemEval2017GPT the best results, in terms of preci-
fragments and it has been updated to spaCy. Finally, the sion, recall and f1-score, are obtained by the two methods
method can now support RoBERTa models by taking into that are based on language models: AttentionRank and
account the problems mentioned in AttentionRank.            MDERank. Although the original dataset contains com-
                                                            plex keywords, the language models perform well as in
                                                            the English dataset.
5. Evaluation                                                  Surprisingly, for Spa SemEval2010GT and Spa
This section discusses the evaluation results obtained      SemEval2010    GPT the best results are obtained by YAKE.

from the execution of the five AKE methods on The nature of the documents in SemEval2010, which are
the four translated datasets (Spa_SemEval2010GT , full papers without any cleaning, including formulas,
Spa_SemEval2010GPT ,         Spa_SemEval2017GT         and references and citations, makes it difficult for a language
Spa_SemEval2017GPT ). The metrics used in the evalu- model to perform well. An added issue is the large
ation are precision, recall and f1-measure. Following length of the documents, which in the case of RAKE
previous works in the literature, the methods are produces results close to zero.
evaluated with the three metrics at the top K of the
keywords extracted in each method. K equals 5, 10, and 5.2. Error Analysis and Discussion
15. Finally, we perform an error analysis and present a
                                                          After a thorough analysis of the results, we conclude
discussion around it. Table 2 shows the results obtained.
                                                          that, beyond some translation errors, the main reason
                                                          behind the low numbers seems to be the poor quality of
5.1. Results                                              some keywords in the original datasets. Although both
Table 2 shows the results for each top K (5, 10, datasets are claimed to have been either generated or
15) and method. The results have been grouped reviewed by humans, we have detected a great number
by the type of dataset and the translation system of anomalies that may be the main source of errors, as
used: Spa SemEval2010 , Spa SemEval2010 , Spa we try to illustrate below:
                            GT                  GPT
SemEval2017GT and Spa SemEval2017GPT , where GT                  • Duplicated structures: We find similar structures
stands for Google Translate and GPT stands for Chat-               with small variations which produce noise and
GPT 3.5. Additionally, the column named BR, that stands            inconsistencies, such as terms with determiners
for Best Result, shows the best f1 result reported in the          (i.e. metal and the metal), terms with symbols or
original datasets in English (BR10 for SemEval2010 and             special characters (i.e. logical inference and “logi-
BR17 for SemEval2017). These results are taken from                cal inference"), and terms with different spellings
the AttentionRank work [13], except for the results for            (i.e reputation mechanism and Reputation mecha-
MDERank, which are taken from their own published                  nism).
work [12].
                                                                 • Misspelled structures: We found several examples
   11
      https://github.com/LinhanZ/mderank                           of misspelled structures, and, specifically, missing
   12
      https://github.com/oeg-upm/mderanklib
Table 2
Evaluation of five AKE methods against the translated datasets measuring Precision (𝑝), Recall (𝑟) and F-measure (𝐹 ). Each
evaluation has taken into account the K (n top) value for 5, 10 and 15. Also, the best F1 obtained for the original SemEval2010
and SemEval2017 in English (BR10 and BR17 ) with each method is reported.
                        Spa_SE2010GT         Spa_SE2010GPT       BR10         Spa_SE2017GT             Spa_SE2017GPT       BR17
  k      Method
                       p      r     F1      p       r     F1      F1      p        r         F1    p         r      F1      F1
          RAKE        0.00   0.00   0.00   0.08    0.03   0.04    0.67   12.17    3.97   5.98     14.88    5.15     7.66   13.24
        TopicRank     4.77   1.65   2.45   7.08    2.53   3.73    5.26   19.39    5.85   8.99     21.94    6.87    10.47   15.92
  5       YAKE        7.49   2.58   3.83   10.95   3.85   5.69    8.46   10.47    3.39   5.13     18.86    6.45     9.61   12.05
      AttentionRank   7.52   2.60   3.86   9.30    3.32   4.89   11.39   19.51    5.88   9.03     24.66    7.84    11.89   23.59
        MDERank       7.63   2.44   3.70   9.62    3.11   4.70   12.95   19.39    5.60   8.69     27.46    7.94    12.32   22.81
          RAKE        0.00   0.00   0.00   0.16    0.11   0.13    1.33   12.70    8.16    9.93    14.86    10.07   12.00   22.61
        TopicRank     4.77   3.28   3.89   6.38    4.50   5.28    7.43   15.98    9.45   11.88    17.97    11.07   13.70   20.60
 10       YAKE        7.37   5.07   6.01   9.42    6.56   7.74   11.98   11.87    7.62    9.28    18.09    12.19   14.56   18.16
      AttentionRank   7.22   4.38   5.45   9.11    5.45   6.81   15.12   16.71    9.96   12.48    20.54    12.91   15.85   34.37
        MDERank       7.17   4.59   5.60   8.88    5.74   6.97   17.07   15.92    9.20   11.66    22.45    12.98   16.45   32.51
          RAKE        0.05   0.05   0.05   0.11    0.11   0.11    1.78   11.98   11.25   11.60    14.02    13.90   13.96   26.87
        TopicRank     4.36   4.39   4.38   5.38    5.65   5.51    8.02   13.61   12.10   12.81    15.09    13.85   14.44   22.37
 15       YAKE        6.83   7.02   6.93   8.56    9.04   8.79   12.87   11.33   10.70   11.01    17.20    17.09   17.15   20.72
      AttentionRank   6.70   5.83   6.23   7.90    7.97   7.93   16.66   14.20   12.52   13.31    17.09    15.93   16.49   38.21
        MDERank       6.27   6.03   6.15   7.79    7.54   7.66   20.09   13.84   12.01   12.86    19.31    16.75   17.93   37.18


        letters both at the beginning and at the end of the    netic can be found translated into two different keywords
        structure (i.e. aked instead of baked).                throughout the text, as ferromagnética and ferromagnéti-
                                                               cos. However, with the aim to be faithful to the original
     • Non-terminological structures: This is the most evaluation datasets, we decided to choose one of the
        common anomaly in both datasets, and one of the translations and discard the alternatives, although we
        main causes for the low performance of the algo- believe that the datasets would benefit from including
        rithms, both in English and in Spanish. Examples such variation.
        of such non-terminological structures are: full
        sentences (i.e. dynamics which clearly reveal the
        origins of the roaming), sentence fragments (i.e. 6. Conclusions
        loading force and penetration depth were recorded
        and their respective values were correlated with This work has analysed the current state-of-the-art of au-
        the observed), concatenated structures (i.e.1. well tomatic keyword extraction and, in particular, the Span-
        defined phase space dividing surfaces attached to, ish landscape. In this analysis, we have identified the
        i.e.2. austenitic or austenitic & ferritic stainless lack of an evaluation framework (including datasets and
        steel), or even text fragments with references (i.e.1. ready-to-test algorithms) for AKE in Spanish. Conse-
        comparison between the realistic calculations for quently, this paper proposes two contributions. First, the
        positive parity [12] and negative parity [14], based generation of a silver standard for the Spanish language
        on the same quark model [15], i.e.2. calculation by community by the translation of two English datasets
        Martinez-Pinedo et al.).                               widely used to evaluate AKE approaches: SemEval2010
                                                               and SemEval2017. Second, the configuration of a set of
   Additionally to inaccuracies and anomalies mentioned state-of-the-art algorithms in an easily executable man-
before, in the results we observe that in some instances ner to facilitate the evaluation task, including the adapta-
the same keyword has been translated differently into tion of two current methods that rely on language models:
Spanish in different parts of the text. For example, the Attention Rank and MDERank.
term deployment has been translated both as despliegue           With the benchmark in place, we have performed an
and implementación within the same text; or the com- evaluation of the implemented algorithms and the trans-
pound term information aggregation can be found trans- lated datasets. To be consistent with the evaluations in
lated as agregación de información and agregación de la English, the translated datasets maintain the original in-
información. In itself, this would not be a problem be- ner structure. The results in Spanish suggest the same
cause these are correct translations in Spanish. Moreover, tendency as in English, although they are lower. The
even in specialised domains, term variants are commonly error analysis shows that low results are due to several
used to designate the same concept.                            factors: 1) the quality of the original datasets, as they
   A similar issue occurs when Spanish terms vary in contain noisy texts, non-terminological structures, and
gender and number. For instance, the keyword ferromag-
terms that are not contained in the texts, 2) the qual-          in: Proceedings of the international conference re-
ity of the translations for the labelled datasets, as both       cent advances in natural language processing, 2015,
systems present translation inconsistencies and have dif-        pp. 473–479.
ficulties to keep track of the translated keyword in the     [7] P. D. Turney, Learning algorithms for keyphrase
text, 3) the fact that a 1 to 1 translation of keywords is       extraction, Information retrieval 2 (2000) 303–336.
not always possible nor desirable, and that it would be      [8] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin,
recommendable to include term variants.                          C. G. Nevill-Manning, Kea: Practical automatic
   In light of the results and taking these remarks into         keyphrase extraction, in: Proceedings of the fourth
account, we conclude that maintaining the dataset struc-         ACM conference on Digital libraries, 1999, pp. 254–
ture in English to evaluate AKE tasks in Spanish might           255.
not be the most appropriate approach. For this rea- [9] D. Sahrawat, D. Mahata, M. Kulkarni, H. Zhang,
son, as part of future work we are considering two ap-           R. Gosangi, A. Stent, A. Sharma, Y. Kumar, R. R.
proaches for generating evaluation datasets in Spanish:          Shah, R. Zimmermann, Keyphrase extraction
1) automatically postprocessing existing datasets, such          from scholarly articles as sequence labeling us-
as the two dealt with in this work, to eliminate all non-        ing contextualized embeddings, arXiv preprint
terminological structures and produce a list of candidate        arXiv:1910.08840 (2019).
terms instead of just one in the translation process, and [10] R. Alzaidy, C. Caragea, C. L. Giles, Bi-lstm-crf
2) semi-automatically generating a dataset with similar          sequence labeling for keyphrase extraction from
characteristics to the ones mentioned, but based on texts        scholarly documents, in: The world wide web con-
originally written in Spanish.                                   ference, 2019, pp. 2551–2557.
                                                            [11] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky,
                                                                 Y. Chi, Deep keyphrase generation, arXiv preprint
Acknowledgments                                                  arXiv:1704.06879 (2017).
                                                            [12] L. Zhang, Q. Chen, W. Wang, C. Deng, S. Zhang,
This work has been partially founded by INESDATA
                                                                 B. Li, W. Wang, X. Cao, Mderank: A masked
(https://inesdata-project.eu/) project, funded by the Span-
                                                                 document embedding rank approach for unsu-
ish Ministry of Digital Transformation and Public Affairs
                                                                 pervised keyphrase extraction, arXiv preprint
and NextGenerationEU, in the framework of the UNICO
                                                                 arXiv:2110.06651 (2021).
I+D CLOUD Program - Real Decreto 959/2022.
                                                            [13] H. Ding, X. Luo, Attentionrank: Unsupervised
                                                                 keyphrase extraction using self and cross attentions,
References                                                       in: Proceedings of the 2021 Conference on Empiri-
                                                                 cal Methods in Natural Language Processing, 2021,
  [1] N. Firoozeh, A. Nazarenko, F. Alizon, B. Daille,           pp. 1919–1928.
      Keyword extraction: Issues and methods, Natural [14] R. Campos, V. Mangaravite, A. Pasquali, A. M. Jorge,
      Language Engineering 26 (2020) 259–291. doi:10.            C. Nunes, A. Jatowt, Yake! collection-independent
      1017/S1351324919000457.                                    automatic keyword extractor, in: Advances in In-
  [2] O. Borisov, M. Aliannejadi, F. Crestani, Keyword           formation Retrieval: 40th European Conference on
      extraction for improved document retrieval in con-         IR Research, ECIR 2018, Grenoble, France, March
      versational search, arXiv preprint arXiv:2109.05979        26-29, 2018, Proceedings 40, Springer, 2018, pp. 806–
      (2021).                                                    810.
  [3] H. Shah, R. Mariescu-Istodor, P. Fränti, We- [15] X. Wan, J. Xiao, Single document keyphrase extrac-
      brank: Language-independent extraction of key-             tion using neighborhood knowledge., in: AAAI,
      words from webpages, in: 2021 IEEE International           volume 8, 2008, pp. 855–860.
      Conference on Progress in Informatics and Com- [16] S. D. Gollapalli, C. Caragea, Extracting keyphrases
      puting (PIC), IEEE, 2021, pp. 184–192.                     from research papers using citation networks, in:
  [4] K. Frantzi, S. Ananiadou, H. Mima, Automatic               Proceedings of the AAAI conference on artificial
      recognition of multi-word terms:. the c-value/nc-          intelligence, volume 28, 2014.
      value method, International journal on digital li- [17] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin,
      braries 3 (2000) 115–130.                                  SemEval-2010 Task 5 : Automatic Keyphrase Ex-
  [5] S. Rose, D. Engel, N. Cramer, W. Cowley, Auto-             traction from Scientific Articles, in: K. Erk, C. Strap-
      matic keyword extraction from individual docu-             parava (Eds.), Proceedings of the 5th International
      ments, Text mining: applications and theory 1              Workshop on Semantic Evaluation, Association for
      (2010) 1–20.                                               Computational Linguistics, Uppsala, Sweden, 2010,
  [6] A. Oliver, M. Vàzquez, Tbxtools: a free, fast and          pp. 21–26. URL: https://aclanthology.org/S10-1004.
      flexible tool for automatic terminology extraction, [18] I. Augenstein, M. Das, S. Riedel, L. Vikraman, A. Mc-
     Callum, SemEval 2017 Task 10: ScienceIE - Extract-             minología gratuita, Translation Journal (2007).
     ing Keyphrases and Relations from Scientific Publi-       [31] J. Rocheteau, B. Daille, Ttc termsuite: A uima ap-
     cations, in: S. Bethard, M. Carpuat, M. Apidianaki,            plication for multilingual terminology extraction
     S. M. Mohammad, D. Cer, D. Jurgens (Eds.), Proceed-            from comparable corpora, in: 5th International
     ings of the 11th International Workshop on Seman-              Joint Conference on Natural Language Processing
     tic Evaluation (SemEval-2017), Association for Com-            (IJCNLP), 2011, pp. 9–12.
     putational Linguistics, Vancouver, Canada, 2017, pp.      [32] J. Vivaldi, H. Rodríguez, Improving term extraction
     546–555. URL: https://aclanthology.org/S17-2091.               by combining different techniques, Terminology.
     doi:10.18653/v1/S17-2091.                                      International Journal of Theoretical and Applied
[19] A. Hulth, Improved Automatic Keyword Extraction                Issues in Specialized Communication 7 (2001) 31–
     Given More Linguistic Knowledge, in: Proceedings               48.
     of the 2003 Conference on Empirical Methods in            [33] R. Mihalcea, P. Tarau, Textrank: Bringing order
     Natural Language Processing, 2003, pp. 216–223.                into text, in: Proceedings of the 2004 conference on
     URL: https://aclanthology.org/W03-1028.                        empirical methods in natural language processing,
[20] A. Oliver, M. Vàzquez, A free terminology extrac-              2004, pp. 404–411.
     tion suite, in: Proceedings of Translating and the        [34] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge,
     Computer 29, 2007.                                             C. Nunes, A. Jatowt, Yake! keyword extraction
[21] J. A. Lossio-Ventura, C. Jonquet, M. Roche, M. Teis-           from single documents using multiple local features,
     seire, Combining c-value and keyword extraction                Information Sciences 509 (2020) 257–289. doi:10.
     methods for biomedical terms extraction, in: LBM:              1016/j.ins.2019.09.013.
     languages in biology and medicine, 2013.                  [35] H. H. Alrehamy, C. Walker, Semcluster: unsuper-
[22] J. S. Justeson, S. M. Katz, Technical terminology:             vised automatic keyphrase extraction using affinity
     some linguistic properties and an algorithm for                propagation, in: Advances in Computational In-
     identification in text, Natural language engineering           telligence Systems: Contributions Presented at the
     1 (1995) 9–27.                                                 17th UK Workshop on Computational Intelligence,
[23] D. Bourigault, Surface grammatical analysis for                September 6-8, 2017, Cardiff, UK, Springer, 2018, pp.
     the extraction of terminological noun phrases, in:             222–235.
     COLING 1992 Volume 3: The 14th International              [36] A. Bougouin, F. Boudin, B. Daille, Topicrank:
     Conference on Computational Linguistics, 1992.                 Graph-based topic ranking for keyphrase extrac-
[24] K. Kageura, E. Marshman, Terminology extraction                tion, in: International joint conference on natural
     and management, in: The Routledge Handbook of                  language processing (IJCNLP), 2013, pp. 543–551.
     Translation and Technology, Routledge, 2019, pp.          [37] C. Florescu, C. Caragea, A position-biased pagerank
     61–77.                                                         algorithm for keyphrase extraction, in: Proceedings
[25] B. Daille, Conceptual structuring through term vari-           of the AAAI conference on artificial intelligence,
     ations, in: Proceedings of the ACL 2003 workshop               volume 31, 2017.
     on Multiword expressions: analysis, acquisition and       [38] B. Wang, S. Yu, W. Lou, Y. T. Hou, Privacy-
     treatment, 2003, pp. 9–16.                                     preserving multi-keyword fuzzy search over en-
[26] F. Sclano, P. Velardi, Termextractor: a web applica-           crypted data in the cloud, in: IEEE INFOCOM
     tion to learn the shared terminology of emergent               2014-IEEE conference on computer communica-
     web communities, in: Enterprise Interoperability               tions, IEEE, 2014, pp. 2112–2120.
     II, Springer, 2007, pp. 287–290.                          [39] D. Mahata, J. Kuriakose, R. Shah, R. Zimmermann,
[27] K. Kageura, B. Umino, Methods of automatic term                Key2vec: Automatic ranked keyphrase extraction
     recognition: A review, Terminology. International              from scientific articles using phrase embeddings, in:
     Journal of Theoretical and Applied Issues in Spe-              Proceedings of the 2018 Conference of the North
     cialized Communication 3 (1996) 259–289.                       American Chapter of the Association for Computa-
[28] M. T. Pazienza, M. Pennacchiotti, F. M. Zanzotto,              tional Linguistics: Human Language Technologies,
     Terminology extraction: an analysis of linguistic              Volume 2 (Short Papers), 2018, pp. 634–639.
     and statistical approaches, in: Knowledge mining,         [40] D. Gromann, L. Wachowiak, C. Lang, B. Heinisch,
     Springer, 2005, pp. 255–279.                                   Multilingual extraction of terminological concept
[29] L. P. Jones, E. W. Gassie, Jr, S. Radhakrishnan, Index:        systems, Deep Learning and Neural Approaches
     The statistical basis for an automatic conceptual              for Linguistic Data (2021) 5.
     phrase-indexing system, Journal of the American           [41] Y. Sun, H. Qiu, Y. Zheng, Z. Wang, C. Zhang,
     Society for Information Science 41 (1990) 87–97.               Sifrank: A new baseline for unsupervised
[30] A. Oliver, M. Vázquez, J. Moré, Linguoc lexterm:               keyphrase extraction based on pre-trained lan-
     una herramienta de extracción automática de ter-               guage model, IEEE Access 8 (2020) 10896–10906.
     doi:10.1109/ACCESS.2020.2965087.                              English sentence: "The University of Florida, in part-
[42] C. Lang, L. Wachowiak, B. Heinisch, D. Gromann,           nership with Motorola, has held two <br>mobile comput-
     Transforming term extraction: Transformer-based           ing</br> design competitions". Spanish sentence : "La
     approaches to multilingual term extraction across         Universidad de Florida, en asociación con Motorola, ha
     domains, in: Findings of the Association for Com-         celebrado dos concursos de diseño de computación móvil".
     putational Linguistics: ACL-IJCNLP 2021, 2021, pp.        Output: computación móvil English sentence: "There,
     3607–3620.                                                we assume that <br>coefficients of non-renormalizable
[43] A. Ghafoor, A. S. Imran, S. M. Daudpota, Z. Kas-          terms</br> are suppressed enough to be neglected". Span-
     trati, R. Batra, M. A. Wani, et al., The impact of        ish sentence: "Aquí, asumimos que los coeficientes de
     translating resource-rich datasets to low-resource        los términos no renormalizables están suficientemente
     languages through multi-lingual text processing,          suprimidos como para ser ignorados". Output: coefi-
     IEEE Access 9 (2021) 124478–124490.                       cientes de los términos no renormalizables
[44] L. Bonifacio, V. Jeronymo, H. Q. Abonizio, I. Cam-            English sentence: "It often exploits an <br>optical dif-
     piotti, M. Fadaee, R. Lotufo, R. Nogueira, mmarco:        fusion model-based image reconstruction algorithm</br>
     A multilingual version of the ms marco passage            to estimate spatial property values from measurements
     ranking dataset, arXiv preprint arXiv:2108.13897          of the light flux at the surface of the tissue." Spanish
     (2021).                                                   sentence: "A menudo se utiliza un algoritmo de recon-
[45] M. Araújo, A. Pereira, F. Benevenuto, A compara-          strucción de imágenes basado en un modelo de difusión
     tive study of machine translation for multilingual        óptica para estimar los valores de propiedades espaciales
     sentence-level sentiment analysis, Information Sci-       a partir de medidas de la flujo de luz en la superficie del
     ences 512 (2020) 1078–1102.                               tejido." Output: algoritmo de reconstrucción de imágenes
[46] C. P. Carrino, M. R. Costa-Jussà, J. A. Fonollosa,        basado en un modelo de difusión óptica
     Automatic spanish translation of the squad dataset            English: "A second group of experiments is aimed at
     for multilingual question answering, arXiv preprint       extensions of the baseline methods that exploit charac-
     arXiv:1912.05200 (2019).                                  teristic features of the UvT Expert Collection; specifically,
[47] G. M. Rosa, L. H. Bonifacio, L. R. de Souza, R. Lotufo,   we propose and evaluate refined expert finding and pro-
     R. Nogueira, A cost-benefit analysis of cross-lingual     filing methods that incorporate <br>topicality and orga-
     transfer methods, arXiv preprint arXiv:2105.06813         nizational structure</br>." Spanish: "Un segundo grupo
     (2021).                                                   de experimentos está dirigido a extensiones de los méto-
[48] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,             dos base que aprovechan las características distintivas de
     Bert: Pre-training of deep bidirectional transform-       la Colección de Expertos de UvT; específicamente, pro-
     ers for language understanding, arXiv preprint            ponemos y evaluamos métodos refinados de búsqueda y
     arXiv:1810.04805 (2018).                                  perfilado de expertos que incorporan la topicalidad y la
[49] A. Gutiérrez-Fandiño, J. Armengol-Estapé,                 estructura organizativa." output: topicalidad y la estruc-
     M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P.       tura organizativa
     Carrino, A. Gonzalez-Agirre, C. Armentano-Oller,
     C. Rodriguez-Penagos, M. Villegas, Maria: Spanish
     language models, arXiv preprint arXiv:2107.07253
     (2021).
[50] C. Toraman, E. H. Yilmaz, F. Şahinuç, O. Ozcelik,
     Impact of tokenization on language models: An
     analysis for turkish, ACM Trans. Asian Low-Resour.
     Lang. Inf. Process. 22 (2023). URL: https://doi.org/
     10.1145/3578707. doi:10.1145/3578707.


A. Term Translation Prompt
You are a scientific translator of English to Spanish spe-
cialized in terminology. I give you one sentence in En-
glish and the same sentence translated to Spanish. The
English sentence has a term between the marks <br> and
</br>. Identify in the Spanish sentence which words cor-
respond to the same original term. The output term is in
Spanish. Some examples