Benchmark for Automatic Keyword Extraction in Spanish:
Datasets and Methods
Pablo Calleja1 , Patricia Martín-Chozas1 and Elena Montiel-Ponsoda1
1
Ontology Engineering Group, Universidad Politécnica de Madrid
Abstract
Tasks such as document indexing or information retrieval still seem to heavily rely on keywords, even in the LLMs era.
However, there is still a need for automatic keyword extraction works and training sets in languages other than English. To
the best of our knowledge, no datasets for keyword extraction in Spanish are publicly available for training or evaluation
purposes. Additionally, those innovative keyword extraction methods that rely on language models are not being adapted
to language models in other languages. To palliate this situation, this work proposes a method to translate into Spanish
two of the main gold standard datasets used by the community, while preserving semantics and terms. Then, the main
state-of-the-art methods are evaluated against the new translated datasets. The methods used for the evaluation have been
configured or re-implemented for Spanish.
Keywords
Spanish Automatic Keyword Extraction, Spanish language, SemEval2017, SemEval2010
1. Introduction tional supervised methods are based on decision trees [7],
naive Bayes [8] or Conditional Random Fields [9]. In
Keywords, typically defined as words or terms that best the past 10 years, several models have emerged based on
characterise the topics discussed in a document, have neural networks and deep learning [10, 11]. The most
proven essential for different NLP tasks such as informa- recent approaches rely on language models and attention
tion extraction (IE), text mining, or information retrieval mechanisms [12, 13].
(IR) [1]. With the exponential growth of available digital Supervised methods tend to offer the best results in
documents, a need emerged for algorithms capable of the literature of machine learning, but they require a
automatically identifying single or compound terms (also large dataset of labelled training corpora. To achieve
referred to as key segments or key phrases) that best rep- that, human experts have to manually annotate large
resented the most relevant information of a document, a amounts of data, which is a costly and tedious task. The
task better known as Automatic Keyword or KeyPhrase resulting annotations refer to the specific keywords that
Extraction (AKE). should be extracted from each sentence, paragraph or
Nowadays, even in the face of AI Generative algo- document in the corpus. On the other hand, unsupervised
rithms and Large Language Models (LLMs), AKE algo- methods, such as statistical or graph-based approaches,
rithms are not only used to classify, retrieve, or inspect do not require labelled corpora. Statistical-based meth-
large corpora [1, 2, 3], but also to fine-tune LLMs and ods [5, 14] use candidate position, frequency, length, and
post-process their output. capitalisation to determine the importance of a word.
However, automatically extracting keywords is a chal- Graph-based approaches [15, 16] construct a graph with
lenging task due to the complexities of natural language, the candidates as nodes. The edges indicate similarity or
document heterogeneity and the type of keywords that co-occurrence of candidates.
usually are needed. The current state-of-the-art is full Some of the best-known datasets for automatic key-
of proposed methods and tools. From the earliest based word extraction such as SemEval2010 [17], SemEval2017
on lexico-syntactic patterns and frequencies [4] to those [18] or Inspec [19], have been created for evaluation
purely based on statistics [5, 6] or the most recent ones tasks and are commonly used to evaluate new methods
based on language models. (both supervised and unsupervised), and not so much for
Keyword extraction methods have generally been clas- training.
sified into supervised or unsupervised methods. Tradi- However, all these efforts are not language agnostic.
SEPLN-2024: 40th Conference of the Spanish Society for Natural Most of the works so far have been oriented towards
Language Processing. Valladolid, Spain. 24-27 September 2024. the English language, giving a small coverage to other
$ p.calleja@upm.es (P. Calleja); patricia.martin@upm.es languages such as Spanish. As far as we know, there
(P. Martín-Chozas); elena.montiel@upm.es (E. Montiel-Ponsoda) are no publicly available annotated training corpora in
0000-0001-8423-8240 (P. Calleja); 0000-0002-8922-7521
Spanish. Therefore, supervised algorithms cannot be
(P. Martín-Chozas); 0000-0003-3263-3403 (E. Montiel-Ponsoda)
© 2024 Copyright for this paper by its authors. Use permitted under Creative easily implemented, and evaluations for supervised or
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) unsupervised algorithms are difficult to perform.
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
In this paper, a method to translate two of the most resources that contain the lexical units that are represen-
important corpora for AKE is proposed and applied to tative of a domain.
their translation into Spanish. The main aim of this work Although these two tasks have been conceived for
is to create a ’silver standard´ to support the training and different purposes, the truth is that, when performed au-
evaluation of automatic keyword extraction in Spanish. tomatically, they obtain similar results and performance,
The translation process has been performed to preserve as both rely on linguistic and textual features (at sentence,
the semantics and terminological representation of the paragraph or document levels). Thus, several state-of-
original texts and the annotations. The translation is the-art methods have been used for both tasks.
supported by the Google Translate service and by Chat- In this section, we will review the most relevant works
GPT3.5. in this area, making a distinction between traditional
Additionally, a benchmark has been generated with approaches (linguistic and statistic) and machine learning
five of the most relevant methods in the current state-of- and neural approaches.
the-art on the two translated corpus. The methods have
been configured for Spanish, and two of them have been 2.1. Traditional approaches
re-implemented to use Spanish language models.
The rest of the paper is structured as follows: In sec- The algorithms considered in this section are usually
tion 2 we provide a summary of the state-of-the-art in based on linguistic patterns, relying on parsing and part-
Automatic Keyword Extraction. Section 3 is devoted to speech tagging processes to identify terms [22]. These
the method for the translation of the corpora. Section 4 patterns were very prolific in the 1990s, with systems
describes the different AKE methods with their config- such as LEXTER [23]. This kind of approaches [24] has
urations or adaptations for the Spanish language, and persisted until today, as patterns are the main starting
section 5 presents the results of the evaluation bench- point to automatically identify keywords or terms in
mark. Finally, section 6 highlights the conclusions and documents and corpora. More advanced works based on
recommendations for future work. Both experiments patterns went further to identify the concept evoked by
and results are reported in an anonymised GitHub repos- term variants in several languages, as the work by [25]
itory1 . for English and French. In any case, the majority of these
works are language dependent.
Later on, researchers started to combine various types
2. State of the art of linguistic techniques, such as pattern-based tech-
niques, regular expressions, stop word lists, and post-
As stated by [1], ‘keywords’ and ‘keyphrases’ do not re- processing algorithms, to mention but a few. In this
fer to any theory. An element is considered as a ‘key’ context, tools such as TermExtractor emerge, a system
element within a document, when it is an important de- that combines several of the previously mentioned tech-
scriptor of the document content. The use of ‘word’ ver- niques and applies post-processing filters like domain
sus ‘phrase’ refers to the number of textual units, which pertinence, lexical cohesion or structural relevance [26].
can be one (1-gram) or several (n-grams). Since such More advanced works in the literature started to use
keywords or keyphrases mostly correspond to terms, de- statistical approaches in combination with linguistic func-
fined as words that are specific to a domain, the AKE task tionalities, which appeared to improve the results. The
is closely related to the so-called Automatic Terminol- process behind statistical approaches generally consists
ogy Extraction/Retrieval (ATE/ATR) task, i.e., the task of of weighting the frequency of occurrence of a combina-
identifying relevant terms in a corpus [20]. tion of words (n-grams) in a text. Normally, statistical
Lossio-Ventura et al. [21] described in their work that algorithms are divided into two types: 1) those based
there are some fundamental differences between term on the unithood that measures the strength of unity of
extraction and keyword extraction tasks. One major dif- complex units (such as X2 , T-score and z-score), and 2)
ference is that extracting terms requires a large collection those based on the termhood that measures the degree
of texts, which is not a necessary requirement in keyword of representation of domain-specific concepts, such as
extraction, which can take only a single document as in- C-Value or co-occurrence [27, 28]. Some of these purely
put. Also, ATE methods aim to extract term-like units statistical term extractors are INDEX for English [29],
and remove those that may not be terms, syntactically Lexterm [30] for Spanish, and RAKE [5], for keyword
or terminologically. On the other hand, AKE methods extraction in English.
extract the ‘key’ elements of a document, which are not In contrast, it is most common to find mixed ap-
limited to terms. Thus, while AKE methods can be do- proaches, such as TerMine, a term extractor that
main independent, ATE methods apply to specific fields combines C-Value with linguistic information [4], or
or professional domains, since their main goal is to build TermSuite, which applies distributional and composi-
1
https://github.com/oeg-upm/spanish-termex tional methods [31]. In [32], authors combine linguistic
processes such as segmentation, PoS tagging and mor- cation of keywords on the embedding representation of
phological analysis, with semantic knowledge extracted the sentence using masked tokens. Moreover, their work
from external resources and statistical techniques. Other proposes a new type of BERT architecture to be trained
works, such as TextRank [33], create a graph from the as a language model, but for the purpose of keyword
text to extract keywords based on statistical metrics. identification.
2.2. Machine Learning and Neural 3. Dataset generation
approaches
In the era of machine learning approaches, datasets are
These approaches exploit different features (linguistic or an essential requirement to train and, what is more im-
not) to identify keywords. For instance, Rose et al. [5] portant, evaluate algorithms for different NLP tasks. For
identified keywords based on word frequency, the num- instance, in the field of Automatic Keyword Extraction,
ber of co-occurring neighbors, and the ratio between the there are well-known gold standard datasets that are com-
co-occurrence and the frequency. Campos et al. [34] pro- monly used to evaluate approaches within the literature
posed YAKE which calculated the importance of each can- such as the SemEval2010 Task 5 [17] and SemEval2017
didate using frequency, offsets, and co-occurrence. Sem- Task 10 [18]. However, the availability of these data sets
Cluster method [35] first clustered the candidates based is limited to languages other than English [43]. Conse-
on the semantic similarity in which the centroids were se- quently, a common approach to overcome this limitation
lected as keywords. TopicRank [36] first assigned a score is to translate the available datasets into the target lan-
to each topic by candidate keywords clustering. The guage [44, 45], including Spanish [46].
topics were scored using the TextRank ranking model, To the best of our knowledge, there is no consolidated
and keywords were extracted using the most represen- dataset in Spanish for Automated Keyword Extraction,
tative candidate from the top-ranked topics. Florescu therefore, the first contribution of this work is the devel-
et al. [37] proposed PositionRank to use the position of opment of an evaluation corpus for keyword extraction in
word occurrences to improve TextRank on a document. Spanish which results from translating two of the most
Word embeddings have also been widely used. Wang et common English AKE datasets: SemEval2010 and Se-
al. [38] made use of the pre-trained word embedding and mEval2017. The target of this contribution is to generate
the frequency of each word to generate weighted edges a ‘silver standard’ labelled dataset, to provide researchers
between words in a document. A weighted PageRank in the field with a consolidated framework to test and
algorithm was used to compute the final scores of words. evaluate their approaches.
Also, Key2Vec [39] used a similar approach using the However, the translation process for labelled datasets
phrase embeddings for representing the candidates and is not a straightforward task. As [47] demonstrated in
ranking the importance of the phrases by calculating the their work, labelled datasets have their labels linked to
semantic similarity and co-occurrences of the phrases. one token or a span of tokens. Since the sentence struc-
Currently, new approaches based on pre-trained neu- ture can vary in different languages, it is very challenging
ral language models have appeared in the literature. For to retain the same annotation structure after the trans-
instance, Text2TCS2 [40], which is able to extract terms lation process. To overcome such difficulties, we have
and relations from raw text, creating taxonomies auto- organised the translation process into two phases: Phase
matically. [41] proposed SIFRank, the integration of a 1) Source Dataset Analysis and Source Dataset Prepro-
statistical model and a pre-trained language model, to cessing, described in Section 3.1, and Phase 2) Source
calculate the relevance between candidates and docu- Dataset Translation and Target Dataset Postprocessing,
ment topics. Other works are focused on the extraction described in Section 3.2.
of multilingual terminology across domains using trans- Figure 1 summarises the method for the translation
formers [42]. process in which, given the two original datasets, a set of
Two of the most recent works in the field of AKE us- four datasets translated into Spanish is obtained, using
ing language models are AttentionRank and MDERank. two different translation systems.
AttentionRank [13] integrates self-attention weights ex-
tracted from a pre-trained language model with the cal-
culated cross-attention relevancy value to identify key- 3.1. Phase 1: Dataset analysis and
words that are important to the local sentence context preprocessing
and also have strong relevancy to all sentences within
In order to generate the proposed silver standard for
the whole document. MDERank [12] bases the identifi-
Spanish AKE, we have selected the two previously men-
tioned datasets, as they are widely used in experiments
2
https://live.european-language-grid.eu/catalogue/tool- of this kind: SemEval2010 Task 5 [17] and SemEval2017
service/8122
Table 1
SemEval2010 Metrics for SemEval2010 and SemEval2017 datasets, including
keywords.
Phase 1
SemEval2017 SemEval2010 SemEval2017
Documents 243 493
Tokens 2.334.613 95.877
Term Annotation Term Annotation
Keywords 3.785 8.529
with quotes with HTML tag Unmatched
555 0
service few-shot Keywords
prompt
Google
ChatGPT 3.5
Translator some of the keywords come from the ones manually
provided by the authors of the papers themselves, and
Phase 2
they may not have an exact correspondence in the text.
Spa_SemEval2010GT Spa_SemEval2010GPT Regarding the preprocessing of the datasets, there are
two main aspects involved in the translation process.
Spa_SemEval2017GT Spa_SemEval2017GPT The first one refers to the original text. Not many issues
were found during the translation of SemEval2017 cor-
pus, since it had a manageable size and a clean structure.
However, the original texts of SemEval2010 were arbitrar-
Manual revision
ily segmented, very long, and contained references and
formulas, which posed many problems for the automatic
translator when processing them.
Figure 1: Method for dataset translation The second aspect refers to the keywords. For the
translation of the keywords, we did not simply trans-
late the list of keywords out of context, but decided to
mark them in the texts with annotations marks (quo-
Task 10 [18]. Both datasets are published following the
tation marks or the HTML tag
, depending on the
same structure, a set of documents containing the raw
translation system). Then, we translated the texts and
text (named docsutf8) and a set of documents containing
retrieved the translated terms contained within the an-
the extracted keywords (named keys). Both types of doc-
notation marks.
uments present the same identifiers to match keywords
with source documents.
Despite their similar structure, they present several 3.2. Phase 2: Dataset translation and
differences. As shown in Table 1, the main difference postprocessing
lies in their size. With a smaller number of documents,
SemEval2010 far exceeds SemEval2017 in the total num- Most of the existing approaches to create silver standards
ber of tokens, which means that it contains fewer docu- from existing gold standards by leveraging machine trans-
ments, but of a much larger size. SemEval2017 contains lation rely on at least two translation sources: 3
one from
shorter documents with an average of 6 to 7 sentences, a common online translator such as DeepL or Google
4
whereas SemEval2010 contains full scientific papers with Translate , and the other using a Neural Machine Trans-
hundreds of sentences. It is interesting to note that, al- lation model, as suggested in [44]. As already announced,
though SemEval2010 is bigger in number of documents in this work 5
we have used Google Translate and ChatGPT
and number of tokens, SemEval2017 has a bigger number 3.5 Turbo APIs.
of extracted keywords. This means that the keywords The keywords from the texts that were translated with
from SemEval2010 have greater representation and num- Google Translate were annotated with quotation marks.
ber of occurrences than the keywords from 2017. These However, on some occasions the system retrieved errors
differences in size are important because they require a in which the annotation marks were missing or misplaced
different treatment of the documents during the prepro- in the translated sentence, and either it was not possible
cessing and the translation stage. to extract the translated term from the annotated sen-
In both datasets, over 50% of the keywords are unigram tence or the extracted term was not correct. To avoid
or bigram. However, in SemEval2010 we observe that 555
keywords are not present in the documents with a similar 3
https://www.deepl.com/es/translator
span text. The reason for this is to be found in the way in 4
https://translate.google.es/
which the original dataset was created. In SemEval2010, 5
https://platform.openai.com/docs/models/gpt-3-5-turbo
that, we decided to append the original term to each anno- model of spaCy has to be downloaded before the methods
tated sentence, to force the system to take that term into can be run.
account and provide a translation. For instance, in the For the RAKE method, the original library cannot be
translation of the sentence ‘...has held two "mobile com- used as it is only oriented to the English language. How-
puting" design competitions’ focused on the term ‘mobile ever, there is a version named Multi-rake8 which covers
computing’ the translation lost the quotation marks: ‘ha different languages. As the method is statistical, to per-
celebrado dos concursos de diseño de computación móvil’. form multilingually, the addition of stopword lists from
Thus, we add the term repeated to obtain the translation the different target languages is necessary.
of the term: ‘...has held two "mobile computing" design
competitions. Mobile computing’. 4.2. Attention Rank
With ChatGPT, the tag
was used to mark the key-
words before and after. The prompt sent to the generative The implementation of the original authors9 had to be
model described the purpose of the model (i.e., ’You are reimplemented from scratch. The original repository
a Spanish translator specialised in terminology’), and does not have libraries and version specifications. More-
then some examples of annotations in English and its over, the original code relies on libraries for language
translations in Spanish with the annotated and trans- models that are not maintained as well as the noun
lated keywords were provided. This is called few-shot phrases identification component, which relies on the
prompting. The full prompt is presented in Annex A. part-of-speech annotation of Stanford CoreNLP and a
Regarding the postprocessing stage, several actions third-party library. Reproducibility was not possible in
were performed. First, we extracted all the annotated this work.
occurrences of each keyword in the sentence, creating a A new repository10 has been created for the implemen-
list of translation candidates per keyword. In some cases, tation of the Attention rank method. This repository uses
reconciliation between candidates was necessary to pro- HuggingFace’s library transformer to manage language
vide a single translation for each keyword. In the case models and spaCy to identify noun phrases. The reposi-
that no disparities between the candidates were found, tory details the specific libraries and versions needed and
the translated keyword was automatically assigned. In the external modules needed. The new repository allows
case of disparities, terms were manually reviewed and the use of BERT (as in the original work) and RoBERTa
a translated keyword was manually assigned. In total, architecture models in different languages.
we manually reviewed an average number of 2000 key- The adaptation for RoBERTa models had to deal with
words per dataset (220 documents in SemEval2010 and two specific issues regarding the tokeniser. The first one
360 documents in SemEval2017). is the use of different special tokens to delimit sentences
at the beginning and at the end to focus the attention
mechanisms, as BERT uses ’[CLS]’ and ’[SEP]’ tokens,
4. AKE Adaptation to Spanish RoBERTa uses ’’ and ’’. The second issue is the
generated tokens, as BERT uses a WordPiece tokeniser
In this section, the different AKE methods used for the ex- in which subwords are marked with the ’##’ tag (e.g.,
periments and their implementation are presented. Some the word thicknesses is divided into tokens thickness and
of them have already been implemented and maintained ##es). In contrast, RoBERTa models use Byte-level Pair
by well-known Python libraries and contain adapters to Encoding (BPE) and classifies different tokens for char
work with other languages. Two of them, those that are sequences that start a word or that are inside. The tokens
based on language models, had to be re-implemented and that start a word include the white space before the word,
adapted. In addition to different technical aspects, both and they are marked whith the special character ’G’. ̇ For
methods use the original BERT model [48] for English, instance, the word extrapolate is divided into two tokens:
and the RoBERTa MarIA model [49] for Spanish. ̇
’Gextrap’ and ’olate’.
Beyond the differences studied in previous works on
4.1. Already implemented methods the benefits or differences between both types of tokenis-
ers [50], this work had to develop the alignment process
The methods used for the evaluation are TopicRank,
between the words of keywords and their correspond-
YAKE and RAKE. The Python library PKE6 has been used
ing tokens. With WordPiece is easier to find tokens and
for the execution of the TopicRank and YAKE methods.
recompose the original word, but BPE is sensible to ap-
PKE uses the Python library spaCy7 , as many other meth-
pearance of the white space before the token. If it does
ods, to identify candidate chunks or nominal phrases that
not appear, the token is different and its attention value
can be relevant terms or keywords. Thus, the Spanish
8
https://github.com/vgrabovets/multi_rake
6 9
https://github.com/boudinfl/pke https://github.com/hd10-iupui/AttentionRank
7 10
https://spacy.io/ https://github.com/oeg-upm/AttentionRankLib
changes. This issue has been solved by ensuring that the The results of the AKE algorithms on the Spanish
input sentences always have a white space before a word. datasets, both multilingual and adapted for Spanish, show
a lower performance compared to the original datasets.
4.3. MDERank However, they are in line with the results for English.
Unlike many other NLP experiments, where a good result
The original implementation11 contains a better descrip- is represented by metrics starting at 0.6 or 0.7 of f1 score,
tion of the requirements. However, it is described for the highest metrics achieved by the algorithms tested in
Python 3.7 which is no longer supported by the commu- SemEval2010 and 2017 do not exceed 0.3821 (BR17 and
nity and most of the versions of the required libraries are K= 15).
deprecated. Also, parts of the execution code are wrong We already expected lower values, as the translation
such as the command line execution or the arguments, process is not perfect and it is not always possible to
and there is no code related to the KPEBERT model, a maintain the correlation of one keyword in English to
model which is trained and used for keyword identifi- the same keyword in Spanish. Apart from the errors
cation. Only it is possible to execute it with traditional detected (explained in Section 5.2), GPT3 showed better
BERT models. performance in maintaining the structure and terminol-
To update the code and method, a new repository has ogy of the translated document.
been created12 . In which the requirements, code and exe- It is also important to mention the different results
cution process have improved. As AttentionRank, MDER- obtained for each dataset. For Spa SemEval2017GT and
ank used Stanford CoreNLP for the identification of noun Spa SemEval2017GPT the best results, in terms of preci-
fragments and it has been updated to spaCy. Finally, the sion, recall and f1-score, are obtained by the two methods
method can now support RoBERTa models by taking into that are based on language models: AttentionRank and
account the problems mentioned in AttentionRank. MDERank. Although the original dataset contains com-
plex keywords, the language models perform well as in
the English dataset.
5. Evaluation Surprisingly, for Spa SemEval2010GT and Spa
This section discusses the evaluation results obtained SemEval2010 GPT the best results are obtained by YAKE.
from the execution of the five AKE methods on The nature of the documents in SemEval2010, which are
the four translated datasets (Spa_SemEval2010GT , full papers without any cleaning, including formulas,
Spa_SemEval2010GPT , Spa_SemEval2017GT and references and citations, makes it difficult for a language
Spa_SemEval2017GPT ). The metrics used in the evalu- model to perform well. An added issue is the large
ation are precision, recall and f1-measure. Following length of the documents, which in the case of RAKE
previous works in the literature, the methods are produces results close to zero.
evaluated with the three metrics at the top K of the
keywords extracted in each method. K equals 5, 10, and 5.2. Error Analysis and Discussion
15. Finally, we perform an error analysis and present a
After a thorough analysis of the results, we conclude
discussion around it. Table 2 shows the results obtained.
that, beyond some translation errors, the main reason
behind the low numbers seems to be the poor quality of
5.1. Results some keywords in the original datasets. Although both
Table 2 shows the results for each top K (5, 10, datasets are claimed to have been either generated or
15) and method. The results have been grouped reviewed by humans, we have detected a great number
by the type of dataset and the translation system of anomalies that may be the main source of errors, as
used: Spa SemEval2010 , Spa SemEval2010 , Spa we try to illustrate below:
GT GPT
SemEval2017GT and Spa SemEval2017GPT , where GT • Duplicated structures: We find similar structures
stands for Google Translate and GPT stands for Chat- with small variations which produce noise and
GPT 3.5. Additionally, the column named BR, that stands inconsistencies, such as terms with determiners
for Best Result, shows the best f1 result reported in the (i.e. metal and the metal), terms with symbols or
original datasets in English (BR10 for SemEval2010 and special characters (i.e. logical inference and “logi-
BR17 for SemEval2017). These results are taken from cal inference"), and terms with different spellings
the AttentionRank work [13], except for the results for (i.e reputation mechanism and Reputation mecha-
MDERank, which are taken from their own published nism).
work [12].
• Misspelled structures: We found several examples
11
https://github.com/LinhanZ/mderank of misspelled structures, and, specifically, missing
12
https://github.com/oeg-upm/mderanklib
Table 2
Evaluation of five AKE methods against the translated datasets measuring Precision (𝑝), Recall (𝑟) and F-measure (𝐹 ). Each
evaluation has taken into account the K (n top) value for 5, 10 and 15. Also, the best F1 obtained for the original SemEval2010
and SemEval2017 in English (BR10 and BR17 ) with each method is reported.
Spa_SE2010GT Spa_SE2010GPT BR10 Spa_SE2017GT Spa_SE2017GPT BR17
k Method
p r F1 p r F1 F1 p r F1 p r F1 F1
RAKE 0.00 0.00 0.00 0.08 0.03 0.04 0.67 12.17 3.97 5.98 14.88 5.15 7.66 13.24
TopicRank 4.77 1.65 2.45 7.08 2.53 3.73 5.26 19.39 5.85 8.99 21.94 6.87 10.47 15.92
5 YAKE 7.49 2.58 3.83 10.95 3.85 5.69 8.46 10.47 3.39 5.13 18.86 6.45 9.61 12.05
AttentionRank 7.52 2.60 3.86 9.30 3.32 4.89 11.39 19.51 5.88 9.03 24.66 7.84 11.89 23.59
MDERank 7.63 2.44 3.70 9.62 3.11 4.70 12.95 19.39 5.60 8.69 27.46 7.94 12.32 22.81
RAKE 0.00 0.00 0.00 0.16 0.11 0.13 1.33 12.70 8.16 9.93 14.86 10.07 12.00 22.61
TopicRank 4.77 3.28 3.89 6.38 4.50 5.28 7.43 15.98 9.45 11.88 17.97 11.07 13.70 20.60
10 YAKE 7.37 5.07 6.01 9.42 6.56 7.74 11.98 11.87 7.62 9.28 18.09 12.19 14.56 18.16
AttentionRank 7.22 4.38 5.45 9.11 5.45 6.81 15.12 16.71 9.96 12.48 20.54 12.91 15.85 34.37
MDERank 7.17 4.59 5.60 8.88 5.74 6.97 17.07 15.92 9.20 11.66 22.45 12.98 16.45 32.51
RAKE 0.05 0.05 0.05 0.11 0.11 0.11 1.78 11.98 11.25 11.60 14.02 13.90 13.96 26.87
TopicRank 4.36 4.39 4.38 5.38 5.65 5.51 8.02 13.61 12.10 12.81 15.09 13.85 14.44 22.37
15 YAKE 6.83 7.02 6.93 8.56 9.04 8.79 12.87 11.33 10.70 11.01 17.20 17.09 17.15 20.72
AttentionRank 6.70 5.83 6.23 7.90 7.97 7.93 16.66 14.20 12.52 13.31 17.09 15.93 16.49 38.21
MDERank 6.27 6.03 6.15 7.79 7.54 7.66 20.09 13.84 12.01 12.86 19.31 16.75 17.93 37.18
letters both at the beginning and at the end of the netic can be found translated into two different keywords
structure (i.e. aked instead of baked). throughout the text, as ferromagnética and ferromagnéti-
cos. However, with the aim to be faithful to the original
• Non-terminological structures: This is the most evaluation datasets, we decided to choose one of the
common anomaly in both datasets, and one of the translations and discard the alternatives, although we
main causes for the low performance of the algo- believe that the datasets would benefit from including
rithms, both in English and in Spanish. Examples such variation.
of such non-terminological structures are: full
sentences (i.e. dynamics which clearly reveal the
origins of the roaming), sentence fragments (i.e. 6. Conclusions
loading force and penetration depth were recorded
and their respective values were correlated with This work has analysed the current state-of-the-art of au-
the observed), concatenated structures (i.e.1. well tomatic keyword extraction and, in particular, the Span-
defined phase space dividing surfaces attached to, ish landscape. In this analysis, we have identified the
i.e.2. austenitic or austenitic & ferritic stainless lack of an evaluation framework (including datasets and
steel), or even text fragments with references (i.e.1. ready-to-test algorithms) for AKE in Spanish. Conse-
comparison between the realistic calculations for quently, this paper proposes two contributions. First, the
positive parity [12] and negative parity [14], based generation of a silver standard for the Spanish language
on the same quark model [15], i.e.2. calculation by community by the translation of two English datasets
Martinez-Pinedo et al.). widely used to evaluate AKE approaches: SemEval2010
and SemEval2017. Second, the configuration of a set of
Additionally to inaccuracies and anomalies mentioned state-of-the-art algorithms in an easily executable man-
before, in the results we observe that in some instances ner to facilitate the evaluation task, including the adapta-
the same keyword has been translated differently into tion of two current methods that rely on language models:
Spanish in different parts of the text. For example, the Attention Rank and MDERank.
term deployment has been translated both as despliegue With the benchmark in place, we have performed an
and implementación within the same text; or the com- evaluation of the implemented algorithms and the trans-
pound term information aggregation can be found trans- lated datasets. To be consistent with the evaluations in
lated as agregación de información and agregación de la English, the translated datasets maintain the original in-
información. In itself, this would not be a problem be- ner structure. The results in Spanish suggest the same
cause these are correct translations in Spanish. Moreover, tendency as in English, although they are lower. The
even in specialised domains, term variants are commonly error analysis shows that low results are due to several
used to designate the same concept. factors: 1) the quality of the original datasets, as they
A similar issue occurs when Spanish terms vary in contain noisy texts, non-terminological structures, and
gender and number. For instance, the keyword ferromag-
terms that are not contained in the texts, 2) the qual- in: Proceedings of the international conference re-
ity of the translations for the labelled datasets, as both cent advances in natural language processing, 2015,
systems present translation inconsistencies and have dif- pp. 473–479.
ficulties to keep track of the translated keyword in the [7] P. D. Turney, Learning algorithms for keyphrase
text, 3) the fact that a 1 to 1 translation of keywords is extraction, Information retrieval 2 (2000) 303–336.
not always possible nor desirable, and that it would be [8] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin,
recommendable to include term variants. C. G. Nevill-Manning, Kea: Practical automatic
In light of the results and taking these remarks into keyphrase extraction, in: Proceedings of the fourth
account, we conclude that maintaining the dataset struc- ACM conference on Digital libraries, 1999, pp. 254–
ture in English to evaluate AKE tasks in Spanish might 255.
not be the most appropriate approach. For this rea- [9] D. Sahrawat, D. Mahata, M. Kulkarni, H. Zhang,
son, as part of future work we are considering two ap- R. Gosangi, A. Stent, A. Sharma, Y. Kumar, R. R.
proaches for generating evaluation datasets in Spanish: Shah, R. Zimmermann, Keyphrase extraction
1) automatically postprocessing existing datasets, such from scholarly articles as sequence labeling us-
as the two dealt with in this work, to eliminate all non- ing contextualized embeddings, arXiv preprint
terminological structures and produce a list of candidate arXiv:1910.08840 (2019).
terms instead of just one in the translation process, and [10] R. Alzaidy, C. Caragea, C. L. Giles, Bi-lstm-crf
2) semi-automatically generating a dataset with similar sequence labeling for keyphrase extraction from
characteristics to the ones mentioned, but based on texts scholarly documents, in: The world wide web con-
originally written in Spanish. ference, 2019, pp. 2551–2557.
[11] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky,
Y. Chi, Deep keyphrase generation, arXiv preprint
Acknowledgments arXiv:1704.06879 (2017).
[12] L. Zhang, Q. Chen, W. Wang, C. Deng, S. Zhang,
This work has been partially founded by INESDATA
B. Li, W. Wang, X. Cao, Mderank: A masked
(https://inesdata-project.eu/) project, funded by the Span-
document embedding rank approach for unsu-
ish Ministry of Digital Transformation and Public Affairs
pervised keyphrase extraction, arXiv preprint
and NextGenerationEU, in the framework of the UNICO
arXiv:2110.06651 (2021).
I+D CLOUD Program - Real Decreto 959/2022.
[13] H. Ding, X. Luo, Attentionrank: Unsupervised
keyphrase extraction using self and cross attentions,
References in: Proceedings of the 2021 Conference on Empiri-
cal Methods in Natural Language Processing, 2021,
[1] N. Firoozeh, A. Nazarenko, F. Alizon, B. Daille, pp. 1919–1928.
Keyword extraction: Issues and methods, Natural [14] R. Campos, V. Mangaravite, A. Pasquali, A. M. Jorge,
Language Engineering 26 (2020) 259–291. doi:10. C. Nunes, A. Jatowt, Yake! collection-independent
1017/S1351324919000457. automatic keyword extractor, in: Advances in In-
[2] O. Borisov, M. Aliannejadi, F. Crestani, Keyword formation Retrieval: 40th European Conference on
extraction for improved document retrieval in con- IR Research, ECIR 2018, Grenoble, France, March
versational search, arXiv preprint arXiv:2109.05979 26-29, 2018, Proceedings 40, Springer, 2018, pp. 806–
(2021). 810.
[3] H. Shah, R. Mariescu-Istodor, P. Fränti, We- [15] X. Wan, J. Xiao, Single document keyphrase extrac-
brank: Language-independent extraction of key- tion using neighborhood knowledge., in: AAAI,
words from webpages, in: 2021 IEEE International volume 8, 2008, pp. 855–860.
Conference on Progress in Informatics and Com- [16] S. D. Gollapalli, C. Caragea, Extracting keyphrases
puting (PIC), IEEE, 2021, pp. 184–192. from research papers using citation networks, in:
[4] K. Frantzi, S. Ananiadou, H. Mima, Automatic Proceedings of the AAAI conference on artificial
recognition of multi-word terms:. the c-value/nc- intelligence, volume 28, 2014.
value method, International journal on digital li- [17] S. N. Kim, O. Medelyan, M.-Y. Kan, T. Baldwin,
braries 3 (2000) 115–130. SemEval-2010 Task 5 : Automatic Keyphrase Ex-
[5] S. Rose, D. Engel, N. Cramer, W. Cowley, Auto- traction from Scientific Articles, in: K. Erk, C. Strap-
matic keyword extraction from individual docu- parava (Eds.), Proceedings of the 5th International
ments, Text mining: applications and theory 1 Workshop on Semantic Evaluation, Association for
(2010) 1–20. Computational Linguistics, Uppsala, Sweden, 2010,
[6] A. Oliver, M. Vàzquez, Tbxtools: a free, fast and pp. 21–26. URL: https://aclanthology.org/S10-1004.
flexible tool for automatic terminology extraction, [18] I. Augenstein, M. Das, S. Riedel, L. Vikraman, A. Mc-
Callum, SemEval 2017 Task 10: ScienceIE - Extract- minología gratuita, Translation Journal (2007).
ing Keyphrases and Relations from Scientific Publi- [31] J. Rocheteau, B. Daille, Ttc termsuite: A uima ap-
cations, in: S. Bethard, M. Carpuat, M. Apidianaki, plication for multilingual terminology extraction
S. M. Mohammad, D. Cer, D. Jurgens (Eds.), Proceed- from comparable corpora, in: 5th International
ings of the 11th International Workshop on Seman- Joint Conference on Natural Language Processing
tic Evaluation (SemEval-2017), Association for Com- (IJCNLP), 2011, pp. 9–12.
putational Linguistics, Vancouver, Canada, 2017, pp. [32] J. Vivaldi, H. Rodríguez, Improving term extraction
546–555. URL: https://aclanthology.org/S17-2091. by combining different techniques, Terminology.
doi:10.18653/v1/S17-2091. International Journal of Theoretical and Applied
[19] A. Hulth, Improved Automatic Keyword Extraction Issues in Specialized Communication 7 (2001) 31–
Given More Linguistic Knowledge, in: Proceedings 48.
of the 2003 Conference on Empirical Methods in [33] R. Mihalcea, P. Tarau, Textrank: Bringing order
Natural Language Processing, 2003, pp. 216–223. into text, in: Proceedings of the 2004 conference on
URL: https://aclanthology.org/W03-1028. empirical methods in natural language processing,
[20] A. Oliver, M. Vàzquez, A free terminology extrac- 2004, pp. 404–411.
tion suite, in: Proceedings of Translating and the [34] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge,
Computer 29, 2007. C. Nunes, A. Jatowt, Yake! keyword extraction
[21] J. A. Lossio-Ventura, C. Jonquet, M. Roche, M. Teis- from single documents using multiple local features,
seire, Combining c-value and keyword extraction Information Sciences 509 (2020) 257–289. doi:10.
methods for biomedical terms extraction, in: LBM: 1016/j.ins.2019.09.013.
languages in biology and medicine, 2013. [35] H. H. Alrehamy, C. Walker, Semcluster: unsuper-
[22] J. S. Justeson, S. M. Katz, Technical terminology: vised automatic keyphrase extraction using affinity
some linguistic properties and an algorithm for propagation, in: Advances in Computational In-
identification in text, Natural language engineering telligence Systems: Contributions Presented at the
1 (1995) 9–27. 17th UK Workshop on Computational Intelligence,
[23] D. Bourigault, Surface grammatical analysis for September 6-8, 2017, Cardiff, UK, Springer, 2018, pp.
the extraction of terminological noun phrases, in: 222–235.
COLING 1992 Volume 3: The 14th International [36] A. Bougouin, F. Boudin, B. Daille, Topicrank:
Conference on Computational Linguistics, 1992. Graph-based topic ranking for keyphrase extrac-
[24] K. Kageura, E. Marshman, Terminology extraction tion, in: International joint conference on natural
and management, in: The Routledge Handbook of language processing (IJCNLP), 2013, pp. 543–551.
Translation and Technology, Routledge, 2019, pp. [37] C. Florescu, C. Caragea, A position-biased pagerank
61–77. algorithm for keyphrase extraction, in: Proceedings
[25] B. Daille, Conceptual structuring through term vari- of the AAAI conference on artificial intelligence,
ations, in: Proceedings of the ACL 2003 workshop volume 31, 2017.
on Multiword expressions: analysis, acquisition and [38] B. Wang, S. Yu, W. Lou, Y. T. Hou, Privacy-
treatment, 2003, pp. 9–16. preserving multi-keyword fuzzy search over en-
[26] F. Sclano, P. Velardi, Termextractor: a web applica- crypted data in the cloud, in: IEEE INFOCOM
tion to learn the shared terminology of emergent 2014-IEEE conference on computer communica-
web communities, in: Enterprise Interoperability tions, IEEE, 2014, pp. 2112–2120.
II, Springer, 2007, pp. 287–290. [39] D. Mahata, J. Kuriakose, R. Shah, R. Zimmermann,
[27] K. Kageura, B. Umino, Methods of automatic term Key2vec: Automatic ranked keyphrase extraction
recognition: A review, Terminology. International from scientific articles using phrase embeddings, in:
Journal of Theoretical and Applied Issues in Spe- Proceedings of the 2018 Conference of the North
cialized Communication 3 (1996) 259–289. American Chapter of the Association for Computa-
[28] M. T. Pazienza, M. Pennacchiotti, F. M. Zanzotto, tional Linguistics: Human Language Technologies,
Terminology extraction: an analysis of linguistic Volume 2 (Short Papers), 2018, pp. 634–639.
and statistical approaches, in: Knowledge mining, [40] D. Gromann, L. Wachowiak, C. Lang, B. Heinisch,
Springer, 2005, pp. 255–279. Multilingual extraction of terminological concept
[29] L. P. Jones, E. W. Gassie, Jr, S. Radhakrishnan, Index: systems, Deep Learning and Neural Approaches
The statistical basis for an automatic conceptual for Linguistic Data (2021) 5.
phrase-indexing system, Journal of the American [41] Y. Sun, H. Qiu, Y. Zheng, Z. Wang, C. Zhang,
Society for Information Science 41 (1990) 87–97. Sifrank: A new baseline for unsupervised
[30] A. Oliver, M. Vázquez, J. Moré, Linguoc lexterm: keyphrase extraction based on pre-trained lan-
una herramienta de extracción automática de ter- guage model, IEEE Access 8 (2020) 10896–10906.
doi:10.1109/ACCESS.2020.2965087. English sentence: "The University of Florida, in part-
[42] C. Lang, L. Wachowiak, B. Heinisch, D. Gromann, nership with Motorola, has held two
mobile comput-
Transforming term extraction: Transformer-based ing design competitions". Spanish sentence : "La
approaches to multilingual term extraction across Universidad de Florida, en asociación con Motorola, ha
domains, in: Findings of the Association for Com- celebrado dos concursos de diseño de computación móvil".
putational Linguistics: ACL-IJCNLP 2021, 2021, pp. Output: computación móvil English sentence: "There,
3607–3620. we assume that
coefficients of non-renormalizable
[43] A. Ghafoor, A. S. Imran, S. M. Daudpota, Z. Kas- terms are suppressed enough to be neglected". Span-
trati, R. Batra, M. A. Wani, et al., The impact of ish sentence: "Aquí, asumimos que los coeficientes de
translating resource-rich datasets to low-resource los términos no renormalizables están suficientemente
languages through multi-lingual text processing, suprimidos como para ser ignorados". Output: coefi-
IEEE Access 9 (2021) 124478–124490. cientes de los términos no renormalizables
[44] L. Bonifacio, V. Jeronymo, H. Q. Abonizio, I. Cam- English sentence: "It often exploits an
optical dif-
piotti, M. Fadaee, R. Lotufo, R. Nogueira, mmarco: fusion model-based image reconstruction algorithm
A multilingual version of the ms marco passage to estimate spatial property values from measurements
ranking dataset, arXiv preprint arXiv:2108.13897 of the light flux at the surface of the tissue." Spanish
(2021). sentence: "A menudo se utiliza un algoritmo de recon-
[45] M. Araújo, A. Pereira, F. Benevenuto, A compara- strucción de imágenes basado en un modelo de difusión
tive study of machine translation for multilingual óptica para estimar los valores de propiedades espaciales
sentence-level sentiment analysis, Information Sci- a partir de medidas de la flujo de luz en la superficie del
ences 512 (2020) 1078–1102. tejido." Output: algoritmo de reconstrucción de imágenes
[46] C. P. Carrino, M. R. Costa-Jussà, J. A. Fonollosa, basado en un modelo de difusión óptica
Automatic spanish translation of the squad dataset English: "A second group of experiments is aimed at
for multilingual question answering, arXiv preprint extensions of the baseline methods that exploit charac-
arXiv:1912.05200 (2019). teristic features of the UvT Expert Collection; specifically,
[47] G. M. Rosa, L. H. Bonifacio, L. R. de Souza, R. Lotufo, we propose and evaluate refined expert finding and pro-
R. Nogueira, A cost-benefit analysis of cross-lingual filing methods that incorporate
topicality and orga-
transfer methods, arXiv preprint arXiv:2105.06813 nizational structure." Spanish: "Un segundo grupo
(2021). de experimentos está dirigido a extensiones de los méto-
[48] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, dos base que aprovechan las características distintivas de
Bert: Pre-training of deep bidirectional transform- la Colección de Expertos de UvT; específicamente, pro-
ers for language understanding, arXiv preprint ponemos y evaluamos métodos refinados de búsqueda y
arXiv:1810.04805 (2018). perfilado de expertos que incorporan la topicalidad y la
[49] A. Gutiérrez-Fandiño, J. Armengol-Estapé, estructura organizativa." output: topicalidad y la estruc-
M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P. tura organizativa
Carrino, A. Gonzalez-Agirre, C. Armentano-Oller,
C. Rodriguez-Penagos, M. Villegas, Maria: Spanish
language models, arXiv preprint arXiv:2107.07253
(2021).
[50] C. Toraman, E. H. Yilmaz, F. Şahinuç, O. Ozcelik,
Impact of tokenization on language models: An
analysis for turkish, ACM Trans. Asian Low-Resour.
Lang. Inf. Process. 22 (2023). URL: https://doi.org/
10.1145/3578707. doi:10.1145/3578707.
A. Term Translation Prompt
You are a scientific translator of English to Spanish spe-
cialized in terminology. I give you one sentence in En-
glish and the same sentence translated to Spanish. The
English sentence has a term between the marks
and
. Identify in the Spanish sentence which words cor-
respond to the same original term. The output term is in
Spanish. Some examples