<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Translation Journal (2007).
ing Keyphrases and Relations from Scientific Publi</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/S17-2091</article-id>
      <title-group>
        <article-title>Benchmark for Automatic Keyword Extraction in Spanish: Datasets and Methods</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pablo Calleja</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patricia Martín-Chozas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Montiel-Ponsoda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Engineering Group, Universidad Politécnica de Madrid</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>3</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Tasks such as document indexing or information retrieval still seem to heavily rely on keywords, even in the LLMs era. However, there is still a need for automatic keyword extraction works and training sets in languages other than English. To the best of our knowledge, no datasets for keyword extraction in Spanish are publicly available for training or evaluation purposes. Additionally, those innovative keyword extraction methods that rely on language models are not being adapted to language models in other languages. To palliate this situation, this work proposes a method to translate into Spanish two of the main gold standard datasets used by the community, while preserving semantics and terms. Then, the main state-of-the-art methods are evaluated against the new translated datasets. The methods used for the evaluation have been configured or re-implemented for Spanish.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Spanish Automatic Keyword Extraction</kwd>
        <kwd>Spanish language</kwd>
        <kwd>SemEval2017</kwd>
        <kwd>SemEval2010</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this paper, a method to translate two of the most resources that contain the lexical units that are
represenimportant corpora for AKE is proposed and applied to tative of a domain.
their translation into Spanish. The main aim of this work Although these two tasks have been conceived for
is to create a ’silver standard´ to support the training and diferent purposes, the truth is that, when performed
auevaluation of automatic keyword extraction in Spanish. tomatically, they obtain similar results and performance,
The translation process has been performed to preserve as both rely on linguistic and textual features (at sentence,
the semantics and terminological representation of the paragraph or document levels). Thus, several
state-oforiginal texts and the annotations. The translation is the-art methods have been used for both tasks.
supported by the Google Translate service and by Chat- In this section, we will review the most relevant works
GPT3.5. in this area, making a distinction between traditional</p>
      <p>Additionally, a benchmark has been generated with approaches (linguistic and statistic) and machine learning
ifve of the most relevant methods in the current state-of- and neural approaches.
the-art on the two translated corpus. The methods have
been configured for Spanish, and two of them have been 2.1. Traditional approaches
re-implemented to use Spanish language models.</p>
      <p>The rest of the paper is structured as follows: In sec- The algorithms considered in this section are usually
tion 2 we provide a summary of the state-of-the-art in based on linguistic patterns, relying on parsing and
partAutomatic Keyword Extraction. Section 3 is devoted to speech tagging processes to identify terms [22]. These
the method for the translation of the corpora. Section 4 patterns were very prolific in the 1990s, with systems
describes the diferent AKE methods with their config- such as LEXTER [23]. This kind of approaches [24] has
urations or adaptations for the Spanish language, and persisted until today, as patterns are the main starting
section 5 presents the results of the evaluation bench- point to automatically identify keywords or terms in
mark. Finally, section 6 highlights the conclusions and documents and corpora. More advanced works based on
recommendations for future work. Both experiments patterns went further to identify the concept evoked by
and results are reported in an anonymised GitHub repos- term variants in several languages, as the work by [25]
itory1. for English and French. In any case, the majority of these
works are language dependent.</p>
      <p>Later on, researchers started to combine various types
2. State of the art of linguistic techniques, such as pattern-based
techniques, regular expressions, stop word lists, and
postAs stated by [1], ‘keywords’ and ‘keyphrases’ do not re- processing algorithms, to mention but a few. In this
fer to any theory. An element is considered as a ‘key’ context, tools such as TermExtractor emerge, a system
element within a document, when it is an important de- that combines several of the previously mentioned
techscriptor of the document content. The use of ‘word’ ver- niques and applies post-processing filters like domain
sus ‘phrase’ refers to the number of textual units, which pertinence, lexical cohesion or structural relevance [26].
can be one (1-gram) or several (n-grams). Since such More advanced works in the literature started to use
keywords or keyphrases mostly correspond to terms, de- statistical approaches in combination with linguistic
funcifned as words that are specific to a domain, the AKE task tionalities, which appeared to improve the results. The
is closely related to the so-called Automatic Terminol- process behind statistical approaches generally consists
ogy Extraction/Retrieval (ATE/ATR) task, i.e., the task of of weighting the frequency of occurrence of a
combinaidentifying relevant terms in a corpus [20]. tion of words (n-grams) in a text. Normally, statistical</p>
      <p>Lossio-Ventura et al. [21] described in their work that algorithms are divided into two types: 1) those based
there are some fundamental diferences between term on the unithood that measures the strength of unity of
extraction and keyword extraction tasks. One major dif- complex units (such as X2, T-score and z-score), and 2)
ference is that extracting terms requires a large collection those based on the termhood that measures the degree
of texts, which is not a necessary requirement in keyword of representation of domain-specific concepts, such as
extraction, which can take only a single document as in- C-Value or co-occurrence [27, 28]. Some of these purely
put. Also, ATE methods aim to extract term-like units statistical term extractors are INDEX for English [29],
and remove those that may not be terms, syntactically Lexterm [30] for Spanish, and RAKE [5], for keyword
or terminologically. On the other hand, AKE methods extraction in English.
extract the ‘key’ elements of a document, which are not In contrast, it is most common to find mixed
aplimited to terms. Thus, while AKE methods can be do- proaches, such as TerMine, a term extractor that
main independent, ATE methods apply to specific fields combines C-Value with linguistic information [4], or
or professional domains, since their main goal is to build TermSuite, which applies distributional and
compositional methods [31]. In [32], authors combine linguistic
processes such as segmentation, PoS tagging and mor- cation of keywords on the embedding representation of
phological analysis, with semantic knowledge extracted the sentence using masked tokens. Moreover, their work
from external resources and statistical techniques. Other proposes a new type of BERT architecture to be trained
works, such as TextRank [33], create a graph from the as a language model, but for the purpose of keyword
text to extract keywords based on statistical metrics. identification.
2.2. Machine Learning and Neural
approaches</p>
    </sec>
    <sec id="sec-2">
      <title>3. Dataset generation</title>
      <sec id="sec-2-1">
        <title>In the era of machine learning approaches, datasets are</title>
        <p>These approaches exploit diferent features (linguistic or an essential requirement to train and, what is more
imnot) to identify keywords. For instance, Rose et al. [5] portant, evaluate algorithms for diferent NLP tasks. For
identified keywords based on word frequency, the num- instance, in the field of Automatic Keyword Extraction,
ber of co-occurring neighbors, and the ratio between the there are well-known gold standard datasets that are
comco-occurrence and the frequency. Campos et al. [34] pro- monly used to evaluate approaches within the literature
posed YAKE which calculated the importance of each can- such as the SemEval2010 Task 5 [17] and SemEval2017
didate using frequency, ofsets, and co-occurrence. Sem- Task 10 [18]. However, the availability of these data sets
Cluster method [35] first clustered the candidates based is limited to languages other than English [43].
Conseon the semantic similarity in which the centroids were se- quently, a common approach to overcome this limitation
lected as keywords. TopicRank [36] first assigned a score is to translate the available datasets into the target
lanto each topic by candidate keywords clustering. The guage [44, 45], including Spanish [46].
topics were scored using the TextRank ranking model, To the best of our knowledge, there is no consolidated
and keywords were extracted using the most represen- dataset in Spanish for Automated Keyword Extraction,
tative candidate from the top-ranked topics. Florescu therefore, the first contribution of this work is the
develet al. [37] proposed PositionRank to use the position of opment of an evaluation corpus for keyword extraction in
word occurrences to improve TextRank on a document. Spanish which results from translating two of the most</p>
        <p>Word embeddings have also been widely used. Wang et common English AKE datasets: SemEval2010 and
Seal. [38] made use of the pre-trained word embedding and mEval2017. The target of this contribution is to generate
the frequency of each word to generate weighted edges a ‘silver standard’ labelled dataset, to provide researchers
between words in a document. A weighted PageRank in the field with a consolidated framework to test and
algorithm was used to compute the final scores of words. evaluate their approaches.</p>
        <p>Also, Key2Vec [39] used a similar approach using the However, the translation process for labelled datasets
phrase embeddings for representing the candidates and is not a straightforward task. As [47] demonstrated in
ranking the importance of the phrases by calculating the their work, labelled datasets have their labels linked to
semantic similarity and co-occurrences of the phrases. one token or a span of tokens. Since the sentence
struc</p>
        <p>Currently, new approaches based on pre-trained neu- ture can vary in diferent languages, it is very challenging
ral language models have appeared in the literature. For to retain the same annotation structure after the
transinstance, Text2TCS2 [40], which is able to extract terms lation process. To overcome such dificulties, we have
and relations from raw text, creating taxonomies auto- organised the translation process into two phases: Phase
matically. [41] proposed SIFRank, the integration of a 1) Source Dataset Analysis and Source Dataset
Preprostatistical model and a pre-trained language model, to cessing, described in Section 3.1, and Phase 2) Source
calculate the relevance between candidates and docu- Dataset Translation and Target Dataset Postprocessing,
ment topics. Other works are focused on the extraction described in Section 3.2.
of multilingual terminology across domains using trans- Figure 1 summarises the method for the translation
formers [42]. process in which, given the two original datasets, a set of</p>
        <p>Two of the most recent works in the field of AKE us- four datasets translated into Spanish is obtained, using
ing language models are AttentionRank and MDERank. two diferent translation systems.</p>
        <p>AttentionRank [13] integrates self-attention weights
extracted from a pre-trained language model with the
calculated cross-attention relevancy value to identify key- 3.1. Phase 1: Dataset analysis and
words that are important to the local sentence context preprocessing
and also have strong relevancy to all sentences within
the whole document. MDERank [12] bases the
identifi</p>
      </sec>
      <sec id="sec-2-2">
        <title>In order to generate the proposed silver standard for</title>
        <p>Spanish AKE, we have selected the two previously
mentioned datasets, as they are widely used in experiments
of this kind: SemEval2010 Task 5 [17] and SemEval2017
2https://live.european-language-grid.eu/catalogue/toolservice/8122
1
e
s
a
h
P
2
e
s
a
h
P</p>
        <p>SemEval2017
Term Annotation
with quotes</p>
        <p>service
Google
Translator</p>
        <p>Term Annotation
with HTML tag
few-shot
prompt</p>
        <p>ChatGPT 3.5</p>
        <sec id="sec-2-2-1">
          <title>Spa_SemEval2010GT</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Spa_SemEval2010GPT</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Spa_SemEval2017GT</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Spa_SemEval2017GPT</title>
          <p>Manual revision
some of the keywords come from the ones manually
provided by the authors of the papers themselves, and
they may not have an exact correspondence in the text.</p>
          <p>Regarding the preprocessing of the datasets, there are
two main aspects involved in the translation process.
The first one refers to the original text. Not many issues
were found during the translation of SemEval2017
corpus, since it had a manageable size and a clean structure.
However, the original texts of SemEval2010 were
arbitrarily segmented, very long, and contained references and
formulas, which posed many problems for the automatic
translator when processing them.</p>
          <p>The second aspect refers to the keywords. For the
translation of the keywords, we did not simply
translate the list of keywords out of context, but decided to
mark them in the texts with annotations marks
(quotation marks or the HTML tag &lt;br&gt;, depending on the
translation system). Then, we translated the texts and
retrieved the translated terms contained within the
annotation marks.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Task 10 [18]. Both datasets are published following the</title>
        <p>same structure, a set of documents containing the raw
text (named docsutf8) and a set of documents containing
the extracted keywords (named keys). Both types of
documents present the same identifiers to match keywords
with source documents.</p>
        <p>Despite their similar structure, they present several 3.2. Phase 2: Dataset translation and
diferences. As shown in Table 1, the main diference postprocessing
lies in their size. With a smaller number of documents,
SemEval2010 far exceeds SemEval2017 in the total num- Most of the existing approaches to create silver standards
ber of tokens, which means that it contains fewer docu- from existing gold standards by leveraging machine
transments, but of a much larger size. SemEval2017 contains lation rely on at least two translation sources: one from
shorter documents with an average of 6 to 7 sentences, a common online translator such as DeepL3 or Google
whereas SemEval2010 contains full scientific papers with Translate4, and the other using a Neural Machine
Transhundreds of sentences. It is interesting to note that, al- lation model, as suggested in [44]. As already announced,
though SemEval2010 is bigger in number of documents in this work we have used Google Translate and ChatGPT
and number of tokens, SemEval2017 has a bigger number 3.5 Turbo5 APIs.
of extracted keywords. This means that the keywords The keywords from the texts that were translated with
from SemEval2010 have greater representation and num- Google Translate were annotated with quotation marks.
ber of occurrences than the keywords from 2017. These However, on some occasions the system retrieved errors
diferences in size are important because they require a in which the annotation marks were missing or misplaced
diferent treatment of the documents during the prepro- in the translated sentence, and either it was not possible
cessing and the translation stage. to extract the translated term from the annotated
sen</p>
        <p>In both datasets, over 50% of the keywords are unigram tence or the extracted term was not correct. To avoid
or bigram. However, in SemEval2010 we observe that 555
keywords are not present in the documents with a similar 3https://www.deepl.com/es/translator
span text. The reason for this is to be found in the way in 4https://translate.google.es/
which the original dataset was created. In SemEval2010, 5https://platform.openai.com/docs/models/gpt-3-5-turbo
that, we decided to append the original term to each anno- model of spaCy has to be downloaded before the methods
tated sentence, to force the system to take that term into can be run.
account and provide a translation. For instance, in the For the RAKE method, the original library cannot be
translation of the sentence ‘...has held two "mobile com- used as it is only oriented to the English language.
Howputing" design competitions’ focused on the term ‘mobile ever, there is a version named Multi-rake8 which covers
computing’ the translation lost the quotation marks: ‘ha diferent languages. As the method is statistical, to
percelebrado dos concursos de diseño de computación móvil’. form multilingually, the addition of stopword lists from
Thus, we add the term repeated to obtain the translation the diferent target languages is necessary.
of the term: ‘...has held two "mobile computing" design
competitions. Mobile computing’. 4.2. Attention Rank</p>
        <p>With ChatGPT, the tag &lt;br&gt;was used to mark the
keywords before and after. The prompt sent to the generative The implementation of the original authors9 had to be
model described the purpose of the model (i.e., ’You are reimplemented from scratch. The original repository
a Spanish translator specialised in terminology’), and does not have libraries and version specifications.
Morethen some examples of annotations in English and its over, the original code relies on libraries for language
translations in Spanish with the annotated and trans- models that are not maintained as well as the noun
lated keywords were provided. This is called few-shot phrases identification component, which relies on the
prompting. The full prompt is presented in Annex A. part-of-speech annotation of Stanford CoreNLP and a</p>
        <p>Regarding the postprocessing stage, several actions third-party library. Reproducibility was not possible in
were performed. First, we extracted all the annotated this work.
occurrences of each keyword in the sentence, creating a A new repository10 has been created for the
implemenlist of translation candidates per keyword. In some cases, tation of the Attention rank method. This repository uses
reconciliation between candidates was necessary to pro- HuggingFace’s library transformer to manage language
vide a single translation for each keyword. In the case models and spaCy to identify noun phrases. The
reposithat no disparities between the candidates were found, tory details the specific libraries and versions needed and
the translated keyword was automatically assigned. In the external modules needed. The new repository allows
case of disparities, terms were manually reviewed and the use of BERT (as in the original work) and RoBERTa
a translated keyword was manually assigned. In total, architecture models in diferent languages.
we manually reviewed an average number of 2000 key- The adaptation for RoBERTa models had to deal with
words per dataset (220 documents in SemEval2010 and two specific issues regarding the tokeniser. The first one
360 documents in SemEval2017). is the use of diferent special tokens to delimit sentences
at the beginning and at the end to focus the attention
mechanisms, as BERT uses ’[CLS]’ and ’[SEP]’ tokens,
4. AKE Adaptation to Spanish RoBERTa uses ’&lt;s&gt;’ and ’&lt;/s&gt;’. The second issue is the
generated tokens, as BERT uses a WordPiece tokeniser
In this section, the diferent AKE methods used for the ex- in which subwords are marked with the ’##’ tag (e.g.,
periments and their implementation are presented. Some the word thicknesses is divided into tokens thickness and
of them have already been implemented and maintained ##es). In contrast, RoBERTa models use Byte-level Pair
by well-known Python libraries and contain adapters to Encoding (BPE) and classifies diferent tokens for char
work with other languages. Two of them, those that are sequences that start a word or that are inside. The tokens
based on language models, had to be re-implemented and that start a word include the white space before the word,
adapted. In addition to diferent technical aspects, both and they are marked whith the special character ’ Ġ’. For
methods use the original BERT model [48] for English, instance, the word extrapolate is divided into two tokens:
and the RoBERTa MarIA model [49] for Spanish. ’ Ġextrap’ and ’olate’.</p>
        <p>Beyond the diferences studied in previous works on
4.1. Already implemented methods the benefits or diferences between both types of
tokenisers [50], this work had to develop the alignment process
between the words of keywords and their
corresponding tokens. With WordPiece is easier to find tokens and
recompose the original word, but BPE is sensible to
appearance of the white space before the token. If it does
not appear, the token is diferent and its attention value</p>
      </sec>
      <sec id="sec-2-4">
        <title>The methods used for the evaluation are TopicRank,</title>
        <p>YAKE and RAKE. The Python library PKE6 has been used
for the execution of the TopicRank and YAKE methods.</p>
        <p>PKE uses the Python library spaCy7, as many other
methods, to identify candidate chunks or nominal phrases that
can be relevant terms or keywords. Thus, the Spanish
6https://github.com/boudinfl/pke
7https://spacy.io/
8https://github.com/vgrabovets/multi_rake
9https://github.com/hd10-iupui/AttentionRank
10https://github.com/oeg-upm/AttentionRankLib
changes. This issue has been solved by ensuring that the The results of the AKE algorithms on the Spanish
input sentences always have a white space before a word. datasets, both multilingual and adapted for Spanish, show
a lower performance compared to the original datasets.
4.3. MDERank However, they are in line with the results for English.
Unlike many other NLP experiments, where a good result
The original implementation11 contains a better descrip- is represented by metrics starting at 0.6 or 0.7 of f1 score,
tion of the requirements. However, it is described for the highest metrics achieved by the algorithms tested in
Python 3.7 which is no longer supported by the commu- SemEval2010 and 2017 do not exceed 0.3821 (BR17 and
nity and most of the versions of the required libraries are K= 15).
deprecated. Also, parts of the execution code are wrong We already expected lower values, as the translation
such as the command line execution or the arguments, process is not perfect and it is not always possible to
and there is no code related to the KPEBERT model, a maintain the correlation of one keyword in English to
model which is trained and used for keyword identifi- the same keyword in Spanish. Apart from the errors
cation. Only it is possible to execute it with traditional detected (explained in Section 5.2), GPT3 showed better
BERT models. performance in maintaining the structure and
terminol</p>
        <p>To update the code and method, a new repository has ogy of the translated document.
been created12. In which the requirements, code and exe- It is also important to mention the diferent results
cution process have improved. As AttentionRank, MDER- obtained for each dataset. For Spa SemEval2017GT and
ank used Stanford CoreNLP for the identification of noun Spa SemEval2017GPT the best results, in terms of
precifragments and it has been updated to spaCy. Finally, the sion, recall and f1-score, are obtained by the two methods
method can now support RoBERTa models by taking into that are based on language models: AttentionRank and
account the problems mentioned in AttentionRank. MDERank. Although the original dataset contains
complex keywords, the language models perform well as in
5. Evaluation the English dataset.</p>
        <p>Surprisingly, for Spa SemEval2010GT and Spa
SemEval2010GPT the best results are obtained by YAKE.</p>
        <p>The nature of the documents in SemEval2010, which are
full papers without any cleaning, including formulas,
references and citations, makes it dificult for a language
model to perform well. An added issue is the large
length of the documents, which in the case of RAKE
produces results close to zero.</p>
      </sec>
      <sec id="sec-2-5">
        <title>This section discusses the evaluation results obtained</title>
        <p>from the execution of the five AKE methods on
the four translated datasets (Spa_SemEval2010GT,
Spa_SemEval2010GPT, Spa_SemEval2017GT and
Spa_SemEval2017GPT). The metrics used in the
evaluation are precision, recall and f1-measure. Following
previous works in the literature, the methods are
evaluated with the three metrics at the top K of the
keywords extracted in each method. K equals 5, 10, and 5.2. Error Analysis and Discussion
15. Finally, we perform an error analysis and present a After a thorough analysis of the results, we conclude
discussion around it. Table 2 shows the results obtained. that, beyond some translation errors, the main reason
behind the low numbers seems to be the poor quality of
5.1. Results some keywords in the original datasets. Although both
datasets are claimed to have been either generated or
reviewed by humans, we have detected a great number
of anomalies that may be the main source of errors, as
we try to illustrate below:
• Duplicated structures: We find similar structures
with small variations which produce noise and
inconsistencies, such as terms with determiners
(i.e. metal and the metal), terms with symbols or
special characters (i.e. logical inference and
“logical inference"), and terms with diferent spellings
(i.e reputation mechanism and Reputation
mechanism).
• Misspelled structures: We found several examples
of misspelled structures, and, specifically, missing</p>
        <p>11https://github.com/LinhanZ/mderank
12https://github.com/oeg-upm/mderanklib
letters both at the beginning and at the end of the
structure (i.e. aked instead of baked).</p>
        <p>netic can be found translated into two diferent keywords
throughout the text, as ferromagnética and
ferromagnéticos. However, with the aim to be faithful to the original
evaluation datasets, we decided to choose one of the
translations and discard the alternatives, although we
believe that the datasets would benefit from including
such variation.
• Non-terminological structures: This is the most
common anomaly in both datasets, and one of the
main causes for the low performance of the
algorithms, both in English and in Spanish. Examples
of such non-terminological structures are: full
sentences (i.e. dynamics which clearly reveal the
origins of the roaming), sentence fragments (i.e. 6. Conclusions
loading force and penetration depth were recorded
and their respective values were correlated with This work has analysed the current state-of-the-art of
authe observed), concatenated structures (i.e.1. well tomatic keyword extraction and, in particular, the
Spandefined phase space dividing surfaces attached to , ish landscape. In this analysis, we have identified the
i.e.2. austenitic or austenitic &amp; ferritic stainless lack of an evaluation framework (including datasets and
steel), or even text fragments with references (i.e.1. ready-to-test algorithms) for AKE in Spanish.
Consecomparison between the realistic calculations for quently, this paper proposes two contributions. First, the
positive parity [12] and negative parity [14], based generation of a silver standard for the Spanish language
on the same quark model [15], i.e.2. calculation by community by the translation of two English datasets
Martinez-Pinedo et al.). widely used to evaluate AKE approaches: SemEval2010
and SemEval2017. Second, the configuration of a set of</p>
        <p>Additionally to inaccuracies and anomalies mentioned state-of-the-art algorithms in an easily executable
manbefore, in the results we observe that in some instances ner to facilitate the evaluation task, including the
adaptathe same keyword has been translated diferently into tion of two current methods that rely on language models:
Spanish in diferent parts of the text. For example, the Attention Rank and MDERank.
term deployment has been translated both as despliegue With the benchmark in place, we have performed an
and implementación within the same text; or the com- evaluation of the implemented algorithms and the
transpound term information aggregation can be found trans- lated datasets. To be consistent with the evaluations in
lated as agregación de información and agregación de la English, the translated datasets maintain the original
ininformación. In itself, this would not be a problem be- ner structure. The results in Spanish suggest the same
cause these are correct translations in Spanish. Moreover, tendency as in English, although they are lower. The
even in specialised domains, term variants are commonly error analysis shows that low results are due to several
used to designate the same concept. factors: 1) the quality of the original datasets, as they</p>
        <p>A similar issue occurs when Spanish terms vary in contain noisy texts, non-terminological structures, and
gender and number. For instance, the keyword
ferromag</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This work has been partially founded by INESDATA</title>
        <p>(https://inesdata-project.eu/) project, funded by the
Spanish Ministry of Digital Transformation and Public Afairs
and NextGenerationEU, in the framework of the UNICO
I+D CLOUD Program - Real Decreto 959/2022.
terms that are not contained in the texts, 2) the
quality of the translations for the labelled datasets, as both
systems present translation inconsistencies and have
dififculties to keep track of the translated keyword in the
text, 3) the fact that a 1 to 1 translation of keywords is
not always possible nor desirable, and that it would be
recommendable to include term variants.</p>
        <p>In light of the results and taking these remarks into
account, we conclude that maintaining the dataset
structure in English to evaluate AKE tasks in Spanish might
not be the most appropriate approach. For this
reason, as part of future work we are considering two
approaches for generating evaluation datasets in Spanish:
1) automatically postprocessing existing datasets, such
as the two dealt with in this work, to eliminate all
nonterminological structures and produce a list of candidate
terms instead of just one in the translation process, and
2) semi-automatically generating a dataset with similar
characteristics to the ones mentioned, but based on texts
originally written in Spanish.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Term Translation Prompt</title>
      <sec id="sec-4-1">
        <title>You are a scientific translator of English to Spanish spe</title>
        <p>cialized in terminology. I give you one sentence in
English and the same sentence translated to Spanish. The
English sentence has a term between the marks &lt;br&gt; and
&lt;/br&gt;. Identify in the Spanish sentence which words
correspond to the same original term. The output term is in
Spanish. Some examples</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>doi:10</source>
          .1109/ACCESS.
          <year>2020</year>
          .
          <volume>2965087</volume>
          .
          <string-name>
            <surname>English</surname>
            <given-names>sentence:</given-names>
          </string-name>
          "The University of Florida, in part[42]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wachowiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Heinisch</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Gromann, nership with Motorola, has held two &lt;br&gt;mobile comput-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>putational Linguistics: ACL-IJCNLP</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <article-title>Output: computación móvil English sentence: "There,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3607-
          <fpage>3620</fpage>
          .
          <article-title>we assume that &lt;br&gt;coeficients of non-renormalizable [43]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghafoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Imran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Daudpota</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <article-title>Kas- terms&lt;/br&gt; are suppressed enough to be neglected"</article-title>
          . Span-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>124478</fpage>
          -
          <lpage>124490</lpage>
          . cientes de los términos no renormalizables [44]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bonifacio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jeronymo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Abonizio</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Cam- English sentence: "It often exploits an &lt;br&gt;optical dif-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>ranking dataset</article-title>
          ,
          <source>arXiv preprint arXiv:2108</source>
          .
          <article-title>13897 of the light flux at the surface of the tissue." Spanish</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          (
          <year>2021</year>
          ).
          <article-title>sentence: "A menudo se utiliza un algoritmo</article-title>
          de recon[45]
          <string-name>
            <given-names>M.</given-names>
            <surname>Araújo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benevenuto</surname>
          </string-name>
          , A compara- strucción de imágenes basado en un modelo de difusión
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>ences 512</source>
          (
          <year>2020</year>
          )
          <fpage>1078</fpage>
          -
          <lpage>1102</lpage>
          . tejido." Output: algoritmo de reconstrucción de imágenes [46]
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Carrino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Costa-Jussà</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Fonollosa</surname>
          </string-name>
          , basado en un modelo de difusión óptica
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          arXiv:
          <year>1912</year>
          .
          <volume>05200</volume>
          (
          <year>2019</year>
          ).
          <article-title>teristic features of the UvT Expert Collection</article-title>
          ; specifically, [47]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Bonifacio</surname>
          </string-name>
          , L. R. de Souza, R. Lotufo,
          <article-title>we propose and evaluate refined expert finding and pro-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>transfer methods</article-title>
          ,
          <source>arXiv preprint arXiv:2105</source>
          .06813 nizational structure&lt;/br&gt;.
          <article-title>" Spanish: "Un segundo grupo</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          (
          <year>2021</year>
          ). de experimentos está dirigido a extensiones de los méto[48]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Toutanova, dos base que aprovechan las características distintivas de</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
          <article-title>perfilado de expertos que incorporan la topicalidad y la [49] A</article-title>
          .
          <string-name>
            <surname>Gutiérrez-Fandiño</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Armengol-Estapé, estructura organizativa." output: topicalidad y la estruc-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>language models</article-title>
          ,
          <source>arXiv preprint arXiv:2107.07253</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          (
          <year>2021</year>
          ). [50]
          <string-name>
            <given-names>C.</given-names>
            <surname>Toraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Şahinuç</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ozcelik</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Lang</surname>
          </string-name>
          . Inf. Process.
          <volume>22</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          10.1145/3578707. doi:
          <volume>10</volume>
          .1145/3578707.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>