1. Introduction

Translation Journal (2007). ing Keyphrases and Relations from Scientific Publi

10.18653/v1/S17-2091

Benchmark for Automatic Keyword Extraction in Spanish: Datasets and Methods

Pablo Calleja

Patricia Martín-Chozas

Elena Montiel-Ponsoda

0 0 Ontology Engineering Group, Universidad Politécnica de Madrid

2017

3 9 12

Tasks such as document indexing or information retrieval still seem to heavily rely on keywords, even in the LLMs era. However, there is still a need for automatic keyword extraction works and training sets in languages other than English. To the best of our knowledge, no datasets for keyword extraction in Spanish are publicly available for training or evaluation purposes. Additionally, those innovative keyword extraction methods that rely on language models are not being adapted to language models in other languages. To palliate this situation, this work proposes a method to translate into Spanish two of the main gold standard datasets used by the community, while preserving semantics and terms. Then, the main state-of-the-art methods are evaluated against the new translated datasets. The methods used for the evaluation have been configured or re-implemented for Spanish.

eol>Spanish Automatic Keyword Extraction Spanish language SemEval2017 SemEval2010

1. Introduction

In this paper, a method to translate two of the most resources that contain the lexical units that are represenimportant corpora for AKE is proposed and applied to tative of a domain. their translation into Spanish. The main aim of this work Although these two tasks have been conceived for is to create a ’silver standard´ to support the training and diferent purposes, the truth is that, when performed auevaluation of automatic keyword extraction in Spanish. tomatically, they obtain similar results and performance, The translation process has been performed to preserve as both rely on linguistic and textual features (at sentence, the semantics and terminological representation of the paragraph or document levels). Thus, several state-oforiginal texts and the annotations. The translation is the-art methods have been used for both tasks. supported by the Google Translate service and by Chat- In this section, we will review the most relevant works GPT3.5. in this area, making a distinction between traditional

Additionally, a benchmark has been generated with approaches (linguistic and statistic) and machine learning ifve of the most relevant methods in the current state-of- and neural approaches. the-art on the two translated corpus. The methods have been configured for Spanish, and two of them have been 2.1. Traditional approaches re-implemented to use Spanish language models.

The rest of the paper is structured as follows: In sec- The algorithms considered in this section are usually tion 2 we provide a summary of the state-of-the-art in based on linguistic patterns, relying on parsing and partAutomatic Keyword Extraction. Section 3 is devoted to speech tagging processes to identify terms [22]. These the method for the translation of the corpora. Section 4 patterns were very prolific in the 1990s, with systems describes the diferent AKE methods with their config- such as LEXTER [23]. This kind of approaches [24] has urations or adaptations for the Spanish language, and persisted until today, as patterns are the main starting section 5 presents the results of the evaluation bench- point to automatically identify keywords or terms in mark. Finally, section 6 highlights the conclusions and documents and corpora. More advanced works based on recommendations for future work. Both experiments patterns went further to identify the concept evoked by and results are reported in an anonymised GitHub repos- term variants in several languages, as the work by [25] itory1. for English and French. In any case, the majority of these works are language dependent.

Later on, researchers started to combine various types 2. State of the art of linguistic techniques, such as pattern-based techniques, regular expressions, stop word lists, and postAs stated by [1], ‘keywords’ and ‘keyphrases’ do not re- processing algorithms, to mention but a few. In this fer to any theory. An element is considered as a ‘key’ context, tools such as TermExtractor emerge, a system element within a document, when it is an important de- that combines several of the previously mentioned techscriptor of the document content. The use of ‘word’ ver- niques and applies post-processing filters like domain sus ‘phrase’ refers to the number of textual units, which pertinence, lexical cohesion or structural relevance [26]. can be one (1-gram) or several (n-grams). Since such More advanced works in the literature started to use keywords or keyphrases mostly correspond to terms, de- statistical approaches in combination with linguistic funcifned as words that are specific to a domain, the AKE task tionalities, which appeared to improve the results. The is closely related to the so-called Automatic Terminol- process behind statistical approaches generally consists ogy Extraction/Retrieval (ATE/ATR) task, i.e., the task of of weighting the frequency of occurrence of a combinaidentifying relevant terms in a corpus [20]. tion of words (n-grams) in a text. Normally, statistical

Lossio-Ventura et al. [21] described in their work that algorithms are divided into two types: 1) those based there are some fundamental diferences between term on the unithood that measures the strength of unity of extraction and keyword extraction tasks. One major dif- complex units (such as X2, T-score and z-score), and 2) ference is that extracting terms requires a large collection those based on the termhood that measures the degree of texts, which is not a necessary requirement in keyword of representation of domain-specific concepts, such as extraction, which can take only a single document as in- C-Value or co-occurrence [27, 28]. Some of these purely put. Also, ATE methods aim to extract term-like units statistical term extractors are INDEX for English [29], and remove those that may not be terms, syntactically Lexterm [30] for Spanish, and RAKE [5], for keyword or terminologically. On the other hand, AKE methods extraction in English. extract the ‘key’ elements of a document, which are not In contrast, it is most common to find mixed aplimited to terms. Thus, while AKE methods can be do- proaches, such as TerMine, a term extractor that main independent, ATE methods apply to specific fields combines C-Value with linguistic information [4], or or professional domains, since their main goal is to build TermSuite, which applies distributional and compositional methods [31]. In [32], authors combine linguistic processes such as segmentation, PoS tagging and mor- cation of keywords on the embedding representation of phological analysis, with semantic knowledge extracted the sentence using masked tokens. Moreover, their work from external resources and statistical techniques. Other proposes a new type of BERT architecture to be trained works, such as TextRank [33], create a graph from the as a language model, but for the purpose of keyword text to extract keywords based on statistical metrics. identification. 2.2. Machine Learning and Neural approaches

3. Dataset generation In the era of machine learning approaches, datasets are

These approaches exploit diferent features (linguistic or an essential requirement to train and, what is more imnot) to identify keywords. For instance, Rose et al. [5] portant, evaluate algorithms for diferent NLP tasks. For identified keywords based on word frequency, the num- instance, in the field of Automatic Keyword Extraction, ber of co-occurring neighbors, and the ratio between the there are well-known gold standard datasets that are comco-occurrence and the frequency. Campos et al. [34] pro- monly used to evaluate approaches within the literature posed YAKE which calculated the importance of each can- such as the SemEval2010 Task 5 [17] and SemEval2017 didate using frequency, ofsets, and co-occurrence. Sem- Task 10 [18]. However, the availability of these data sets Cluster method [35] first clustered the candidates based is limited to languages other than English [43]. Conseon the semantic similarity in which the centroids were se- quently, a common approach to overcome this limitation lected as keywords. TopicRank [36] first assigned a score is to translate the available datasets into the target lanto each topic by candidate keywords clustering. The guage [44, 45], including Spanish [46]. topics were scored using the TextRank ranking model, To the best of our knowledge, there is no consolidated and keywords were extracted using the most represen- dataset in Spanish for Automated Keyword Extraction, tative candidate from the top-ranked topics. Florescu therefore, the first contribution of this work is the develet al. [37] proposed PositionRank to use the position of opment of an evaluation corpus for keyword extraction in word occurrences to improve TextRank on a document. Spanish which results from translating two of the most

Word embeddings have also been widely used. Wang et common English AKE datasets: SemEval2010 and Seal. [38] made use of the pre-trained word embedding and mEval2017. The target of this contribution is to generate the frequency of each word to generate weighted edges a ‘silver standard’ labelled dataset, to provide researchers between words in a document. A weighted PageRank in the field with a consolidated framework to test and algorithm was used to compute the final scores of words. evaluate their approaches.

Also, Key2Vec [39] used a similar approach using the However, the translation process for labelled datasets phrase embeddings for representing the candidates and is not a straightforward task. As [47] demonstrated in ranking the importance of the phrases by calculating the their work, labelled datasets have their labels linked to semantic similarity and co-occurrences of the phrases. one token or a span of tokens. Since the sentence struc

Currently, new approaches based on pre-trained neu- ture can vary in diferent languages, it is very challenging ral language models have appeared in the literature. For to retain the same annotation structure after the transinstance, Text2TCS2 [40], which is able to extract terms lation process. To overcome such dificulties, we have and relations from raw text, creating taxonomies auto- organised the translation process into two phases: Phase matically. [41] proposed SIFRank, the integration of a 1) Source Dataset Analysis and Source Dataset Preprostatistical model and a pre-trained language model, to cessing, described in Section 3.1, and Phase 2) Source calculate the relevance between candidates and docu- Dataset Translation and Target Dataset Postprocessing, ment topics. Other works are focused on the extraction described in Section 3.2. of multilingual terminology across domains using trans- Figure 1 summarises the method for the translation formers [42]. process in which, given the two original datasets, a set of

Two of the most recent works in the field of AKE us- four datasets translated into Spanish is obtained, using ing language models are AttentionRank and MDERank. two diferent translation systems.

AttentionRank [13] integrates self-attention weights extracted from a pre-trained language model with the calculated cross-attention relevancy value to identify key- 3.1. Phase 1: Dataset analysis and words that are important to the local sentence context preprocessing and also have strong relevancy to all sentences within the whole document. MDERank [12] bases the identifi

In order to generate the proposed silver standard for

Spanish AKE, we have selected the two previously mentioned datasets, as they are widely used in experiments of this kind: SemEval2010 Task 5 [17] and SemEval2017 2https://live.european-language-grid.eu/catalogue/toolservice/8122 1 e s a h P 2 e s a h P

SemEval2017 Term Annotation with quotes

service Google Translator

Term Annotation with HTML tag few-shot prompt

ChatGPT 3.5

Spa_SemEval2010GT Spa_SemEval2010GPT Spa_SemEval2017GT Spa_SemEval2017GPT

Manual revision some of the keywords come from the ones manually provided by the authors of the papers themselves, and they may not have an exact correspondence in the text.

Regarding the preprocessing of the datasets, there are two main aspects involved in the translation process. The first one refers to the original text. Not many issues were found during the translation of SemEval2017 corpus, since it had a manageable size and a clean structure. However, the original texts of SemEval2010 were arbitrarily segmented, very long, and contained references and formulas, which posed many problems for the automatic translator when processing them.

The second aspect refers to the keywords. For the translation of the keywords, we did not simply translate the list of keywords out of context, but decided to mark them in the texts with annotations marks (quotation marks or the HTML tag , depending on the translation system). Then, we translated the texts and retrieved the translated terms contained within the annotation marks.

Task 10 [18]. Both datasets are published following the

same structure, a set of documents containing the raw text (named docsutf8) and a set of documents containing the extracted keywords (named keys). Both types of documents present the same identifiers to match keywords with source documents.

Despite their similar structure, they present several 3.2. Phase 2: Dataset translation and diferences. As shown in Table 1, the main diference postprocessing lies in their size. With a smaller number of documents, SemEval2010 far exceeds SemEval2017 in the total num- Most of the existing approaches to create silver standards ber of tokens, which means that it contains fewer docu- from existing gold standards by leveraging machine transments, but of a much larger size. SemEval2017 contains lation rely on at least two translation sources: one from shorter documents with an average of 6 to 7 sentences, a common online translator such as DeepL3 or Google whereas SemEval2010 contains full scientific papers with Translate4, and the other using a Neural Machine Transhundreds of sentences. It is interesting to note that, al- lation model, as suggested in [44]. As already announced, though SemEval2010 is bigger in number of documents in this work we have used Google Translate and ChatGPT and number of tokens, SemEval2017 has a bigger number 3.5 Turbo5 APIs. of extracted keywords. This means that the keywords The keywords from the texts that were translated with from SemEval2010 have greater representation and num- Google Translate were annotated with quotation marks. ber of occurrences than the keywords from 2017. These However, on some occasions the system retrieved errors diferences in size are important because they require a in which the annotation marks were missing or misplaced diferent treatment of the documents during the prepro- in the translated sentence, and either it was not possible cessing and the translation stage. to extract the translated term from the annotated sen

In both datasets, over 50% of the keywords are unigram tence or the extracted term was not correct. To avoid or bigram. However, in SemEval2010 we observe that 555 keywords are not present in the documents with a similar 3https://www.deepl.com/es/translator span text. The reason for this is to be found in the way in 4https://translate.google.es/ which the original dataset was created. In SemEval2010, 5https://platform.openai.com/docs/models/gpt-3-5-turbo that, we decided to append the original term to each anno- model of spaCy has to be downloaded before the methods tated sentence, to force the system to take that term into can be run. account and provide a translation. For instance, in the For the RAKE method, the original library cannot be translation of the sentence ‘...has held two "mobile com- used as it is only oriented to the English language. Howputing" design competitions’ focused on the term ‘mobile ever, there is a version named Multi-rake8 which covers computing’ the translation lost the quotation marks: ‘ha diferent languages. As the method is statistical, to percelebrado dos concursos de diseño de computación móvil’. form multilingually, the addition of stopword lists from Thus, we add the term repeated to obtain the translation the diferent target languages is necessary. of the term: ‘...has held two "mobile computing" design competitions. Mobile computing’. 4.2. Attention Rank

With ChatGPT, the tag was used to mark the keywords before and after. The prompt sent to the generative The implementation of the original authors9 had to be model described the purpose of the model (i.e., ’You are reimplemented from scratch. The original repository a Spanish translator specialised in terminology’), and does not have libraries and version specifications. Morethen some examples of annotations in English and its over, the original code relies on libraries for language translations in Spanish with the annotated and trans- models that are not maintained as well as the noun lated keywords were provided. This is called few-shot phrases identification component, which relies on the prompting. The full prompt is presented in Annex A. part-of-speech annotation of Stanford CoreNLP and a

Regarding the postprocessing stage, several actions third-party library. Reproducibility was not possible in were performed. First, we extracted all the annotated this work. occurrences of each keyword in the sentence, creating a A new repository10 has been created for the implemenlist of translation candidates per keyword. In some cases, tation of the Attention rank method. This repository uses reconciliation between candidates was necessary to pro- HuggingFace’s library transformer to manage language vide a single translation for each keyword. In the case models and spaCy to identify noun phrases. The reposithat no disparities between the candidates were found, tory details the specific libraries and versions needed and the translated keyword was automatically assigned. In the external modules needed. The new repository allows case of disparities, terms were manually reviewed and the use of BERT (as in the original work) and RoBERTa a translated keyword was manually assigned. In total, architecture models in diferent languages. we manually reviewed an average number of 2000 key- The adaptation for RoBERTa models had to deal with words per dataset (220 documents in SemEval2010 and two specific issues regarding the tokeniser. The first one 360 documents in SemEval2017). is the use of diferent special tokens to delimit sentences at the beginning and at the end to focus the attention mechanisms, as BERT uses ’[CLS]’ and ’[SEP]’ tokens, 4. AKE Adaptation to Spanish RoBERTa uses ’<s>’ and ’</s>’. The second issue is the generated tokens, as BERT uses a WordPiece tokeniser In this section, the diferent AKE methods used for the ex- in which subwords are marked with the ’##’ tag (e.g., periments and their implementation are presented. Some the word thicknesses is divided into tokens thickness and of them have already been implemented and maintained ##es). In contrast, RoBERTa models use Byte-level Pair by well-known Python libraries and contain adapters to Encoding (BPE) and classifies diferent tokens for char work with other languages. Two of them, those that are sequences that start a word or that are inside. The tokens based on language models, had to be re-implemented and that start a word include the white space before the word, adapted. In addition to diferent technical aspects, both and they are marked whith the special character ’ Ġ’. For methods use the original BERT model [48] for English, instance, the word extrapolate is divided into two tokens: and the RoBERTa MarIA model [49] for Spanish. ’ Ġextrap’ and ’olate’.

Beyond the diferences studied in previous works on 4.1. Already implemented methods the benefits or diferences between both types of tokenisers [50], this work had to develop the alignment process between the words of keywords and their corresponding tokens. With WordPiece is easier to find tokens and recompose the original word, but BPE is sensible to appearance of the white space before the token. If it does not appear, the token is diferent and its attention value

The methods used for the evaluation are TopicRank,

YAKE and RAKE. The Python library PKE6 has been used for the execution of the TopicRank and YAKE methods.

PKE uses the Python library spaCy7, as many other methods, to identify candidate chunks or nominal phrases that can be relevant terms or keywords. Thus, the Spanish 6https://github.com/boudinfl/pke 7https://spacy.io/ 8https://github.com/vgrabovets/multi_rake 9https://github.com/hd10-iupui/AttentionRank 10https://github.com/oeg-upm/AttentionRankLib changes. This issue has been solved by ensuring that the The results of the AKE algorithms on the Spanish input sentences always have a white space before a word. datasets, both multilingual and adapted for Spanish, show a lower performance compared to the original datasets. 4.3. MDERank However, they are in line with the results for English. Unlike many other NLP experiments, where a good result The original implementation11 contains a better descrip- is represented by metrics starting at 0.6 or 0.7 of f1 score, tion of the requirements. However, it is described for the highest metrics achieved by the algorithms tested in Python 3.7 which is no longer supported by the commu- SemEval2010 and 2017 do not exceed 0.3821 (BR17 and nity and most of the versions of the required libraries are K= 15). deprecated. Also, parts of the execution code are wrong We already expected lower values, as the translation such as the command line execution or the arguments, process is not perfect and it is not always possible to and there is no code related to the KPEBERT model, a maintain the correlation of one keyword in English to model which is trained and used for keyword identifi- the same keyword in Spanish. Apart from the errors cation. Only it is possible to execute it with traditional detected (explained in Section 5.2), GPT3 showed better BERT models. performance in maintaining the structure and terminol

To update the code and method, a new repository has ogy of the translated document. been created12. In which the requirements, code and exe- It is also important to mention the diferent results cution process have improved. As AttentionRank, MDER- obtained for each dataset. For Spa SemEval2017GT and ank used Stanford CoreNLP for the identification of noun Spa SemEval2017GPT the best results, in terms of precifragments and it has been updated to spaCy. Finally, the sion, recall and f1-score, are obtained by the two methods method can now support RoBERTa models by taking into that are based on language models: AttentionRank and account the problems mentioned in AttentionRank. MDERank. Although the original dataset contains complex keywords, the language models perform well as in 5. Evaluation the English dataset.

Surprisingly, for Spa SemEval2010GT and Spa SemEval2010GPT the best results are obtained by YAKE.

The nature of the documents in SemEval2010, which are full papers without any cleaning, including formulas, references and citations, makes it dificult for a language model to perform well. An added issue is the large length of the documents, which in the case of RAKE produces results close to zero.

This section discusses the evaluation results obtained

from the execution of the five AKE methods on the four translated datasets (Spa_SemEval2010GT, Spa_SemEval2010GPT, Spa_SemEval2017GT and Spa_SemEval2017GPT). The metrics used in the evaluation are precision, recall and f1-measure. Following previous works in the literature, the methods are evaluated with the three metrics at the top K of the keywords extracted in each method. K equals 5, 10, and 5.2. Error Analysis and Discussion 15. Finally, we perform an error analysis and present a After a thorough analysis of the results, we conclude discussion around it. Table 2 shows the results obtained. that, beyond some translation errors, the main reason behind the low numbers seems to be the poor quality of 5.1. Results some keywords in the original datasets. Although both datasets are claimed to have been either generated or reviewed by humans, we have detected a great number of anomalies that may be the main source of errors, as we try to illustrate below: • Duplicated structures: We find similar structures with small variations which produce noise and inconsistencies, such as terms with determiners (i.e. metal and the metal), terms with symbols or special characters (i.e. logical inference and “logical inference"), and terms with diferent spellings (i.e reputation mechanism and Reputation mechanism). • Misspelled structures: We found several examples of misspelled structures, and, specifically, missing

11https://github.com/LinhanZ/mderank 12https://github.com/oeg-upm/mderanklib letters both at the beginning and at the end of the structure (i.e. aked instead of baked).

netic can be found translated into two diferent keywords throughout the text, as ferromagnética and ferromagnéticos. However, with the aim to be faithful to the original evaluation datasets, we decided to choose one of the translations and discard the alternatives, although we believe that the datasets would benefit from including such variation. • Non-terminological structures: This is the most common anomaly in both datasets, and one of the main causes for the low performance of the algorithms, both in English and in Spanish. Examples of such non-terminological structures are: full sentences (i.e. dynamics which clearly reveal the origins of the roaming), sentence fragments (i.e. 6. Conclusions loading force and penetration depth were recorded and their respective values were correlated with This work has analysed the current state-of-the-art of authe observed), concatenated structures (i.e.1. well tomatic keyword extraction and, in particular, the Spandefined phase space dividing surfaces attached to , ish landscape. In this analysis, we have identified the i.e.2. austenitic or austenitic & ferritic stainless lack of an evaluation framework (including datasets and steel), or even text fragments with references (i.e.1. ready-to-test algorithms) for AKE in Spanish. Consecomparison between the realistic calculations for quently, this paper proposes two contributions. First, the positive parity [12] and negative parity [14], based generation of a silver standard for the Spanish language on the same quark model [15], i.e.2. calculation by community by the translation of two English datasets Martinez-Pinedo et al.). widely used to evaluate AKE approaches: SemEval2010 and SemEval2017. Second, the configuration of a set of

Additionally to inaccuracies and anomalies mentioned state-of-the-art algorithms in an easily executable manbefore, in the results we observe that in some instances ner to facilitate the evaluation task, including the adaptathe same keyword has been translated diferently into tion of two current methods that rely on language models: Spanish in diferent parts of the text. For example, the Attention Rank and MDERank. term deployment has been translated both as despliegue With the benchmark in place, we have performed an and implementación within the same text; or the com- evaluation of the implemented algorithms and the transpound term information aggregation can be found trans- lated datasets. To be consistent with the evaluations in lated as agregación de información and agregación de la English, the translated datasets maintain the original ininformación. In itself, this would not be a problem be- ner structure. The results in Spanish suggest the same cause these are correct translations in Spanish. Moreover, tendency as in English, although they are lower. The even in specialised domains, term variants are commonly error analysis shows that low results are due to several used to designate the same concept. factors: 1) the quality of the original datasets, as they

A similar issue occurs when Spanish terms vary in contain noisy texts, non-terminological structures, and gender and number. For instance, the keyword ferromag

Acknowledgments This work has been partially founded by INESDATA

(https://inesdata-project.eu/) project, funded by the Spanish Ministry of Digital Transformation and Public Afairs and NextGenerationEU, in the framework of the UNICO I+D CLOUD Program - Real Decreto 959/2022. terms that are not contained in the texts, 2) the quality of the translations for the labelled datasets, as both systems present translation inconsistencies and have dififculties to keep track of the translated keyword in the text, 3) the fact that a 1 to 1 translation of keywords is not always possible nor desirable, and that it would be recommendable to include term variants.

In light of the results and taking these remarks into account, we conclude that maintaining the dataset structure in English to evaluate AKE tasks in Spanish might not be the most appropriate approach. For this reason, as part of future work we are considering two approaches for generating evaluation datasets in Spanish: 1) automatically postprocessing existing datasets, such as the two dealt with in this work, to eliminate all nonterminological structures and produce a list of candidate terms instead of just one in the translation process, and 2) semi-automatically generating a dataset with similar characteristics to the ones mentioned, but based on texts originally written in Spanish.

A. Term Translation Prompt You are a scientific translator of English to Spanish spe

cialized in terminology. I give you one sentence in English and the same sentence translated to Spanish. The English sentence has a term between the marks and . Identify in the Spanish sentence which words correspond to the same original term. The output term is in Spanish. Some examples

doi:10 .1109/ACCESS. 2020 . 2965087 . English

sentence:

"The University of Florida, in part[42]

Lang ,

Wachowiak ,

Heinisch , D.

Gromann, nership with Motorola, has held two mobile comput-

putational Linguistics: ACL-IJCNLP 2021 , 2021 , pp. Output: computación móvil English sentence: "There,

3607- 3620 . we assume that coeficients of non-renormalizable [43]

Ghafoor ,

A. S.

Imran ,

S. M.

Daudpota , Z. Kas- terms are suppressed enough to be neglected" . Span-

IEEE Access 9 ( 2021 ) 124478 - 124490 . cientes de los términos no renormalizables [44]

Bonifacio ,

Jeronymo ,

H. Q.

Abonizio , I.

Cam- English sentence: "It often exploits an optical dif-

ranking dataset , arXiv preprint arXiv:2108 . 13897 of the light flux at the surface of the tissue." Spanish

( 2021 ). sentence: "A menudo se utiliza un algoritmo de recon[45]

Araújo ,

Pereira ,

Benevenuto , A compara- strucción de imágenes basado en un modelo de difusión

ences 512 ( 2020 ) 1078 - 1102 . tejido." Output: algoritmo de reconstrucción de imágenes [46]

C. P.

Carrino ,

M. R.

Costa-Jussà ,

J. A.

Fonollosa , basado en un modelo de difusión óptica

arXiv: 1912 . 05200 ( 2019 ). teristic features of the UvT Expert Collection ; specifically, [47]

G. M.

Rosa ,

L. H.

Bonifacio , L. R. de Souza, R. Lotufo, we propose and evaluate refined expert finding and pro-

transfer methods , arXiv preprint arXiv:2105 .06813 nizational structure. " Spanish: "Un segundo grupo

( 2021 ). de experimentos está dirigido a extensiones de los méto[48]

Devlin , M.-

Chang ,

Lee , K.

Toutanova, dos base que aprovechan las características distintivas de

arXiv: 1810 . 04805 ( 2018 ). perfilado de expertos que incorporan la topicalidad y la [49] A . Gutiérrez-Fandiño , J.

Armengol-Estapé, estructura organizativa." output: topicalidad y la estruc-

language models , arXiv preprint arXiv:2107.07253

( 2021 ). [50]

Toraman ,

E. H.

Yilmaz ,

Şahinuç ,

Ozcelik ,

Lang . Inf. Process. 22 ( 2023 ). URL: https://doi.org/

10.1145/3578707. doi: 10 .1145/3578707.