PyTerrier-based Research Data Recommendations for Scientific Articles in the Social Sciences Narges Tavakolpoursaleh1 , Johann Schaible1 1 GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany Abstract Research data is of high importance in scientific research, especially when making progress in exper- imental investigations. However, finding appropriate research data is difficult. One possible way to alleviate the situation is to recommend research data to scholarly search system users based on the re- search articles they are searching. With LiLAS, the lab organizers provide the opportunity i) to present such recommendations to users of the live system GESIS Search and ii) to evaluate the experimental recommender system in this live system with its actual users. As part of our participation in LiLAS, we computed a simple method for recommending research data and evaluated our approach in two rounds each lasting approximately one month. For our approach, we applied the classical TF-IDF method to rank the research data by their relevance to existing publications. We measure our method’s useful- ness using user feedback, i.e., simple clicks on the recommendations. In both rounds, our experimental system obtained almost the same outcomes as the baseline. Keywords recommender systems, information retrieval, online evaluation, research data, living lab 1. Introduction Evaluating recommender systems and its underlying approaches is a crucial step in assessing the overall quality of recommendations. Typically, such approaches are evaluated offline using test collections. Using such collections including relevance assessment, we can specify how well a recommendation fits the users’ information need. The CLEF-lab LiLAS makes use of the STELLA framework [1, 2, 3], which allows the organizers to provide an environment to evaluate recommendations online. This means that recommendations are presented to real users in a live system and the recommendation quality is assessed by how well the users perceive the recommendations. Thus, no test collection or manual relevance assessment by domain experts is needed. Instead, solely the actions by the actual users of the search system “indicate” whether a recommendation fits the information need or not. This pseudo-relevance is estimated by implicit user feedback, such as clicking on well-fitting recommendations. Recommending research data based on a currently viewed publication aims at alleviating the situation of finding appropriate research data for a specific use. However, to do so, a scholarly search system must contain information on both scientific publications and research data in a given domain. Luckily, the live search system for the broad domain of social sciences GESIS Search [4] is integrated in the STELLA framework provided in the LiLAS lab. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) In this paper, we present our recommendation approach for GESIS Search that suggests research data based on a currently viewed scientific publication. We considered the basic content representing metadata of publication and research data as the features. After extracting the features, we compute term frequency and document frequency (TF-IDF) for each term in each dataset for text matching and making recommendations. The implemented approach is provided as a Docker image in the format required by LiLAS for reproducibility. This way, our recommendation approach is able to compute recommendations on-the-fly instead of using pre-computed result lists for only a small portion of scientific publications. The results illustrate that our approach is not outperforming the provided baseline in a statistical significant manner. However, it shows its general usefulness, especially being a simple approach that can be improved further in many ways, e.g., using multi-lingual features to detect more precise similarities. In the following, we briefly present the scientific search system GESIS Search as well as the data and task provided by LiLAS (cf. Section 2). We depict and describe our approach in detail in Section 3. The results of the evaluation and the discussion of the results are described in Section 4 and Section 5, respectively, before we conclude the paper in Section 6. 2. System, Data and Task 2.1. System and the Data GESIS Search 1 [4] offers an integrated search system for information on the broad topic of social sciences and facilitates finding research data and publication in one portal. To train a recommendation approach, GESIS provides a corpus of social science publications and research data for the LiLAS participants. The research data set consists of about 78𝑘 records in the first round and 99𝑘 records in the second round. The number of publications provided increased from 93𝑘 records in the first round to 110k in the second round. The records are composed of metadata of the documents in different languages. This metadata consists of a title, an abstract, and topics for both research data and publications and some specific metadata like authors and DOI for publication and temporal and geographical coverages for research data. Examples of such research data and publications can be seen in Figure 2. 2.2. LiLAS Task The recommendation task is defined as ranking the most relevant research data to the top for a given source publication. In submission type B, the participants should provide a REST-API for sending requests and getting the ranking. The system should be prepared as a Docker container service which LiLAS integrates into the evaluation environment in GESIS Search. LiLAS also provides sample templates that implement minimal REST-based web services and can be developed by the participant. Finally, the participants register their public accessible GitHub repository at the central dashboard service of the LiLAS for the Living lab evaluation. To perform the task, the participants are provided with a list of seed publications (i.e., the 1 http://search.gesis.org (a) Abstracts (b) Titles Figure 1: Density of number of words in abstract and titles publication IDs), a list of research data IDs, the metadata for both obtained from GESIS Search, as well as a list of research data candidates for 1𝑘 seed publications. 3. Approach Krämer et al. [5] classified the relevancy of research data to the research question in the social science domain into three main aspects: relevance of the content, relevance criteria related to data characteristics or factors, and documentation needed to assess relevance. In their study, the participants assess the topical relevance of research data based on how well the research data content fits the research questions. Besides the topical relevance, participants assess the relevance based on other metadata as the type of publication of the primary research connected to a research data, temporal and spatial extent of the data, and characteristics of the sample. The context-based recommendation of contextual data is rarely covered due to the lack of standard evaluation data and the cold start problem [6]. Given the opportunity to participate in a living lab evaluation infrastructure, we aim to assess our experimental content-based recommendation methods for recommending research data based on publications with actual users of the GESIS portal. The records (research data and publications) are rarely described with the whole metadata set. But most of the publications have at least a title with an average of 11.5 words. More than 40% of publications have no abstract. On average, the number of words in the abstract amounts to 86.7 terms. We selected a few descriptive metadata from the publications and research data as the entities’ features. These metadata of research data and publication include information that can characterize the entities and semantically connect the two types of entities. These include the title, abstract, and the topics for research data and publications, which resemble the set of features used by our approach. The title as a noteworthy minimal description of data expresses the very brief data content and is scanned when looking for data [5]. Unlike the abstract and topics, titles are always available for both data types. 3.1. Experimental System: gesis_rec_pyterrier We collected three fundamental descriptive metadata elements to identify the resources of both types: title, abstract, and topics. The publication’s titles are, in most cases, highly representative and informative and represent the content of the paper. However, the titles of research data are not always informative; for example, “German general Social Survey - ALLBUS 2012”. Nevertheless, the abstract information of both resources is limited but appropriate to identify them. Also, the topics or keywords hold compact essential descriptive information about the content of the resources. Figure 3 and 1 depicts the distribution of topics, titles, and abstract lengths (number of words) in both types of documents. As the first experimental system, we decided to utilize Pyterrier, a Python wrapper on top of Terrier for performing information retrieval experiments [7] and compare it with the baseline. We chose this system to establish a comparison between the baseline and this simplistic out-of- the-box approach. Pyterrier provides easy to conduct IR experiments with different weighting models, such as TF-IDF and BM25. Terrier also supports non-English language texts since it represents terms as UTF. It has additional plugins for Bert, EPIC, ColBert, and other methods. However, we did not apply them for the first two experimental rounds. We considered the simple term weighting model of TF-IDF, which scores a document regardless of term position in the text, in order to compare it directly to the baseline using BM25. We collected the title, abstract and the topics of the publications for issuing the queries, and the research data for the indexing. The research data recommendations are based on the terms in the title, topic, and abstract of the research data as well as of the publications. When providing a publication identifier (seed item of the recommendation), it will be translated into the corresponding publication title, abstract, and topics, which, in turn, are used to query the index of research data with a basic TF-IDF-based algorithm without extra features. This means, during the indexing as well as retrieval process, the text from the title, abstract, and topics of research data and publications is analyzed using the standard tokenizer, stemmer, as well as stop-word-removal provided by Pyterrier. The Terrier weighting model employs Robertson’s TF (the term frequency of the term in the document) and standard Sparck Jones’ IDF [7]. The experimental system implements an API for indexing and searching, and it is provided as a Docker image with the LiLAS required format for reproducibility2 . 3.2. Experimental Setup The STELLA infrastructure used for the LiLAS lab contributes the participants’ experimental recommender systems in two forms: A) the pre-computed runs and B) the Docker container (Dockerfiles and their source code) [8]. The participants can decide whether they submit type A) or type B). We chose to submit type B), i.e., a Docker container comprising our recommendation approach. Our experimental ranking is merged with the baseline through the STELLA interleaving mechanism to generate the final result list and to present it to the users (see Figure 2). User feedback in the form of clicks is collected and sent to the central STELLA server. There the 2 https://github.com/stella-project/gesis_rec_pyterrier Figure 2: Screenshot of GESIS Search: an example of experimental recommendation ranking, gesis_rec_pyterrier, interleaved with Baseline ranking (a) publication (b) research data Figure 3: Density of number of words in topics of documents evaluation metrics are calculated with some statistics and are displayed as well as reported to the participants (Figure 4). Figure 4: Screenshot of LiLAS dashboard contains the evaluation metrics for the experimental system 4. Result (a) first Round (b) Second Round Figure 5: The cumulative number of user clicks on the recommended items during the first round In the first round in March 2021, the first 100 most frequently viewed publications in the GESIS search were clicked between 4 to 29 times per session. In some sessions, items have been viewed several times. The recommended research data of all systems got about 0.013 CTR with 91 clicks (which includes 82 unique research data) in 6, 765 impressions. The CTR for gesis_rec_pyterrier and baseline are respectively 0.0055 and 0.0069. (Figure 5). In the first round, the two systems performed almost the same (Figure 6). It is observed that the baseline is showing an not substantially better performance than the experimental system with the outcome of 0.5168. In the second round from 15 April to 25 May, we proceeded with our single experimental system gesis_rec_pyterrier. The experimental setting remained unchanged, and only new records of publication and research data are included in the corpus. In the second round GESIS received 131 user clicks on recommended items of all three systems. The CTR is 0.016, with the whole impressions of 7, 753. Our pyterrier system received 25 clicks and got a CTR of 0.0068. In this round, the CTRs for the baseline and new experimental system are 0.007 and 0.011. Our pyterrier system and the baseline, as for the first round, perform almost the same (see Figure 6). 5. Discussion Although the pair of experimental ranking systems are interleaved starting randomly from one system ranking, as shown in Figure 8, the number of clicks on the top-1 ranked item (highest rank) is greatly different from the items in the other places. 52.8% of clicked items have the ranking position one. It shows that the first item in the ranking list has been clicked the most, regardless of the ranker system. Recommendations on the second (13.5%) and third positions (15.7%) have been clicked almost the same. Different studies ( [9, 10]) have also specified that (a) first Round (b) Second Round Figure 6: Outcomes for Baseline and Pyterrier during the first and the second round WIN LOSS TIE OUTCOME Round#1 BASE 46 43 1 0.5168 gesis_rec_pyterrier 31 34 0 0.4769 Round#2 BASE 54 68 4 0.442 gesis_rec_pyterrier 25 27 1 0.48 (a) first round (b) second round Figure 7: Click-Through Rate in the first and second rounds ranking higher yields in higher CTR. However, in the LiLAS setting, the systems have an equal chance of being represented first in the highest rank regarding the interleaving algorithm. The recommendation service of GESIS displays just the first three rankings of two systems, (a) distribution of clicked item’s rank in two systems (b) Rank of clicked item Figure 8: Position of clicked item in the recommendation page the baseline constantly and an experimental system interleaved by LiLAS. Due to the low traffic of active users viewing the publication (the number of impressions per day) and the limited number of recommended items, the number of user clicks is deficient. Therefore, the collected clicks might not be entirely satisfactory for two months of evaluation for several test systems (Figure 9). Figure 9: Daily number of clicks on the all recommended item during the first and second Round 6. Conclusion We succeeded in the first-hand experiment on online evaluation of cross-domain recommenda- tions between publication and research data. As the experimental system, we have implemented a naive content-based recommendation using the metadata available in most documents, i.e., title, abstract, and topics, primarily to compare this out-of-the-box approach to the baseline. We submitted a dockerized system capable of reproducing the ranking with new data. Our recommender system is implemented using the pyterrier library, and we applied the weighting model based on TF-IDF, which resembles a direct comparison to the baseline’s BM25 ranking. This simplistic approach was not able outperform the baseline. However, we used only the out-of-the-box Pyterrier system without any further configuration or inclusion of other features, such as the multi-lingual support. Future work comprises to extend this simplistic approach step by step. For example, we will focus on utilizing the Bert plugin to integrate our embedding-based recommendation approach [11] into Pyterrier. We will also focus on user needs and decision factors for considering the research data and apply information extraction, translation and semantic representations, and contextual text representation methods for the research data recommendation. On a general note, the low number of clicks during the first two evaluation rounds (Figure 9) indicate that more “traffic” is needed to better evaluate recommendation approaches in a live setting, as a recommended item is clicked only after the users have searched for a publication, clicked on a publication of interest, and then only clicked on the recommended research data. However, we still believe that LiLAS provides an excellent opportunity for researchers to evaluate their approaches with real users in a live system. It supports both researchers and the portal to develop and evaluate their experimental system to recommend cross-domain data. References [1] T. Breuer, P. Schaer, N. Tavakolpoursaleh, J. Schaible, B. Wolff, B. Müller, Stella: Towards a framework for the reproducibility of online search experiments., in: OSIRRC@ SIGIR, 2019, pp. 8–11. [2] J. Schaible, T. Breuer, N. Tavakolpoursaleh, B. Müller, B. Wolff, P. Schaer, Evaluation infrastructures for academic shared tasks, Datenbank-Spektrum (2020) 1–8. [3] P. Schaer, T. Breuer, L. J. Castro, B. Wolff, J. Schaible, N. Tavakolpoursaleh, Overview of lilas 2021 - living labs for academic search, in: K. S. Candan, B. Ionescu, L. Goeuriot, B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Twelfth International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science, 2021. [4] D. Hienert, D. Kern, K. Boland, B. Zapilko, P. Mutschke, A digital library for research data and related information in the social sciences, in: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), IEEE, 2019, pp. 148–157. [5] T. Krämer, A. Papenmeier, Z. Carevic, D. Kern, B. Mathiak, Data-seeking behaviour in the social sciences, International Journal on Digital Libraries (2021) 1–21. [6] Y. Li, J. Nie, Y. Zhang, B. Wang, B. Yan, F. Weng, Contextual recommendation based on text mining, in: Coling 2010: Posters, 2010, pp. 692–700. [7] C. Macdonald, N. Tonellotto, Declarative experimentation ininformation retrieval using pyterrier, in: Proceedings of ICTIR 2020, 2020. [8] P. Schaer, J. Schaible, L. J. G. Castro, Overview of lilas 2020–living labs for academic search, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2020, pp. 364–371. [9] Y. Chen, T. W. Yan, Position-normalized click prediction in search advertising, in: Pro- ceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, USA, 2012. URL: https://doi.org/10.1145/2339530.2339654. doi:10.1145/2339530.2339654. [10] D. Yankov, P. Berkhin, L. Li, Evaluation of explore-exploit policies in multi-result ranking systems, arXiv preprint arXiv:1504.07662 (2015). [11] N. Tavakolpoursaleh, J. Schaible, S. Dietze, Using word embeddings for recommending datasets based on scientific publications, in: LWDA, 2019.