TEKMA at CLEF-2021: BM25 based rankings for scientific publication retrieval and data set recommendation Jüri Keller1 , Leon P. M. Munz1 1 Technische Hochschule Köln, Ubierring 48, 50678 Cologne, Germany Abstract In this paper we report the results of our participation in the Living Labs for Academic Search (LiLAS) CLEF Challenge, which is aimed at strengthening the concept of user-centered living labs for the aca- demic search domain. We made one submission for each of the two tasks. For both submissions we focused on data enrichment and Solr’s implementation of the probabilistic BM25 ranking function. The proposed systems were evaluated live using the STELLA infrastructure. These live results show that the submitted pre-computed ranking for ad-hock search (tekma_s) cannot compete with the live baseline system. However, our approach of a pre-computed hybrid recommendation system for research data sets (tekma_n) produced better results than the baseline system. Keywords Living Labs, Social Science, Life Science, (Online) Evaluation in IR 1. Introduction Due to the continuing flood of information and the steadily growing number of scientific publications and research data sets, the ability to find them is an ongoing challenge. In order to find suitable publications in a multilingual scientific database, sophisticated search systems are required that can rank the most relevant results for a search query to the top. In addition, recommendations of suitable research data sets can be equally relevant to completely cover the information need. Since the search for data sets, even using designated search engines, can be tedious, a possible solution may be to recommend relevant research data sets directly to corresponding publications. For this reason, as participants in the Living Labs [1] for Academic Search (LiLAS) CLEF Challenge, we decided on submitting pre-computed rankings for both tasks presented below. An introduction to the LiLAS lab at CLEF can be found in the corresponding overview paper [2]. We participated in both tasks of LiLAS 2021: - Task 1 - Ad-hoc retrieval of multilingual scientific documents. The goal of Task 1 is to support researchers to find the most relevant documents regarding a head query. Participants are asked to create an experimental ranking system for the multi-lingual life science search portal LIVIVO 1 . A good ranking system should present users the most CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.livivo.de/ relevant documents regarding a query on top of the result set. Multiple languages can be used for querying (e.g. English, German, French, etc.); regardless of the language used on the query, the retrieved results can include candidate documents in other languages. - Task 2 - Research Data Set Recommendations. The main task here is to provide a recommendation system for the social science portal GESIS Search2 . Regarding a seed publication, relevant research data sets should be recommended. For example, the user is interested in the impact of religion on political elections and found a publication regarding that topic, she will be presented with a list of research data sets regarding the same topic.3 Both proposed systems are based on an approach that uses the probabilistic BM25 ranking function [3] to determine the similarity between index and query. Results from the TREC- COVID Challenge4 described by Roberts et al. [4] show that almost all top-performing systems used BM25 as first stage ranker to produce already good baselines. As part of a semester project, we have successfully implemented a similar approach on the TREC-COVID data set. While these evaluations have been offline, it is especially interesting to see online performances using live data from real users in the Living Labs for Academic Search (LiLAS) Challenge at CLEF using the STELLA [5] infrastructure. Furthermore, we decided to modify the approach to function as a recommender system as well. In general, there are three approaches to recommender systems: Content-based recommen- dations, collaborative recommendations and hybrid approaches [6]. Since no user or profile data was initially available to accomplish the task, we used a type of content-based recommendation. After completing the first round, we could use the obtained click data to rerank the results. The remainder of this paper is structured as follows. In Section 2 and 3 we outline the submitted systems tekma_s for ad-hoc retrieval and tekma_n for recommendations. In these Sections, the corresponding corpora, enrichment approaches and experiments are described for each system. The results achieved are summarized in Section 4. In Section 5 this paper ends with a conclusion. 2. Task 1: Ad-hoc retrieval of scientific documents For the first evaluation round in Task: 1 Ad-hoc Search Ranking, a pre-computed ranking approach was proposed. The system was implemented using Apache Solr5 and Pseudo-Relevance Feedback. To evaluate its ranking ability, the head queries and corresponding candidates are used to replicate the baseline system. Based on the given head queries and the full document corpus, multiple runs are pre-computed and evaluated. 2.1. The LIVIVO corpus Through the lab organizers, two data sets are provided by the cooperating research infrastructure platform LIVIVO for task 1, documents and candidates. 2 https://search.gesis.org 3 https://clef-lilas.github.io/tasks/ 4 https://ir.nist.gov/covidSubmit/index.html 5 https://solr.apache.org/ With the documents data set, metadata for over 22 million documents from the bio-medical field, from the LIVIVO search portal were provided. The metadata includes, among several others, titles, abstracts, tags from controlled vocabularies like the Medical Subject Headings (MESH)6 and Chemical Thesaurus (CHEM)7 as well as the language of the document. Even though over three-quarters of the documents are labelled as English, documents from over 30 other languages are provided as well. The metadata is not distributed consistently, leaving some documents even without a title. The candidate data set contains the head queries from the LIVIVO search portal and the ranked document identifiers. Every head query includes the query string and its frequency. The query strings are multilingual and sometimes include boolean operators. Since the candidates are ordered based on the current LIVIVO ranking system, they can serve as a baseline ranking. 2.2. tekma_s To pre-compute rankings, the full document corpus is indexed as provided using Apache Solr. The documents are processed by a Solr analyzer stack and then queried using Pseudo-Relevance Feedback. To fine-tune the queries, several fields are boosted. In general, only English documents are considered for the ranking. The same analyzer is used for indexing and querying. This includes the Solr standard tokenizer, the Solr classic filter, a stopword filter with a corresponding English stopword list, the Porter stem filter and an English possessive filter. Queries are generated by searching multiple document fields with various boosting. Besides the title and abstract fields, the author, mesh, chem and language fields are considered for querying. Since the query and document analyzers are designed for English documents only and the vast majority of documents are English anyway, all other documents are ignored while querying. In order to improve the baseline ranking, Pseudo-Relevance Feedback is used to extend the query. Based on the assumption that the best-ranked documents are somehow relevant, information on them is used to rewrite and extend the query [7]. Using the base query, a ranking is generated. The MESH terms are extracted from the ten best-ranked result documents. These MESH terms are ranked by frequency and the most frequent five terms are added to almost all fields in the final query, except the "author" and "language" fields. These fields do not contain standard information and therefore should not be expanded with MESH terms. Thus, to retrieve the final ranking actually two queries are sent to the system. The first one gathers information from the first search results and the second query uses this information in addition to produce the final ranking. By using the provided head query candidates as a baseline, several query configurations and field boosting are tested. The submitted run tekma_s queries the fields title, abstract, author, mesh, chem and language. As described in Section 2.2, only English documents are utilized. Therefore, all fields except the language field are optional and are boosted. The fields mesh and chem are boosted by 1.5. If they exist, they are considered highly relevant, since they precisely classify the content of the document. The fields title and author are boosted 6 https://www.nlm.nih.gov/mesh/meshhome.html 7 https://images.webofknowledge.com/WOKRS534DR1/help/MEDLINE/hp_chemical_thesaurus.html Table 1 Research data data sets statistics Title Title_en Abstract Abstract_en Topic Topic_en Before preprocessing 99541 6320 94479 4725 83384 5067 After preprocessing 99541 6320 83957 4725 83957 7426 by 1.0 and because some head queries include author names, the author field is included. The abstract field is included as well, but is boosted down by 0.3 because much more words are in the abstracts and chances are higher that they are irrelevant for the document. The corresponding source code can be found in a public repository.8 3. Task 2: Research data set recommendations Building on the pre-computed ranking approach from round one, a variation of the system was proposed for this task. Variations are made to adapt the system to the recommendation task. Furthermore, different re-ranker are added and the data sets are enriched. These changes are evaluated following the same strategy as described in 2. The smaller document corpus for Task 2: allowed for pre-compute rankings for every single seed document, making this task more suitable for the pre-computed system type. Before indexing, the baseline data sets are enriched by translations and additional topics from the Consortium of European Social Science Data Archives (CESSDA)9 . By that multiple languages can be used to query the corpus and the topic distribution is more complete. To pre-compute the recommendations, the tekma_s system utilized the whole data set and not just the provided candidate lists. Instead of user-generated queries, for this task the queries are generated by the system itself from the seed documents. Retrieved results are re-ranked and then serve as recommendations for the seed document. The corresponding source code can be found in a public repository.10 3.1. The GESIS corpus Three data sets are provided by the lab organizers originating GESIS Search for task 2, publica- tions, data sets and candidates. The publication data set contains metadata for 110420 documents from GESIS-Search, a social science database. The metadata includes 11 attribute fields e.g. title, abstract and authors of the documents. Again, the metadata field were inconsistent. 56% of the publication data contains an abstract and topics are assigned in 67% of the cases. The metadata of the publications are the seed documents given data set recommendations should be made for. In addition, metadata for 99541 research data sets is provided. This metadata contains 16 fields containing title, abstract and topic and other fields for English data sets. The distribution of the content is shown in Table 1. 8 https://github.com/stella-project/tekmas_precom 9 https://www.cessda.eu/ 10 https://github.com/stella-project/tekma_n_precom Like the LIVIVO corpus described in Section 2.1, the GESIS corpus also contains collections of candidates. The top 100 most used seed documents and its data set recommendation are listed here. By that, they can as well be used as a baseline ranking. 3.2. Data enrichment 3.2.1. Field translation The publication metadata fields for title and abstract are language inconsistent. Using the Python library langdetect11 , we found that 53% of the titles and 46% of the abstracts are in German, while 40% of the titles and 35% of the abstracts are in English. In the metadata of the research records, in addition to the fields for titles and abstracts, there are also fields for English titles and abstracts. However, different languages are mixed here as well. To solve this problem of multilingualism and to homogenize and extend the publications’ metadata, all titles and abstracts of the publications are translated into both languages using Python library Deep_translator12 . For this purpose, two additional fields were created: title_en and abstract_en and filled with the respective translated content. Since 93% of the titles and 81% of the abstracts are in German or English, we narrowed down the translation to these and ignored all other languages. Using this method, we were able to sort the publication metadata linguistically and populate the expanded fields title_en with 110420 and abstract_en with 62013 entries. 3.2.2. Assigning missing topics Not all metadata records have topics assigned. The assigned topics are from a controlled vocabulary managed by CESSDA. To assign appropriate topics in a simple way automatically, just existing topics are used for assigning. Therefor, a collection from all topics in the corpus is created and then translated into German or English depending on their source language. Since it should be avoided to overwrite existing information or attributes, two additional attribute fields were added: topic_ext_ger and topic_ext_en. For these collections, a matching procedure was performed on the title. If one of the topics appeared in the title of a metadata record, it was added to the corresponding attribute field. These approaches should result in more matches being generated between the topics of the data sets. By that method, the German topics are expanded by 556 and the English topics by 2359. 3.3. Indexing Through separating fields with multiple languages in dedicated fields for each language, language depending analyzers could be used on one index. The same index and query analyzers were applied to the respective field types to achieve as many matches as possible in the search query. The filters and tokenizers correspond to the standard repertoire of Solr. For the German fields, the tokens were separated at the blanks by a whitespace tokenizer. For the English fields, the 11 https://pypi.org/project/langdetect/ 12 https://pypi.org/project/deep-translator/ Figure 1: Visualization of the full system used to pre-computing the tekma_n run, from data input on the left to the final output on the right. Curvy boxes represent data inputs, rectangular boxes processing steps. standard tokenizer of Solr13 is used. For the German, as well as for the English field type a lower case filter was used. For the German and English field type, a stopword filter, with a corresponding stopword list is applied. The English stop word list is based on Wordnet14 . The Snowball Porter Stemmer algorithm15 is used to shorten the tokens uniformly to their root words for the German field type under specification of the language "German2". For the English field type, the Porter Stem Filter16 and the English possessive filter17 is used. 3.4. tekma_n To generate recommendations for a publication, the publication is used as query to search the created Solr data set index. As baseline search, Solr default BM25 ranking function and a variety of field combinations and boosting factors are used. 3.4.1. Querying Since not all fields are given for all seed publications, the queries are generated dynamically considering all available field data and therefore differ in length and complexity. If available, the fields title, abstract and topic as well as their language variations title_en, title_de, ab- stract_en, abstract_de, topic_en and topic_de and the extended topic fields ext_topic_de and ext_topic_en are used for the search. Each searched field is boosted individually for a 13 https://solr.apache.org/guide/8_8/tokenizers.html#standard-tokenizer 14 https://wordnet.princeton.edu/ 15 https://snowballstem.org/ 16 https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescriptions-PorterStemFilter 17 https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescriptions-EnglishPossessiveFilter run, considering its ability to describe the searched data set. In general, title fields are boosted higher than abstract fields for example. 3.4.2. Re-Ranking To improve recommendation quality the baseline results are re-ranked in two ways. First, a re-ranker based on the results from round one is applied. On top of these re-ranked results, a second re-ranker is applied considering similarity based on document embeddings. 3.4.3. Re-Ranking by User Feedback As direct proof of relevance, the click feedback from round one is used to boost certain data sets. Given a ranking from the baseline, system data sets are boosted that were clicked in round one, considering the same query document. Due to click sparsity and importance, a strong, static boost is added. 3.4.4. Embedding based Similarities Since documents and data sets have broad similarities in structure and nature, the overall document similarity is considered as another factor of relevance. To calculate similarity across documents and data sets, document embeddings and a k-nearest neighbors (k-NN) [8] algorithm is used. The document embeddings are calculated using SPECTER [9] a transformer-based SciBERT language model through its available web API 18 . From the title and abstract of a document, this language model calculates a vector that represents the document. With vectors for all documents, the documents can be mapped in a multidimensional space and the distances between them can be measured. The closer the documents are, the bigger the similarity between them. By means of the k-NN algorithm, using the euclidean distance to measure the distance between the documents, the closest documents to a seed document are calculated. Given a baseline ranking, the most similar data sets are calculated for that query document and all matches gain a strong static boost. 3.5. Experiments Multiple experiments were made to test different system configurations. By that, optimal field combinations and parameter settings for boostings and re-ranker should be determined. There- fore, two test collections are created from the given head queries and candidates, resembling the baseline system. This data holds no ground truth, but can help to put the results in context. The overall goal is to determine system settings, returning results not too far off from the baseline system, but still providing enough variation for different results. All runs are evaluated using pytrec_eval19 . The supplied head queries and the candidates ranked for that query were used to create two baselines representing the production system. For the first baseline, all candidates for a given 18 https://github.com/allenai/paper-embedding-public-apis 19 https://github.com/cvangysel/pytrec_eval Table 2 Evaluation results for different system settings Run re-ranked map ndcg recip_rank P_5 P_10 R_5 R_10 num_rel_ret 1 False 0.077 0.281 0.441 0.273 0.241 0.014 0.024 43380 1.1 True 0.077 0.280 0.425 0.269 0.239 0.014 0.024 43380 2 False 0.070 0.266 0.432 0.256 0.224 0.013 0.023 41141 2.1 True 0.070 0.266 0.423 0.255 *0.225 0.013 *0.023 41141 3 False 0.082 0.292 0.446 0.278 0.249 0.014 0.025 45156 3.1 True 0.082 0.291 0.427 0.272 0.246 0.014 0.025 45156 4 False 0.074 0.274 0.429 0.263 0.230 0.013 0.023 42482 4.1 True 0.073 0.273 0.415 0.253 0.229 0.013 0.023 42482 5 False 0.083 0.293 0.447 0.277 0.246 0.014 0.025 45330 5.1 True 0.082 0.292 0.426 0.266 0.243 0.014 0.025 45330 head query are marked as relevant. For the second baseline, the relevance scores, provided for every candidate, are used to rank the candidates. In early experiments the first baseline was used to construct the field composition for the query. Fields were added gradually to improve overall results retrieved and relevant results retrieved. Furthermore the fields title, title_en and topic from seed publications where used to query the fields title, abstract, topic and the created field ext_topic_de as well as the English fields title_en, abstract_en and the created field ext_topic_en. Evaluating re-ranking and boosting was done using the second baseline. Results are shown in Table 2. Each line contains the results for a single run. Runs marked as .1 and re-ranked "True" contain results for the same run above, but are re-ranked. The first three runs compare the boostings for the topic fields. The boosts 0.5, 0.7 and 0.3 are tested. Surprisingly, boosting the topics down to 0.3, tested in run 3, showed the best results. In the remaining runs 4 and 5 negative boosts are applied to the abstract field. In run 4 abstract fields are boost down to 0.3 and in run 5 slightly less harsh, down to 0.5. Results in general, are close to each other; run 3 and 5 are all almost the same. Even though run 5 performed slightly better in overall metrics like ndcg, the P@5 and R@5 for run 3 were slightly better. Since just little recommendations can be provided, these metrics were privileged and the configuration of the highlighted run 3.1 was used for the final system tekma_n. For the experiments in Table 2 allmost all runs without re-ranking performed slightly better. P@10 and R@10 from run 2.1 being an exception are marked with an asterisk. For the final run, re-ranking was included in any way to test its performance in a live system and on the full data. 4. Results tekma_s As the results in Table 3 show, the ranking ability of the proposed pre-computed run is limited by multiple factors. Overall, the system received 124 impressions in 80 days. This is mostly because it could only be utilized for former head queries which therefore were pre-computed. Furthermore, 61 pre-computed rankings from the submitted run have ten or fewer results, so the chance of being clicked is even smaller. Another limiting factor is the Table 3 Final results from system tekma_s after 80 days Metric Win Loss Tie Session Impression Clicks CTR Value 12 17 2 104 124 15 0.121 Table 4 Final results from system tekma_n after two rounds Metric Win Loss Tie Session Impression Clicks CTR Value 26 17 1 1144 2026 28 0.0138 Table 5 Ranking position distribution of data set recommendations clicked from task 2. Ranking Position 1 2 3 4 5 6 Amount documents clicked 21 8 6 5 2 5 language. By only including English documents, even for German queries, highly relevant documents were ignored. Because of the few resulting data, further investigations do not promise any information gain and therefore haven’t been done. tekma_n Over a course of 28 days, from 12. April to 9. May 2021 the system tekma_n received 1980 Impressions. Recommendation rankings for 57 seed documents received clicks, by that an overall click-through rate of 0,0227 was achieved. Two recommendation rankings received more than one click, resulting in a total of 45 data sets clicked. With 28 clicks total, the experimental system tekma_n wins 24 times, while the baseline system wins 16 times. The results are summarized as well in Table 4. One recommendation ranking revived equally one click for the experimental and one for the baseline system. The clicks are distributed unevenly favoring the first ranking positions. While data sets in the first position were clicked 21 times, data sets ranked lower were clicked less often. The full distribution of ranking positions documents were clicked is shown in Table 5. Considering just clicked recommendation lists both systems, the baseline and the experimental were utilized almost equally for the first ranking. The baseline system could rank 11 times first and the experimental system 10 times. Comparing all recommendation ranking, this finding amplifies, resulting in 1021 by 958 in favor of the baseline system. To analyze the recommendations, the experimental system tekma_n performed worse than the comparative system. The originally submitted recommendations are compared with the actual clicked recommendations. The experimental system does not rank 9 clicked data sets at all, but ranked four data sets at the exact same position they were ranked by the baseline system and were clicked. Investigating clicked data sets and seed publications help to understand the experimental system better, recommending data sets based on similarity. Given the publication with the German title "Kriminalität im Deutschen Kaiserreich, 1883-1902: eine sozialökologische Analyse", tekma_n recommends the data set "Sozialökologische Analyse der Kriminalität in Deutschland am Ende des 19. Jahrhunderts unter besonderer Berücksichtigung der Jugendkriminalität" and got clicked. Seed publication: Title: Kriminalität im Deutschen Kaiserreich, 1883-1902: eine sozialökologis- che Analyse [10] Topic: GESIS-Studie Clicked data set recommendation from tekma_n : Title: Sozialökologische Analyse der Kriminalität in Deutschland am Ende des 19. Jahrhunderts unter besonderer Berücksichtigung der Jugendkriminalität [11] Abstract: "Daten zur Kriminalität im Kaiserreich. Die Untersuchungseinheiten sind die Stadt- und Landkreise des Deutschen Reiches unter Berücksichtigung von Gebietsänderungen. Für alle erfassten Kreise wurden Kriminalitätsraten in den Kategorien Gesamtkriminalität, gefährliche Körperverletzung, sowie einfacher und schwerer Diebstahl erhoben. Themen: Entwicklung der Kriminalität in den Untersuchungsperioden 1893. (...)" Extended Topic: Kriminalität This match came together by multiple matching tokens, highlighted in the text above. The matching title tokens already describe the broader topic of both documents. This gets enhanced by the extended topic. The temporal dimension is added through matching abstract tokens. Creating the tekma_n system, one focus was data enrichment. Described in Section 3.2, titles and abstracts from the publications were translated and the topics were extended. To measure any effect of these approaches resulting in data sets clicked the interleaved recommendations returned from the STELLA are compared with recommendations with data enrichment like the submitted one and without data enrichment. If a data set is ranked lower without data enrichment, this directly impacts being clicked for that query. Surprisingly no applied data enrichment method, neither the translations nor the new assigned, formerly missing, topics resulted in changed positions for the clicked documents. Remembering the small basis of data, data enrichment did not affect the results. The same methods are used to measure any effects of the applied re-rankers. Results are the same. The clicked documents were not re-ranked. 5. Conclusion The goal of our participation in the Living Labs for Academic Search (LiLAS) CLEF Challenge was to extend our existing approach from the TREC-COVID Challenge and evaluate how well it performs on various tasks and in a live environment. Building on the baseline system for the first task, we developed a recommender system for the second task with the same underlying calculation of token similarity and implemented by Solr. We extracted terms from the seed publications to generate queries. We paid special attention to the translation and expansion of data sets as well as re-rankings based on click data and embedding similarities. It showed that our data enrichment methods and re-ranking did not affect the position of the clicked documents. Nevertheless, our experimental system with 28 clicks performed better against the baseline system with 16 clicks. The better performance of our system can therefore be attributed to the BM25 function and the set analyzers. The findings can be used as a guide for future experiments. Thus, the data expansion procedure could be extended, so that significantly more topics are expanded. Multilingual processing and translation could also be extended to other languages. It would also be interesting to see how well the system would perform over a longer runtime. References [1] P. Schaer, Living labs - an ethical challenge for researchers and platform providers, in: M. Zimmer, K. Kinder-Kurlanda (Eds.), Internet Research Ethics for the Social Age: New Challenges, Cases, and Contexts, Digital Formations, Peter Lang, 2017. [2] P. Schaer, T. Breuer, L. J. Castro, B. Wolff, J. Schaible, N. Tavakolpoursaleh, Overview of lilas 2021 - living labs for academic search, in: K. S. Candan, B. Ionescu, L. Goeuriot, B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Twelfth International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science, 2021. [3] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al., Okapi at trec-3, Nist Special Publication Sp 109 (1995) 109. [4] K. Roberts, T. Alam, S. Bedrick, D. Demner-Fushman, K. Lo, I. Soboroff, E. M. Voorhees, L. L. Wang, W. R. Hersh, Searching for scientific evidence in a pandemic: An overview of TREC-COVID, CoRR abs/2104.09632 (2021). URL: https://arxiv.org/abs/2104.09632. arXiv:2104.09632. [5] T. Breuer, P. Schaer, N. Tavakolpoursaleh, J. Schaible, B. Wolff, B. Müller, Stella: Towards a framework for the reproducibility of online search experiments., in: OSIRRC@ SIGIR, 2019, pp. 8–11. [6] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Transactions on Knowledge and Data Engineering 17 (2005) 734–749. doi:10.1109/TKDE.2005.99. [7] I. Ruthven, M. Lalmas, A survey on the use of relevance feedback for information access sys- tems, Knowl. Eng. Rev. 18 (2003) 95–145. URL: https://doi.org/10.1017/S0269888903000638. doi:10.1017/S0269888903000638. [8] E. Fix, J. L. Hodges Jr, Discriminatory analysis: nonparametric discrimination, consistency properties, volume 1, USAF school of Aviation Medicine, 1951. [9] A. Cohan, S. Feldman, I. Beltagy, D. Downey, D. Weld, SPECTER: Document-level Repre- sentation Learning using Citation-informed Transformers, in: ACL, 2020. [10] H. Thome, Kriminalität im deutschen kaiserreich, 1883-1902 : eine sozialökologische analyse, Geschichte und Gesellschaft 28 (2002). [11] H. Thome, Sozialökologische analyse der kriminalität in deutschland am ende des 19. jahrhunderts unter besonderer berücksichtigung der jugendkriminalität, GESIS Datenar- chiv, Köln. ZA8100 Datenfile Version 1.0.0, https://doi.org/10.4232/1.8100, 2006. doi:10. 4232/1.8100.