Towards an IR Test Collection for the German National Library Johanna Munkelt1 , Philipp Schaer2 , and Klaus Lepsky2 1 Fachhochschulbibliothek Dortmund, 44227 Dortmund, Germany johanna.munkelt@fh-dortmund.de 2 Technische Hochschule Köln, 50678 Cologne, Germany firstname.lastname@th-koeln.de 1 Introduction Automatic content indexing is one of the innovations that are increasingly chang- ing the way libraries work. In theory, it promises a cataloguing service that would hardly be possible with humans in terms of speed, quantity and maybe quality. The German National Library (DNB) has also recognised this potential and is increasingly relying on the automatic indexing of their catalogue content. The DNB took a major step in this direction in 2017, which was announced in two pa- pers. Since September 2017, the DNB has discontinued the intellectual indexing of their series B (monographs and periodicals outside the publishing industry) and H (university publications) and has switched to an automatic process for these series. The subject indexing of online publications (series O) has been purely automatical since 2010. This again raises the well-known question: What is the quality of the automatic indexing compared to the manual work or in other words to which degree can the automatic indexing replace people without a significant drop regarding quality? As an argument for the conversion, the DNB primarily mentions the pro- gressing modernisation and the currently prevailing heterogeneity in the subject indexing of the various document types. The DNB intends to make a large pro- portion of access to document types accessible in a uniform manner using a process that saves time and is intended to contrast this heterogeneity with a more homogeneous subject indexing. A critical article by Klaus Ceynowa [1], General Director of the Bayerische Staatsbibliothek, which was published in the Frankfurter Allgemeine Zeitung at the end of July 2017, triggered a discussion about the DNB’s decision. Hei- drun Wiesenmüller, Professor for Library and Information Management at the Stuttgart Media University, also discusses the topic and shows some understand- ing for Ceynova’s position, although she puts some of his criticisms into perspec- tive and shows understanding for several arguments of the DNB [6]. A typical counterargument put forward by critics of automatic indexation is that the quality of content extraction suffers. A regulation which favours a uni- form development of more documents than before is only justified if the general development standard does not fall to a lower, “unacceptable level” [1], without 2 specifying the actual thresholds for such a measure, which Wiesenmüller recom- mends. She points out that there aren’t any common quality measures at the moment accepted in the community. In order to set up certain quality standards a discussion that includes the concerned interest groups would be necessary. The decision of the DNB and the various opinions on the same one raised the following question: Is automatic content indexing ready to satisfy the demands of librarians? We argue that it is necessary to carry out tests that examine the quality of the indexing by looking on the retrievability of the content. So far there is no evaluation of the subject indexing quality of the DNB data. It is impossible to make a comparison with previous practice because there is a lack of suitable tools, for example a test collection. A test collection designed specifically for the purpose of testing the subject indexing quality in DNB data is intended to provide a remedy. With the help of this collection a retrieval test and thus an objective evaluation is made possible, which so far could not take place and is nevertheless of great importance for the further developments at the DNB. Currently there is a project group working on constructing and working with such a test collection [3] at the Department of Information Science at the Tech- nische Hochschule Köln. The test collection should be based on principles similar to those used by the TREC community. In the following, the development pro- cess, the most relevant steps of construction and the prospects of using a test collection are presented. 2 Designing a Test Collection with DNB Catalogue Data Three basic building blocks are required for a standard retrieval test and its successful execution: (1) a document collection, (2) real information needs, which are imitated with the help of so-called topics with a short description and a more precise explanation (narrative), and (3) relevance assessments that are made for the result sets from the collection for the respective topics. With reference to the change in the subject indexing practice of DNB ex- plained at the beginning, some special features can be noted for a retrieval test and the underlying test collection: – the test collection consists of sufficient data taken from the DNB database, – the data must be processed accordingly and freed from unnecessary ballast for the sake of clarity, – the content distribution of the functional areas is at best based on the dis- tribution of functional areas within the overall data of the DNB, – the topics for which relevance assessments are to be made are not linked to a specific subject area in terms of content, but are as broadly dispersed as possible, – a suitable procedure must be chosen for the preparation of the relevance judgements, which is both practicable and expedient, – the implementation of all planned steps must take place, – the test collection must be tested for functionality and suitability. 3 2.1 The Document Corpus The corpus supplied by DNB consists of 200,000 documents and is broadly di- versified with a content distribution of over 100 subject groups. Of the data records, 131,538 document contents were intellectually indexed and 68,462 doc- ument contents machine-accessed; the share of machine-accessed document con- tents is therefore around 35 percent. DNB’s general catalogue contains more than 32 million records. For technical reasons, the entire data pool of the DNB cannot be used as a corpus for the test collection. The supplied Pica+ format is an internal bibliographic data format of special software, which in its pure form has very poor readability. The Pica+ format had to be converted into a clearer format, in this case a Solr-friendly XML. In this transformation, categories con- taining unnecessary information such as internal DNB codes were filtered out of the metadata, for reasons of clarity and data record reduction. The categories may have a different value internally in the DNB, but in this case they lead to a very confusing view of the documents and therefore were removed. 2.2 Topics and Pooling An existing topic pool was used as a basis for creating the topics. The topics that were already used in a retrieval test in the MILOS II project [4] were considered for this purpose. The MILOS II project dealt with the automatic indexing of catalogue data and was carried out in 1998 with data from the DNB. The data for the MILOS II retrieval test is comparable to the data of the test collection that is the subject of this work. The written results for the MILOS II test include 100 topics. These topics were created for a corpus consisting of title recordings of the DNB and covering all subject groups, with the exception of fiction, children and young people literature, and calendars. The database is therefore based on similar requirements. The number of topics to be created has been set to 50. The 100 topics from MILOS II were sifted and filtered. In particular, the topics were removed that had led to very few or no hits, since some of them are very specific. As with TREC, the topics consist of a topic ID, the topic title, a description that summarises in one sentence what is searched for, and the narrative, an explanation of what is relevant, partially relevant and not relevant for the topic. The description is on average 8.44 words long, the narrative is much more specific with 48.26 words on average. Description and narratives are also of great importance for the reuse of the test collection. One can use these topic components to find out why certain documents were evaluated as relevant or not relevant for a topic. The corpus of the test collection consists of 200,000 documents and 50 topics have been selected, which form the foundation for the search queries. The ideal case, in which a relevance judgement is available for each individual document topic pair, would mean the production of 10 million relevance judgements, which must be assessed by human judges. To avoid this enormous amount of work, a document pool was created for each topic, which was used for the relevance assessments. 4 DNB database processing of delivered data corpus with filtered categories, indexed in solr search query A search query B search query C search query D result list A result list B result list C result list D for topic 1 for topic 1 for topic 1 for topic 1 total of all results for one topic removal of duplicates document pool for topic 1 Preparation of the relevance assessments complete pool for topic 1 Fig. 1. Overview on the pooling process. The pooling procedure used for this (see Figure 1) is based on the procedure used at TREC for pool creation as part of the ad hoc task. At TREC, the systems of different participants search in the same corpus for the same topic. Since the systems usually work differently, the result sets differ. The top k ranked results are transmitted to TREC, from which a large pool is formed and this pool is evaluated. For the creation of these test collections, it was not possible to use different systems that searched the corpus. This results in slight changes in the pooling process. Four people with a background in library sciences and previous knowledge of the DNB corpus developed a search strategy for each topic and formulated a search query. This means that there are four search queries for each of the 50 topics, whose result sets can be combined into a pool. The four search queries for each topic are documented with the respective result sets. For each topic, the result sets are more or less similar. For topic 03 (Gerontology), for example, 31, 90, 179 and 89 records were found using different search strategies. The duplicates were removed and the remaining documents were combined in a pool. Such a pool was created for each topic. These pools are the basis for the relevance assessment. Some topics are characterised by a higher number of records than others, so the pools are of different sizes. The majority of the results per pool are between 50 and 150 documents. The smallest pool consists of 13 records (39 - Medicine in the Third Reich), the largest pool consists of 514 records (14 - Alternative energies), such outliers are exceptions. 2.3 Relevance Assessments With the help of the open-source toolkit Relevation! [2] a total of 6,984 relevance assessments were made. Relevation! ensures a clearer presentation of the docu- ments from the respective pools for a topic, simplifies the evaluation and finally outputs a result list of the relevance ratings in TREC format. The relevance was documented in graded form. A hit in a result set can therefore have three values: “0” for not relevant, “1” for partially relevant and “2” for relevant. From 5 Table 1. Examples of topics with topic ID, description and narrative (in German). ID Topic Description Narrative 24 Geschichte Finde Dokumente, die Relevante Dokumente behandeln die Geschichte des Israels sich mit der Geschichte Staates Israel und die territorialen Dispute, die des Staates Israel damit einhergehen. Relevant sind auch Dokumente, beschäftigen. die sich mit dem historischen, kulturellen und re- ligiösen Israel befassen und dessen Bedeutung für den gegenwärtigen Staat und die Konflikte. Teilweise relevante Dokumente behandeln die Geschichte Is- raels im Hinblick auf bestimmte Einzelaspekte. Nicht relevant sind Dokumente, in denen das Wort Geschichte im Sinne von “Erzählung” vorkommt. 25 Heilfasten Finde Dokumente, die Relevante Dokumente behandeln unterschiedliche sich mit Heilfasten Methoden von Nulldiäten und Heilfasten und der beschäftigen. Wirkung auf die körperliche und seelische Gesund- heit sowie den Risiken, die damit verbunden sind. Relevant sind auch Dokumente, die sich mit dem esotherischen Aspekt von Fasten und Diät beschäfti- gen. Teilweise relevant sind Dokumente, die das re- ligiöse Fasten zum Thema haben. 26 Elektroautos Finde Dokumente, die Relevante Dokumente behandeln die Entwicklung sich mit elektrisch be- von Kraftfahrzeugen, die mit Elektromotoren be- triebenen Fahrzeugen be- trieben werden und die Bedeutung für die En- fassen. ergiewirtschaft. Relevant sind auch Dokumente, die sich mit dem Umweltaspekt elektrischer Motoren beschäftigen. Teilweise relevant sind Dokumente, die Elektromobilität im Allgemeinen mit dem Thema Elektroautos verbinden. 27 Homöo- Finde Dokumente, die Relevante Dokumente behandeln die verschiedenen pathische sich mit homöopathis- homöopathischen Wirkstoffe sowie die Diskussion Mittel chen Mitteln und ihren um Wirksamkeit homöopathischer Mittel allgemein. Wirkungen beschäftigen. Auch tiermedizinische Dokumente sind relevant. the total number of 6,984 judgements 909 were relevant, 1,457 were partially relevant, and 4,616 were not relevant. In the preparation of the judgements, care was taken to ensure that they are based on the guidelines from the topics and the narratives. However, relevance assessments can never claim to be perfect because of their subjectivity. The use of a pooling procedure also does not guarantee the completeness of the relevant assessments. In this case, the pooling procedure is a compromise between the amount of work, the time available and the human and monetary resources. All topics and all relevance assessments were prepared by one person or finally revised. This avoids that different relevance judgements of different persons can influence the results of possible retrieval tests. It is impossible to completely avoid subjectivity in the relevance judgements. However, if the judgements on a topic come from the same person, they are stringent in themselves and can be considered valid. 3 Conclusion and Future Work The final test collection consists of the three essential components: (1) 200.000 revised document records containing all content-bearing fields of original DNB title recordings, (2) real information needs in the form of 50 topics with ID, title, 6 description and narratives, and (3) relevance judgements that were made using a pooling process and the Relevation! toolkit. The test collection can serve as a basis for retrieval tests to answer the ques- tion on the quality of the automatic indexing practices at DNB. By documenting all work steps and thought processes, the construction of the test collection as such has been made comprehensible. All data is static, the re-use of the collec- tion with the contained documents, topics and relevance judgements is possible. The test collection can be reused, transferred to other systems or adapted to the requirements of other retrieval tests. The wide range of subject groups makes the test collection suitable as a general test collection with library metadata from the German National Library. The size of the collection also facilitates reuse. The collection itself is not published yet but should be available in autumn 2018. The general feasibility of the IR collection was tested during the construction process. It was not yet used to evaluate the initial questions on indexing quality of the new DNB procedures. While this would be the primary area of application after the completion of this retrieval test, it should be assessed to what extent the performance of the test collection is convincing and which further areas of application are conceivable. In the area of relevance assessments, there is still a need to expand the test collection. With 200,000 documents and 50 topics, there are 10 million possible relevance assessments, of which almost 7,000 were compiled in this case. An expansion of this quota is conceivable with a corresponding investment in working time. Also interesting can be the use of the document set with a new set of topics, to which relevance judgements can be made. New topics and more specific topics are an option for further areas of application of the test collection due to the deliberately general distribution of subject areas in the document quantity. Finally the test collection should be tested for inter-rater reliability by drawing up various relevance assessments for the existing topics [5]. References 1. Ceynowa, K.: Deutsche Nationalbibliothek: In Frankfurt lesen jetzt zuerst Maschi- nen. FAZ.NET (Jul 2017), http://www.faz.net/1.5128954 2. Koopman, B., Zuccon, G.: Relevation!: An open source system for information re- trieval relevance assessment. In: SIGIR ’14. pp. 1243–1244. ACM, New York, NY, USA (2014), http://doi.acm.org/10.1145/2600428.2611175 3. Munkelt, J.: Erstellung einer DNB-Retrieval-Testkollektion. Bachelor thesis, Tech- nische Hochschule Köln (2018) 4. Sachse, E., Liebig, M., Gödert, W.: Automatische Indexierung unter Einbeziehung semantischer Relationen: Ergebnisse des Retrievaltests zum MILOS II-Projekt. Fachhochschule Köln, Fachbereich Bibliotheks- und Informationswesen, Köln (1998) 5. Schaer, P.: Better than their reputation? on the reliability of relevance assessments with students. In: Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. pp. 124–135. Springer, Berlin, Heidelberg (2012) 6. Wiesenmüller, H.: Das neue Sacherschließungskonzept der DNB in der FAZ (Aug 2017), http://www.basiswissen-rda.de/neues-sacherschliessungskonzept- faz/