Smart Topic Miner: Supporting Springer Nature Editors with Semantic Web Technologies Francesco Osborne1, Angelo Salatino1, Aliaksandr Birukou2, Enrico Motta1 1 Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK {francesco.osborne,angelo.salatino,enrico.motta}@open.ac.uk 2 Springer-Verlag GmbH, Tiergartenstrasse 17, 69121 Heidelberg, Germany aliaksandr.birukou@springer.com Abstract. Academic publishers, such as Springer Nature, annotate scholarly products with the appropriate research topics and keywords to facilitate the marketing process and to support (digital) libraries and academic search engines. This critical process is usually handled manually by experienced editors, leading to high costs and slow throughput. In this demo paper, we present Smart Topic Miner (STM), a semantic application designed to support the Springer Nature Computer Science editorial team in classifying scholarly publications. STM analyses conference proceedings and annotates them with a set of topics drawn from a large automatically generated ontology of research areas and a set of tags from Springer Nature Classification. Keywords: Scholarly Data, Ontology Learning, Bibliographic Data, Scholarly Ontologies, Data Mining, Conference Proceedings, Metadata. 1 Introduction An important challenge for academic publishers is to categorize their editorial products with respect to relevant research topics and keywords. This is critical for a variety of tasks that benefit both the publisher and the research community. First, the use of appropriate descriptors helps researchers in identifying relevant papers and supports academic search engines and recommender systems. In the second instance, a topic-based representation can inform marketing decisions, such as in which venues or communities to present a book. Finally, a granular description of the editorial content can be useful for producing advanced analytics about research trends, thus supporting publishing strategies. Traditionally, editors classify proceedings manually, by associating to each proceedings book a list of categories from existing classifications (e.g., ACM, MeSH) and a set of keywords. This task is performed according to their experience in the research field, after analysing titles, abstracts, keywords and a list of additional terms from the call for papers. It is thus a costly and time-consuming process that may be biased by the editor view of the academic landscape. In addition, this manual analysis may miss the subtle emergence of some innovative topics or fail to detect the decline of traditional ones. In this demo paper we present Smart Topic Miner (STM), a web application developed in collaboration with Springer Nature (SN) that classifies scholarly publications according to an automatically generated ontology of research areas. This paper is complementary to the one accepted in the ISWC 2016 Applications Track and focuses on the main functionalities and the technical implementation of the system. We refer the reader to [1] for a comprehensive exposition of the set-covering algorithm, the knowledge bases and the system evaluation. The demo version of STM is available at http://rexplore.kmi.open.ac.uk/STM_demo. The reader can try it by using the ‘Example Springer Nature Proceedings’ option, which allows testing the application by using six default SN proceedings. Figure 1. The STM interface. 2 Smart Topic Miner When conference organizers send the proceedings to Springer Nature, the papers are typeset and copyedited. In the typesetting phase, XML files with relevant metadata are produced. The in-house editors analyse these metadata with the Smart Topic Miner for selecting a number of keywords and SN classification tags. The output includes: 1) a set of research areas structured according to an ontology of research areas, 2) a set of Springer Nature Classification tags, and 3) a variety of analytics for allowing editors to analyse the content of the proceedings and the quality of the classification. The web interface of STM is shown in Figure 1. 2.1 Knowledge Bases STM categorizes publications according to two classifications: the Klink-2 Computer Science Ontology (CSO) and the SN Classification for Computer Science (SNC). CSO was created by applying the Klink-2 algorithm [2] on a dataset of about 16 million publications, mainly drawn from Computer Science. Klink-2 is an algorithm which generates an ontology of research areas by inferring semantic relationships from scholarly metadata and external sources – e.g., DBpedia, calls for papers, web pages. It is integrated in Rexplore [3], an innovative system which uses semantic technologies for exploring and making sense of scholarly data. The current version of CSO includes about 17k topics linked by 70k semantic relations and structured in terms of 8 levels of granularity. The Springer Nature Classification for Computer Science is an internal company classification, which is used to categorize proceedings, books and journals. It contains 76 categories in a three level taxonomy and was mapped to CSO by means of 349 relationships, so that every SN category is associated to a set of related topics. 2.2 Architecture Figure 2 shows the STM architecture. When the user submits a collection of XML files, the parser extracts the relevant metadata, which are then sent as a JSON file to the background API via a POST query. Figure 2. The STM architecture. The backend tags each paper with a list of frequent terms extracted from its abstract, title, and keywords. Then it associates to each topic in the CSO ontology all papers tagged with its label or the label of a sub-area. For example, a paper containing the keyword “support vector machines” will be associated with the Support Vector Machines research area and with all its super-topics, such as Machine Learning and Artificial Intelligence. The set of topics is then pruned by applying a greedy set- covering algorithm. The size of the resulting set and the granularity of the topics depend on a number of parameters controlled by the users, as discussed in the next section. STM then uses the mapping between CSO and SNC to produce a relevant set of SNC tags. Finally, the outcome of the process is cached to improve performances of future queries and returned to the front-end. 2.3 Main Functionalities The STM web interface was iteratively improved by taking in consideration the feedback of experienced SN editors and thus includes many functionalities to enhance their ability to customize the output and to assess its quality and coverage. In fact, editors did not want a completely automatic process, but a flexible tool that could be used to investigate the proceedings and to produce different kinds of annotations according to their needs. In the following we will discuss the main functionalities and their rationales (further details are discussed in [1]). The most used setting is the granularity value, which goes from 1 to 5 (default is 3) and allows users to intuitively choose how comprehensive the classification should be. Every level of granularity is associated with a number of settings of the set- covering algorithm. Figure 3 shows as an example a proceedings book processed with different granularities. A second important functionality is the show explanation one, which displays near each topic (e.g., Semantic Web) the list of keywords that were used to infer it (e.g., “OWL”, “linked data”, “ontology matching”) and how many papers they cover. In fact, editors often want to investigate new or unexpected topics to decide if they have to be included in the final version of the annotations. Editors need also to check how representative a certain set of topics and tags actually is. For this reason, STM offers an advanced analytics functionality that provides additional information, such as the percentage coverage of the outcome and a list of uncovered and covered papers associated with their keywords and topics. Figure 3. The same proceedings book processed with granularity 2, 3 and 4. 3 Conclusions In this demo paper we summarized the main characteristics of Smart Topic Miner, a Semantic Web application designed to assist Springer Nature editors in classifying conference proceedings. We are now working on integrating STM into the Springer Nature workflow and we also plan to release a public version of the application to help researchers in choosing the set of topics which best describe their work. References 1. Osborne, F., Salatino, A., Birukou, A., Motta, E.: Automatic Classification of Springer Nature Proceedings with Smart Topic Miner. In ISWC 2016 Application Track. (2016) 2. Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. The Semantic Web-ISWC 2015, pp. 408-424. Springer. (2015) 3. Osborne, F., Motta, E. and Mulholland, P.: Exploring scholarly data with Rexplore. In International Semantic Web Conference (pp. 460-477). Springer. (2013)