What is in the proceedings? Combining publisher’s and researcher’s perspectives Volha Bryl1 , Aliaksandr Birukou2 , Kai Eckert1 , and Mirjam Kessler2 1 University of Mannheim, Germany {volha|kai}@informatik.uni-mannheim.de 2 Springer-Verlag, Heidelberg, Germany {aliaksandr.birukou|mirjam.kessler}@springer.com Abstract. Despite many efforts for making data about scholarly publi- cations available on the Web of Data, lots of information about academic conferences is still contained in (at best) free-text format. When avail- able in a structured format, these data would provide an essential input for the decisions researchers, libraries, publishers, funding and evalua- tion bodies take every day. In this paper we present a vision for having such data available as Linked Open Data (LOD), and we argue that this is only possible – and for the mutual benefit – in cooperation be- tween researchers and publishers. We also present a pilot project aimed at publishing data about 8,500 computer science conferences as LOD. Keywords: Linked Open Data, linked science, research evaluation 1 Motivation: why we need more data Data on scientific publications, authors, institutions, and conferences are widely and publicly available on the web. Moreover, there have been many initiatives aimed at publishing these data as linked and open: for example, DBLP data on publications1 or bibliographic information on books and authors provided by the German National Library2 . Both examples provide trusted data on publications with high coverage. A few applications have been developed to browse and query these data [1, 3], with a focus on authors, publications and research topics. However, making sense of data about conference proceedings is still an issue. The conference series in which research results are published appears to be a crucial provenance dimension – along with the author metadata – based on which the research results are evaluated and trusted. The problem becomes even more complex when one takes into account the fact that conferences change name over time, and recent developments of “predatory publishing” and “fake conferences” phenomena3 . Here are some examples of trust-related questions about conferences that various stakeholders in the academic world regularly face: 1 D2R Server for DBLP data – http://dblp.l3s.de/d2r/ 2 http://datahub.io/dataset/dnb-gemeinsame-normdatei 3 http://scholarlyoa.com/publishers/ – Shall I submit a paper to this conference? How good and relevant is it? What are the alternatives? (younger researcher) – Shall I accept a PC membership invitation, give an invited talk, send a workshop proposal to this conference? (more senior researcher) – Shall we publish the proceedings of this conference? (publisher) – Is it worth sponsoring this conference? (sponsor) What do you do when you face these questions? You google, read many docu- ments and webpages, ask people, and you are never sure whether you have found all relevant data and numbers. Data about conferences are spread across several sources in a largely chaotic and non-structured way, being duplicated multiple times. Let us take the ex- ample of the PC (Program Committee) membership: being involved in paper reviewing and other activities related to a conference organization is hard work that should be credited [2]. The data on the conference organizers and PC is also essential for a publisher when evaluating a new conference proposal. On one hand, Semantic Web Conference Ontology4 provides a way to describe the roles of scientists in conference organization, such as “chair”, “PC member”. Any conference management system (CMS), e.g. EasyChair, contains the list of PC members. On the other hand, hardly any conference reuses such PC data through the conference lifecycle. Instead, the PC membership information is copied to appear at a conference webpage, in the call for papers (on WikiCfP, Eventseer or mailing lists), in the preface of the proceedings. Moreover, traces of such PC data are also present at author webpages and in CVs. Obviously, changes in one system (e.g. reject of a PC member to assume their role via a CMS) are not necessarily be reflected in other data sources (CfP, conference website). The key issues to address here are data exchange between various systems involved in conference organizations and the lack of trusted sustainable5 large- scale data sources providing detailed conference data. Currently, the LOD cloud includes several resources that contain conference metadata. Semantic Web Con- ference Corpus6 includes information on major Semantic Web conferences (37) and workshops (235), therefore, providing high quality but low coverage data. Another example is COLINDA [4]7 that contains information about 15,000 con- ferences in a 2003–2013 time span, with main data sources being WikiCfP and Eventseer, which aggregate information from the call for papers, meaning that there is no guarantee that the events (especially workshops) actually happened and had formal proceedings. In the following, we show how the reliability, coverage and sustainability of such data can be improved by cooperation between publishers and researchers. 4 http://data.semanticweb.org/ns/swc/ontology 5 Not many resources and tools outlive the research projects they originate from. 6 http://data.semanticweb.org/ 7 http://www.colinda.org/ 2 Filling the gap: linked open conference data The issues outlined above motivated the launch of the Springer LOD pilot, which aims at publishing data about Computer Science conferences as a linked open dataset. The availability of such a dataset will contribute to the broader goals of publishing the scholarly data as LOD: – accessible science: data about publications, authors, topics, and conferences should be easy to explore; – transparent science: the data on productivity and impact of authors, research institutions, and conferences should be open and easy to analyze. But these goal are only marginally relevant for publishers, whose primary goal is, not surprisingly, commercial benefits. So, how do the interests of publishers and researches align? Publishing conference data as LOD would allow Springer to enrich biblio- graphic data provided via data services to libraries, data agencies and aggrega- tors. This would also allow linking to other data, thus increasing the visibility of the proceedings in SpringerLink digital library. This would provide benefits for conference community (i.e. researchers): more readers, more downloads, more citations, conference submissions and participants. Moreover, Springer sees this as a way of collaborating with the research community and other stakeholders (libraries, indexing services, conference-related systems) to get new insights on the data. Also, the data would allow detecting trends in the conference business, and plan accordingly: knowing that many conferences go to Russia or China, publishers need to establish agreements with local printers, take into account customs regulations, etc. As with any LOD resource, sustainability is crucial: and in our opinion, it is directly related to the economic value the data brings. Moreover, the benefits of boosting the content usage and discoverability, and data enrichment via linking, outweigh potential profits from selling these data. The pilot has started in 2013 and is ongoing. In the conference dataset that will be made available as a result of the pilot (later in 2014), for each conference the following information is provided: conference series name and ID; confer- ence ID, acronym, and number in the series; city country, start and end dates. See Figure 1 for an internal XML representation. The starting point are the conference data that are present in the subtitles of the proceedings, i.e. in a free- text format: e.g. “12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I”. In the pilot, the Springer internal conference data management system was extended with a module that extracts and structures this information from the subtitles. Then, the quality of the extracted data is manually assured with the help of an interactive GUI, fol- lowing the same philosophy for the data quality standards as the one of DBLP. The resulting conference data are stored in a database, which makes their con- version to RDF straightforward. As the example shows, the data contains some fields, e.g., conference acronym, number, city, country, link to the proceedings, which are either not available in COLINDA or the Semantic Web Conference Corpus or available there in free text form. Fig. 1. Data about conferences: example Currently, the data for 8,500 conferences (which correspond to around 2,000 conference series) published in LNCS, LNAI, LNBI, LNBIP, CCIS, IFIP-AICT and LNICST series since 1973 was processed following the above procedure. As every year 650 new conferences are published in these series, the information about them will be added to the system, structured and exported into RDF. The data are curated at publisher’s end, using well-established (over the course of the last 40 years) processes: the process of producing metadata for SpringerLink was augmented with an additional step, during which the conference metadata is extracted and its quality is assured. Services such as scholarly search engines will be able to use the conference data directly from SpringerLink. The very same conference metadata will be then published as LOD, in RDF format. Such separation of data from formats allows for adding third party LOD conference data (for conferences not published by Springer) in the future. In the future we plan to provide richer metadata that includes the number of submitted and accepted papers, acceptance rates, information on the best paper awards, PC and chairs, co-located workshops, links to CORE rankings8 , etc. Moreover, in the future the data would go beyond the computer science scope, extending to approximately 350 conferences published annually by Springer in other disciplines. According to the internal Springer statistics, the 8,500 conferences contain al- most 300,000 articles published in the proceedings, and slightly over 300,000 dis- tinct authors contributing to the papers. Making publication and author meta- data available is not the focus of the current stage of the pilot, but such informa- tion can be provided in the future by linking to other datasets, such as DBLP. Linking to citation figures (e.g. from CrossRef9 ) and ORCIDs will further enrich the data. 8 http://core.edu.au/index.php/categories/conference\%20rankings 9 http://www.crossref.org/ 3 How to move further? The result of this initial data publishing stage is a well-structured carefully maintained conference dataset, which can be interlinked with other datasets (DBLP or national libraries’ data, GeoNames and DBpedia for locations, etc.) and used in applications. However, the initial data publishing stage will hardly go any further unless both researchers and publishers actively participate in providing more data, linking them, and developing new applications supporting the questions we posed in the introduction. One example of application is based on the Rexplore [3] tool with its focus on sensemaking tasks: Rexplore combines statistical analysis, semantic technolo- gies and visual analytics, and allows answering complex queries to make sense of scholarly data. Fetching a conference dataset into Rexplore and linking it with the publication datasets and the topic ontology the tool uses, would allow ana- lyzing how the focus and main topics of a specific conference series were changing over time, how “good” the conference is in terms of citations, top researchers publishing there or involved in its organization, etc. Another application is using the conference data during the conference lifecy- cle. Once entered in a CMS, the data about PC membership could be exported10 to become part of LOD cloud and then displayed on the website in one of n standard ways (e.g. using specific plugins), or be included in the preface of the proceedings, various conference apps, etc. Such coordination between researchers and publishers would prevent data duplication and enable data reuse. 4 Acknowledgments This work has been supported by the LOD2 and DM2E EU FP7 projects. We thank Max Schmachtenberg, University of Mannheim, for providing the meta- data on the scholarly domain in LOD, and Markus Richter for developing the Springer conference data management tools. References 1. Diederich, J., Balke, W.T., Thaden, U.: Demonstrating the semantic GrowBag: Automatically creating topic facets for FacetedDBLP. In: JCDL’07. pp. 505–505. ACM (2007) 2. Ley, M.: DBLP – some lessons learned. PVLDB 2(2), 1493–1500 (2009) 3. Osborne, F., Motta, E., Mulholland, P.: Exploring scholarly data with Rexplore. In: International Semantic Web Conference (1). pp. 460–477 (2013) 4. Softic, S., Vocht, L.D., Mannens, E., de Walle, R.V.: COLINDA – conference linked data. Submitted to Semantic Web Journal (2013), available at http://semantic-web- journal.net/content/colinda-conference-linked-data 10 In fact, in the OCS (Online Conference Service) conference management system – http://ocs.cs.uni-dortmund.de – such an export already exists.