Editorial for the Second Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics (CLBib2017) Iana Atanassova Centre Tesnière - CRIT, University of Bourgogne Franche-Comté, Besançon, France iana.atanassova@univ-fcomte.fr Marc Bertin ELICO Laboratory, University of Lyon, Lyon, France marc.bertin@univ-lyon1.fr Philipp Mayr GESIS – Leibniz-Institute for the Social Sciences, Cologne, Germany Philipp.Mayr@gesis.org 1 Introduction The Open Access movement in scientific publishing and search engines like Google Scholar have made scientific articles more broadly accessible. During the last decade, the availability of scientific papers in full text has become more and more widespread thanks to the growing number of publications on online platforms such as ArXiv, CiteSeer and Public Library of Science (PLOS). In this context, new needs arise around the processing and efficient exploitation of scientific corpora. Scientific papers are highly structured texts and display specific properties related to their references but also argumentative and rhetorical structure. Recent research in this field has concentrated on the construction of ontologies for citations and scientific articles (e.g. FaBiO and CiTO [8]) and studies of the distribution of references (see [2]). However, up to now full-text mining efforts are rarely used to provide data for bibliometric analyses. While bibliometrics traditionally relies on the analysis of metadata of scientific papers (see e.g. a recent special issue on ”Combining Bibliometrics and Information Retrieval”, Mayr & Scharnhorst [6]), we will explore the ways full-text processing of scientific papers and linguistic analyses can play. The CLBib workshop series provides a forum to discuss novel approaches and insights into scientific writing that can bring new perspectives to understand both the nature of citations and the nature of scientific arti- cles. The possibility to enrich metadata by the full-text processing of papers offers new fields of application to bibliometrics studies. 2 Scope and Motivation The CLBib workshops aim to bring together researchers in bibliometrics and computational linguistics in order to study the ways bibliometrics can benefit from large-scale text analytics and sense mining of scientific papers, thus exploring the interdisciplinarity of Bibliometrics and Natural Language Processing. Working with full text allows us to go beyond metadata used in bibliometrics. Full text offers a new field of investigation, where the major problems arise around the organization and structure of text, the extraction of information and its representation on the level of metadata. Furthermore, the study of contexts around in-text citations offers new perspectives related to the semantic dimension of citations. The analyses of citation contexts and the semantic Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: Iana Atanassova, Marc Bertin and Philipp Mayr (eds.): Proceedings of the CLBib2017 Workshop, Wuhan, China, 17-oct-2017, published at http://ceur-ws.org categorization of publications will allow us to rethink co-citation networks, bibliographic coupling and other bibliometric techniques. The first edition of this workshop1 , co-located with the International Society of Scientometrics and Informetrics Conference (ISSI) in 2015, attracted more than 70 participants and six full paper contributions, showing a large interest in these topics in the community. From a technical point of view, during the first edition of the workshop, the efforts to provide articles in machine-readable formats and the rise of Open Access publishing have resulted in a number of standardized formats for scientific papers, full-text datasets for research experiments and corpora and focus on number of open source tools for versatile text processing. The goal of this second edition of the CLBib workshop, co-located with the ISSI conference 2017, is to continue to encourage the collaboration between these two domains and to answer questions like: How can we enhance author network analysis and Bibliometrics using data obtained by text analytics? What insights can NLP provide on the structure of scientific writing, on citation networks, and on in-text citation analysis? Natural Language Processing and Bibliometrics meet again in this second workshop in a context where Open Access is at the heart of exchanges between scientists and publishers and raises many economic and ethical issues, but also new research problems through the access to articles in full text. Indeed, the possibility of enriching metadata currently used in bibliometrics with information from the text is an essential step towards building the tools of tomorrow. As the CLBib 2017 workshop was held in China, at Wuhan University, the discussions raised important questions not only around the processing of scientific papers but also on the need to take into account the multilingual aspect of the scientific production. Even if today English is essential on the international stage, national level publications can also be rich in information and relevant for bibliometric studies. The linguistic aspect, which is more and more present at the ISSI conference, must be taken into consideration and highlights the importance of this workshop series and the growing interest in the community of bibliometricians but also in other communities for Natural Language Processing. 3 Overview of the papers The call for papers2 attracted several submissions, of which 50% were accepted for publication. The workshop featured an introduction and four paper presentations. All papers are included in the current proceedings. We shortly summarize each workshop paper below. The publications selected for this workshop CLBib2017 concern both, theory and application aspects of bibliometric. For this second edition of the CLBib workshop, the authors used methods that come from various fields, such as Information Retrieval with traditional measures (e.g. tf-idf, vector models, . . . ), network analysis and visualisation with Ucinet, Netdraw, CiteSpace (see [3]), and also NLP with word embedding techniques and co-words analysis. The paper ”Understanding the Changing Roles of Scientific Publications via Citation Embeddings” by Jian- gen He and Chaomei Chen [5] describes an approach which helps to understand the changing and complex role of a publication characterized by its citation contexts. The authors propose a temporal representation of in-text citations of publications as a sequence of vectors and they apply their method in the biomedical domain. These in-text citations represent the changing role of publications in a community. The authors end with an interpretatation of the proposed methods on the basis of one PubMed article from 2006. In the paper ”CitationAS: A Summary Generation Tool Based on Clustering of Retrieved Citation Content”, Jie Wang, Shutian Ma and Chengzhi Zhang [9] focus on an automatic summary generation tool CitationAS that uses citation sentences to construct summaries. The authors build a new application which can automatically generate a summary on a given topic by optimizing the search results using clustering engine, named Carrot2, in three stages: similar cluster label merging, important sentences extraction and summary generation. The paper ”Temporal Evolution, Research Themes, and Emerging Trends in Case-Based Reasoning Litera- ture” by Dongxiao Gu, Bo Liu, Isabelle Bichindaritz and Changyong Liang [4] proposes a study on the temporal evolution and emerging trends in a specific scientific field which is case-based reasoning. They analyze a dataset of 4460 papers published from 2000 to 2015. The authors study the temporal distribution of papers, the reference and journal co-citation network, and also the co-occurrence of keywords. The methodology used in this paper to provides an extensive study of a scientific field can further be applied to other fields. The discovery of potential collaborations is in the focus of the last paper, ”Mining the Potential Collaborative Relationships Based on the Author Keyword Coupling Analysis and Social Network Analysis” by Yufang Peng, 1 See the proceedings of the first edition of the workshop: http://ceur-ws.org/Vol-1384/, [1]. 2 https://easychair.org/cfp/CLBib2017 Gu Dongxiao and Shi Jin [7]. Among the methods that are used are social network analysis, keyword and co- word analysis and clustering. Considering the hypothesis that authors that work on similar topics and keywords could potentially be contributors, this paper provides a method for author similarity analysis. 4 Outlook The interest for this interdisciplinary research has been growing during the last years (see e.g. the workshops of BIRNDL - ”Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries” and WoSP - ”Workshop on Mining Scientific Publications”) and the series of CLBib workshops up to now have shown that both fields of Natural Language Processing and Bibliometrics can benefit from addressing the problem of the full text processing of papers. As a result of this workshop series, a new Research Topic ”Mining Scientific Papers: NLP-enhanced Biblio- metrics”3 has been launched as part of the ”Frontiers in Research Metrics and Analytics” journal published in Open Access. We intend to continue the effort to bring both communities together and foster the development of semantic technologies dedicated to Bibliometrics and Scientometrics. 4.0.1 Acknowledgements Part of this research has been funded by the FEDER (Fonds européen de développement régional) and selected by the French-Swiss programme Interreg V: Webso+ project4 . References [1] Atanassova, I., Bertin, M., Mayr, P.: Editorial for the first workshop on mining scientific papers: Com- putational linguistics and bibliometrics. In: Proceedings of the First Workshop on Mining Scientific Pa- pers: Computational Linguistics and Bibliometrics co-located with 15th International Society of Scien- tometrics and Informetrics Conference (ISSI 2015), Istanbul, Turkey, June 29, 2015. pp. 1–4 (2015), http://ceur-ws.org/Vol-1384/editorial.pdf [2] Bertin, M., Atanassova, I., Larivière, V., Gingras, Y.: The invariant distribution of references in scientific articles. Journal of the Association for Information Science and Technology (JASIST) 67(1), 164–177 (2016), http://dx.doi.org/10.1002/asi.23367 [3] Chen, C.: Citespace ii: Detecting and visualizing emerging trends and transient patterns in scientific lit- erature. Journal of the American Society for Information Science and Technology 57(3), 359–377 (2006), http://dx.doi.org/10.1002/asi.20317 [4] Gu, D., Liu, B., Bichindaritz, I., Liang, C.: Temporal evolution, research themes, and emerging trends in case-based reasoning literature. In: Atanassova, I., Bertin, M., Mayr, P. (eds.) 2nd Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics collocated with 16th International Conference on Scientometrics and Informetrics (ISSI 2017). CEUR-WS.org (2017) [5] He, J., Chen, C.: Understanding the changing roles of scientific publications via citation embeddings. In: Atanassova, I., Bertin, M., Mayr, P. (eds.) 2nd Workshop on Mining Scientific Papers: Computational Lin- guistics and Bibliometrics collocated with 16th International Conference on Scientometrics and Informetrics (ISSI 2017). CEUR-WS.org (2017) [6] Mayr, P., Scharnhorst, A.: Combining bibliometrics and information retrieval: preface. Scientometrics 102(3), 2191–2192 (Mar 2015), https://doi.org/10.1007/s11192-015-1529-2 [7] Peng, Y., Gu, D., Jin, S.: Mining the potential collaborative relationships based on the author keyword coupling analysis and social network analysis. In: Atanassova, I., Bertin, M., Mayr, P. (eds.) 2nd Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics collocated with 16th International Conference on Scientometrics and Informetrics (ISSI 2017). CEUR-WS.org (2017) [8] Shotton, D.: Cito, the citation typing ontology. Journal of Biomedical Semantics 1(1), S6 (Jun 2010), https://doi.org/10.1186/2041-1480-1-S1-S6 3 https://www.frontiersin.org/research-topics/7043/mining-scientific-papers-nlp-enhanced-bibliometrics 4 http://tesniere.univ-fcomte.fr/projet-webso/ [9] Wang, J., Ma, S., Zhang, C.: Citationas: A summary generation tool based on clustering of retrieved citation content. In: Atanassova, I., Bertin, M., Mayr, P. (eds.) 2nd Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics collocated with 16th International Conference on Scientometrics and Informetrics (ISSI 2017). CEUR-WS.org (2017)