EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Preface to the 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents at JCDL 2020 Chengzhi Zhang1, Philipp Mayr2, Wei Lu3, Yi Zhang4 1. Nanjing University of Science and Technology, Nanjing, China, zhangcz@njust.edu.cn 2.GESIS-Leibniz-Institute for the Social Sciences, Cologne, Germany, philipp.mayr@gesis.org 3.Wuhan University, Wuhan, China, weilu@whu.edu.cn 4.University of Technology Sydney, Sydney, Australia, Yi.Zhang@uts.edu.au 1. Introduction The 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE 2020) was launched at the ACM/IEEE Joint Conference on Digital Libraries (JCDL) on August 1, 2020. The goal of this workshop is to engage the related communities in open problems in the extraction and evaluation of knowledge entities from scientific documents. Participants are encouraged to identify knowledge entities, explore feature of various entities, analyze the relationship between entities, and construct the extraction platform or knowledge base. Results of this workshop are expected to provide scholars, especially early career researchers, with knowledge recommendations and other knowledge entity-based services [1]. 2. Overview of the papers This year, 14 papers (including 3 long papers, 6 short papers, 4 posters and 1 demo) were accepted for presentation and inclusion in the proceedings. In addition, the workshop featured two keynote talks in the different EEKE-related fields. All workshop contributions are documented in the workshop website1. The following section briefly lists the various contributions. 2.1 Keynotes Two keynotes were presented in EEKE2020. The first one was given by Ming Song: Entitymetrics 2.0: Measuring the Impact of Entities and Relations Extracted from Scientific Documents. The concept of entitymetrics was first introduced in 2013, entitymetrics [2] has been applied to measure the impact of entities as well as to gauge the knowledge usage and transfer anchored on entities for knowledge discovery. In this talk, the previous studies employing entitiymetrics are summarized and the limitations of the current approaches are discussed. In addition, the future directions of entitymetrics are suggested. 1 https://eeke2020.github.io/ Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents The second keynote was given by Markus Stocker: Building Scholarly Knowledge Bases with Crowdsourcing and Text Mining. Building on the Open Research Knowledge Graph (http://orkg.org) as a concrete research infrastructure, in this talk Dr. Stocker presented how using crowdsourcing and text mining humans and machines can collaboratively build scholarly knowledge bases. He discussed some key challenges that human and technical infrastructures face as well as the possibilities scholarly knowledge bases enable. 2.2 Research papers, Posters and Demo The following papers were presented in 5 sessions. Session 1: Knowledge Entity Extraction and Application -Jennifer D'Souza and Sören Auer NLPContributions: An Annotation Scheme for Machine Reading of Scholarly Contributions in Natural Language Processing Literature This paper describes an annotation initiative to capture the scholarly contributions in natural language processing (NLP) articles, particularly, for the articles that discuss machine learning (ML) approaches for various information extraction tasks. They attempted to find a systematic set of patterns of subject-predicate-object statements for the semantic structuring of scholarly contributions, and to apply the discovered patterns in the creation of a larger annotated dataset for training machine readers of research contributions. -Mengjia Wu and Yi Zhang Intelligent Bibliometrics for Discovering the Associations between Genes and Diseases: Methodology and Case study This paper proposes an adaptable and transferable methodology to extract biomedical entities including diseases, chemicals, genes and genetic variations from literature data. A heterogeneous co-occurrence network is constructed and a semantic adjacency matrix is generated to identify key genes and genetic variants, and capture the emerging disease-gene associations via a link prediction approach. Session 2: Entity Extraction from Scientific Documents -Liangping Ding, Zhixiong Zhang, Huan Liu, Jie Li and Gaihong Yu Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling In this paper, authors regard automatic keyphrase extraction from Chinese text as a character-level sequence labeling task. Unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bi-directional Long Short Term Memory Network (BiLSTM) and BiLSTM-CRF are used to extract keyphrases from academic papers in medical domain. The character-level sequence labeling model based on BERT obtains the best result. 2 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents -Jin Mao, Shiyun Wang and Xianli Shang Investigating interdisciplinary knowledge flow through citances This study attempts to investigate the content of knowledge flow towards an interdisciplinary field by analyzing the citation sentences (i.e., citances) in the articles of eHealth field. The associated knowledge phrases between citances and the references are identified and categorized to analyze the content and categories of knowledge spread from the source disciplines to the field. In general, this study contributes to the understanding of content characteristics about interdisciplinary knowledge integration. -Yu Li, Tao Yue (Speaker) and Wu Zhenxin IEKM-MD: An Intelligent Platform for Information Extraction and Knowledge Mining in Multi-Domains This paper constructs a platform for information extraction and knowledge mining, namely IEKMMD. Two innovative technologies are proposed: Firstly, a phrase-level scientific entity extraction model combining neural network and active learning is designed to reduce the model’s dependence on large-scale corpus. Secondly, a translation-based relation prediction model is provided, which improves the relation embedding by optimizing loss function. In addition, the platform integrates the advanced entity recognition model and the keyword extraction mode, and provides abundant services for fine-grained and multi-dimensional knowledge. -Liang Chen, Shuo Xu, Weijiao Shang, Zheng Wang, Chao Wei and Haiyun Xu What is Special about Patent Information Extraction? This article aims at exploring the particularity in patent information extraction, thus to point out the direction for further research. To be more specific, they discuss: (1) what is the special about labeled patent dataset? (2) What is special about word embedding in patent information extraction? (3) What kind of method is more suitable for patent information extraction? Sesson 3: Interactive demos -Zi Xiong, Yue Qi, Wei Lu and Qikai Cheng Design and Implementation of an Academic Search System Based on a General Query Language and Automatic Question Answering This research designs and implements an academic search system with two major innovations: 1) proposing a general query language SSL to describe the academic search intention in a unified and standardized way, 2) proposing a user intention recognition method to help improve traditional automatic question-answering systems. The SSL language and intention recognition method is applied to a QA-oriented academic system which is innovative compared with traditional query-based systems. 3 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Session 4: Entity Relation Extraction and Application -Xin An, Jinghong Li, Shuo Xu, Liang Chen and Sainan Pi A Novel Approach for Patent Similarity Measurement Based on Sequence Alignment In order to measure the similarity among different patents, this study proposes a novel approach on the basis of sequence alignment. The method takes semantic direction of each sequence structure and the word order information of each component into consideration; an algorithm for calculating the global importance of each sequence structure is put forward. Extensive experimental results show that the proposed approach is significantly more accurate and is not sensitive to several core parameters. -Fang Tan, Siting Yang, Xiaoyan Wu and Jian Xu Exploring the Relation between Biomedical Entities and Government Funding In order to analyze the effect of government funding on the promotion of scientific research, and to help the government manage research funds more rationally, this study proposes a framework for analyzing the relationship between entities in the field of medicine and funds. The results reveal that the field of genetic research is in a period of rapid development and disease research catch NIH’s continuous attention. However, the stimulating effect of government funding on the research popularity is decreasing. -Sahand Vahidnia, Alireza Abbasi and Hussein A. Abbass Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping In this study, a method is proposed to extract research trends and their temporal evolution, throughout discrete time periods. Adapting contextualized word embedding techniques, the method utilizes published academic documents as knowledge units and clusters them into groups. Various labeling techniques are explored to evaluate the quality of clusters and explore their explain ability. The results show that utilization of neural embedding in conjunction with paragraph-term weights would provide simple and reliable paragraph embedding that can be used for clustering of the textual data. Session 5: Poster/ Greeting Notes of EEKE2020 -Qikai Liu, Pengcheng Li, Wei Lu and Qikai Cheng Long-tail dataset entity recognition based on Data Augmentation Datasets play an important role in data-driven scientific research. It is important to recognize dataset entities correctly, especially when it comes to unusual long-tail dataset entities. However, it is very difficult to obtain high quality training corpus in named entity recognition. This paper obtained the data based on a distant supervision method along with two data augmentation methods. A BERT-BiLSTM-CRF model is used to predict long-tail dataset entity. -Xiaole Li, Yuzhuo Wang Assessing Impact of Method Entities in a Special Task Methods play an important role in the research. Identifying and analyzing entities about research methods can help scholars understand methods used in their field and accelerate the efficiency of scientific research. This paper takes named entity recognition (NER) as an 4 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents example and evaluates the impact of method entities in this domain. They found that conditional random field (CRF) is the most influential algorithms in NER. Deep learning algorithms have developed rapidly in the past 5 years. F-measure, precision and recall are the most widely used indices and measurements. Scholars do not pay enough attention to use tools and they prefer to use classic datasets. -Chong Chen, Jingying Zhang, Xiaoyu Chu and Jinglin Zheng Study on the Difference between Summary Peer Reviews and Abstracts of Scientific Papers This article proposes primary measurement to compare Summary peer reviews with abstracts from readability and semantic function types. The results show that summary peer reviews highlight some distinct function types, and the terminology in peer reviews is not as dense as in abstracts. Summary peer reviews can be complement to abstracts in literature searches, and can help readers understanding papers more thoroughly. -Wei Shao, Hua Bolin, Qiang Ma, Jiaying Liu, Hongwei He, Keqi Chen An Unsupervised Method for Terminology Extraction from Scientific Text Finding new terminology is a kind of named entity recognition (NER) problem. However, many high performance methods need labeled data. This paper proposes an unsupervised method based on sentence pattern and part of speech. They initialize a few patterns to extract terminologies in certain sentences, and then try to find the same POS sequences in sentences not matched by initial patterns with obtained terminologies’ POS sequences. The new patterns and more terminologies are obtained after several iterations. 3. Outlook and further reading Currently the EEKE2020 organizers edit the following two Special issues: -Special Issue on “Extraction and Evaluation of Knowledge Entities from Scientific Documents” in Journal of Data and Information Science (https://mc03.manuscriptcentral.com/jdis). -Special Issue on “Scientific Documents Mining and Applications” in Data and Information Management (https://www.editorialmanager.com/dim/default.aspx). References [1] Chengzhi Zhang, Philipp Mayr, Wei Lu, Yi Zhang. (2020). Extraction and Evaluation of Knowledge Entities from Scientific Documents: EEKE2020. In: Proceedings of the 20th ACM/IEEE Joint Conference on Digital Libraries (JCDL2020), Wuhan, China, 2020. https://doi.org/10.1145/3383583.3398504 [2] Ying Ding, Min Song, Jia Han, Qi Yu, Erjia Yan, Lili Lin, Tamy Chambers. Entitymetrics: measuring the impact of entities. Plos One, 8(8), e71416. 5