EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Extracting Domain Entities from Scientific Papers Leveraging Author Keywords Jiabin Peng Jing Chen Guo Chen School of Economics & School of Economics & School of Economics & Management Management Management Nanjing University of Science and Nanjing University of Science and Nanjing University of Science and Technology Technology Technology Nanjing, JiangSu, China Nanjing, JiangSu, China Nanjing, JiangSu, China 2542505085@qq.com chenjinguuu@126.com delphi1987@qq.com ABSTRACT reduces the dependence of models on labeled data and improves the generalization ability of models to a certain extent. Current methods of domain entity extraction of scientific texts rely Actually, domain entity extraction can be divided into two heavily on manually annotation corpus and thus have poor subtasks: entity boundary recognition and entity type generalization ability. In this paper, we proposed a two-stage classification. Taking the domain of artificial intelligence (AI) as methodology that can make good use of existed author keywords an example, we firstly used the domain glossary to help to identify of the given domain to solve this problem. Firstly, the author the entity boundaries and then constructed a low-cost training data keyword set was used to mark the boundary of candidate entities, to classify the entities. Problems and solutions were viewed as the and then their features are integrated to classify their entity type. key-insights of scientific papers[1], so we took them as the main In the experiment on artificial intelligence (AI) documents from entity types in the experiment. According to related studies, we WOS, our approach obtains an F1 value of 0.753 without manual summarized the research objectives, domains, applications, and annotation, which is slightly lower than the BERT-BiLSTM-CRF tasks in technical papers as problems, and the methods, schemes, baseline model (F1=0.772) trained on manual annotation corpus, models, technologies, tools, software, algorithms, and theories showing the usability of our approach in practice. used to solve these problems as solutions[4][5][6]. The experimental results showed that a good index of our methodology CCS CONCEPTS was obtained without manual annotation, and the F1-measure • Computing methodologies • Artificial intelligence • Natural reached 0.753. language processing • Information extraction KEYWORDS 2 Related Studies Information Extraction, Domain Named Entity Recognition, At present, the mainstream methods of domain NER are divided Analysis of Scientific Papers, Author Keywords into two categories: methods based on statistical machine learning (ML) and methods based on deep learning (DL). The NER method based on ML is essentially classification, that is, given multiple 1 Introduction types of named entities, and then models are used to classify the At present, there have been many studies on knowledge entity entities in the text. And there are two ideas in the implementation. extraction in scientific papers, and the biggest problem is the lack One is to identify the boundaries of all named entities in the text of labeled data[1]. As we all know, scientific papers usually firstly, and then classify them into different types, such as belong to a specific domain, so manual annotation needs CoBoost[7]. The other is sequence annotation. Each word in the corresponding domain knowledge, which makes the annotation text is given several candidate type labels, which correspond to its more expensive, and many popular named entity recognition position in various entities. The classical NER models based on (NER) models can not play their inherent excellent performance. sequence annotation in ML include HMM[8], CRF[9], etc. The To ensure the generalization ability of NER models, it is necessary NER models based on DL use pre-trained word vectors to to reduce their dependence on manual annotation. At present, represent words, which can solve the problem of data sparsity in thanks to the rapid development of databases and the Internet, a high latitude vector space. Meanwhile, pre-trained word vectors large number of knowledge resources have been accumulated in contain more semantic information than manually selected many domains, such as knowledge bases, gazetteers, glossaries, features and can obtain the feature representation in unified vector dictionaries, etc. These resources are widely used in NER models space from heterogeneous texts, which brings strong development of distant supervision[2] or semi-supervision learning[3], which for sequence annotation tasks, especially for NER[10]. Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 41 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents The biggest problem of domain NER is the lack of labeled corpus 3 Methodology nowadays. When a general NER method is applied to a specific domain, corresponding adjustment strategies need to be taken 3.1 Framework according to the domain corpus. A common idea is to use transfer Traditional NER was regarded as a sequence labeling task, which learning to share data and models among domains. Ni et al. assigned the corresponding entity type and location label to each projected labeled data and distributed representation of words token in the text. In fact, NER could be regarded as two subtasks: without manual annotation in the target domain[11]. Giorgi et al. boundary recognition and entity classification. That was, we can transferred the source domain model parameters to the target firstly identify the boundaries of named entities in the text, and domain for initialization, and then fine-tune the parameters to then classify them into different types. The NER method based on meet the task[12]. Another idea is to make full use of the existing sequence labeling treated two subtasks as a whole, in which the knowledge resources in domains to automatically build datasets same labeled data was shared by two subtasks so that the and carry out distant supervision, semi-supervision, weak requirement for its quality was pretty high. As a result, many supervision, etc. Nooralahzadeh et al. adopted a technique of classic NER methods could not be applied in some subdivided partial annotation and implemented a reinforcement learning domains. In addition, the NER method based on sequence labeling strategy with a neural network policy in distant supervision could not be effectively integrated into the existing domain NER[2]. Peters et al. demonstrated a general semi-supervised resources. At present, the common practice was to use domain approach for adding pre-trained context embeddings from terms as auxiliary data to help roughly label data. On the contrary, bidirectional language models to NLP systems and apply it to by dividing NER into boundary recognition and entity NER[3]. Lison et al. relied on a broad spectrum of labeling classification, we could make full use of the existing domain functions to automatically annotate texts from the target knowledge resources. domain[13]. Entity boundary recognition could be regarded as a word From the above researches, it could be seen that various domain segmentation task, which required large-scale resources (i.e. user- resources were widely used to reduce the manual annotation cost defined lexicon). And there usually exist some domain glossaries as much as possible, which thus achieved good results. However, and a large-scale author keyword set in a given domain, which can the domain NER models based on transfer learning, semi- help to solve the word segmentation task. Compared with word supervision etc. still could not avoid manual participation in the segmentation, entity classification required smaller-scale resources construction of datasets. Therefore, after analyzing the essence of (i.e. training data). At present, many domains knowledge can be the NER task, domain NER was divided into two subtasks in this obtained easily through an online database or knowledge graph, paper, which avoided manual annotation with the help of domain which can provide necessary training data for entity type resources. In addition, some new ideas such as zero-shot classification without manual annotation. To sum up, the learning[14] and learning with noisy labels[15] have also been framework of this paper was shown in Figure 1. applied to domain NER to help further reduce labor costs. Domain Resources Entity Boundary Recognition Entity Classification Training Data Glossary Model Lexicon Features Author Keywords Word Vector Document Pos of Speech Evaluation & Segmentation Optimization Abstract Word Case Figure 1 Framework of knowledge entities extraction. The framework was divided into three parts. The first was the CNKI, etc.), then author keywords and abstract were extracted. acquisition of domain resources. The domain resources used in Author keywords were indispensable large-scale resources for this paper included domain glossary and domain documents. entity boundary recognition, and abstracts could be used in Domain glossary could be obtained directly through browser (such constructing features for the training data. The second was entity as Google, Firefox, etc.) or relevant domain knowledge websites boundary recognition, which was regarded as a word segmentation (such as Wikipedia, Baidu baike, etc.). In addition, we also got the task. The user-defined lexicon of the word segmentation task was types of terms when constructing a domain glossary. Domain constructed by combining domain glossary and author keywords documents could be obtained through databases (such as WOS, set and helped to realize entity boundary recognition at a low cost. The third was entity classification. The training data required for 42 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents classification were extracted from the domain glossary, and the context-independent. Because our core task was phrase text features were obtained from abstract training or counting. classification, which not needed context information, we chose Word2Vec to train word vectors. Before training, we used 3.2 Implementation underline to concatenate the words in a phrase in the corpus after 3.2.1 Entity Boundary Recognition. As mentioned above, entity word segmentation, to ensure that phrases were regarded as a boundary recognition was transformed into word segmentation. whole when training word vectors. Inspired by the Chinese automatic word segmentation method, the Word2Vec included two algorithms: Skip-gram and CBOW. forward maximum matching algorithm based on string matching Research showed that Skip-gram contained more semantic was used in this paper. Slightly different from Chinese word information, while CBOW contains more grammatical segmentation, it was necessary to extract the stem of English information[17]. The window size was also very important for words before English word segmentation to avoid the influence of training word vectors, and the commonly used window sizes were word form on word segmentation. In addition, there would be 5 and 101. Therefore, we would first explore the above two factors some noise when using word segmentation lexicon to label the affecting word vector results in the classification experiment. candidate entities, so the following entity classification task was According to related studies[18], other parameters were shown in actually multi-class. Table 1. In addition, to make word vectors more robust, we used 3.2.2 Entity Classification. Entity classification was essentially stemmed corpus to train Word2Vec. word classification, which was a typical supervised task, and the Parameters Values training data was indispensable. At present, it is difficult to sg 1/0 construct a large number of high-quality classification data in a window size(w) 5 / 10 given domain. However, a small number of high-quality data min count 5 could usually be obtained at a low cost with the help of domain iteration number 20 knowledge bases or domain experts. embedding size 200 1) Construct training data. Training data consisted of positive samples and negative samples. Positive samples consisted of Table 1 Parameters of Word2Vec. sg=1 meant algorithm was entities and corresponding types. And the pre-constructed Skip-gram, sg=0 was CBOW. w was used to represent the glossary contained the types of terms, so we directly window size later. extracted some high-quality terms and their types as the positive samples. Negative samples, i.e. non-entities, were 3.3.2 POS Feature. POS of words in sentences could be randomly extracted from keyword sets and texts. obtained through the Python third-party package nltk. There were 2) Construct text features. According to the task, word vector, 36 kinds of POS in nltk, which meant the length of the POS vector part of speech (POS) feature, and word case feature were of a single word was 36. In the classification experiment, POS constructed. Word vectors could be obtained by training vectors of training data were obtained by concatenating the POS large-scale domain unlabeled corpus, and semantic vectors of component words. To avoid the inconsistent length of information was given to discrete words according to the POS vectors, we counted the lengths of phrases (the number of context. POS feature was obtained by counting the corpus words in phrases) in the segmentation lexicon to obtain the without word segmentation. The acquisition of the case maximum phrase length. When the lengths of training data were feature was basically consistent with the POS feature, but the less than the maximum phrase length, POS vectors of training data corpus with word segmentation was used and the cases would be padded with 0. Finally, the length of POS vectors of needed to be self-defined. training data was 36 * max(len(phrase{lexicon})) (Maximum length 3) Model selection, training, evaluation, and optimization. of phrases in lexicon). POS vectors were used by concatenating According to the task, the models we used included four word vectors in the experiment. classical machine learning models: Random Forest (RF), K- 3.3.3 Case Feature. Three types of phrase cases were defined Nearest Neighbor (KNN), Support Vector Machine (SVM), in this paper: initial uppercase, all uppercase and all lowercase. Multilayer Perceptron (MLP), and TextCNN, which The lengths of case vectors of training data were 3. Similarly, case performed well in sentence classification[16]. The detailed vectors were used by concatenating word vectors. steps of our experiment were as follows: ① feeding the 3.4 Classification Models training data to models to obtain the basic results; ② optimizing the word vector according to the model effect and Classification models used in this paper included RF, KNN, SVM, MLP, and TextCNN. The first four models were implemented by adding features in the training process; ③ evaluating the sklearn in Python. In the RF, the number of decision trees was set effect of models to decide whether to continue optimizing. to 100. All parameters of the KNN took the default values. In the SVM, the probability was set to True, that was, probability 3.3 Feature Processing estimation was enabled. In the MLP, the number of neurons in the 3.3.1 Word Vector. At present, common models for training word vectors included Word2Vec, GloVe, ELMO, GPT, BERT, 1 https://www.bbsmax.com/A/A2dm2D7zen/ etc. The word vectors trained by the first two models were 43 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents hidden layer was (100, 50). TextCNN was originally used to The macro average of precision, recall, and F1-measure were used classify sentences, and phrases could be regarded as shorter to evaluate the models. sentences. The input of TextCNN 2 were vectors generated by Word vectors were the basic input of models, so we firstly Word2Vec. The embedding size, sequence length, batch size, and explored the influences of word vectors trained by the two training epoch in TextCNN were set as 200, 10, 32, and 20 algorithms in different window sizes, and the results were shown respectively. Parameters not mentioned above took the default in Table 2. values. sg=1 sg=0 w=5 w=10 w=5 w=10 RF 0.672 0.681 0.666 0.655 4 Experiment KNN 0.55 0.672 0.151 0.146 To verify the effectiveness of our methodology, we took the SVM 0.736 0.701 0.709 0.679 domain of AI as an example in the experiment. MLP 0.672 0.685 0.428 0.478 TextCNN 0.695 0.685 0.67 0.67 4.1 Data Acquisition and Preprocessing Firstly, we obtained the bibliography data of the AI domain. The Table 2 Macro F1-measure of models using different word data was from the category of AI in the core collection of WOS vectors on the test set. (Web of Science). Documents were retrieved with WC = computer science and WC = artificial intelligence, and the time range was Firstly, we compared the results in Table 2 vertically. When sg=1, set as from 1996 to 2020. Then, abstracts and keywords were five models in two window sizes achieved good results on the extracted from the bibliography data, including 927675 abstracts whole. Only when w=5, KNN performed poorly. However, when and 161169 keywords. sg=0, the performances of all models decreased, especially KNN Secondly, we constructed a glossary of AI domain. The data came and MLP. The possible reason was that Skip-gram focused on from a knowledge website3 in AI domain, from which we obtained semantics, which was more conducive to the NER task than all problem and solution entities. The problem entities were from CBOW. In addition, KNN and MLP had higher requirements for the tasks of the Browse State-of-the-Art page in the website, and data quality, but word vectors trained by CBOW could not meet the solution entities were from the machine learning components the requirement. Secondly, we compared the results in Table 2 horizontally, and the best F1-measures of each model were about of the Methods page. After removing duplications, 1887 problem 0.7, which were highlighted in bold in Table 2. In the following entities and 1209 solution entities were remained. experiments, the word vector that made each model achieved the Finally, we processed the above data to get the final experimental best performance was used, and POS features and case features data. The user-defined lexicon of English word segmentation was were added to models. The results were shown in Table 3. constructed by merging keywords set and domain glossary. The It could be found in Table 3 that the addition of two features training data of classifiers consisted of entities and non-entities, in effectively improved the F1-measures. When all features were which 360 entities of each type were manually extracted from the fused, optimal results were obtained in all models. Among the five glossary, and 360 non-entities were manually constructed. Non- models, SVM had the best performance, with an F1-measure of entities included phrases and words, in which phrases were 0.753. This might be because the underlying training mechanism extracted from high-frequency keywords and words were of SVM made SVM more suitable for small sample classification. constructed randomly. The ratio of phrases to words in non- When the word vector parameters were sg=1 and w=10, the voting entities was about 2:1. This was because almost all entities were model had the best performance, and the F1-measure was 0.752. phrases, so more phrase-level non-entities were needed to help SVM or voting model could be selectively used in practical train models. Finally, 1080 pieces of classification data were application. The best F1-measure of TextCNN was 0.715, which obtained. The training set and validation set were randomly was far from its performance in sentence classification. One divided according to the ratio of 5:1. In addition, to evaluate the possible reason was that the length of phrases was much smaller performance of our methodology, we set our baseline as the than sentences. traditional BERT-BiLSTM-CRF NER model. A previous The baseline BERT-BiLSTM-CRF4 performed well on the domain annotated corpus containing 3000 sentences was used for the NER task. Its F1-measure was 0.772, which was far less than its baseline model, in which 2000 sentence were randomly selected as performance in the general NER task, but it had been a very good the training data and the rest 1000 sentences were used as the result in the subdivided domain. And the result was 0.019 higher common test set. than our optimal model. From the experimental result, there was still a gap in our methodology, but from the cost of experimental 4.2 Result Analysis data, the gap was acceptable. In the following work, we can further optimize the word vectors and add more features to improve the performance. 2 https://github.com/cjymz886/text-cnn 3 4 https://paperswithcode.com/ https://github.com/macanv/BERT-BiLSTM-CRF-NER 44 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents 5 Conclusion achieved in the extraction of problem and solution entities without manual annotation using our approach. Aiming at the problem that the current domain NER models In general, our approach has good domain generalization because heavily rely on manually annotation data and thus has poor it does not need manual annotation, and can be applied to many domain generalization ability, we propose a two-stage knowledge subdivided domains at a low cost. However, the performance of entity extraction methodology, which can get rid of the our scheme still has some room for improvement. In the follow-up dependence on manually annotation data. Experiments in WOS work, we can try to use better word vectors and more features to documents in the domain of AI showed that good results can be improve the accuracy of entity extraction, and gradually extend the model to the extraction of more knowledge types. f1 f1+f2 f1+f3 f1+f2+f3 P R F1 P R F1 P R F1 P R F1 RF 0.588 0.812 0.681 0.629 0.824 0.713 0.618 0.812 0.702 0.661 0.833 0.736 KNN 0.593 0.78 0.672 0.603 0.773 0.676 0.604 0.784 0.681 0.614 0.783 0.687 SVM 0.677 0.81 0.736 0.701 0.809 0.749 0.69 0.812 0.744 0.706 0.812 0.753 MLP 0.593 0.813 0.685 0.604 0.815 0.694 0.605 0.814 0.694 0.621 0.826 0.709 TextCNN 0.63 0.785 0.695 0.631 0.815 0.701 0.64 0.765 0.697 0.65 0.81 0.715 Voting - - - - - - - - - 0.689 0.831 0.752 BERT-BiLSTM-CRF(baseline) P: 0.756 R: 0.789 F1:0.772 Table 3 Macro P, R, F1-measure of models using different features on the test set. f1 was word vector, f2 was POS feature, f3 was case feature. [11] Ni Jian, Dinu Georgiana and Florian Radu, 2017. Weakly Supervised Cross- Lingual Named Entity Recognition via Effective Annotation and ACKNOWLEDGMENTS Representation Projection. In Proceedings of the 54th Annual Meeting on This study is supported by the MOE (Ministry of Education in Association for Computational Linguistics. 1470-1480. [12] Giorgi M John and Bader D Gary, 2018. Transfer learning for biomedical China) Project of Humanities and Social Sciences. named entity recognition with neural networks. Bioinformatics 34 23 (2018), 4087-4094. [13] Lison Pierre, Barnes Jeremy, Hubin Aliaksandr and Touileb Samia. 2020. REFERENCES Named Entity Recognition without Labelled Data: A Weak Supervision [1] Zara Nasar, Syed Waqar Jaffry and Muhammad Kamran Malik, 2018. Approach. In Proceedings of the 58th Annual Meeting of the Association for Information extraction from scientific articles: a survey. Scientometrics 117 Computational Linguistics. 1518-1533. 3(2018), 1931-1990. DOI: https://doi.org/10.1007/s11192-018-2921-5. [14] Dai, Damai, et al. "Inductively Representing Out-of-Knowledge-Graph [2] Nooralahzadeh Farhad, Lønning Tore Jan and Øvrelid Lilja, 2019. Entities by Optimal Estimation Under Translational Assumptions." (2020). Reinforcement-based denoising of distantly supervised NER with partial [15] Ifeoluwa David Adelani, A. Michael Hedderich, Dawei Zhu, den Esther van annotation. In Proceedings of the 2nd Workshop on Deep Learning Berg and Dietrich Klakow, 2020. Distant Supervision and Noisy Label Approaches for Low-Resource NLP. 225-233. Learning for Low Resource Named Entity Recognition: A Study on Hausa [3] Peters E. Matthew, Ammar Waleed and Bhagavatula Chandra, et al., 2017. and Yor\`ub\'a. arXiv preprint arXiv: 2003.08370 (2020). Semi-supervised sequence tagging with bidirectional language models. In [16] Kim Y . Convolutional Neural Networks for Sentence Classification[J]. Eprint Proceedings of the 55th Annual Meeting of the Association for Computational Arxiv, 2014. Linguistics. 1756–1765. [17] Mikolov Tomas, Chen Kai, Corrado Greg and Dean Jeffrey, 2013. Efficient [4] Gupta Sonal and Manning D, 2011. Analyzing the dynamics of research by Estimation of Word Representations in Vector Space. Computer Science. extracting key aspects of scientific papers. In Proceedings of the 5th arXiv preprint arXiv:1301.3781v3 (2013). International Joint Conference on Natural Language Processing. 1-9. [18] Siwei Lai, Kang Liu, Liheng Xu and Jun Zhao, 2016. How to Generate a [5] Singh Mayank, Dan Soham, Agarwal Sanyam, Goyal Pawan and Mukherjee Good Word Embedding. IEEE Intelligent Systems 31 6 (2016), 5–14. Animesh, 2017. AppTechMiner: Mining Applications and Techniques from Scientific Articles. In Proceedings of the Joint Conference on Digital Libraries Joint Conference on Digital Libraries. 1-8. [6] Heffernan Kevin and Teufel Simone, 2018. Identifying Problems and Solutions in Scientific Text. Scientometrics 116 2 (2018), 1367–1382. DOI: https://doi.org/10.1007/s11192-018-2718-6. [7] Michael Collins and Yoram Singer, 1999. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 100-110. [8] Zhou Guodong and Su Jian, 2002. Named entity recognition using an HMM- based chunk tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 473-480. [9] McCallum Andrew and Li Wei, 2003. Early results for named entity recognition with conditional random fields, feature induction and web- enhanced lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. Stroudsburg: Association for Computational Linguistics, 188-191. [10] Cherry Colin and Guo Hongyu, 2015. The unreasonable effectiveness of word representations for Twitter named entity recognition. In The 2015 Annual Conference of the North American Chapter of the ACL. Stroudsburg: Association for Computational Linguistics, 735-745. 45