=Paper=
{{Paper
|id=Vol-3004/paper6
|storemode=property
|title=Extracting Domain Entities from Scientific Papers Leveraging Author Keywords
|pdfUrl=https://ceur-ws.org/Vol-3004/paper6.pdf
|volume=Vol-3004
|authors=Jiabin Peng,Jing Chen,Guo Chen
|dblpUrl=https://dblp.org/rec/conf/jcdl/PengCC21
}}
==Extracting Domain Entities from Scientific Papers Leveraging Author Keywords==
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
Extracting Domain Entities from Scientific Papers Leveraging
Author Keywords
Jiabin Peng Jing Chen Guo Chen
School of Economics & School of Economics & School of Economics &
Management Management Management
Nanjing University of Science and Nanjing University of Science and Nanjing University of Science and
Technology Technology Technology
Nanjing, JiangSu, China Nanjing, JiangSu, China Nanjing, JiangSu, China
2542505085@qq.com chenjinguuu@126.com delphi1987@qq.com
ABSTRACT reduces the dependence of models on labeled data and improves
the generalization ability of models to a certain extent.
Current methods of domain entity extraction of scientific texts rely Actually, domain entity extraction can be divided into two
heavily on manually annotation corpus and thus have poor subtasks: entity boundary recognition and entity type
generalization ability. In this paper, we proposed a two-stage classification. Taking the domain of artificial intelligence (AI) as
methodology that can make good use of existed author keywords an example, we firstly used the domain glossary to help to identify
of the given domain to solve this problem. Firstly, the author the entity boundaries and then constructed a low-cost training data
keyword set was used to mark the boundary of candidate entities, to classify the entities. Problems and solutions were viewed as the
and then their features are integrated to classify their entity type. key-insights of scientific papers[1], so we took them as the main
In the experiment on artificial intelligence (AI) documents from
entity types in the experiment. According to related studies, we
WOS, our approach obtains an F1 value of 0.753 without manual
summarized the research objectives, domains, applications, and
annotation, which is slightly lower than the BERT-BiLSTM-CRF
tasks in technical papers as problems, and the methods, schemes,
baseline model (F1=0.772) trained on manual annotation corpus,
models, technologies, tools, software, algorithms, and theories
showing the usability of our approach in practice.
used to solve these problems as solutions[4][5][6]. The
experimental results showed that a good index of our methodology
CCS CONCEPTS was obtained without manual annotation, and the F1-measure
• Computing methodologies • Artificial intelligence • Natural reached 0.753.
language processing • Information extraction
KEYWORDS 2 Related Studies
Information Extraction, Domain Named Entity Recognition, At present, the mainstream methods of domain NER are divided
Analysis of Scientific Papers, Author Keywords into two categories: methods based on statistical machine learning
(ML) and methods based on deep learning (DL). The NER method
based on ML is essentially classification, that is, given multiple
1 Introduction types of named entities, and then models are used to classify the
At present, there have been many studies on knowledge entity entities in the text. And there are two ideas in the implementation.
extraction in scientific papers, and the biggest problem is the lack One is to identify the boundaries of all named entities in the text
of labeled data[1]. As we all know, scientific papers usually firstly, and then classify them into different types, such as
belong to a specific domain, so manual annotation needs CoBoost[7]. The other is sequence annotation. Each word in the
corresponding domain knowledge, which makes the annotation text is given several candidate type labels, which correspond to its
more expensive, and many popular named entity recognition position in various entities. The classical NER models based on
(NER) models can not play their inherent excellent performance. sequence annotation in ML include HMM[8], CRF[9], etc. The
To ensure the generalization ability of NER models, it is necessary NER models based on DL use pre-trained word vectors to
to reduce their dependence on manual annotation. At present, represent words, which can solve the problem of data sparsity in
thanks to the rapid development of databases and the Internet, a high latitude vector space. Meanwhile, pre-trained word vectors
large number of knowledge resources have been accumulated in contain more semantic information than manually selected
many domains, such as knowledge bases, gazetteers, glossaries, features and can obtain the feature representation in unified vector
dictionaries, etc. These resources are widely used in NER models space from heterogeneous texts, which brings strong development
of distant supervision[2] or semi-supervision learning[3], which for sequence annotation tasks, especially for NER[10].
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
41
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
The biggest problem of domain NER is the lack of labeled corpus 3 Methodology
nowadays. When a general NER method is applied to a specific
domain, corresponding adjustment strategies need to be taken 3.1 Framework
according to the domain corpus. A common idea is to use transfer Traditional NER was regarded as a sequence labeling task, which
learning to share data and models among domains. Ni et al. assigned the corresponding entity type and location label to each
projected labeled data and distributed representation of words token in the text. In fact, NER could be regarded as two subtasks:
without manual annotation in the target domain[11]. Giorgi et al. boundary recognition and entity classification. That was, we can
transferred the source domain model parameters to the target firstly identify the boundaries of named entities in the text, and
domain for initialization, and then fine-tune the parameters to then classify them into different types. The NER method based on
meet the task[12]. Another idea is to make full use of the existing sequence labeling treated two subtasks as a whole, in which the
knowledge resources in domains to automatically build datasets same labeled data was shared by two subtasks so that the
and carry out distant supervision, semi-supervision, weak requirement for its quality was pretty high. As a result, many
supervision, etc. Nooralahzadeh et al. adopted a technique of classic NER methods could not be applied in some subdivided
partial annotation and implemented a reinforcement learning domains. In addition, the NER method based on sequence labeling
strategy with a neural network policy in distant supervision could not be effectively integrated into the existing domain
NER[2]. Peters et al. demonstrated a general semi-supervised resources. At present, the common practice was to use domain
approach for adding pre-trained context embeddings from terms as auxiliary data to help roughly label data. On the contrary,
bidirectional language models to NLP systems and apply it to by dividing NER into boundary recognition and entity
NER[3]. Lison et al. relied on a broad spectrum of labeling classification, we could make full use of the existing domain
functions to automatically annotate texts from the target knowledge resources.
domain[13]. Entity boundary recognition could be regarded as a word
From the above researches, it could be seen that various domain segmentation task, which required large-scale resources (i.e. user-
resources were widely used to reduce the manual annotation cost defined lexicon). And there usually exist some domain glossaries
as much as possible, which thus achieved good results. However, and a large-scale author keyword set in a given domain, which can
the domain NER models based on transfer learning, semi- help to solve the word segmentation task. Compared with word
supervision etc. still could not avoid manual participation in the segmentation, entity classification required smaller-scale resources
construction of datasets. Therefore, after analyzing the essence of (i.e. training data). At present, many domains knowledge can be
the NER task, domain NER was divided into two subtasks in this obtained easily through an online database or knowledge graph,
paper, which avoided manual annotation with the help of domain which can provide necessary training data for entity type
resources. In addition, some new ideas such as zero-shot classification without manual annotation. To sum up, the
learning[14] and learning with noisy labels[15] have also been framework of this paper was shown in Figure 1.
applied to domain NER to help further reduce labor costs.
Domain Resources Entity Boundary Recognition Entity Classification
Training Data
Glossary
Model
Lexicon Features
Author Keywords Word Vector
Document Pos of Speech Evaluation &
Segmentation
Optimization
Abstract Word Case
Figure 1 Framework of knowledge entities extraction.
The framework was divided into three parts. The first was the CNKI, etc.), then author keywords and abstract were extracted.
acquisition of domain resources. The domain resources used in Author keywords were indispensable large-scale resources for
this paper included domain glossary and domain documents. entity boundary recognition, and abstracts could be used in
Domain glossary could be obtained directly through browser (such constructing features for the training data. The second was entity
as Google, Firefox, etc.) or relevant domain knowledge websites boundary recognition, which was regarded as a word segmentation
(such as Wikipedia, Baidu baike, etc.). In addition, we also got the task. The user-defined lexicon of the word segmentation task was
types of terms when constructing a domain glossary. Domain constructed by combining domain glossary and author keywords
documents could be obtained through databases (such as WOS, set and helped to realize entity boundary recognition at a low cost.
The third was entity classification. The training data required for
42
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
classification were extracted from the domain glossary, and the context-independent. Because our core task was phrase
text features were obtained from abstract training or counting. classification, which not needed context information, we chose
Word2Vec to train word vectors. Before training, we used
3.2 Implementation underline to concatenate the words in a phrase in the corpus after
3.2.1 Entity Boundary Recognition. As mentioned above, entity word segmentation, to ensure that phrases were regarded as a
boundary recognition was transformed into word segmentation. whole when training word vectors.
Inspired by the Chinese automatic word segmentation method, the Word2Vec included two algorithms: Skip-gram and CBOW.
forward maximum matching algorithm based on string matching Research showed that Skip-gram contained more semantic
was used in this paper. Slightly different from Chinese word information, while CBOW contains more grammatical
segmentation, it was necessary to extract the stem of English information[17]. The window size was also very important for
words before English word segmentation to avoid the influence of training word vectors, and the commonly used window sizes were
word form on word segmentation. In addition, there would be 5 and 101. Therefore, we would first explore the above two factors
some noise when using word segmentation lexicon to label the affecting word vector results in the classification experiment.
candidate entities, so the following entity classification task was According to related studies[18], other parameters were shown in
actually multi-class. Table 1. In addition, to make word vectors more robust, we used
3.2.2 Entity Classification. Entity classification was essentially stemmed corpus to train Word2Vec.
word classification, which was a typical supervised task, and the Parameters Values
training data was indispensable. At present, it is difficult to sg 1/0
construct a large number of high-quality classification data in a window size(w) 5 / 10
given domain. However, a small number of high-quality data min count 5
could usually be obtained at a low cost with the help of domain iteration number 20
knowledge bases or domain experts. embedding size 200
1) Construct training data. Training data consisted of positive
samples and negative samples. Positive samples consisted of Table 1 Parameters of Word2Vec. sg=1 meant algorithm was
entities and corresponding types. And the pre-constructed Skip-gram, sg=0 was CBOW. w was used to represent the
glossary contained the types of terms, so we directly window size later.
extracted some high-quality terms and their types as the
positive samples. Negative samples, i.e. non-entities, were 3.3.2 POS Feature. POS of words in sentences could be
randomly extracted from keyword sets and texts. obtained through the Python third-party package nltk. There were
2) Construct text features. According to the task, word vector, 36 kinds of POS in nltk, which meant the length of the POS vector
part of speech (POS) feature, and word case feature were of a single word was 36. In the classification experiment, POS
constructed. Word vectors could be obtained by training vectors of training data were obtained by concatenating the POS
large-scale domain unlabeled corpus, and semantic vectors of component words. To avoid the inconsistent length of
information was given to discrete words according to the POS vectors, we counted the lengths of phrases (the number of
context. POS feature was obtained by counting the corpus words in phrases) in the segmentation lexicon to obtain the
without word segmentation. The acquisition of the case maximum phrase length. When the lengths of training data were
feature was basically consistent with the POS feature, but the less than the maximum phrase length, POS vectors of training data
corpus with word segmentation was used and the cases would be padded with 0. Finally, the length of POS vectors of
needed to be self-defined. training data was 36 * max(len(phrase{lexicon})) (Maximum length
3) Model selection, training, evaluation, and optimization. of phrases in lexicon). POS vectors were used by concatenating
According to the task, the models we used included four word vectors in the experiment.
classical machine learning models: Random Forest (RF), K- 3.3.3 Case Feature. Three types of phrase cases were defined
Nearest Neighbor (KNN), Support Vector Machine (SVM), in this paper: initial uppercase, all uppercase and all lowercase.
Multilayer Perceptron (MLP), and TextCNN, which The lengths of case vectors of training data were 3. Similarly, case
performed well in sentence classification[16]. The detailed vectors were used by concatenating word vectors.
steps of our experiment were as follows: ① feeding the
3.4 Classification Models
training data to models to obtain the basic results; ②
optimizing the word vector according to the model effect and Classification models used in this paper included RF, KNN, SVM,
MLP, and TextCNN. The first four models were implemented by
adding features in the training process; ③ evaluating the
sklearn in Python. In the RF, the number of decision trees was set
effect of models to decide whether to continue optimizing.
to 100. All parameters of the KNN took the default values. In the
SVM, the probability was set to True, that was, probability
3.3 Feature Processing
estimation was enabled. In the MLP, the number of neurons in the
3.3.1 Word Vector. At present, common models for training
word vectors included Word2Vec, GloVe, ELMO, GPT, BERT, 1
https://www.bbsmax.com/A/A2dm2D7zen/
etc. The word vectors trained by the first two models were
43
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
hidden layer was (100, 50). TextCNN was originally used to The macro average of precision, recall, and F1-measure were used
classify sentences, and phrases could be regarded as shorter to evaluate the models.
sentences. The input of TextCNN 2 were vectors generated by Word vectors were the basic input of models, so we firstly
Word2Vec. The embedding size, sequence length, batch size, and explored the influences of word vectors trained by the two
training epoch in TextCNN were set as 200, 10, 32, and 20 algorithms in different window sizes, and the results were shown
respectively. Parameters not mentioned above took the default in Table 2.
values. sg=1 sg=0
w=5 w=10 w=5 w=10
RF 0.672 0.681 0.666 0.655
4 Experiment KNN 0.55 0.672 0.151 0.146
To verify the effectiveness of our methodology, we took the SVM 0.736 0.701 0.709 0.679
domain of AI as an example in the experiment. MLP 0.672 0.685 0.428 0.478
TextCNN 0.695 0.685 0.67 0.67
4.1 Data Acquisition and Preprocessing
Firstly, we obtained the bibliography data of the AI domain. The Table 2 Macro F1-measure of models using different word
data was from the category of AI in the core collection of WOS vectors on the test set.
(Web of Science). Documents were retrieved with WC = computer
science and WC = artificial intelligence, and the time range was Firstly, we compared the results in Table 2 vertically. When sg=1,
set as from 1996 to 2020. Then, abstracts and keywords were five models in two window sizes achieved good results on the
extracted from the bibliography data, including 927675 abstracts whole. Only when w=5, KNN performed poorly. However, when
and 161169 keywords. sg=0, the performances of all models decreased, especially KNN
Secondly, we constructed a glossary of AI domain. The data came and MLP. The possible reason was that Skip-gram focused on
from a knowledge website3 in AI domain, from which we obtained semantics, which was more conducive to the NER task than
all problem and solution entities. The problem entities were from CBOW. In addition, KNN and MLP had higher requirements for
the tasks of the Browse State-of-the-Art page in the website, and data quality, but word vectors trained by CBOW could not meet
the solution entities were from the machine learning components the requirement. Secondly, we compared the results in Table 2
horizontally, and the best F1-measures of each model were about
of the Methods page. After removing duplications, 1887 problem
0.7, which were highlighted in bold in Table 2. In the following
entities and 1209 solution entities were remained.
experiments, the word vector that made each model achieved the
Finally, we processed the above data to get the final experimental
best performance was used, and POS features and case features
data. The user-defined lexicon of English word segmentation was
were added to models. The results were shown in Table 3.
constructed by merging keywords set and domain glossary. The
It could be found in Table 3 that the addition of two features
training data of classifiers consisted of entities and non-entities, in
effectively improved the F1-measures. When all features were
which 360 entities of each type were manually extracted from the
fused, optimal results were obtained in all models. Among the five
glossary, and 360 non-entities were manually constructed. Non-
models, SVM had the best performance, with an F1-measure of
entities included phrases and words, in which phrases were
0.753. This might be because the underlying training mechanism
extracted from high-frequency keywords and words were
of SVM made SVM more suitable for small sample classification.
constructed randomly. The ratio of phrases to words in non-
When the word vector parameters were sg=1 and w=10, the voting
entities was about 2:1. This was because almost all entities were
model had the best performance, and the F1-measure was 0.752.
phrases, so more phrase-level non-entities were needed to help
SVM or voting model could be selectively used in practical
train models. Finally, 1080 pieces of classification data were
application. The best F1-measure of TextCNN was 0.715, which
obtained. The training set and validation set were randomly
was far from its performance in sentence classification. One
divided according to the ratio of 5:1. In addition, to evaluate the
possible reason was that the length of phrases was much smaller
performance of our methodology, we set our baseline as the
than sentences.
traditional BERT-BiLSTM-CRF NER model. A previous
The baseline BERT-BiLSTM-CRF4 performed well on the domain
annotated corpus containing 3000 sentences was used for the
NER task. Its F1-measure was 0.772, which was far less than its
baseline model, in which 2000 sentence were randomly selected as
performance in the general NER task, but it had been a very good
the training data and the rest 1000 sentences were used as the
result in the subdivided domain. And the result was 0.019 higher
common test set.
than our optimal model. From the experimental result, there was
still a gap in our methodology, but from the cost of experimental
4.2 Result Analysis
data, the gap was acceptable. In the following work, we can
further optimize the word vectors and add more features to
improve the performance.
2
https://github.com/cjymz886/text-cnn
3 4
https://paperswithcode.com/ https://github.com/macanv/BERT-BiLSTM-CRF-NER
44
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
5 Conclusion achieved in the extraction of problem and solution entities without
manual annotation using our approach.
Aiming at the problem that the current domain NER models
In general, our approach has good domain generalization because
heavily rely on manually annotation data and thus has poor
it does not need manual annotation, and can be applied to many
domain generalization ability, we propose a two-stage knowledge
subdivided domains at a low cost. However, the performance of
entity extraction methodology, which can get rid of the
our scheme still has some room for improvement. In the follow-up
dependence on manually annotation data. Experiments in WOS
work, we can try to use better word vectors and more features to
documents in the domain of AI showed that good results can be
improve the accuracy of entity extraction, and gradually extend
the model to the extraction of more knowledge types.
f1 f1+f2 f1+f3 f1+f2+f3
P R F1 P R F1 P R F1 P R F1
RF 0.588 0.812 0.681 0.629 0.824 0.713 0.618 0.812 0.702 0.661 0.833 0.736
KNN 0.593 0.78 0.672 0.603 0.773 0.676 0.604 0.784 0.681 0.614 0.783 0.687
SVM 0.677 0.81 0.736 0.701 0.809 0.749 0.69 0.812 0.744 0.706 0.812 0.753
MLP 0.593 0.813 0.685 0.604 0.815 0.694 0.605 0.814 0.694 0.621 0.826 0.709
TextCNN 0.63 0.785 0.695 0.631 0.815 0.701 0.64 0.765 0.697 0.65 0.81 0.715
Voting - - - - - - - - - 0.689 0.831 0.752
BERT-BiLSTM-CRF(baseline) P: 0.756 R: 0.789 F1:0.772
Table 3 Macro P, R, F1-measure of models using different features on the test set. f1 was word vector, f2 was POS feature, f3 was
case feature.
[11] Ni Jian, Dinu Georgiana and Florian Radu, 2017. Weakly Supervised Cross-
Lingual Named Entity Recognition via Effective Annotation and
ACKNOWLEDGMENTS Representation Projection. In Proceedings of the 54th Annual Meeting on
This study is supported by the MOE (Ministry of Education in Association for Computational Linguistics. 1470-1480.
[12] Giorgi M John and Bader D Gary, 2018. Transfer learning for biomedical
China) Project of Humanities and Social Sciences. named entity recognition with neural networks. Bioinformatics 34 23 (2018),
4087-4094.
[13] Lison Pierre, Barnes Jeremy, Hubin Aliaksandr and Touileb Samia. 2020.
REFERENCES Named Entity Recognition without Labelled Data: A Weak Supervision
[1] Zara Nasar, Syed Waqar Jaffry and Muhammad Kamran Malik, 2018. Approach. In Proceedings of the 58th Annual Meeting of the Association for
Information extraction from scientific articles: a survey. Scientometrics 117 Computational Linguistics. 1518-1533.
3(2018), 1931-1990. DOI: https://doi.org/10.1007/s11192-018-2921-5. [14] Dai, Damai, et al. "Inductively Representing Out-of-Knowledge-Graph
[2] Nooralahzadeh Farhad, Lønning Tore Jan and Øvrelid Lilja, 2019. Entities by Optimal Estimation Under Translational Assumptions." (2020).
Reinforcement-based denoising of distantly supervised NER with partial [15] Ifeoluwa David Adelani, A. Michael Hedderich, Dawei Zhu, den Esther van
annotation. In Proceedings of the 2nd Workshop on Deep Learning Berg and Dietrich Klakow, 2020. Distant Supervision and Noisy Label
Approaches for Low-Resource NLP. 225-233. Learning for Low Resource Named Entity Recognition: A Study on Hausa
[3] Peters E. Matthew, Ammar Waleed and Bhagavatula Chandra, et al., 2017. and Yor\`ub\'a. arXiv preprint arXiv: 2003.08370 (2020).
Semi-supervised sequence tagging with bidirectional language models. In [16] Kim Y . Convolutional Neural Networks for Sentence Classification[J]. Eprint
Proceedings of the 55th Annual Meeting of the Association for Computational Arxiv, 2014.
Linguistics. 1756–1765. [17] Mikolov Tomas, Chen Kai, Corrado Greg and Dean Jeffrey, 2013. Efficient
[4] Gupta Sonal and Manning D, 2011. Analyzing the dynamics of research by Estimation of Word Representations in Vector Space. Computer Science.
extracting key aspects of scientific papers. In Proceedings of the 5th arXiv preprint arXiv:1301.3781v3 (2013).
International Joint Conference on Natural Language Processing. 1-9. [18] Siwei Lai, Kang Liu, Liheng Xu and Jun Zhao, 2016. How to Generate a
[5] Singh Mayank, Dan Soham, Agarwal Sanyam, Goyal Pawan and Mukherjee Good Word Embedding. IEEE Intelligent Systems 31 6 (2016), 5–14.
Animesh, 2017. AppTechMiner: Mining Applications and Techniques from
Scientific Articles. In Proceedings of the Joint Conference on Digital
Libraries Joint Conference on Digital Libraries. 1-8.
[6] Heffernan Kevin and Teufel Simone, 2018. Identifying Problems and
Solutions in Scientific Text. Scientometrics 116 2 (2018), 1367–1382.
DOI: https://doi.org/10.1007/s11192-018-2718-6.
[7] Michael Collins and Yoram Singer, 1999. Unsupervised models for named
entity classification. In Proceedings of the Joint SIGDAT Conference on
Empirical Methods in Natural Language Processing and Very Large Corpora.
100-110.
[8] Zhou Guodong and Su Jian, 2002. Named entity recognition using an HMM-
based chunk tagger. In Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics. Stroudsburg: Association for
Computational Linguistics, 473-480.
[9] McCallum Andrew and Li Wei, 2003. Early results for named entity
recognition with conditional random fields, feature induction and web-
enhanced lexicons. In Proceedings of the Seventh Conference on Natural
Language Learning at HLT-NAACL. Stroudsburg: Association for
Computational Linguistics, 188-191.
[10] Cherry Colin and Guo Hongyu, 2015. The unreasonable effectiveness of word
representations for Twitter named entity recognition. In The 2015 Annual
Conference of the North American Chapter of the ACL. Stroudsburg:
Association for Computational Linguistics, 735-745.
45