=Paper=
{{Paper
|id=Vol-2658/paper13
|storemode=property
|title=A Unsupervised Method for Terminology Extraction from Scientific Text
|pdfUrl=https://ceur-ws.org/Vol-2658/paper13.pdf
|volume=Vol-2658
|authors=Wei Shao,Hua Bolin,Qiang Ma,Jiaying Liu,Hongwei He,Keqi Chen
|dblpUrl=https://dblp.org/rec/conf/jcdl/ShaoHMLHC20
}}
==A Unsupervised Method for Terminology Extraction from Scientific Text==
EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
An Unsupervised Method for Terminology Extraction
from Scientific Text
Wei Shao Bolin Hua Qiang Ma
1600016634@pku.edu.cn huabolin@pku.edu.cn Department of Information
Department of Information Department of Information Management, Peking University
Management, Peking University Management, Peking University
Jiaying Liu Hongwei He Keqi Chen
Department of Information Department of Information Department of Information
Management, Peking University Management, Peking University Management, Peking University
CCS Concepts: • Information systems → Data mining; However, they rely on labelled data and have a poor perfor-
Information extraction; • Applied computing → Docu- mance on new unlabelled data. To solve this problem, some
ment management and text processing. semi-supervised and unsupervised methods are proposed. A
graph-based semi-supervised algorithm[4] achieve a high
Keywords: terminology extraction, unsupervised method, F1 on SemEval Task 10. Automatic rule learning based on
scientific text morphological features method[7] is used to extract entities
without annotated data. However, owing to the difficulty of
1 Introduction searching optimal parameters, these methods can’t get fully
developed.
Finding new terminology is a kind of named entity recogni-
tion(NER) problem. However, many high performance meth-
3 Method
ods need labelled data. Although they can obtain excellent
results on training and testing data, it is hard for them to 3.1 Overview
process new unlabelled data. One factor leading to this gap is Our method aims to extract terminology from unlabelled
that features of new text are different from features models data. For this purpose, we utilize two features of terminology:
learn on training data owing to the difference between their surrounding words and POS sequences. The process can be
domains. Also, these new scientific texts usually lack labels divided into two steps. One step is to cold-start model with
for extraction. So an unsupervised method which can also unlabelled data. In this step, the model will get sentence pat-
adapt different domains is needed. terns, POS sequences of terminology from data. Another step
To overcome this problem, we propose an unsupervised is to extract terminology with POS sequences and sentence
method based on sentence pattern and part of speech. In patterns learned by model. For a sentence, the model can
detail, we initialize a few patterns to extract terminologies extract terminology with learned sentence pattern or POS
in certain sentences. In this step, we can obtain some termi- sequences.
nologies and their part of speech sequences. Then, we try to
find the same POS sequences in sentences not matched by 3.2 Sentence Patterns
initial patterns with obtained terminologies’ POS sequences.
If a sentence is matched, we will utilize suitable words in this
sentence to replace the extendable parts of initial patterns.
In this case, we can obtain new patterns and get more ter-
minologies by using new patterns. After several iterations,
most terminology in scientific sentences can be extracted.
2 Related Work Figure 1. Pattern Examples
Recent years, terminology extraction has attracted more
and more attention. And all kinds of methods are produced. Our sentence pattern is represented by regular expression.
Some methods rely on string, syntax and other original fea- Examples are given in figure.1. These are two patterns aim-
tures. Liu li[2] and Zen Wen[8] use length of word and gram- ing to extract method terminology. "propose" is a word which
matical features to choose terminology candidates. Nowa- often appear with method words at the same time. Boundary
days, some methods based on machine learning and deep words like "by, to, for" are used to limit the range of termi-
learning are put forward. Among these methods, LSTM[1] nology words. What we want is matched by "(.+?)". When
and CRF[6] and their variants achieve the best performance. generating new patterns, we can use words from matched
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
86
EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
sentence to replace the extendable part of extant pattern. For filter new generated patterns according to their matching
examples in figure.1, the extendable parts are "propose" and results and move suitable patterns to pattern base. For new
"proposed". They can be replaced by "develop", "present", "put terminology words, they replace the initial extracted termi-
forward" and so on. In this case, new patterns are obtained nology words to participate in the extraction loop until no
and can be used to extract terminology in other sentences. new sentence could be extracted.
3.4 Extraction from New Data
Loop unitl no new
Sentence could be
extracted
After cold start, we can obtain sentence patterns and POS
sequences of terminology words. Here are two approaches to
Sentence
get new terminologies from new unlabelled data. One is that
Pattern Base Pattern Sentence
Base we can use patterns to match sentences for obtaining new
terminologies when only sentence string is input. Another is
Match the
Sentence
Extracted that when sentence string and POS sequence (processed by
Matched Sentence natural language tools) are input, we can use POS sequence
Base
Terminology
Matched
to match POS sequence of sentences to get a more accurate
Words
result.
Unextracted
Not Matched Sentence
POS sequence Base
Not Matched 4 Experiment and Result
filter
Sentence 4.1 Data and Preprocessing
POS Sequence POS To test our method, we crawled 200k+ abstracts from Web of
Base Sequence
POS Sequence in
Knowledge. Their topics include machine learning, big data
Sentence POS and data mining. We utilize nltk[3] to split abstracts into
Some parts are
replaceed by
choose Terminology Words
sentences and splitted sentences into tokens. Also we use
candidte
candidate words
wrods stanfordnlp[5] to get POS tags and dependency relations of
cut sentences. Our method only needs to use the tokenized
filter
Candidate sentences of abstracts and their POS tags.
Patterns
In experiment, we use 54000 sentences and their POS
sequences as training data and 1000 sentences and their POS
sequences as testing data. All sentences are unlabelled.
Figure 2. Cold Start Process
4.2 Extraction Results
Owing to the lack of labels, we use human evaluation to
3.3 Cold Start
measure our method’s performance. We use training data
The process of cold start of our method is shown in fig- to cold-start our model and extract 146902 terminologies
ure.2. The inputs are sentences and their POS sequences from training and testing data. Specifically, the accuracy of
and form the sentence base. First, we use each pattern from our method in testing data is 0.64. According to some cases
pattern base to match each sentence from sentence base. At of result, we can find that this method can partly solve the
beginning, pattern base only contains initial sentence pat- problem of extracting terminologies from unlabelled texts.
terns. Matched sentence will be moved to extracted sentence However, when it comes to very professional terminologies,
base and we can obtain terminology words and their POS the performance may be lower.
sequences. Otherwise, the sentence will be moved to unex-
tracted sentence base. The two bases are empty before . After
getting terminology words and their POS sequences, we need 5 Conclusion
to filter them to obtain more accurate results. The filtered To extract terminologies from scientific texts, we propose an
POS sequences are moved to POS Sequence Base. Then, for unsupervised method based on sentence pattern and POS
each POS sequences from POS sequence base, it is used to sequence of sentence. This method can extract terminologies
find if the sentence POS sequence in unextracted sentence without learning on labelled data and just need a few initial
base contains itself. If sentence POS sequence contains, we sentence patterns to cold-start. Then it can learn new pat-
can choose the candidate words from matched sentence for terns and POS sequences on unlabelled data and use them
generation of new patterns. After new patterns are generated, to extract new terminologies. In the future, we will test our
we use them to match sentences in unextracted sentence model on standard datasets and compare it with some base-
base and new terminology words are obtained. Then we can lines.
87
EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
References [5] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel,
[1] Zhao Dongyue, Du Yongping, and Shi Chongde. 2018. Scientific Litera- Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP nat-
ture Terms Extraction Based on Bidirectional Long Short-Term Memory ural language processing toolkit. In Proceedings of 52nd annual meeting
Model. Technology Intelligence Engineering 4, 1 (2018), 67–74. of the association for computational linguistics: system demonstrations.
[2] Liu Li and Xiao Yingyuan. 2017. A statistical domain terminology ex- 55–60.
traction method based on word length and grammatical feature. Journal [6] Wang Miping, Wang Hao, and etc Deng Sanhong. 2016. Extracting
of Harbin Engineering University 38, 9 (2017), 1437–1443. Chinese Metallurgy Patent Terms with Conditional Random Fields.
[3] Edward Loper and Steven Bird. 2002. NLTK: the natural language New Technology of Library and Information Service 6 (2016), 28–36.
toolkit. arXiv preprint cs/0205028 (2002). [7] Serhan Tatar and Ilyas Cicekli. 2011. Automatic rule learning exploiting
[4] Yi Luan, Mari Ostendorf, and Hannaneh Hajishirzi. 2017. Scientific morphological features for named entity recognition in Turkish. Journal
information extraction with semi-supervised neural tagging. arXiv of Information Science 37, 2 (2011), 137–151.
preprint arXiv:1708.06075 (2017). [8] Zeng Wen, Xu Shuo, and etc Zhang Yunliang. 2014. The Research and
Analysis on Automatic Extraction of Science and Technology Literature
Terms. New Technology of Library and Information Service 1 (2014),
51–55.
88