=Paper= {{Paper |id=Vol-2658/paper13 |storemode=property |title=A Unsupervised Method for Terminology Extraction from Scientific Text |pdfUrl=https://ceur-ws.org/Vol-2658/paper13.pdf |volume=Vol-2658 |authors=Wei Shao,Hua Bolin,Qiang Ma,Jiaying Liu,Hongwei He,Keqi Chen |dblpUrl=https://dblp.org/rec/conf/jcdl/ShaoHMLHC20 }} ==A Unsupervised Method for Terminology Extraction from Scientific Text== https://ceur-ws.org/Vol-2658/paper13.pdf
                   EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents




An Unsupervised Method for Terminology Extraction
               from Scientific Text
                 Wei Shao                                      Bolin Hua                                     Qiang Ma
       1600016634@pku.edu.cn                          huabolin@pku.edu.cn                         Department of Information
      Department of Information                     Department of Information                    Management, Peking University
     Management, Peking University                 Management, Peking University

                Jiaying Liu                                  Hongwei He                                     Keqi Chen
      Department of Information                     Department of Information                     Department of Information
     Management, Peking University                 Management, Peking University                 Management, Peking University

CCS Concepts: • Information systems → Data mining;                       However, they rely on labelled data and have a poor perfor-
Information extraction; • Applied computing → Docu-                      mance on new unlabelled data. To solve this problem, some
ment management and text processing.                                     semi-supervised and unsupervised methods are proposed. A
                                                                         graph-based semi-supervised algorithm[4] achieve a high
Keywords: terminology extraction, unsupervised method,                   F1 on SemEval Task 10. Automatic rule learning based on
scientific text                                                          morphological features method[7] is used to extract entities
                                                                         without annotated data. However, owing to the difficulty of
1   Introduction                                                         searching optimal parameters, these methods can’t get fully
                                                                         developed.
Finding new terminology is a kind of named entity recogni-
tion(NER) problem. However, many high performance meth-
                                                                         3     Method
ods need labelled data. Although they can obtain excellent
results on training and testing data, it is hard for them to             3.1 Overview
process new unlabelled data. One factor leading to this gap is           Our method aims to extract terminology from unlabelled
that features of new text are different from features models             data. For this purpose, we utilize two features of terminology:
learn on training data owing to the difference between their             surrounding words and POS sequences. The process can be
domains. Also, these new scientific texts usually lack labels            divided into two steps. One step is to cold-start model with
for extraction. So an unsupervised method which can also                 unlabelled data. In this step, the model will get sentence pat-
adapt different domains is needed.                                       terns, POS sequences of terminology from data. Another step
   To overcome this problem, we propose an unsupervised                  is to extract terminology with POS sequences and sentence
method based on sentence pattern and part of speech. In                  patterns learned by model. For a sentence, the model can
detail, we initialize a few patterns to extract terminologies            extract terminology with learned sentence pattern or POS
in certain sentences. In this step, we can obtain some termi-            sequences.
nologies and their part of speech sequences. Then, we try to
find the same POS sequences in sentences not matched by                  3.2   Sentence Patterns
initial patterns with obtained terminologies’ POS sequences.
If a sentence is matched, we will utilize suitable words in this
sentence to replace the extendable parts of initial patterns.
In this case, we can obtain new patterns and get more ter-
minologies by using new patterns. After several iterations,
most terminology in scientific sentences can be extracted.

2   Related Work                                                                           Figure 1. Pattern Examples
Recent years, terminology extraction has attracted more
and more attention. And all kinds of methods are produced.                  Our sentence pattern is represented by regular expression.
Some methods rely on string, syntax and other original fea-              Examples are given in figure.1. These are two patterns aim-
tures. Liu li[2] and Zen Wen[8] use length of word and gram-             ing to extract method terminology. "propose" is a word which
matical features to choose terminology candidates. Nowa-                 often appear with method words at the same time. Boundary
days, some methods based on machine learning and deep                    words like "by, to, for" are used to limit the range of termi-
learning are put forward. Among these methods, LSTM[1]                   nology words. What we want is matched by "(.+?)". When
and CRF[6] and their variants achieve the best performance.              generating new patterns, we can use words from matched


      Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                    86
                         EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents




sentence to replace the extendable part of extant pattern. For                                                                filter new generated patterns according to their matching
examples in figure.1, the extendable parts are "propose" and                                                                  results and move suitable patterns to pattern base. For new
"proposed". They can be replaced by "develop", "present", "put                                                                terminology words, they replace the initial extracted termi-
forward" and so on. In this case, new patterns are obtained                                                                   nology words to participate in the extraction loop until no
and can be used to extract terminology in other sentences.                                                                    new sentence could be extracted.

                                                                                                                              3.4 Extraction from New Data
                                                                                            Loop unitl no new
                                                                                            Sentence could be
                                                                                                extracted
                                                                                                                              After cold start, we can obtain sentence patterns and POS
                                                                                                                              sequences of terminology words. Here are two approaches to
                                                                            Sentence
                                                                                                                              get new terminologies from new unlabelled data. One is that
          Pattern Base        Pattern                      Sentence
                                                                              Base                                            we can use patterns to match sentences for obtaining new
                                                                                                                              terminologies when only sentence string is input. Another is
                                             Match the
                                             Sentence
                                                                            Extracted                                         that when sentence string and POS sequence (processed by
                                                                Matched     Sentence                                          natural language tools) are input, we can use POS sequence
                                                                              Base
                                             Terminology
                                                                                             Matched
                                                                                                                              to match POS sequence of sentences to get a more accurate
                                                Words
                                                                                                                              result.
                                                                           Unextracted
                                                               Not Matched Sentence
                                         POS sequence                         Base
                                                                                            Not Matched                       4   Experiment and Result
                                                filter
                                                                             Sentence                                         4.1 Data and Preprocessing
                                        POS Sequence             POS                                                          To test our method, we crawled 200k+ abstracts from Web of
                                            Base               Sequence
                                                                          POS Sequence in
                                                                                                                              Knowledge. Their topics include machine learning, big data
                                                                           Sentence POS                                       and data mining. We utilize nltk[3] to split abstracts into
                            Some parts are
                             replaceed by
                                                                              choose                      Terminology Words
                                                                                                                              sentences and splitted sentences into tokens. Also we use
                                                                             candidte
                           candidate words
                                                                              wrods                                           stanfordnlp[5] to get POS tags and dependency relations of
                                                                                                                              cut sentences. Our method only needs to use the tokenized
             filter
                                                                            Candidate                                         sentences of abstracts and their POS tags.
                                                                             Patterns
                                                                                                                                 In experiment, we use 54000 sentences and their POS
                                                                                                                              sequences as training data and 1000 sentences and their POS
                                                                                                                              sequences as testing data. All sentences are unlabelled.
                      Figure 2. Cold Start Process
                                                                                                                              4.2 Extraction Results
                                                                                                                              Owing to the lack of labels, we use human evaluation to
3.3 Cold Start
                                                                                                                              measure our method’s performance. We use training data
The process of cold start of our method is shown in fig-                                                                      to cold-start our model and extract 146902 terminologies
ure.2. The inputs are sentences and their POS sequences                                                                       from training and testing data. Specifically, the accuracy of
and form the sentence base. First, we use each pattern from                                                                   our method in testing data is 0.64. According to some cases
pattern base to match each sentence from sentence base. At                                                                    of result, we can find that this method can partly solve the
beginning, pattern base only contains initial sentence pat-                                                                   problem of extracting terminologies from unlabelled texts.
terns. Matched sentence will be moved to extracted sentence                                                                   However, when it comes to very professional terminologies,
base and we can obtain terminology words and their POS                                                                        the performance may be lower.
sequences. Otherwise, the sentence will be moved to unex-
tracted sentence base. The two bases are empty before . After
getting terminology words and their POS sequences, we need                                                                    5   Conclusion
to filter them to obtain more accurate results. The filtered                                                                  To extract terminologies from scientific texts, we propose an
POS sequences are moved to POS Sequence Base. Then, for                                                                       unsupervised method based on sentence pattern and POS
each POS sequences from POS sequence base, it is used to                                                                      sequence of sentence. This method can extract terminologies
find if the sentence POS sequence in unextracted sentence                                                                     without learning on labelled data and just need a few initial
base contains itself. If sentence POS sequence contains, we                                                                   sentence patterns to cold-start. Then it can learn new pat-
can choose the candidate words from matched sentence for                                                                      terns and POS sequences on unlabelled data and use them
generation of new patterns. After new patterns are generated,                                                                 to extract new terminologies. In the future, we will test our
we use them to match sentences in unextracted sentence                                                                        model on standard datasets and compare it with some base-
base and new terminology words are obtained. Then we can                                                                      lines.



                                                                                                                      87
                     EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents




References                                                                       [5] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel,
[1] Zhao Dongyue, Du Yongping, and Shi Chongde. 2018. Scientific Litera-             Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP nat-
    ture Terms Extraction Based on Bidirectional Long Short-Term Memory              ural language processing toolkit. In Proceedings of 52nd annual meeting
    Model. Technology Intelligence Engineering 4, 1 (2018), 67–74.                   of the association for computational linguistics: system demonstrations.
[2] Liu Li and Xiao Yingyuan. 2017. A statistical domain terminology ex-             55–60.
    traction method based on word length and grammatical feature. Journal        [6] Wang Miping, Wang Hao, and etc Deng Sanhong. 2016. Extracting
    of Harbin Engineering University 38, 9 (2017), 1437–1443.                        Chinese Metallurgy Patent Terms with Conditional Random Fields.
[3] Edward Loper and Steven Bird. 2002. NLTK: the natural language                   New Technology of Library and Information Service 6 (2016), 28–36.
    toolkit. arXiv preprint cs/0205028 (2002).                                   [7] Serhan Tatar and Ilyas Cicekli. 2011. Automatic rule learning exploiting
[4] Yi Luan, Mari Ostendorf, and Hannaneh Hajishirzi. 2017. Scientific               morphological features for named entity recognition in Turkish. Journal
    information extraction with semi-supervised neural tagging. arXiv                of Information Science 37, 2 (2011), 137–151.
    preprint arXiv:1708.06075 (2017).                                            [8] Zeng Wen, Xu Shuo, and etc Zhang Yunliang. 2014. The Research and
                                                                                     Analysis on Automatic Extraction of Science and Technology Literature
                                                                                     Terms. New Technology of Library and Information Service 1 (2014),
                                                                                     51–55.




                                                                            88