<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Unsupervised Method for Terminology Extraction from Scientific Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wei Shao</string-name>
          <email>1600016634@pku.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiaying Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bolin Hua</string-name>
          <email>huabolin@pku.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongwei He</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiang Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keqi Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information, Management, Peking University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>86</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>CCS Concepts: • Information systems → Data mining; Information extraction; • Applied computing → Document management and text processing.</p>
      </abstract>
      <kwd-group>
        <kwd>terminology extraction</kwd>
        <kwd>unsupervised method</kwd>
        <kwd>scientific text</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Finding new terminology is a kind of named entity
recognition(NER) problem. However, many high performance
methods need labelled data. Although they can obtain excellent
results on training and testing data, it is hard for them to
process new unlabelled data. One factor leading to this gap is
that features of new text are diferent from features models
learn on training data owing to the diference between their
domains. Also, these new scientific texts usually lack labels
for extraction. So an unsupervised method which can also
adapt diferent domains is needed.</p>
      <p>
        To overcome this problem, we propose an unsupervised
method based on sentence pattern and part of speech. In
detail, we initialize a few patterns to extract terminologies
in certain sentences. In this step, we can obtain some
terminologies and their part of speech sequences. Then, we try to
ifnd the same POS sequences in sentences not matched by
initial patterns with obtained terminologies’ POS sequences.
If a sentence is matched, we will utilize suitable words in this
sentence to replace the extendable parts of initial patterns.
In this case, we can obtain new patterns and get more
terminologies by using new patterns. After several iterations,
most terminology in scientific sentences can be extracted.
Recent years, terminology extraction has attracted more
and more attention. And all kinds of methods are produced.
Some methods rely on string, syntax and other original
features. Liu li[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Zen Wen[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] use length of word and
grammatical features to choose terminology candidates.
Nowadays, some methods based on machine learning and deep
learning are put forward. Among these methods, LSTM[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and CRF[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and their variants achieve the best performance.
However, they rely on labelled data and have a poor
performance on new unlabelled data. To solve this problem, some
semi-supervised and unsupervised methods are proposed. A
graph-based semi-supervised algorithm[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] achieve a high
F1 on SemEval Task 10. Automatic rule learning based on
morphological features method[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is used to extract entities
without annotated data. However, owing to the dificulty of
searching optimal parameters, these methods can’t get fully
developed.
3
3.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>Overview</title>
        <p>Our method aims to extract terminology from unlabelled
data. For this purpose, we utilize two features of terminology:
surrounding words and POS sequences. The process can be
divided into two steps. One step is to cold-start model with
unlabelled data. In this step, the model will get sentence
patterns, POS sequences of terminology from data. Another step
is to extract terminology with POS sequences and sentence
patterns learned by model. For a sentence, the model can
extract terminology with learned sentence pattern or POS
sequences.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Sentence Patterns</title>
        <p>Our sentence pattern is represented by regular expression.</p>
        <p>Examples are given in figure.1. These are two patterns
aiming to extract method terminology. "propose" is a word which
often appear with method words at the same time. Boundary
words like "by, to, for" are used to limit the range of
terminology words. What we want is matched by "(.+?)". When
generating new patterns, we can use words from matched
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
sentence to replace the extendable part of extant pattern. For
examples in figure.1, the extendable parts are "propose" and
"proposed". They can be replaced by "develop", "present", "put
forward" and so on. In this case, new patterns are obtained
and can be used to extract terminology in other sentences.
iflter new generated patterns according to their matching
results and move suitable patterns to pattern base. For new
terminology words, they replace the initial extracted
terminology words to participate in the extraction loop until no
new sentence could be extracted.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.4 Extraction from New Data</title>
        <p>After cold start, we can obtain sentence patterns and POS
sequences of terminology words. Here are two approaches to
get new terminologies from new unlabelled data. One is that
we can use patterns to match sentences for obtaining new
terminologies when only sentence string is input. Another is
that when sentence string and POS sequence (processed by
natural language tools) are input, we can use POS sequence
to match POS sequence of sentences to get a more accurate
result.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiment and Result</title>
      <sec id="sec-3-1">
        <title>4.1 Data and Preprocessing</title>
        <p>
          To test our method, we crawled 200k+ abstracts from Web of
Knowledge. Their topics include machine learning, big data
and data mining. We utilize nltk[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to split abstracts into
Terminology Words sentences and splitted sentences into tokens. Also we use
stanfordnlp[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to get POS tags and dependency relations of
cut sentences. Our method only needs to use the tokenized
sentences of abstracts and their POS tags.
        </p>
        <p>In experiment, we use 54000 sentences and their POS
sequences as training data and 1000 sentences and their POS
sequences as testing data. All sentences are unlabelled.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3 Cold Start</title>
        <p>The process of cold start of our method is shown in
figure.2. The inputs are sentences and their POS sequences
and form the sentence base. First, we use each pattern from
pattern base to match each sentence from sentence base. At
beginning, pattern base only contains initial sentence
patterns. Matched sentence will be moved to extracted sentence
base and we can obtain terminology words and their POS
sequences. Otherwise, the sentence will be moved to
unextracted sentence base. The two bases are empty before . After
getting terminology words and their POS sequences, we need
to filter them to obtain more accurate results. The filtered
POS sequences are moved to POS Sequence Base. Then, for
each POS sequences from POS sequence base, it is used to
ifnd if the sentence POS sequence in unextracted sentence
base contains itself. If sentence POS sequence contains, we
can choose the candidate words from matched sentence for
generation of new patterns. After new patterns are generated,
we use them to match sentences in unextracted sentence
base and new terminology words are obtained. Then we can</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.2 Extraction Results</title>
        <p>Owing to the lack of labels, we use human evaluation to
measure our method’s performance. We use training data
to cold-start our model and extract 146902 terminologies
from training and testing data. Specifically, the accuracy of
our method in testing data is 0.64. According to some cases
of result, we can find that this method can partly solve the
problem of extracting terminologies from unlabelled texts.
However, when it comes to very professional terminologies,
the performance may be lower.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 Conclusion</title>
      <p>To extract terminologies from scientific texts, we propose an
unsupervised method based on sentence pattern and POS
sequence of sentence. This method can extract terminologies
without learning on labelled data and just need a few initial
sentence patterns to cold-start. Then it can learn new
patterns and POS sequences on unlabelled data and use them
to extract new terminologies. In the future, we will test our
model on standard datasets and compare it with some
baselines.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Zhao</surname>
            <given-names>Dongyue</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du Yongping</surname>
            , and
            <given-names>Shi</given-names>
          </string-name>
          <string-name>
            <surname>Chongde</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Scientific Literature Terms Extraction Based on Bidirectional Long Short-Term Memory Model</article-title>
          .
          <source>Technology Intelligence Engineering 4</source>
          ,
          <issue>1</issue>
          (
          <year>2018</year>
          ),
          <fpage>67</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Liu</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Xiao</given-names>
            <surname>Yingyuan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A statistical domain terminology extraction method based on word length and grammatical feature</article-title>
          .
          <source>Journal of Harbin Engineering University</source>
          <volume>38</volume>
          ,
          <issue>9</issue>
          (
          <year>2017</year>
          ),
          <fpage>1437</fpage>
          -
          <lpage>1443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>NLTK: the natural language toolkit</article-title>
          .
          <source>arXiv preprint cs/0205028</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yi</given-names>
            <surname>Luan</surname>
          </string-name>
          , Mari Ostendorf, and
          <string-name>
            <given-names>Hannaneh</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Scientific information extraction with semi-supervised neural tagging</article-title>
          .
          <source>arXiv preprint arXiv:1708.06075</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Christopher</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and
          <string-name>
            <surname>David McClosky</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          .
          <source>In Proceedings of 52nd annual</source>
          <article-title>meeting of the association for computational linguistics: system demonstrations</article-title>
          .
          <volume>55</volume>
          -
          <fpage>60</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Wang</given-names>
            <surname>Miping</surname>
          </string-name>
          ,
          <source>Wang Hao, and etc Deng Sanhong</source>
          .
          <year>2016</year>
          .
          <article-title>Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields</article-title>
          .
          <source>New Technology of Library and Information Service</source>
          <volume>6</volume>
          (
          <year>2016</year>
          ),
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Serhan</given-names>
            <surname>Tatar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ilyas</given-names>
            <surname>Cicekli</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Automatic rule learning exploiting morphological features for named entity recognition in Turkish</article-title>
          .
          <source>Journal of Information Science</source>
          <volume>37</volume>
          ,
          <issue>2</issue>
          (
          <year>2011</year>
          ),
          <fpage>137</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Zeng</given-names>
            <surname>Wen</surname>
          </string-name>
          , Xu Shuo, and etc Zhang Yunliang.
          <year>2014</year>
          .
          <source>The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms. New Technology of Library and Information Service</source>
          <volume>1</volume>
          (
          <year>2014</year>
          ),
          <fpage>51</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>