An Unsupervised Method for Terminology Extraction from Scientific Text

Introduction

Finding new terminology is a kind of named entity recognition(NER) problem. However, many high performance methods need labelled data. Although they can obtain excellent results on training and testing data, it is hard for them to process new unlabelled data. One factor leading to this gap is that features of new text are different from features models learn on training data owing to the difference between their domains. Also, these new scientific texts usually lack labels for extraction. So an unsupervised method which can also adapt different domains is needed.

To overcome this problem, we propose an unsupervised method based on sentence pattern and part of speech. In detail, we initialize a few patterns to extract terminologies in certain sentences. In this step, we can obtain some terminologies and their part of speech sequences. Then, we try to find the same POS sequences in sentences not matched by initial patterns with obtained terminologies' POS sequences. If a sentence is matched, we will utilize suitable words in this sentence to replace the extendable parts of initial patterns. In this case, we can obtain new patterns and get more terminologies by using new patterns. After several iterations, most terminology in scientific sentences can be extracted.

Related Work

Recent years, terminology extraction has attracted more and more attention. And all kinds of methods are produced. Some methods rely on string, syntax and other original features. Liu li [2] and Zen Wen [8] use length of word and grammatical features to choose terminology candidates. Nowadays, some methods based on machine learning and deep learning are put forward. Among these methods, LSTM [1] and CRF [6] and their variants achieve the best performance.

However, they rely on labelled data and have a poor performance on new unlabelled data. To solve this problem, some semi-supervised and unsupervised methods are proposed. A graph-based semi-supervised algorithm [4] achieve a high F1 on SemEval Task 10. Automatic rule learning based on morphological features method [7] is used to extract entities without annotated data. However, owing to the difficulty of searching optimal parameters, these methods can't get fully developed.

Method

Overview

Our method aims to extract terminology from unlabelled data. For this purpose, we utilize two features of terminology: surrounding words and POS sequences. The process can be divided into two steps. One step is to cold-start model with unlabelled data. In this step, the model will get sentence patterns, POS sequences of terminology from data. Another step is to extract terminology with POS sequences and sentence patterns learned by model. For a sentence, the model can extract terminology with learned sentence pattern or POS sequences. Examples are given in figure .1. These are two patterns aiming to extract method terminology. "propose" is a word which often appear with method words at the same time. Boundary words like "by, to, for" are used to limit the range of terminology words. What we want is matched by "(.+?)". When generating new patterns, we can use words from matched sentence to replace the extendable part of extant pattern. For examples in figure.1, the extendable parts are "propose" and "proposed". They can be replaced by "develop", "present", "put forward" and so on. In this case, new patterns are obtained and can be used to extract terminology in other sentences. filter new generated patterns according to their matching results and move suitable patterns to pattern base. For new terminology words, they replace the initial extracted terminology words to participate in the extraction loop until no new sentence could be extracted.

Sentence Patterns

Extraction from New Data

After cold start, we can obtain sentence patterns and POS sequences of terminology words. Here are two approaches to new terminologies from new unlabelled data. One is that we can use patterns to match sentences for obtaining new terminologies when only sentence string is input. Another is that when sentence string and POS sequence (processed by natural language tools) are input, we can use POS sequence to match POS sequence of sentences to get a more accurate result.

Experiment and Result

Data and Preprocessing

To test our method, we crawled 200k+ abstracts from Web of Knowledge. Their topics include machine learning, big data and data mining. We utilize nltk [3] to split abstracts into sentences and splitted sentences into tokens. Also we use stanfordnlp [5] to get POS tags and dependency relations of cut sentences. Our method only needs to use the tokenized sentences of abstracts and their POS tags.

In experiment, we use 54000 sentences and their POS sequences as training data and 1000 sentences and their POS sequences as testing data. All sentences are unlabelled.

Extraction Results

Owing to the lack of labels, we use human evaluation to measure our method's performance. We use training data to cold-start our model and extract 146902 terminologies from training and testing data. Specifically, the accuracy of our method in testing data is 0.64. According to some cases of result, we can find that this method can partly solve the problem of extracting terminologies from unlabelled texts. However, when it comes to very professional terminologies, the performance may be lower.

Conclusion

To extract terminologies from scientific texts, we propose an unsupervised method based on sentence pattern and POS sequence of sentence. This method can extract terminologies without learning on labelled data and just need a few initial sentence patterns to cold-start. Then it can learn new patterns and POS sequences on unlabelled data and use them to extract new terminologies. In the future, we will test our model on standard datasets and compare it with some baselines.

Figure 1 .1Figure 1. Pattern Examples

Figure 2 .2Figure 2. Cold Start Process

Pattern Base Sentence Base Pattern Sentence Terminology Words Match the Sentence Extracted Sentence Base Unextracted Sentence Base Matched Not Matched POS sequence POS Sequence Base filter Sentence POS Sequence choose candidte wrods POS Sequence in Sentence POS Matched Not Matched Candidate Patterns Some parts are replaceed by candidate words Terminology Words filter Loop unitl no new Sentence could be extractedCopyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

Scientific Literature Terms Extraction Based on Bidirectional Long Short-Term Memory Model DuZhao Dongyue ShiYongping Chongde Technology Intelligence Engineering 4 1 2018. 2018 A statistical domain terminology extraction method based on word length and grammatical feature LiuLi XiaoYingyuan Journal of Harbin Engineering University 38 9 2017. 2017 NLTK: the natural language toolkit EdwardLoper StevenBird arXiv preprint cs/0205028 2002. 2002 Scientific information extraction with semi-supervised neural tagging YiLuan MariOstendorf HannanehHajishirzi arXiv:1708.06075 2017. 2017 arXiv preprint The Stanford CoreNLP natural language processing toolkit MihaiChristopher D Manning JohnSurdeanu JennyRoseBauer StevenFinkel DavidBethard Mcclosky Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations 52nd annual meeting of the association for computational linguistics: system demonstrations 2014 Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields WangMiping WangHao DengSanhong New Technology of Library and Information Service 6 2016. 2016 Automatic rule learning exploiting morphological features for named entity recognition in Turkish SerhanTatar IlyasCicekli Journal of Information Science 37 2 2011. 2011 The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms ZengWen XuShuo YunliangZhang New Technology of Library and Information Service 1 2014. 2014