<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">An Unsupervised Method for Terminology Extraction from Scientific Text</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Wei</forename><surname>Shao</surname></persName>
						</author>
						<author role="corresp">
							<persName><forename type="first">Bolin</forename><surname>Hua</surname></persName>
							<email>huabolin@pku.edu.cn</email>
						</author>
						<author>
							<persName><forename type="first">Qiang</forename><surname>Ma</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Jiaying</forename><surname>Liu</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Keqi</forename><surname>Chen</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Department of Information Management</orgName>
								<orgName type="institution">Peking University</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Department of Information Management</orgName>
								<orgName type="institution">Peking University</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Department of Information Management</orgName>
								<orgName type="institution">Peking University</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">Department of Information Management</orgName>
								<orgName type="institution">Peking University Hongwei He</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff4">
								<orgName type="department">Department of Information Management</orgName>
								<orgName type="institution">Peking University</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff5">
								<orgName type="department">Department of Information Management</orgName>
								<orgName type="institution">Peking University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">An Unsupervised Method for Terminology Extraction from Scientific Text</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">851D284071D0112806559FB24FDB8BC5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T04:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>• Information systems → Data mining</term>
					<term>Information extraction</term>
					<term>• Applied computing → Document management and text processing terminology extraction, unsupervised method, scientific text</term>
				</keywords>
			</textClass>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Finding new terminology is a kind of named entity recognition(NER) problem. However, many high performance methods need labelled data. Although they can obtain excellent results on training and testing data, it is hard for them to process new unlabelled data. One factor leading to this gap is that features of new text are different from features models learn on training data owing to the difference between their domains. Also, these new scientific texts usually lack labels for extraction. So an unsupervised method which can also adapt different domains is needed.</p><p>To overcome this problem, we propose an unsupervised method based on sentence pattern and part of speech. In detail, we initialize a few patterns to extract terminologies in certain sentences. In this step, we can obtain some terminologies and their part of speech sequences. Then, we try to find the same POS sequences in sentences not matched by initial patterns with obtained terminologies' POS sequences. If a sentence is matched, we will utilize suitable words in this sentence to replace the extendable parts of initial patterns. In this case, we can obtain new patterns and get more terminologies by using new patterns. After several iterations, most terminology in scientific sentences can be extracted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Recent years, terminology extraction has attracted more and more attention. And all kinds of methods are produced. Some methods rely on string, syntax and other original features. Liu li <ref type="bibr" target="#b1">[2]</ref> and Zen Wen <ref type="bibr" target="#b7">[8]</ref> use length of word and grammatical features to choose terminology candidates. Nowadays, some methods based on machine learning and deep learning are put forward. Among these methods, LSTM <ref type="bibr" target="#b0">[1]</ref> and CRF <ref type="bibr" target="#b5">[6]</ref> and their variants achieve the best performance.</p><p>However, they rely on labelled data and have a poor performance on new unlabelled data. To solve this problem, some semi-supervised and unsupervised methods are proposed. A graph-based semi-supervised algorithm <ref type="bibr" target="#b3">[4]</ref> achieve a high F1 on SemEval Task 10. Automatic rule learning based on morphological features method <ref type="bibr" target="#b6">[7]</ref> is used to extract entities without annotated data. However, owing to the difficulty of searching optimal parameters, these methods can't get fully developed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Overview</head><p>Our method aims to extract terminology from unlabelled data. For this purpose, we utilize two features of terminology: surrounding words and POS sequences. The process can be divided into two steps. One step is to cold-start model with unlabelled data. In this step, the model will get sentence patterns, POS sequences of terminology from data. Another step is to extract terminology with POS sequences and sentence patterns learned by model. For a sentence, the model can extract terminology with learned sentence pattern or POS sequences. Examples are given in figure <ref type="figure">.</ref>1. These are two patterns aiming to extract method terminology. "propose" is a word which often appear with method words at the same time. Boundary words like "by, to, for" are used to limit the range of terminology words. What we want is matched by "(.+?)". When generating new patterns, we can use words from matched sentence to replace the extendable part of extant pattern. For examples in figure.1, the extendable parts are "propose" and "proposed". They can be replaced by "develop", "present", "put forward" and so on. In this case, new patterns are obtained and can be used to extract terminology in other sentences.  filter new generated patterns according to their matching results and move suitable patterns to pattern base. For new terminology words, they replace the initial extracted terminology words to participate in the extraction loop until no new sentence could be extracted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Sentence Patterns</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Extraction from New Data</head><p>After cold start, we can obtain sentence patterns and POS sequences of terminology words. Here are two approaches to new terminologies from new unlabelled data. One is that we can use patterns to match sentences for obtaining new terminologies when only sentence string is input. Another is that when sentence string and POS sequence (processed by natural language tools) are input, we can use POS sequence to match POS sequence of sentences to get a more accurate result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiment and Result</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Data and Preprocessing</head><p>To test our method, we crawled 200k+ abstracts from Web of Knowledge. Their topics include machine learning, big data and data mining. We utilize nltk <ref type="bibr" target="#b2">[3]</ref> to split abstracts into sentences and splitted sentences into tokens. Also we use stanfordnlp <ref type="bibr" target="#b4">[5]</ref> to get POS tags and dependency relations of cut sentences. Our method only needs to use the tokenized sentences of abstracts and their POS tags.</p><p>In experiment, we use 54000 sentences and their POS sequences as training data and 1000 sentences and their POS sequences as testing data. All sentences are unlabelled.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Extraction Results</head><p>Owing to the lack of labels, we use human evaluation to measure our method's performance. We use training data to cold-start our model and extract 146902 terminologies from training and testing data. Specifically, the accuracy of our method in testing data is 0.64. According to some cases of result, we can find that this method can partly solve the problem of extracting terminologies from unlabelled texts. However, when it comes to very professional terminologies, the performance may be lower.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>To extract terminologies from scientific texts, we propose an unsupervised method based on sentence pattern and POS sequence of sentence. This method can extract terminologies without learning on labelled data and just need a few initial sentence patterns to cold-start. Then it can learn new patterns and POS sequences on unlabelled data and use them to extract new terminologies. In the future, we will test our model on standard datasets and compare it with some baselines.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Pattern Examples</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. Cold Start Process</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Pattern Base Sentence Base Pattern Sentence Terminology Words Match the Sentence Extracted Sentence Base Unextracted Sentence Base Matched Not Matched POS sequence POS Sequence Base filter Sentence POS Sequence choose candidte wrods POS Sequence in Sentence POS Matched Not Matched Candidate Patterns Some parts are replaceed by candidate words Terminology Words filter Loop unitl no new Sentence could be extracted</head><label></label><figDesc></figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Scientific Literature Terms Extraction Based on Bidirectional Long Short-Term Memory Model</title>
		<author>
			<persName><forename type="first">Du</forename><surname>Zhao Dongyue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shi</forename><surname>Yongping</surname></persName>
		</author>
		<author>
			<persName><surname>Chongde</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Technology Intelligence Engineering</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="67" to="74" />
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A statistical domain terminology extraction method based on word length and grammatical feature</title>
		<author>
			<persName><forename type="first">Liu</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiao</forename><surname>Yingyuan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Harbin Engineering University</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page" from="1437" to="1443" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">NLTK: the natural language toolkit</title>
		<author>
			<persName><forename type="first">Edward</forename><surname>Loper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Steven</forename><surname>Bird</surname></persName>
		</author>
		<idno>arXiv preprint cs/0205028</idno>
		<imprint>
			<date type="published" when="2002">2002. 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Scientific information extraction with semi-supervised neural tagging</title>
		<author>
			<persName><forename type="first">Yi</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mari</forename><surname>Ostendorf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hannaneh</forename><surname>Hajishirzi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.06075</idno>
		<imprint>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The Stanford CoreNLP natural language processing toolkit</title>
		<author>
			<persName><forename type="first">Mihai</forename><surname>Christopher D Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">John</forename><surname>Surdeanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jenny</forename><forename type="middle">Rose</forename><surname>Bauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Steven</forename><surname>Finkel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Bethard</surname></persName>
		</author>
		<author>
			<persName><surname>Mcclosky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations</title>
				<meeting>52nd annual meeting of the association for computational linguistics: system demonstrations</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="55" to="60" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields</title>
		<author>
			<persName><forename type="first">Wang</forename><surname>Miping</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wang</forename><surname>Hao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Deng</forename><surname>Sanhong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">New Technology of Library and Information Service</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="28" to="36" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Automatic rule learning exploiting morphological features for named entity recognition in Turkish</title>
		<author>
			<persName><forename type="first">Serhan</forename><surname>Tatar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ilyas</forename><surname>Cicekli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Information Science</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="137" to="151" />
			<date type="published" when="2011">2011. 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms</title>
		<author>
			<persName><forename type="first">Zeng</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xu</forename><surname>Shuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yunliang</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">New Technology of Library and Information Service</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="51" to="55" />
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
