<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Unified Approach for Short Question Entity Discovery and Linking</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Qin</forename><surname>Wei</surname></persName>
							<email>weiqin@putao.com</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Shanghai for Science and Technology</orgName>
								<address>
									<postCode>200093</postCode>
									<settlement>Shanghai</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jiong</forename><surname>Zhang</surname></persName>
							<email>zhangjiong@putao.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Shanghai Putao Technology Co., Ltd</orgName>
								<address>
									<postCode>200233</postCode>
									<settlement>Shanghai</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Huimin</forename><surname>Zhang</surname></persName>
							<email>zhanghuimin@putao.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Shanghai Putao Technology Co., Ltd</orgName>
								<address>
									<postCode>200233</postCode>
									<settlement>Shanghai</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Unified Approach for Short Question Entity Discovery and Linking</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1378D60516AD28B091C547AA907FB9D7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T13:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The problem of entity discovery and linking(EDL) in short questions aims to find the entities the questions focused on and disambiguate them, usually over linkeddata sources. For Chinese questions, the problem mainly involves three tasks: the segmentation of words and phrases in questions, the segment disambiguation and the mapping of mentions to semantic entities. In this paper, we propose an integer linear program (ILP) based method to solve the three tasks jointly. Our solution harnesses the rich feature types provided by the question context and the linked data source, CN-DBpedia in the experiment, to constrain our semantic-coherence objective function and a genetic algorithm(GA) is used to tune the parameters. In the evaluation of CCKS2017 shared task one, our approach achieves a f1 score of 0.804 in the mention discovery and 0.56 in the entity linking, and ranks 1 st among all the 17 teams according to the f1 score of entity linking.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Entity Discovery and Linking (EDL) in Natural Language Processing (NLP) is the task of matching entity mentions in texts to a unique identifier in linked-data sources, such as CN-DBpedia, and it is becoming a hot topic as the linked data grows. Unlike conventional tasks of Named Entity Recognition (NER), which focus on identifying the occurrence of an entity and its type, but not the specific unique entity that the mention refers to, EDL makes a further step in understanding texts, thus playing a critical role in the construction of the upper applications, such as the Information Retrieval (IR) and Knowledge Based Question Answering (KBQA) systems.</p><p>EDL is commonly divided into two sub tasks: Mention Detection (MD) and Entity Linking (EL). MD is concerned with identifying potential mentions of entities in the text and EL involves mapping mentions to semantic entities. EDL is complex and challenging due not only to the ambiguity of word and phrase senses but also entity mentions, which are affected by the context of words and phrases, the similarity of mentions and entities, the prior of entities, the coherence among entities and etc.</p><p>Approaches have been proposed in literature to solve EDL. The majorities treat the two tasks in EDL separately, which can be distinguished into two main types: rule-based approaches and machine learning models. Rule-based approaches make good efforts in linguistic analysis <ref type="bibr" target="#b0">[1]</ref> [2] <ref type="bibr" target="#b2">[3]</ref> and can build practical systems in limited periods, thus getting good performances in the early related shared tasks. Machine learning models such as Maximum Entropy (ME) <ref type="bibr" target="#b3">[4]</ref>, generative models <ref type="bibr" target="#b4">[5]</ref>, ranking methods <ref type="bibr" target="#b5">[6]</ref> and etc., benefitting from the data explosion in the last decade, good at balancing the precision and recall, is becoming more and more dominate. Suffering from cascade errors, gaps to the theoretical best performance exist for these separate approaches. Some jointly approaches are also reported <ref type="bibr" target="#b6">[7]</ref>  <ref type="bibr" target="#b7">[8]</ref>, which are good at taking the affecting factors of EDL to a single model but lack methods for model parameter tuning.</p><p>This paper presents our approach for CCKS2017 shared task, Question Entity Discovery and Linking (QEDL) in Chinese. One more sub task is raised, the segmentation of words and phrases in questions, compared to EDL, due to the boundary lack of Chinese words. The three sub tasks are jointly solved by an Integer Linear Program (ILP) based model tuning parameter by the Genetic Algorithm (GA). An f1 score of 0.804 in the mention discovery and 0.56 in the entity linking have been achieved in the evaluation.</p><p>The paper is structured as follows. After describing the four steps of the online predicting framework in section 2, we discuss the joint disambiguation step in detail in section 3. Section 4 presents the offline parameter tuning of the online model. The evaluation results are outlined in section 5. Finally, we review the main conclusions and preview the future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Framework</head><p>As shown in figure <ref type="figure" target="#fig_0">1</ref>, given a Chinese short question, our online approach takes four steps for QEDL: word detection, mention discovery, entity mapping and joint disambiguation. A question sentence is processed as a sequence of characters, qNL = (t0, t1,… tn) while a word is a contiguous subsequence of the character sequence, wij = (ti, ti+1, … tj) , 0&lt;=i&lt;=j&lt;=n. The input question is handled by the pipeline in figure <ref type="figure" target="#fig_0">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Word Detection</head><p>Words are detected by "jieba", a common used Chinese text segmentation. In the cut for search mode of "jieba", all the word candidates are generated and put to a word set. For sample question "李娜是在哪一年拿的澳网冠军", the word set contains the following candidates: "李娜","是","在","哪一年","一年","拿","的","澳网" and "冠 军"。</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Mention Discovery</head><p>Mentions are discovered using CN-DBpedia, by querying the contiguous subsequences of the questions. Subsequences are made mention candidates, which will be added to the word set containing all the word candidates, by the existence of the query results. For the sample question, the mention candidates, "李"，"李娜"， "一"，"一年" and etc., are added to the word set, with duplication removed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Entity Mapping</head><p>A mention candidate can be mapped to multiple semantic entities. During this step, a semantic entity mapping space for the mentions is constructed. "李娜" in the above sample question is mapped to "李娜(中国女子网球名将)"，"李娜(南京师范大 学讲师)"，"李娜(2016年陈可辛导演电影)" and etc., in the mapping space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Joint Disambiguation</head><p>During this step, the word boundary determination, mention disambiguation and semantic entity disambiguation are solved jointly by calculating a disambiguation graph. For the sample question, by decoding the outcome graph, we can get entity mentions as "李娜" and "澳网" and entity linking as "李娜(中国女子网球名将)" and "澳大利亚网球公开赛". The details of this step is described in section 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Joint Disambiguation</head><p>As the disambiguation of one word, mention and entity can influence the others, a disambiguation graph encoding all possible mappings is constructed. To simplify the problem, we model the problem as an ILP, rather than graph models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Overlap-Word-Entity Graph</head><p>A weighted, undirected Graph DG= (V, E) is defined with words Vw, entities Ve and the word overlap constrains Vo as nodes. The graph for the sample question is shown in figure <ref type="figure" target="#fig_1">2</ref>, from which, we can see the edges, in which the word-entity edges tie closely to the final predicting results, are indicated by solid and dashed lines corresponding to 1 and 0 are assigned to variables in ILP.</p><p>Features and Constraints are expressed by the weighted edges in the Graph and by optimizing the final objective function, the best word and entity nodes are selected. The features and constraints harnessed in the model are presented below. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Over-all Objective Function</head><p>Our framework combines mention prior, entity prior, entity-context prior, similarity between entity and mention, and coherences between entities into a combined objective function.</p><formula xml:id="formula_0">max! )) 1 ) (sum( * ) , sim( * ) , sim( * )) cxt( , e sim( * ) prior( * ) prior( * 4 2 t 1 1 1 1 1 1 1 1 1 1 1 i ji 1 1 1 1 1 2 2 5 2 4 2 3 2 2 1 1                                            t m c e e e m m e m i t k i k j k i n k m p t mn t k i k j p t j i t k i k j p t t k i k j p t ji t k i p t i t i n ji i i i i      (1)</formula><p>Subject to:</p><formula xml:id="formula_1">               other , 0 exists or if , 1 ) I( ), I( ， 1 ) I( ) I( 0 other , 0 ) scope( if , 1 ) I( ， 1 ) I( , 1 , i i i i i i i i k i m e m e m e m j j m j m i<label>(2) Where 1 ) ( ) ( ) ( ) ( ) ( ) ( 2 4 2 t 1 2 1 2 1 2 1 2 1 2 5 4 3 2 1 </label></formula><formula xml:id="formula_2">                 t p t t p t t p t t p t t p t t c      ,</formula><p>) cxt( i m denotes the context of mentions, ) sum( i m represents the number of all the mentions, do penalize when the number of mentions is over 2, 3 or 4, each feature correspond to one coefficient, which changing by the GA tuning.</p><p>) scope( i m is the start position and end position interval of i m in the question sentence. Section 3.2</p><p>gives details on each of these components.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">ILP Processing</head><p>We use binary variables to indicate whether the nodes and edges are selected and integrate the features and constraints to the ILP objective function and constraints as shown in equation 1 and make the objective function linear with introducing some new variables and the spinoff constraints. It seems a sophisticated ILP, but for the questions are short, it is within the regime of modern ILP solvers. In Our Experiment, we use Pulp and achieved run-times, usually less than one second.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Parameter Tuning</head><p>The parameters in the objective function in ILP, about 30 in quantity, are optimized by GA, a random search and optimization method based on natural selection and genetic mechanism of the living beings <ref type="bibr" target="#b9">[10]</ref>, without calculating the gradients. As the target parameters are floats, real number coding method are elected.</p><p>The CCKS2017 shared task one use f1 score in mention discovery and entity linking, We use score in equation ( <ref type="formula">5</ref>) as our fitness function in GA, which grows bigger as the f1 scores increase.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>beta</head><formula xml:id="formula_3">B B A alpha B A e score * |) | | | * /( | | _     (3) beta B B A entityDiff alpha B A l score * |) | ) , ( * /( | | _    (4) l score gamma e score gamma score _ * ) . 1 ( _ *    (5)</formula><p>Where, alpha is the recognition rate adjustment coefficient, beta is the bonus coefficient, gamma is the fitting coefficient and As Shown in Table <ref type="table" target="#tab_0">1</ref>, our model makes reasonable predictions in the QEDL. Further Experiments show the f1 score of 0.804 in the mention discovery and 0.56 in the entity linking are achieved on the test data, and over 0.10 are obtained on both scores over the second team in the shared task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Future Work</head><p>In this paper, an ILP based approach has been proposed for CCKS2017 shared task one. The approach harnesses the rich feature types provided by the question context and the linked data source to constrain the semantic-coherence objective function using a GA to tune the parameters, achieving the best performance in the task. Future work includes considering additional feature mining, improving online model calculating efficiency and its universality in other corpus.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Architecture of the online system</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Overlap-Mention-Entity Graph Example</figDesc><graphic coords="4,135.72,147.36,330.12,268.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>We evaluate our system in CCKS2017 shared task one, QEDL. 1400 questions are provided as training data, with mentions and entities annotated while another 750 questions as test data without annotated information. The training and testing procedures are carried out on CN-DBpedia, which consists of 16,601,597 Baike entities and 213,945,421 Baike relationships. Output Examples of Our System.</figDesc><table><row><cell>entityDiff</cell><cell>) ( B , A</cell><cell>is the count of</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Mention detection: First steps in the development of a Basque coreference resolution system</title>
		<author>
			<persName><surname>Soraluze</surname></persName>
		</author>
		<author>
			<persName><surname>Ander</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">KONVENS</title>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">UBIU: A language-independent system for coreference resolution</title>
		<author>
			<persName><forename type="first">Desislava</forename><surname>Zhekova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sandra</forename><surname>Kübler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics</title>
				<meeting>the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Stanford&apos;s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task</title>
		<author>
			<persName><forename type="first">Heeyoung</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the fifteenth conference on computational natural language learning: Shared task. Association for Computational Linguistics</title>
				<meeting>the fifteenth conference on computational natural language learning: Shared task. Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The IBM systems for English entity discovery and linking and Spanish entity linking at TAC 2014</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Florian</forename><forename type="middle">R</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">C]//Text Analysis Conference (TAC)</title>
				<meeting><address><addrLine>Gaithersburg, Maryland, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A generative entity-mention model for linking entities with knowledge base</title>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1</title>
				<meeting>the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1</meeting>
		<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="945" to="954" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Entity linking with a knowledge base: Issues, techniques, and solutions</title>
		<author>
			<persName><forename type="first">W</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="443" to="460" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note>J</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Natural language questions for the web of data</title>
		<author>
			<persName><forename type="first">Mohamed</forename><surname>Yahya</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics</title>
				<meeting>the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Robust disambiguation of named entities in text</title>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Hoffart</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Survey On Genetic Algorithm</title>
		<author>
			<persName><forename type="first">Lingen</forename><surname>Ji</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Applications and Software</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="69" to="73" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Survey On Genetic Algorithm</title>
		<author>
			<persName><forename type="first">Jike</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuhui</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chunming</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><surname>Pu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Application Research</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="2911" to="2916" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
