=Paper= {{Paper |id=Vol-1976/paper02 |storemode=property |title=Unsupervised Resource-Free Entity Discovery and Linking in Natural Language Questions |pdfUrl=https://ceur-ws.org/Vol-1976/paper02.pdf |volume=Vol-1976 |authors=Shu Guo,Jiangxia Cao,Quan Wang,Lihong Wang,Bin Wang }} ==Unsupervised Resource-Free Entity Discovery and Linking in Natural Language Questions== https://ceur-ws.org/Vol-1976/paper02.pdf

Unsupervised Resource-Free Entity Discovery
and Linking in Natural Language Questions

Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2⋆ , Lihong Wang4 , and Bin Wang1,2
1
Institute of Information Engineering, Chinese Academy of Sciences
2
School of Cyber Security, University of Chinese Academy of Sciences
3
School of Computer Scince & Technology, Heilongjiang University
4
National Computer Network Emergency Response Technical Team
Coordination Center of China

Abstract. We present the solution to the CCKS 2017 question entity
discovery and linking (QEDL) task. This task is to discover and link en-
tity mentions in natural language questions with their referent entities in
a knowledge base (KB). For entity discovery, we devise recognition pat-
terns based on word segmentation and POS tagging. For entity linking,
we leverage contextual similarity reﬁned by rich side information con-
tained in the KB. Our solution is fully unsupervised and resource-free,
requiring neither labeled data nor auxiliary resources. Experimental re-
sults show that our solution is simple yet eﬀective, achieving an F1-score
of 44.3% which ranks the third in the QEDL task.

Keywords: Entity discovery, entity linking, natural language questions

1 Introduction
The CCKS 2017 question entity discovery and linking (QEDL) task1 is to recog-
nize entity mentions from natural language questions, and link them with their
referent entities in a given knowledge base (KB), i.e., CN-DBpedia [1]. For ex-
ample, given a question “吴晓敏演过什么电视剧?/What TV shows did Xiaomin Wu
play in?”, we should recognize the mention “吴晓敏/Xiaomin Wu” and link it to
its referent entity “吴晓敏(演员)/Xiaomin Wu (actress)” in the KB. Such linking
results are extremely useful for answering these questions [2].
Entity discovery and linking has long been regarded as a challenging task in
natural language processing (NLP) [3, 4]. The speciﬁc scenario of CCKS 2017
QEDL further poses new challenges to this traditional NLP task.
– Entities are no longer restricted to the three classical types of person, loca-
tion, and organization, but instead could be more generic, e.g., “手指/finger”
and “发型/hairstyle”. Most of the currently available well-performing sys-
tems (usually trained from massive labeled data) can only recognize entities
of the three classical types, and hence fail to work here.
⋆
Corresponding author: Quan Wang (wangquan@iie.ac.cn)
1
http://www.ccks2017.com/?page_id=51
2 Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2 , Lihong Wang4 , and Bin Wang1,2

– Questions are too short, containing 13 Chinese characters on average, which
cannot provide suﬃcient contextual information for entity linking.
– Only a small number of training instances are provided, i.e., 1,400 ques-
tions with 1,980 entities manually annotated, some of which might even be
mislabeled. For instance, in the question “像我这脸型适合剪什么发型?/What
hairstyle fits my facial shape?”, “发型/hairstyle” is labeled as an entity
while “脸型/facial shape” is not, although both of them have redirects in
CN-DBpedia. This limited (and potentially inconsistent) supervision makes
it diﬃcult to train supervised models for both entity discovery and linking.
To address these challenges, we devise a fully unsupervised method for QEDL.
In our approach, entity discovery is conducted based solely on the results of word
segmentation and POS tagging. Entity linking is performed by measuring con-
textual similarity between entity mentions and their referent in the KB. As ques-
tions are short with insuﬃcient contexts, we further leverage side information,
e.g., titles, types, and primary tags of entities, to reﬁne contextual similarity.
Our approach is fully unsupervised and resource-free, requiring neither labeled
training data nor auxiliary resources like hand-crafted dictionaries or thesauri.
Our approach is simple yet eﬀective, achieving an F1-score of 44.3% which ranks
the third in the CCKS 2017 QEDL task.

2 Related Work
Entity discovery is closely related to named entity recognition (NER) which
recognizes entities of speciﬁc types (person, location, and organization). Studies
on NER roughly fall into three categories: 1) rule-based methods which use hand-
crafted rules and dictionaries to design recognition patterns; 2) machine learning-
based methods which pose NER as a sequence classiﬁcation problem, solved by
hidden Markov models or conditional random ﬁelds; 3) hybrid methods which
combine rule-based and machine learning-based approaches. For more details
about NER methods, please refer to [3].
Entity linking is to link textual mentions with their referent entities in a
given knowledge base. Existing approaches can be roughly categorized into two
groups: 1) supervised methods which rely on massive annotated data to learn
how to rank candidate entities for each textual mention; 2) unsupervised methods
which do not require any annotated data to train the ranking model. See [4] for
a thorough review of entity linking techniques.
Given that only a small number of annotated data is provided in the QEDL
task, we employ a rule-based method for entity discovery, and an unsupervised
method for entity linking.

3 Our Approach
Fig. 1 provides a simple illustration of our approach. Given a question, entity
discovery is ﬁrst conducted by using recognition patterns devised on the basis of
Title Suppressed Due to Excessive Length 3

Question Entity Discovery
• Recognition patterns
What TV shows did Xiaomin Wu play in? − Word segmentation Candidate Entities
− POS tagging
( )
Mention: Xiaomin Wu (professor)

Entity Linking
• Contextual similarity Xiaomin Wu (actress)
• Side information
− Entity titles
− Entity types ( )
− Entity primary tags Xiaomin Wu (mayor)

Fig. 1. Simple illustration of our approach.

word segmentation and POS tagging. Entity linking is then performed by mea-
suring contextual similarity between textual mentions and their referent entities,
reﬁned by rich side information in the KB. Our approach is fully unsupervised
and resource-free, requiring neither labeled training data nor auxiliary resources.

3.1 Entity Discovery

Given an input natural language question, we employ the SWJTU Chinese word
segmentation system to perform word segmentation and POS tagging. This sys-
tem supports two segmentation manners, i.e., coarse-grained and ﬁne-grained
segmentations. The former uses a longest matching algorithm, and the latter
can split the words into smaller units. For example, “新浪微博/Sina Microblog”
is segmented into a single word “新浪微博/nt” in coarse-grained segmentation,
but two separate words “[新浪/ntc 微博/n]/nt” in ﬁne-grained segmentation.
Here, “nt”, “ntc”, and “n” are POS tags.2 After word segmentation and POS
tagging, we detect entity mentions as follows (see Table 1 for concrete examples).

Rule 1 Words with coarse-grained POS tags of nr (person name), ns (place),
and nt (organization) are recognized as entity mentions, e.g., “霍建华”, “宁
波”, and “普陀山”.
Rule 2 Words with coarse-grained POS tags of nz (proper noun) and n (noun)
are recognized as entity mentions if they have redirects in the KB, e.g., “湘潭
火车站”. Otherwise, any ﬁne-grained units therein that have redirects are de-
termined as entity mentions, e.g., “中国” in the coarse-grained segmentation
“[中国/ns 领土/n]/nz”.
Rule 3 Words with coarse-grained POS tags of nz (proper noun) can further
be concatenated with their antecedent or succedent words. If the combined
words have redirects in CN-DBpedia, they are also identiﬁed as entity men-
tions, e.g., “qq木马病毒” and “三星note2”.

2
They stand for organization, company name, and noun, respectively. A full descrip-
tion of POS tags is available at http://ics.swjtu.edu.cn.
4 Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2 , Lihong Wang4 , and Bin Wang1,2

Table 1. Examples of recognition rules and corresponding discovered entity mentions.

Word segmentation & POS tagging results Entity mentions
霍建华/nr 演/v 过/uguo 哪些/ry 电视剧/n 霍建华
Rule 1
从/p 宁波/ns 到/v 普陀山/ns 怎么走/nz 最/d 方便/a 宁波, 普陀山
[湘潭/ns 火车站/n]/nz 什么/ry 时候/n 通车/vi 湘潭火车站
Rule 2
求/v 钓鱼岛/ns 属/v [中国/ns 领土/n]/nz 的/ude 资料/n 中国, 钓鱼岛
qq/x [木马/n 病毒/n]/nz 怎么/ryv 编写/v qq木马病毒
Rule 3
三星/nz note2/x 电池/n 怎么样/ryv 三星note2

3.2 Entity Linking

Entity linking consists of three modules: candidate entity selection, candidate
entity ranking, and NIL (unlinkable entities) detection, detailed as follows.
Candidate entity selection. For each recognized mention, we query it directly
in the CN-DBpedia search engine3 and retrieve a list of relevant entities. These
entities are taken as candidates for that mention.
Candidate entity ranking. We rank candidates for each mention by measuring
their contextual similarity. Speciﬁcally, given a mention m and a candidate entity
e, we construct two feature vectors m and e for them. The former is composed
of context words of the mention in the question, and the latter context words
of the entity in its abstract. Here only words that are tagged as noun and verb
are considered. The contextual similarity between m and e can be calculated as,
e.g., the dot product of m and e, i.e., s(m, e) = ⟨m, e⟩. However, since both the
question and the entity abstract are short, we might not get enough contextual
information in m and e. So we propose to further use side information in CN-
DBpedia, and calculate a reﬁned contextual similarity s̃(m, e) = w × s(m, e).
Candidate with the largest s̃(m, e) score will be selected as the true referent.
Three types of side information are considered to calculate the reﬁning factor w,
including entity title, entity type, and primary tag.
Entity title is the title of an entity page in CN-DBpedia. The intuition here is
that referent entities are usually those that have similar titles with their mentions
(string matching). For example, given the mention “格林豪泰”, the entity “格林豪
泰” is more likely to be the true referent than “林豪泰”. So we deﬁne the reﬁning
factor as string similarity between the mention m and entity title e, i.e.,
edit(m, e)
w1 = 1 − ,
max(|m|, |e|)
where | · | is the length of a string, and edit(·, ·) the edit distance.
Entity type is the category to which an entity belongs, denoted as type(e).
Usually, a mention can only be linked to entities of certain types, e.g., China/ns
should be linked to countries. So we specify a type set T for each POS tag, e.g.,
3
http://knowledgeworks.cn:30001/?p=**
http://knowledgeworks.cn:20313/cndbpedia/api/entity?mention=**
Title Suppressed Due to Excessive Length 5

T = {Place, Country, City} for ns (space), and deﬁne the reﬁning factor as


1, type(e) ∈ T ,
w2 = α1 , type(e) = ∅,


α2 , type(e) ̸∈ T and type(e) ̸= ∅,
where 0 ≤ α2 < α1 < 1. Here we specify type sets for only three POS tags, i.e.,
nr (person name), ns (space), and nt (organization).
Primary tags indicate the most popular entities of the given mentions. Can-
didates with primary tags are more likely to be the true referent. So we deﬁne
the reﬁning factor according to the presence or absence of primary tags, i.e.,
{
1, if e has a primary tag,
w3 =
β, otherwise,
where β is a parameter in the range of [0, 1). These three weights can further be
aggregated together, giving a combined reﬁning factor. For example, aggregating
all the three weights gives a combined reﬁning factor of w = w1 × w2 × w3 .
NIL detection. Not all mentions have referent entities in the KB. To detect
such unlinkable mentions, we use a simple heuristic: mentions with no candidates
after performing the candidate entity selection module are predicted to be NIL.
To yield more accurate NIL, we do this only for mentions discovered by Rule 1.

4 Experiments

Datasets and evaluation metrics. The training set consists of 1,400 ques-
tions with 1,917 mentions manually linked with their referent entities, and 63
mentions labeled as NIL. The test set consists of 749 unlabeled questions. As
our approach is fully unsupervised, we use the training data as a development
set only for parameter tuning. Submissions are ﬁnally evaluated on the test set.
Three metrics Precision, Recall, and F1-score are used for the QEDL task.
Implementation details. We use Rule 1, Rule 2, and Rule 3 to detect entity
mentions (Section 3.1). And for entity linking (Section 3.2), we test diﬀerent set-
tings. In the calculation of contextual similarity, we use diﬀerent term weighting
schemes including Boolean, TF, and TF-IDF [5] to compute feature vectors, and
we explore two similarity measures, i.e., cosine similarity (Cos) and dot product
(Dot). In the calculation of reﬁning factor, we apply each of the three types of
side information (Title, Type, and Tag) alone, and get a reﬁning factor of w1 , w2 ,
and w3 , respectively. We also test all possible combinations, e.g., Title+Type,
and get a reﬁning factor of, e.g., w1 × w2 . Due to the space limitation, we only
report the combination with the highest entity linking F1-score on the training
set, i.e., Title+Tag which gives a reﬁning factor of w1 ×w3 . All parameters in our
approach are determined by maximizing entity linking F1-score on the training
set. The optimal conﬁgurations are: α1 = 0.8, α2 = 0.75, and β = 0.3.
Results. Results of entity discovery and linking are shown in Table 2. For discov-
ery, we can see that the recognition rules are simple yet eﬀective in recognizing
6 Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2 , Lihong Wang4 , and Bin Wang1,2

Table 2. Results of entity discovery and entity linking.

Train Test
Precision Recall F1 Precision Recall F1
Entity Discovery 0.515 0.715 0.599 0.530 0.744 0.619
Boolean (Cos) 0.193 0.291 0.232 0.246 0.348 0.288
Boolean (Dot) 0.273 0.376 0.316 0.295 0.415 0.345
TF (Cos) 0.225 0.310 0.261 0.296 0.419 0.347
Entity Linking

TF (Dot) 0.305 0.424 0.355 0.337 0.477 0.395
TF-IDF (Cos) 0.283 0.389 0.328 0.298 0.421 0.349
TF-IDF (Dot) 0.245 0.337 0.283 0.248 0.351 0.291
TF (Dot)+Title 0.315 0.437 0.366 0.345 0.487 0.404
TF (Dot)+Type 0.309 0.428 0.359 0.349 0.493 0.408
TF (Dot)+Tag 0.315 0.437 0.366 0.348 0.491 0.408
TF (Dot)+Title+Tag 0.328 0.456 0.382 0.378 0.534 0.443

most entity mentions in short questions. However, the rules we created may
still miss some diﬃcult cases such as “红/a 米/n note2/x”, which will be stud-
ied in our future work. For linking, we can see that: 1) the TF term weighting
scheme combined with the dot product similarity measure performs the best in
calculating contextual similarity; 2) incorporating each of the three types of side
information alone can further improve contextual matching; 3) among all possi-
ble combinations of side information, Title+Type performs the best, achieving
an F1-score of 44.3% on the test set;4 4) the performance on the test set is better
than that on the training set, which might indicate higher annotation quality of
the test data.

5 Conclusion
This paper introduces our solution to the CCKS 2017 QEDL task. We ﬁrst devise
recognition patterns based on word segmentation and POS tagging to discover
mentions. Then, we utilize contextual similarity reﬁned by rich side informa-
tion for entity linking. Our solution is simple yet eﬀective for short questions,
achieving an F1-score of 44.3% which ranks the third in the QEDL task.

References
1. B. Xu, Y. Xu, J. Liang, C. Xie, B. Liang, W. Cui, Y. Xiao: CN-DBpedia: A Never-Ending Chinese
Knowledge Extraction. In: Proceedings of IEA/AIE, pp. 428-438 (2017)
2. C. Welty, J. W. Murdock, A. Kalyanpur, J. Fan: A comparison of hard ﬁlters and soft evidence
for answer typing in watson. In: Proceedings of ISWC, pp. 243-256 (2015)
3. A. Mansouri, L. S. Aﬀendy, A. Mamat: Named Entity Recognition Approaches. IJCSNS, vol. 8,
no. 2, pp. 339-344, 2008.
4. W. Shen, J. Wang, J. Han: Entity Linking with a Knowledge Base: Issues, Techniques, and
Solutions. IEEE T KNOWL DATA EN, 27(2), pp. 443-460 (2015)
5. G. Salton and C. Buckley. Term Weighting Approaches in Automatic Text Retrieval. INF PRO-
CESS MANAGE, 24(5), pp. 513–523 (1988)

4
During the test phase, we reﬁne the outputs using labeled data in the training set.