=Paper= {{Paper |id=Vol-1976/paper02 |storemode=property |title=Unsupervised Resource-Free Entity Discovery and Linking in Natural Language Questions |pdfUrl=https://ceur-ws.org/Vol-1976/paper02.pdf |volume=Vol-1976 |authors=Shu Guo,Jiangxia Cao,Quan Wang,Lihong Wang,Bin Wang }} ==Unsupervised Resource-Free Entity Discovery and Linking in Natural Language Questions== https://ceur-ws.org/Vol-1976/paper02.pdf
    Unsupervised Resource-Free Entity Discovery
     and Linking in Natural Language Questions

Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2⋆ , Lihong Wang4 , and Bin Wang1,2
            1
              Institute of Information Engineering, Chinese Academy of Sciences
        2
            School of Cyber Security, University of Chinese Academy of Sciences
            3
               School of Computer Scince & Technology, Heilongjiang University
            4
              National Computer Network Emergency Response Technical Team
                                 Coordination Center of China



        Abstract. We present the solution to the CCKS 2017 question entity
        discovery and linking (QEDL) task. This task is to discover and link en-
        tity mentions in natural language questions with their referent entities in
        a knowledge base (KB). For entity discovery, we devise recognition pat-
        terns based on word segmentation and POS tagging. For entity linking,
        we leverage contextual similarity refined by rich side information con-
        tained in the KB. Our solution is fully unsupervised and resource-free,
        requiring neither labeled data nor auxiliary resources. Experimental re-
        sults show that our solution is simple yet effective, achieving an F1-score
        of 44.3% which ranks the third in the QEDL task.

        Keywords: Entity discovery, entity linking, natural language questions


1     Introduction
The CCKS 2017 question entity discovery and linking (QEDL) task1 is to recog-
nize entity mentions from natural language questions, and link them with their
referent entities in a given knowledge base (KB), i.e., CN-DBpedia [1]. For ex-
ample, given a question “吴晓敏演过什么电视剧?/What TV shows did Xiaomin Wu
play in?”, we should recognize the mention “吴晓敏/Xiaomin Wu” and link it to
its referent entity “吴晓敏(演员)/Xiaomin Wu (actress)” in the KB. Such linking
results are extremely useful for answering these questions [2].
    Entity discovery and linking has long been regarded as a challenging task in
natural language processing (NLP) [3, 4]. The specific scenario of CCKS 2017
QEDL further poses new challenges to this traditional NLP task.
 – Entities are no longer restricted to the three classical types of person, loca-
   tion, and organization, but instead could be more generic, e.g., “手指/finger”
   and “发型/hairstyle”. Most of the currently available well-performing sys-
   tems (usually trained from massive labeled data) can only recognize entities
   of the three classical types, and hence fail to work here.
⋆
    Corresponding author: Quan Wang (wangquan@iie.ac.cn)
1
    http://www.ccks2017.com/?page_id=51
2         Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2 , Lihong Wang4 , and Bin Wang1,2

    – Questions are too short, containing 13 Chinese characters on average, which
      cannot provide sufficient contextual information for entity linking.
    – Only a small number of training instances are provided, i.e., 1,400 ques-
      tions with 1,980 entities manually annotated, some of which might even be
      mislabeled. For instance, in the question “像我这脸型适合剪什么发型?/What
      hairstyle fits my facial shape?”, “发型/hairstyle” is labeled as an entity
      while “脸型/facial shape” is not, although both of them have redirects in
      CN-DBpedia. This limited (and potentially inconsistent) supervision makes
      it difficult to train supervised models for both entity discovery and linking.
    To address these challenges, we devise a fully unsupervised method for QEDL.
In our approach, entity discovery is conducted based solely on the results of word
segmentation and POS tagging. Entity linking is performed by measuring con-
textual similarity between entity mentions and their referent in the KB. As ques-
tions are short with insufficient contexts, we further leverage side information,
e.g., titles, types, and primary tags of entities, to refine contextual similarity.
Our approach is fully unsupervised and resource-free, requiring neither labeled
training data nor auxiliary resources like hand-crafted dictionaries or thesauri.
Our approach is simple yet effective, achieving an F1-score of 44.3% which ranks
the third in the CCKS 2017 QEDL task.


2      Related Work
Entity discovery is closely related to named entity recognition (NER) which
recognizes entities of specific types (person, location, and organization). Studies
on NER roughly fall into three categories: 1) rule-based methods which use hand-
crafted rules and dictionaries to design recognition patterns; 2) machine learning-
based methods which pose NER as a sequence classification problem, solved by
hidden Markov models or conditional random fields; 3) hybrid methods which
combine rule-based and machine learning-based approaches. For more details
about NER methods, please refer to [3].
    Entity linking is to link textual mentions with their referent entities in a
given knowledge base. Existing approaches can be roughly categorized into two
groups: 1) supervised methods which rely on massive annotated data to learn
how to rank candidate entities for each textual mention; 2) unsupervised methods
which do not require any annotated data to train the ranking model. See [4] for
a thorough review of entity linking techniques.
    Given that only a small number of annotated data is provided in the QEDL
task, we employ a rule-based method for entity discovery, and an unsupervised
method for entity linking.


3      Our Approach
Fig. 1 provides a simple illustration of our approach. Given a question, entity
discovery is first conducted by using recognition patterns devised on the basis of
                                               Title Suppressed Due to Excessive Length                 3

                    Question                       Entity Discovery
                                               •   Recognition patterns
       What TV shows did Xiaomin Wu play in?       −    Word segmentation      Candidate Entities
                                                   −    POS tagging
                                                                                         (       )
                                                                Mention:       Xiaomin Wu (professor)


                                                       Entity Linking
                                               •   Contextual similarity       Xiaomin Wu (actress)
                                               •   Side information
                                                   −    Entity titles
                                                   −    Entity types                     (       )
                                                   −    Entity primary tags     Xiaomin Wu (mayor)




                                Fig. 1. Simple illustration of our approach.

word segmentation and POS tagging. Entity linking is then performed by mea-
suring contextual similarity between textual mentions and their referent entities,
refined by rich side information in the KB. Our approach is fully unsupervised
and resource-free, requiring neither labeled training data nor auxiliary resources.


3.1     Entity Discovery

Given an input natural language question, we employ the SWJTU Chinese word
segmentation system to perform word segmentation and POS tagging. This sys-
tem supports two segmentation manners, i.e., coarse-grained and fine-grained
segmentations. The former uses a longest matching algorithm, and the latter
can split the words into smaller units. For example, “新浪微博/Sina Microblog”
is segmented into a single word “新浪微博/nt” in coarse-grained segmentation,
but two separate words “[新浪/ntc 微博/n]/nt” in fine-grained segmentation.
Here, “nt”, “ntc”, and “n” are POS tags.2 After word segmentation and POS
tagging, we detect entity mentions as follows (see Table 1 for concrete examples).

Rule 1 Words with coarse-grained POS tags of nr (person name), ns (place),
   and nt (organization) are recognized as entity mentions, e.g., “霍建华”, “宁
   波”, and “普陀山”.
Rule 2 Words with coarse-grained POS tags of nz (proper noun) and n (noun)
   are recognized as entity mentions if they have redirects in the KB, e.g., “湘潭
   火车站”. Otherwise, any fine-grained units therein that have redirects are de-
   termined as entity mentions, e.g., “中国” in the coarse-grained segmentation
   “[中国/ns 领土/n]/nz”.
Rule 3 Words with coarse-grained POS tags of nz (proper noun) can further
   be concatenated with their antecedent or succedent words. If the combined
   words have redirects in CN-DBpedia, they are also identified as entity men-
   tions, e.g., “qq木马病毒” and “三星note2”.

2
    They stand for organization, company name, and noun, respectively. A full descrip-
    tion of POS tags is available at http://ics.swjtu.edu.cn.
4        Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2 , Lihong Wang4 , and Bin Wang1,2

Table 1. Examples of recognition rules and corresponding discovered entity mentions.

          Word segmentation & POS tagging results                  Entity mentions
          霍建华/nr 演/v 过/uguo 哪些/ry 电视剧/n                            霍建华
Rule 1
          从/p 宁波/ns 到/v 普陀山/ns 怎么走/nz 最/d 方便/a                     宁波, 普陀山
          [湘潭/ns 火车站/n]/nz 什么/ry 时候/n 通车/vi                        湘潭火车站
Rule 2
          求/v 钓鱼岛/ns 属/v [中国/ns 领土/n]/nz 的/ude 资料/n                中国, 钓鱼岛
          qq/x [木马/n 病毒/n]/nz 怎么/ryv 编写/v                          qq木马病毒
Rule 3
          三星/nz note2/x 电池/n 怎么样/ryv                               三星note2


3.2    Entity Linking

Entity linking consists of three modules: candidate entity selection, candidate
entity ranking, and NIL (unlinkable entities) detection, detailed as follows.
Candidate entity selection. For each recognized mention, we query it directly
in the CN-DBpedia search engine3 and retrieve a list of relevant entities. These
entities are taken as candidates for that mention.
Candidate entity ranking. We rank candidates for each mention by measuring
their contextual similarity. Specifically, given a mention m and a candidate entity
e, we construct two feature vectors m and e for them. The former is composed
of context words of the mention in the question, and the latter context words
of the entity in its abstract. Here only words that are tagged as noun and verb
are considered. The contextual similarity between m and e can be calculated as,
e.g., the dot product of m and e, i.e., s(m, e) = ⟨m, e⟩. However, since both the
question and the entity abstract are short, we might not get enough contextual
information in m and e. So we propose to further use side information in CN-
DBpedia, and calculate a refined contextual similarity s̃(m, e) = w × s(m, e).
Candidate with the largest s̃(m, e) score will be selected as the true referent.
Three types of side information are considered to calculate the refining factor w,
including entity title, entity type, and primary tag.
    Entity title is the title of an entity page in CN-DBpedia. The intuition here is
that referent entities are usually those that have similar titles with their mentions
(string matching). For example, given the mention “格林豪泰”, the entity “格林豪
泰” is more likely to be the true referent than “林豪泰”. So we define the refining
factor as string similarity between the mention m and entity title e, i.e.,
                                             edit(m, e)
                                 w1 = 1 −                ,
                                           max(|m|, |e|)
where | · | is the length of a string, and edit(·, ·) the edit distance.
    Entity type is the category to which an entity belongs, denoted as type(e).
Usually, a mention can only be linked to entities of certain types, e.g., China/ns
should be linked to countries. So we specify a type set T for each POS tag, e.g.,
3
    http://knowledgeworks.cn:30001/?p=**
    http://knowledgeworks.cn:20313/cndbpedia/api/entity?mention=**
                                  Title Suppressed Due to Excessive Length        5

T = {Place, Country, City} for ns (space), and define the refining factor as
                        
                        
                        1,     type(e) ∈ T ,
                  w2 = α1 , type(e) = ∅,
                        
                        
                          α2 , type(e) ̸∈ T and type(e) ̸= ∅,
where 0 ≤ α2 < α1 < 1. Here we specify type sets for only three POS tags, i.e.,
nr (person name), ns (space), and nt (organization).
    Primary tags indicate the most popular entities of the given mentions. Can-
didates with primary tags are more likely to be the true referent. So we define
the refining factor according to the presence or absence of primary tags, i.e.,
                            {
                              1, if e has a primary tag,
                      w3 =
                              β, otherwise,
where β is a parameter in the range of [0, 1). These three weights can further be
aggregated together, giving a combined refining factor. For example, aggregating
all the three weights gives a combined refining factor of w = w1 × w2 × w3 .
NIL detection. Not all mentions have referent entities in the KB. To detect
such unlinkable mentions, we use a simple heuristic: mentions with no candidates
after performing the candidate entity selection module are predicted to be NIL.
To yield more accurate NIL, we do this only for mentions discovered by Rule 1.


4   Experiments

Datasets and evaluation metrics. The training set consists of 1,400 ques-
tions with 1,917 mentions manually linked with their referent entities, and 63
mentions labeled as NIL. The test set consists of 749 unlabeled questions. As
our approach is fully unsupervised, we use the training data as a development
set only for parameter tuning. Submissions are finally evaluated on the test set.
Three metrics Precision, Recall, and F1-score are used for the QEDL task.
Implementation details. We use Rule 1, Rule 2, and Rule 3 to detect entity
mentions (Section 3.1). And for entity linking (Section 3.2), we test different set-
tings. In the calculation of contextual similarity, we use different term weighting
schemes including Boolean, TF, and TF-IDF [5] to compute feature vectors, and
we explore two similarity measures, i.e., cosine similarity (Cos) and dot product
(Dot). In the calculation of refining factor, we apply each of the three types of
side information (Title, Type, and Tag) alone, and get a refining factor of w1 , w2 ,
and w3 , respectively. We also test all possible combinations, e.g., Title+Type,
and get a refining factor of, e.g., w1 × w2 . Due to the space limitation, we only
report the combination with the highest entity linking F1-score on the training
set, i.e., Title+Tag which gives a refining factor of w1 ×w3 . All parameters in our
approach are determined by maximizing entity linking F1-score on the training
set. The optimal configurations are: α1 = 0.8, α2 = 0.75, and β = 0.3.
Results. Results of entity discovery and linking are shown in Table 2. For discov-
ery, we can see that the recognition rules are simple yet effective in recognizing
6                     Shu Guo1,2 , Jiangxia Cao3 , Quan Wang1,2 , Lihong Wang4 , and Bin Wang1,2

                             Table 2. Results of entity discovery and entity linking.

                                                        Train                         Test
                                            Precision    Recall    F1     Precision    Recall    F1
                   Entity Discovery          0.515        0.715   0.599    0.530       0.744    0.619
                   Boolean (Cos)              0.193      0.291    0.232    0.246        0.348   0.288
                   Boolean (Dot)              0.273      0.376    0.316    0.295        0.415   0.345
                   TF (Cos)                   0.225      0.310    0.261    0.296        0.419   0.347
Entity Linking




                   TF (Dot)                   0.305      0.424    0.355    0.337        0.477   0.395
                   TF-IDF (Cos)               0.283      0.389    0.328    0.298        0.421   0.349
                   TF-IDF (Dot)               0.245      0.337    0.283    0.248        0.351   0.291
                   TF (Dot)+Title             0.315      0.437    0.366    0.345        0.487   0.404
                   TF (Dot)+Type              0.309      0.428    0.359    0.349        0.493   0.408
                   TF (Dot)+Tag               0.315      0.437    0.366    0.348        0.491   0.408
                   TF (Dot)+Title+Tag         0.328      0.456    0.382    0.378        0.534   0.443


most entity mentions in short questions. However, the rules we created may
still miss some difficult cases such as “红/a 米/n note2/x”, which will be stud-
ied in our future work. For linking, we can see that: 1) the TF term weighting
scheme combined with the dot product similarity measure performs the best in
calculating contextual similarity; 2) incorporating each of the three types of side
information alone can further improve contextual matching; 3) among all possi-
ble combinations of side information, Title+Type performs the best, achieving
an F1-score of 44.3% on the test set;4 4) the performance on the test set is better
than that on the training set, which might indicate higher annotation quality of
the test data.


5                  Conclusion
This paper introduces our solution to the CCKS 2017 QEDL task. We first devise
recognition patterns based on word segmentation and POS tagging to discover
mentions. Then, we utilize contextual similarity refined by rich side informa-
tion for entity linking. Our solution is simple yet effective for short questions,
achieving an F1-score of 44.3% which ranks the third in the QEDL task.


References
 1. B. Xu, Y. Xu, J. Liang, C. Xie, B. Liang, W. Cui, Y. Xiao: CN-DBpedia: A Never-Ending Chinese
    Knowledge Extraction. In: Proceedings of IEA/AIE, pp. 428-438 (2017)
 2. C. Welty, J. W. Murdock, A. Kalyanpur, J. Fan: A comparison of hard filters and soft evidence
    for answer typing in watson. In: Proceedings of ISWC, pp. 243-256 (2015)
 3. A. Mansouri, L. S. Affendy, A. Mamat: Named Entity Recognition Approaches. IJCSNS, vol. 8,
    no. 2, pp. 339-344, 2008.
 4. W. Shen, J. Wang, J. Han: Entity Linking with a Knowledge Base: Issues, Techniques, and
    Solutions. IEEE T KNOWL DATA EN, 27(2), pp. 443-460 (2015)
 5. G. Salton and C. Buckley. Term Weighting Approaches in Automatic Text Retrieval. INF PRO-
    CESS MANAGE, 24(5), pp. 513–523 (1988)


    4
                 During the test phase, we refine the outputs using labeled data in the training set.