=Paper= {{Paper |id=Vol-2446/paper3 |storemode=property |title=Collective Learning From Diverse Datasets for Entity Typing in the Wild |pdfUrl=https://ceur-ws.org/Vol-2446/paper3.pdf |volume=Vol-2446 |authors=Abhishek Abhishek,Amar Prakash Azad,Balaji Ganesan,Ashish Anand,Amit Awekar |dblpUrl=https://dblp.org/rec/conf/eyre/AbhishekAGAA19 }} ==Collective Learning From Diverse Datasets for Entity Typing in the Wild== https://ceur-ws.org/Vol-2446/paper3.pdf
    Collective Learning From Diverse Datasets for Entity Typing in
                              the Wild
                      Abhishek                                        Amar Prakash Azad                                  Ashish Anand
          abhishek.abhishek@iitg.ac.in                                 Balaji Ganesan                                    Amit Awekar
         Indian Institute of Technology                                amarazad@in.ibm.com                          anand.ashish@iitg.ac.in
                   Guwahati                                            bganesa1@in.ibm.com                             awekar@iitg.ac.in
            Guwahati, Assam, India                                       IBM Research Lab                        Indian Institute of Technology
                                                                               India                                       Guwahati
                                                                                                                    Guwahati, Assam, India
ABSTRACT
Entity typing (ET) is the problem of assigning labels to given entity
mentions in a sentence. Existing works for ET require knowledge
about the domain and target label set for a given test instance. ET in
the absence of such knowledge is a novel problem that we address as
ET in the wild. We hypothesize that the solution to this problem is
to build supervised models that generalize better on the ET task as a
whole, rather than a specific dataset. In this direction, we propose a
Collective Learning Framework (CLF), which enables learning from
diverse datasets in a unified way. The CLF first creates a unified
hierarchical label set (UHLS) and a label mapping by aggregating
label information from all available datasets. Then it builds a single                Figure 1: The output of four learning models on typing
neural network classifier using UHLS, label mapping and a partial                     four entity mentions. For example, the model M1 trained on
loss function. The single classifier predicts the finest possible label               CoNLL dataset assigned the type ORG to the entity mention
across all available domains even though these labels may not be                      Wallaby, from the same dataset.
present in any domain-specific dataset. We also propose a set of
evaluation schemes and metrics to evaluate the performance of
models in this novel problem. Extensive experimentation on seven                      applications such as entity ranking for web question answering
diverse real-world datasets demonstrates the efficacy of our CLF.                     systems [6] and knowledge base completion [7], where ET models
                                                                                      are used.
CCS CONCEPTS                                                                             We address ET in the absence of domain and target label set
• Computing methodologies → Natural language process-                                 knowledge as ET in the wild problem. As a result, we have to
ing; Machine learning.                                                                predict the best possible labels for all test instances as illustrated in
                                                                                      Figure 1 (output with dashed line border). These labels may not be
KEYWORDS                                                                              present in the same domain dataset. For example, the prediction of
entity typing, hierarchy creation, learning from multiple datasets                    the label sports team for the entity mention Wallaby, when the best
                                                                                      possible fine-grained label (sports team) is not present in the same
                                                                                      domain CoNLL dataset [25]. We hypothesize that the solution to
1    INTRODUCTION
                                                                                      this problem is to build supervised models that generalize better on
                                                                                      the ET task as a whole, rather than a specific dataset. This solution
   Evolution of ET has led to the generation of multiple datasets.                    requires collective learning from several diverse datasets.
These datasets differ from each other in terms of their domain or                        However, collectively learning from diverse datasets is a challeng-
label set or both. Here, the domain of a dataset represents the data                  ing problem. Figure 2 illustrates the diversity of seven ET datasets.
distribution of its sentences. The label set represents the entity types              We can observe that every dataset provides some distinct infor-
annotated. Existing work for ET requires knowledge of the domain                      mation for the ET task such as domain and labels. For example,
and the target label of a test instance [22]. Figure 1 illustrates this               CADEC dataset [11] contains informally written sentences from
issue where four learning models are typing four entity mentions.                     a medical forum, whereas JNLPBA dataset [12] contains formally
We can observe that, in order to make a reasonable prediction                         written sentences from scientific abstracts in life sciences. Moreover,
(output with a solid border), it is required to assign labels from a                  there is an overlap in the label sets as well as a relation between the
model which has been trained on a dataset with similar domain                         labels of these datasets. For example, both CoNLL and Wiki [14]
and labels as that of test instances. However, domain and target                      datasets have a label person. However, only Wiki dataset has a label
label information of a test instance is unknown in several NLP                        athlete, a subtype of person. This means that CoNLL dataset can also
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons   contain athlete mentions but were only annotated with a coarse
License Attribution 4.0 International (CC BY 4.0).                                    label person. Thus, learning collectively from these diverse datasets
EYRE’19 Workshop, CIKM, November 2019, Beijing, China                                                                                                 Abhishek, et al.




                                                                                            Figure 3: An overview of the proposed collective learning
                                                                                            framework.



                                                                                            merged test set, learning models must generalize beyond a single
                                                                                            dataset. Our evaluation metrics are designed to measure learning
Figure 2: Illustration of the diversity of the seven ET datasets                            models performance to predict the best possible fine-grained label.
in the label set and domain.                                                                We compared a single classifier model trained with our proposed
                                                                                            framework with an ensemble of various models. Our model outper-
require models to learn a useful feature or representation of the                           forms competitive baselines with a significant margin.
sentences from diverse domains as well as to learn the relation                                Our contributions can be highlighted as below:
among labels.                                                                                  (1) We propose a novel problem of ET in the wild with the
   This study proposes a collective learning framework for the ET                                  objective of building better generalizable ET models (§ 2).
in the wild problem. CLF first builds a unified hierarchical label                             (2) We propose a novel collective learning framework which
set (UHLS) and associated label mapping by pooling labels from di-                                 makes it possible to train a single classifier on an amalgam of
verse datasets. Then, a single classifier1 collectively learns from the                            diverse ET datasets, enabling fine-grained prediction across
pooled dataset using UHLS, label mapping and a partial hierarchy                                   all the datasets, i.e., a generalized model for ET task as a
aware loss function.                                                                               whole (§ 3).
   In the UHLS, the nodes are contributed by different datasets,                               (3) We propose evaluation schemes and evaluation metrics to
and a parent-child relation among nodes translate to a coarse-fine                                 compare learning models for the ET in the wild problem
label relation. During construction of UHLS, a mapping from every                                  setting (§ 4.5, 4.6).
dataset specific label to the UHLS nodes is also constructed. We
expect to have one-to-many mappings, as in the case of real-world                           2   TERMINOLOGIES AND PROBLEM
datasets. For example, a coarse-grained label for a dataset could                               DEFINITION
be mapped to multiple nodes in the UHLS introduced by some                                  In this section, we formally define the ET in the wild problem and
other dataset. During the UHLS construction, human judgment                                 related terminologies.
is used when comparing two labels. This effort is four orders of                            Dataset: A dataset, D, is a collection of (X , D, Y). Here, X corre-
magnitude lesser compared to annotating every dataset with fine-                            sponds to a corpus of sentences with entity boundaries annotated,
grained labels.                                                                             D corresponds to the domain and Y = {y1 , . . . yn } is the set of la-
   Utilizing the UHLS and the mapping, we can view several domain-                          bels used to annotate each entity mention in the X . It is possible that
specific datasets as a collection of a multi-domain dataset having                          two datasets share domain but differ in their label sets or vice versa.
the same label set. On this combined dataset, we use an LSTM [10]                           Here the domain means the data characteristics such as writing
based encoder to learn a useful representation of the text followed                         style and vocabulary. For example, sentences in the CoNLL dataset
by a partial hierarchical loss function [29] for label classification.                      are sampled from Reuters news stories around 1999, whereas, sen-
This setup enables a single neural network classifier to predict fine-                      tences in the CADEC dataset are from medical forum posts around
grained labels across all domains, even though the fine-grained                             2015. Thus, these datasets have different domains.
label was not present in any in-domain dataset.                                             Label space: A label space L for a particular label y, is defined
   We also propose a set of evaluation schemes and metrics for                              as a set of entities that can be assigned a label y. For example, the
the ET in the wild problem. In our evaluation schemes, we evalu-                            label space for a label car includes mentions of all cars including
ate learning models performance on a test set which is formed by                            that of label space of differet car types such as hatchback, SUV etc.
merging test instances of seven diverse datasets. To excel on this                          For different datasets, even if two labels with the same name exist,
1 We used the term single classifier to denote a learning model with a single classifica-   their label space can be different. The label space information is
tion head being trained on multiple datasets with different labels together.                defined in the annotation guidelines used to create datasets.
Collective Learning From Diverse Datasets for Entity Typing in the Wild                                   EYRE’19 Workshop, CIKM, November 2019, Beijing, China


                                                                                               n
                                                                                   Data: Y =
                                                                                               Ð
                                                                                                     Yi
                                                                                               i=1
                                                                                Result: Unified Hierarchical Label Set (UHLS), Tu = (Yu , Ru )
                                                                                          and label mapping, ϕ.
                                                                              1 Initialize: Yu = {root }, Ru = {}
                                                                              2 for y ∈ Y do
                                                                              3     if ∃S ⊆ Yu s.t . L(y) == L(S) then            // Case 2
                                                                              4          ϕ(y) 7→ S
                                                                              5     else                                          // Case 1
                                                                              6          v = arg min {v | v ∈ Yu & L(y) ≺ L(v)}
                                                                                               size(L(v))
                                                                              7           Yu = Yu ∪ {y}
                                                                              8           Ru = Ru ∪ {(y, v)}
                                                                              9           ϕ(y) 7→ y
                                                                              10          for (x, v) ∈ Ru do      // Update existing nodes
Figure 4: A simplified illustration of the UHLS and the label                 11              if x , y & L(x) ≺ L(y) then
mapping from individual datasets.                                             12                  Ru = Ru − {(x, v)}
                                                                              13                  Ru = Ru ∪ {(x, y)}

Type Hierarchy: A type or label hierarchy, T , is a natural way to            14          for v̂ ∈ Yu do     // Restrict to tree hierarchy
organize label set in a hierarchy. It is formally defined as (Y, R),          15              if L(v̂) ≺ L(y) & v̂ < subtree(y) then
where Y is the type set and R = {(yi , y j ) | yi , y j ∈ Y & i ,             16                  ϕ(y) 7→ v̂
j & L(yi ) ≺ L(y j )} is the relation set, in which (yi , y j ) means
that yi is a subtype of y j or in other words the label space of yi is             Algorithm 1: UHLS and label mapping creation algorithm.
subsumed within the label space of y j .
ET in the Wild problem definition Given n datasets, D1 , . . . , Dn ,
each having its own domain and label set, Di and Yi respectively,                Our analysis of the labels of several ET datasets suggests that the
the objective is to predict the best possible fine-grained label from         presence of the same label name in the two or more datasets may
                             n
the set of all labels, Y =
                             Ð
                                {Yi }, for a test entity mention. The         not necessarily imply that their label spaces are same. For example,
                                i=1                                           in the CoNLL dataset, the label space for the label location includes
fine-grained label might not be present in any in-domain dataset.             facilities, whereas in the OntoNotes dataset [28] the location la-
                                                                              bel space excludes facilities. These differences are because these
3     COLLECTIVE LEARNING FRAMEWORK                                           datasets were created by different organizations, at different times
      (CLF)                                                                   and for a different objective. Figure 4 illustrates this label space in-
Figure 3 provides a complete overview of the CLF, which is based              teraction. Additionally, some of these labels are very specific to the
on the following key observations and ideas:                                  domains, and not all of them are present in any publicly available
                                                                              hierarchies such as WordNet, ConceptNet or even knowledge bases
    (1) From the set of all available labels Y, it is possible to construct   (Freebase [2] or WikiData [26]).
        a type hierarchy Tu = (Yu , Ru ) where Yu ⊆ Y (§ 3.1).                   Thus, to construct UHLS, we analyzed the annotation guide-
    (2) We can map each y ∈ Y, to one or more than one node in Tu ,           lines of several datasets and came up with an algorithm formally
        such that the L(y) is same as the label space of the union of         described in Algorithm 1 and explained below.
        the mapped nodes (§ 3.1).                                                Given the set of all labels, Y, the goal is to construct a type
    (3) Using the above hierarchy and mapping, now even if for                hierarchy, Tu = (Yu , Ru ) and a label mapping ϕ : Y 7→ P(Yu ).
        some datasets we only have the coarse labels, i.e., the labels        Here, Yu is the set of labels present in the hierarchy, Ru is the
        which are mapped to non-leaf nodes, a learning model with             relation set and P(Yu ) is the power set of the label set. To construct
        a partial hierarchy aware loss function can predict fine labels       Tu , we start with an initial type hierarchy, which can be Yu =
        (§ 3.2.2, 3.2.3).                                                     {root }, Ru = {} or initialized by any existing hierarchy. We keep
                                                                              on processing each label y ∈ Y and decide if there is a need to
3.1     Unified Hierarchy Label Set and Label                                 update Tu and update the mapping ϕ. For each label y there are
                                                                              only two possible cases, either Tu is updated or not.
        Mapping
                                                                              Case 1, Tu is updated: In this case y is added to a child of an
The labels of entity mentions can be arranged in a hierarchy. For             existing node in the Tu , say v. While updating Tu it is ensured
example, the label space of airports is subsumed in the label space of        that v = arg min {v | v ∈ Yu & L(y) ≺ L(v) }, i.e., L(v) is the
facilities. In literature, several hierarchies, such as WordNet [16] and                 size(L(v))
ConceptNet [15] exists. Even two ET datasets, BBN [27] and Wiki               smallest possible label space that completely subsumes the label
organize labels in a hierarchy. However, none of these hierarchies            space of y (lines 6-8). After the update, if there are existing subtrees
can be directly used as discussed next.                                       rooted at v, then if the label space of y subsumes any of the subtree
EYRE’19 Workshop, CIKM, November 2019, Beijing, China                                                                                              Abhishek, et al.


space, then y becomes the root of those subtrees (lines 10-13). In                        OntoNotes dataset, is mapped to the label GPE in the UHLS, then
this case the label mapping is updated as ϕ(y) 7→ y, i.e., the label                      GPE as well as all descendants of GPE are possible candidates. This
in an individual dataset is mapped to a same label name in UHLS.                          is because, even though the original example in OntoNotes is a
Additionally, if there exist any other nodes, v̂ ∈ Yu s.t . L(v̂) ≺                       name of a city, the annotation guidelines restrict the fine-labeling.
L(y) & v̂ < subtree(y), we add ϕ(y) 7→ v̂ for all such nodes (lines                       Thus the mapped set would be updated to {GPE, City, Country,
14-16). This additional condition ensures that even in the cases                          County, ...}. Additionally, some label have a one-to-many mapping,
where the actual hierarchy will be a directed acyclic graph, we                           for example, for the label MISC in CoNLL dataset, the candidate
restrict it to a tree hierarchy by adding additional mappings.                            labels could be {product, event, ...}.
Case 2, Tu is not updated: In this case, ∃S ⊆ Y s.t . L(y) ==                                From the set of mapped candidate labels, a partial label loss func-
L(S), i.e, there exists a subset of nodes whose union of label space                      tion will select the best candidate label. Due to the inherent design
is equal to the label space of y. If |S| > 1, intuitively this means that                 of the UHLS and label mapping, there will always be examples
the label space of y is a mixed space, and from some other datasets                       available that will be mapped only at a single leaf node. Thus al-
labels with finer label spaces were added to Yu . If |S| = 1, this                        lowing fine labels in the candidate set for actual coarse labels, will
means that some other dataset added a label which has the same                            encourage model to predict finer labels across datasets.
label space. In this case we will only update the label mapping as
                                                                                          3.2.3 Partial Hierarchical Label Loss. A partial label loss deals with
ϕ(y) 7→ S (lines 3-4).
                                                                                          the situation where training example have a set of candidate labels
    In Algorithm 1 whenever a decision has to be made related to a
                                                                                          and among which only a subset is correct for that given example
comparison between two label spaces, we refer a domain expert.
                                                                                          [4, 18, 30].
The expert makes the decision based on the annotation guidelines
                                                                                             In our case, this situation arises because of the mapping of the
for the queried labels and using existing organization of the queried
                                                                                          individual dataset labels to the UHLS. We use a hierarchy aware
label space in WordNet or Freebase if the queried labels are present
                                                                                          partial loss function as proposed in [29]. We first compute the
in these resources. We argue that since the overall size of Y is
                                                                                          probability distribution for the labels available in Yu as described
several order of magnitude less than the size of annotated instances
                                                                                          in equation 1. Here W is a weight matrix of size |R| × |Yu | and x is
(≈ 250 << ≈ 3 × 106 ), having a human in the loop preserves
                                                                                          the input entity mention along with its context.
the overall semantic property of the tree, which will be exploited
by a partial loss function to enable fine-grained prediction across                                           p(y|x) = so f tmax(RW + b)                       (1)
domains. An illustration of UHLS and label mapping is provided in
                                                                                          Then we compute p̂(y|x), a distribution adjusted to include a weighted
Figure 4.
                                                                                          sum of the ancestors probability for each label as defined in equa-
    In the next section, we will describe how the UHLS and the label
                                                                                          tion 2. Here At is the set of ancestors of the label y in Ru and β is
mapping will be used by a learning model to make finest possible
                                                                                          a hyperparameter.
predictions across datasets.                                                                                                       Õ
                                                                                                          p̂(y|x) = p(y|x) + β ∗        p(t |x)             (2)
3.2     Learning Model                                                                                                            t ∈At
Our learning model can be decomposed into two parts: (1) Neural                           Then we normalize p̂(y|x). From this normalized distribution, we
Mention and Context Encoders to encode the entity mention and its                         select a label which has the highest probability and is also a member
surrounding context into a feature vector; (2) Unified Type Predictor                     of the mapped labels Ym . We assumed the selected label to be
to infer entity types in the UHLS.                                                        correct and propagate the log-likelihood loss. The intuition behind
3.2.1 Neural Mention and Context Encoder. The input to our model                          this is that given the design of the ULHS and label mapping; there
is a sentence with the start and end index of entity mentions. Fol-                       will always be examples where Ym will contain only one element,
lowing the work of [1, 24, 29] we use Bi-directional LSTMs [8] to                         in that case, the model gets trained for that label. In the case where
encode left and right context surrounding the entity mention and                          there are multiple labels, the model has already built a belief about
use a character level LSTM to encode the entity mention. After                            the fine label suitable for that example because of simultaneously
this we concatenate the output of the three encoders, to generate a                       training with inputs having a single mapped label. Restricting that
single representation (R) for the input.                                                  belief to the mapped labels encourages correct fine-predictions for
                                                                                          these coarsely labeled examples.
3.2.2 Unified Type Predictor. Given the input representation, R, the
objective of the predictor is to assign a type from the unified label                     4     EXPERIMENTS AND ANALYSIS
set Yu . Thus, during model training, using the mapping function                          In this section, we describe the datasets used, details of experi-
ϕ : Y 7→ P(Yu ) we convert individual dataset specific labels to the                      ments related to UHLS creation, baseline models, model training,
unified label set, Yu . Due to one to many mapping, now there are                         evaluation schemes and result analysis.
multiple positive labels available for each individual input label y.
Lets call the mapped label set for an input label y as Ym . Now, if any                   4.1    Datasets
of the mapped label ŷ ∈ Ym has descendants, then the descendants
                                                                                          Table 1 describes the seven datasets used in this work. These datasets
are also added to Ym 2 . For example, if the label GPE from the
                                                                                          are diverse, as they span several domains, none of them have an
2 This is exempted when the annotated label is a coarse label and a fine label from the   identical label set and some datasets capture fine-grained labels
same dataset exist in the subtree.                                                        while others only have coarse labels. Also, the Wiki [14] dataset is
Collective Learning From Diverse Datasets for Entity Typing in the Wild                                      EYRE’19 Workshop, CIKM, November 2019, Beijing, China




                                      Figure 5: A pictorial illustration of the complete experimental setup.


 Dataset          Domain                        No. of     Mention        Fine     a silo ensemble model3 . In this ensemble model, each silo model
                                                Labels     count          labels   has the same mention and context encoder structure described in
 BC5CDR [13]      Clinical abstracts            2          9,385          No       Section 3.2.1. However, the loss function is different. For single-label
 CoNLL [25]       Reuters news stories          4          23,499         No       datasets, we use a standard softmax based cross-entropy loss. For
 JNLPBA [12]      Life sciences abstracts       5          46,750         Yes      multi-label datasets, we use a sigmoid based cross-entropy loss.
 CADEC [11]       Medical forum                 5          5,807          Yes
 OntoNotes        Newswire, conversations,      18         1,16,465       No
                                                                                      The second baseline is a learning model trained using a classic
 [28]             newsgroups, weblogs                                              hard parameter sharing multi-task learning framework [3]. In this
 BBN [27]         Wall Street Journal text      73         86,921         Yes      baseline, all the seven datasets are fed through a common mention
 Wiki [14]        Wikipedia                     116        20,00,000      Yes      and context encoder. For each dataset, there is a separate classifier
         Table 1: Description of the seven ET datasets.                            head with the output labels same as that was available in the re-
                                                                                   spective original dataset. We name this baseline as a multi-head
                                                                                   ensemble baseline4 . Similar to the silo models, the appropriate loss
                                                                                   function is selected for each head. The only difference between the
automatically generated using distant supervision process [5] and                  silo and multi-head model is the way mention and context repre-
has multiple labels per entity mention in its label set. The other                 sentations are learned. In the multi-head model, the representations
remaining datasets have a single label per entity mention.                         are shared across datasets. In silo models, the representations are
                                                                                   learned separately for each dataset.
4.2     UHLS and Label Mapping
We followed the Algorithm 1 to create the UHLS and the label map-                  4.4      Model Training
ping. To reduce the load on domain experts for verification of the                 For each of the seven datasets, we use the standard train, validation
label spaces, we initialized the UHLS with the BBN dataset hierar-                 and testing split. If the standard splits are not available, we randomly
chy. We keep on updating the initial hierarchy until all the labels                split the available data into 70%, 15%, and 15%, and use them as train,
from the seven datasets were processed. There were total 223 labels                validation, and testing set respectively. In the case of the silo model,
in Y and in the end Yu had 168 labels. This difference in label count              for each dataset, we train a model on its training split and select the
is due to the mapping of several labels to one or multiple existing                best model using its validation split. In the case of the multi-head
nodes, without the creation of a new node. This corresponds to case                and our proposed model, we train the model on the training splits
2 of the UHLS creation process (lines 3-4, Algorithm 1). Also, this                of all seven datasets together and select the best model using the
indicates the overlapping nature of the seven datasets. The label                  combined validation split.5 .
set overlap is illustrated in Figure 2. The MISC label from CoNLL
dataset has the highest ten number of mappings to the UHLS nodes.                  4.5      Experimental Setup
Wiki and BBN datasets were the largest contributor towards fine
labels with 96 and 57 labels at the leaf of UHLS. However, only 25                 Figure 5 illustrates the complete experimental setup along with
fine-grained labels were shared by these two datasets. This indi-                  the learning models compared. In this setup, the objective is to
cates that even though these are the fine-grained datasets with one                measure the learning model’s generalizability for the ET task as
of the largest label sets, each of them has complementary labels.                  a whole, rather than on any specific dataset. To achieve this, we
                                                                                   3 Here unlike traditional ensemble models, in silo ensemble, the learning models are
4.3     Baselines                                                                  trained on different datasets.
                                                                                   4 Here since the “task" is the same, i.e., entity typing, we use the term multi-head
We compared our learning model with two baseline models. The
                                                                                   instead of multi-task for the baseline.
first baseline is an ensemble of seven learning models, where each                 5 The source code and the implementation details are available at: https://github.com/
model is trained on one of the seven datasets. We name this model                  abhipec/ET_in_the_wild
EYRE’19 Workshop, CIKM, November 2019, Beijing, China                                                                                                  Abhishek, et al.


merged the test instances from the seven datasets listed in Table
1 to form a combined test corpus. On this test set, we compared
the performance of the baseline models with the learning model
trained via our proposed framework. We compare these models
performance using the following evaluation schemes.

4.5.1 Evaluation schemes. Idealistic scheme: Given a test instance,
this scheme picks a silo model from the silo ensemble model (or
head of the multi-head ensemble model) which has been trained
on a training dataset with the same domain and target labels set as
the test instance. This scheme gives an advantage to the ensemble
baselines and compares the models in the traditional ways.
Realistic scheme: In this scheme, all of the test instances are in-            Figure 6: Comparison of learning models in the idealistic
distinguishable in their domain and candidate label set. In other              and realistic schemes.
words, given a test instance, learning models do not have infor-
mation about its domain and target labels. This is a challenging
                                                                               which allows us to compare learning models making predictions in
evaluation scheme and close to real-world setting, where once
                                                                               different label sets with minimum re-annotation effort.
learning models are deployed, it cannot be guaranteed that the
                                                                                  In the first metric, we compute an aggregate micro-averaged F1
user submitted test instances will be from the same domain. In this
                                                                               score on best effort basis. It is based on the intuition that if the labels
scheme, the silo ensemble and multi-head ensemble models assign
                                                                               are only annotated at a coarse level in the gold test annotations,
a label to a test instance based on the following criteria:
                                                                               then even if a model predicts a fine-label within that coarse label,
Highest confidence label (HCL): The label which has the highest
                                                                               this metric should not penalize such cases6 . To find the fine-coarse
confidence score among the different models/heads of an ensemble
                                                                               subtype information, we use the UHLS and the label mapping. We
model. For example, let there be two models/heads, MA and MB, in
                                                                               map both prediction and gold label to the UHLS and evaluate in
a silo/multi-head ensemble model. For a test instance, MA assigns
                                                                               that space. We compute this metric both in an idealistic and realistic
the score of 0.1, 0.2 and 0.7 for the labels l 1 , l 2 and l 3 respectively.
                                                                               scheme. By design, this metric will not capture errors made at a
For the same test instance, MB assigns the score of 0.05 and 0.95
                                                                               finer level, which the next metric will capture.
for the labels l 4 and l 5 respectively. Then the final label will be the
                                                                                  In the second metric, we measure how good are the fine-grained
label l 5 which has a confidence score of 0.95.
                                                                               predictions on examples where the gold dataset has only coarse
Relative highest confidence label (RHCL): The label which has
                                                                               labels. We re-annotate a representative sample of a coarse-grained
the highest normalized confidence score among the different mod-
                                                                               dataset and evaluate the model’s performance on this sample.
els/heads from an ensemble model. Continuing with the example
mentioned above for MA and MB, in this criteria, we normalize the
                                                                               4.7     Result and Analysis
confidence score for each model based on the number of labels the
model is predicting. In this example, MA is predicting three labels            4.7.1 Analysis of the idealistic scheme results. In Figure 6, we can
and MB is predicting two labels. Here the normalized scores for                observe that the multi-head ensemble model outperforms the silo
MA will be 0.3, 0.6 and 2.1 for the label l 1 , l 2 , and l 3 respectively.    ensemble model (95.19% vs. 94.12%). The primary reason could be
Similarly, the normalized scores for MB will be 0.1 and 1.9 for the            that the multi-head model has learned better representations using
label l 4 and l 5 . Then the final label will be the label l 3 with the        the multi-task framework as well as has an independent head for
confidence score of 2.1.                                                       each dataset to learn dataset specific idiosyncrasy. The performance
   Recall that the experimental setup includes multiple models,                of our single model (UHLS) is between the silo ensemble model and
each having a different label set. The existing classifier integration         multi-head ensemble model. Note that this performance comparison
strategies [31], such as sum rule or majority voting are not suitable          is in a setting which is the best possible case for ensemble models
in this setup. For these evaluation schemes, we use the evaluation             where the ensemble models know complete information about the
metrics described in the following section.                                    test instance domain and label set. Despite this, UHLS model which
                                                                               does not require any information about test instance domain and
                                                                               candidate labels performs competitive (94.29%), even better than
4.6    Evaluation metrics                                                      the silo ensemble model. Moreover, the ensemble models do not
In the evaluation schemes, there are cases where the predicted label           always predict the finest possible label, whereas UHLS can (§ 4.7.3).
is not part of the gold dataset label set. For example, our proposed           4.7.2 Analysis of the realistic scheme results. In Figure 6, we can
model or the ensemble model might predict a label city for a test              observe that both silo ensemble and multi-head ensemble model
instance which has a gold label annotated as a geopolitical entity.            performs poorly in this scheme. The best result for ensemble mod-
Here, the models are predicting a fine-grained label, however, the             els (73.08%) is obtained by the silo ensemble model when the labels
dataset from where the test instance came only had annotations                 were assigned using the HCL criteria. We analyzed some of the
at the coarse level. Thus, without manually verifying, it is not               outputs of ensemble models and found that there were several cases
possible to know whether the model’s prediction was correct or
not. To overcome this issue, we propose two evaluation metrics,                6 Exception is where the source dataset also has fine-grained labels.
Collective Learning From Diverse Datasets for Entity Typing in the Wild                        EYRE’19 Workshop, CIKM, November 2019, Beijing, China




                                                                          Figure 8: Example output of our proposed approach. Sen-
                                                                          tence 1, 2, 3 are from the CoNLL, BBN and BC5CDR dataset
                                                                          respectively.

Figure 7: Analysis of Fine-grained label predictions. The two             it assigns a coarse label events other, which also subsumes other
columns specify results for nationality and sports event la-              events such as elections. Similarly, model/head with the Wiki label
bel. Each row represents a model used for prediction. The re-             set can predict the label sports event but not the label nationality.
sults can be interpreted as, out of 351 entity mentions with              For nationality instances, it assigns completely out of scope labels
type nationality, model Silo (CoNLL) predicted 338 as MISC                such as location and organizations. The out of scope predictions are
type and the remaining as other types illustrated.                        due to the domain mismatch.
                                                                          No information about domain or candidate label: The top two
                                                                          rows, Silo (HCL) and UHLS represent these results. The Silo (HCL) is
where a narrowly focused model predicts with very high confidence         a silo ensemble model with the realistic evaluation scheme. We can
(0.99 probability or above) out-of-scope labels. For example, predic-     observe that this model makes out of scope predictions such as pre-
tion of label ADR with confidence 0.999 by a silo model trained on        dicting ADR for sports event instances. The UHLS model is trained
the CADEC dataset for a sports event test instance of Wiki domain.        using our proposed framework. It can predict fine-grained labels in
The performance of our UHLS model is 94.29%, which is an absolute         both nationality and sports event test instances, even though two
improvement of 21.21% compared to the next best model Silo (HCL)          different datasets contributed these labels. Also, it does not use any
model in the realistic scheme of evaluation.                              information about the test instance domain or candidate labels.
4.7.3 Analysis of the fine-grained predictions. For this analysis, we     4.7.4 Example output on different datasets. In Figure 8, we show
re-annotate the examples of type MISC from the CoNLL test set into        the labels assigned by the model trained using the proposed frame-
nationality (support of 351), sports event (support of 117) and others    work on the sentences from the CoNLL, BBN and BC5CDR datasets.
(support 234). We analyzed the prediction of different models for the     We can observe that, even though the BBN dataset is fine-grained,
nationality and sports event labels. Note that this is an interesting     it has complementary labels compared with the Wiki dataset. For
evaluation where the test instances domain is Reuters News, and           example, for the entity mention Magellan, a label spacecraft is
the in-domain dataset does not have labels nationality and sports         assigned. Spacecraft label is only present in the Wiki dataset. Ad-
event. The nationality label is contributed by the BBN dataset whose      ditionally, even in sentences from clinical abstracts, the proposed
domain is Wall Street Journal. The sports event label is contributed      approach is assigning fine-types, which came from a dataset with
by the Wiki dataset whose domain is Wikipedia. The results (Figure        the medical forum domain. For example, ADR label is only present
7) are categorized into three parts as described below:                   in the CADEC dataset with the domain of medical forum. The pro-
In-domain results: The bottom two rows, Silo (CoNLL) and MH               posed approach can aggregate fine-labels across datasets and makes
(CoNLL) represent these results. We can observe that in this case,        unified fine-grained predictions.
since train and test dataset are from the same domain, these models
can predict accurately the label MISC for both the nationality and        4.7.5 Result and analysis summary. Collective learning framework
sports event instances. However, MISC is not a fine-grained label.        allows a limitation of one dataset being covered by some other
These results are from the idealistic scheme where it is known            dataset(s). Our results convey that a model trained using CLF on
about the test instance characteristics.                                  an amalgam of diverse datasets generalizes better for the ET task
Out of domain but with known candidate label: The middle                  as a whole. Thus, the framework is suitable for the ET in the wild
four rows, Silo (BBN), MH (BBN), Silo (Wiki) and MH (Wiki) rep-           problem.
resent these results. In this case, we assume that the candidate
labels are known, and pick the models which can predict that label.       5   RELATED WORK
However, there is not a single silo/head model in the ensemble mod-       To the best of our knowledge, the work of [21] in the visual object
els which can predict both nationality and sports event labels. For       recognition task is closet to our work. They consider two datasets.
example, model/head with the BBN label set can predict the label          First a coarse-grained and second, a fine-grained. Label set of the
nationality but not the sports event label. For sports event instances,   first dataset is assumed to be subsumed by the label set of the second
EYRE’19 Workshop, CIKM, November 2019, Beijing, China                                                                                                            Abhishek, et al.


dataset. Thus coarse-grained labels can be mapped to fine-grained                          [7] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Mur-
dataset labels in a one-to-one mapping. Additionally, they did not                             phy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge Vault:
                                                                                               A Web-scale Approach to Probabilistic Knowledge Fusion. In Proceedings of the
propagate the coarse labels to the finer labels. As demonstrated by                            20th ACM SIGKDD International Conference on Knowledge Discovery and Data
our experiments, when several real-world datasets are merged, one                              Mining (KDD ’14). ACM, New York, NY, USA, 601–610.
                                                                                           [8] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech
to one mapping is not possible. In our work, we provide a principled                           recognition with deep recurrent neural networks. In Acoustics, speech and signal
approach where multiple datasets can contribute to fine-grained                                processing (icassp), 2013 ieee international conference on. IEEE, 6645–6649.
labels. In our framework, a partial loss function enables fine-label                       [9] Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid
                                                                                               Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan
propagation on datasets with coarse labels.                                                    Szpakowicz. 2009. Semeval-2010 task 8: Multi-way classification of semantic
   In the area of cross-lingual syntactic parsing, there is a notation                         relations between pairs of nominals. In Proceedings of the Workshop on Semantic
of universal POS tagset [20]. This tagset is a collection of coarse tags                       Evaluations: Recent Achievements and Future Directions. Association for Compu-
                                                                                               tational Linguistics, 94–99.
that exist in similar form across languages. Utilizing this tagset and                    [10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
a mapping from language-specific fine-tags, it becomes possible                                computation 9, 8 (1997), 1735–1780.
                                                                                          [11] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang.
to train a single model in a cross-lingual setting. In this case, the                          2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical
mapping is many-to-one, i.e., a fine-category to a coarse category,                            informatics 55 (2015), 73–81.
thus the models are limited to predict a coarse-grained label.                            [12] Jin-Dong Kim, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Nigel Collier.
                                                                                               2004. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings
   Related to the use of partial label loss function in the context of                         of the international joint workshop on natural language processing in biomedicine
the ET problem, there exist other notable works including [22] and                             and its applications. Association for Computational Linguistics, 70–75.
[1]. In our work, we use the current state-of-the-art hierarchical                        [13] Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert
                                                                                               Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong
partial loss function proposed in [29].                                                        Lu. 2016. BioCreative V CDR task corpus: a resource for chemical disease relation
                                                                                               extraction. Database 2016 (2016).
                                                                                          [14] Xiao Ling and Daniel S Weld. 2012. Fine-Grained Entity Recognition.. In AAAI,
6    CONCLUSION                                                                                Vol. 12. 94–100.
                                                                                          [15] Hugo Liu and Push Singh. 2004. ConceptNetâĂŤa practical commonsense rea-
In this paper, we propose building learning models that generalize                             soning tool-kit. BT technology journal 22, 4 (2004), 211–226.
better on the ET as a whole, rather than on a specific dataset. We                        [16] George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM
comprehensively studied ET in the wild task which includes prob-                               38, 11 (1995), 39–41.
                                                                                          [17] David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition
lem definition, collective learning framework, and evaluation setup.                           and classification. Lingvisticae Investigationes 30, 1 (2007), 3–26.
We demonstrated that by using in conjunction a UHLS, one-to-many                          [18] Nam Nguyen and Rich Caruana. 2008. Classification with partial labels. In
label mappings, and a partial hierarchical loss function; we can train                         Proceedings of the 14th ACM SIGKDD international conference on Knowledge
                                                                                               discovery and data mining. ACM, 551–559.
a single classifier on several diverse datasets together. The single                      [19] Sinno Jialin Pan, Qiang Yang, et al. 2010. A survey on transfer learning. IEEE
classifier collectively learns from diverse datasets and predicts the                          Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
                                                                                          [20] Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A Universal Part-of-
best possible fine-grained label across all datasets, outperforming                            Speech Tagset. In Proceedings of the Eighth International Conference on Language
an ensemble of narrowly focused models in their best possible                                  Resources and Evaluation (LREC-2012).
case. Also, during collective learning there is a multi-directional                       [21] Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. In
                                                                                               2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
knowledge flow, i.e., there is no one source or target dataset. This                           6517–6525.
knowledge flow is different from the well studied multi-task and                          [22] Xiang Ren, Wenqi He, Meng Qu, Lifu Huang, Heng Ji, and Jiawei Han. 2016. AFET:
transfer learning approaches [19] where the prime objective is to                              Automatic fine-grained entity typing by hierarchical partial-label embedding.
                                                                                               In Proceedings of the 2016 Conference on Empirical Methods in Natural Language
transfer knowledge from a source dataset to a target dataset.                                  Processing. 1369–1378.
   In NLP there are several tasks such as entity linking [23], rela-                      [23] Wei Shen, Jianyong Wang, and Jiawei Han. 2014. Entity linking with a knowledge
                                                                                               base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data
tion classification [9], and named entity recognition [17], where                              Engineering 27, 2 (2014), 443–460.
the current focus in on excelling at a particular dataset, not on a                       [24] Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. 2017.
particular task. We expect that collective learning approaches will                            Neural Architectures for Fine-grained Entity Type Classification. In Proceedings
                                                                                               of the 15th Conference of the European Chapter of the Association for Computational
open up a new research direction for each of these tasks.                                      Linguistics: Volume 1, Long Papers, Vol. 1. 1271–1280.
                                                                                          [25] Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-
                                                                                               2003 shared task: Language-independent named entity recognition. In Proceedings
REFERENCES                                                                                     of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume
 [1] Abhishek Abhishek, Ashish Anand, and Amit Awekar. 2017. Fine-grained entity               4. Association for Computational Linguistics, 142–147.
     type classification by jointly learning representations and label embeddings. In     [26] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative
     Proceedings of the 15th Conference of the European Chapter of the Association for         knowledgebase. Commun. ACM 57, 10 (2014), 78–85.
     Computational Linguistics: Volume 1, Long Papers, Vol. 1. 797–807.                   [27] Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity
 [2] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.              type corpus. Linguistic Data Consortium, Philadelphia 112 (2005).
     2008. Freebase: a collaboratively created graph database for structuring human       [28] Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Prad-
     knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on              han, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Fran-
     Management of data. AcM, 1247–1250.                                                       chini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium,
 [3] Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41–75.             Philadelphia, PA (2013).
 [4] Timothee Cour, Ben Sapp, and Ben Taskar. 2011. Learning from partial labels.         [29] Peng Xu and Denilson Barbosa. 2018. Neural Fine-Grained Entity Type Classifi-
     Journal of Machine Learning Research 12, May (2011), 1501–1536.                           cation with Hierarchy-Aware Loss. arXiv preprint arXiv:1803.03378 (2018).
 [5] Mark Craven and Johan Kumlien. 1999. Constructing Biological Knowledge               [30] Min-Ling Zhang, Fei Yu, and Cai-Zhi Tang. 2017. Disambiguation-free partial
     Bases by Extracting Information from Text Sources. In Proceedings of the Seventh          label learning. IEEE Transactions on Knowledge and Data Engineering 29, 10 (2017),
     International Conference on Intelligent Systems for Molecular Biology. AAAI Press,        2155–2167.
     77–86.                                                                               [31] Zhi-Hua Zhou. 2012. Ensemble methods: foundations and algorithms. Chapman
 [6] Li Dong, Furu Wei, Hong Sun, Ming Zhou, and Ke Xu. 2015. A hybrid neural                  and Hall/CRC.
     model for type classification of entity mentions. In Twenty-Fourth International
     Joint Conference on Artificial Intelligence.