<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Collective Learning From Diverse Datasets for Entity Typing in the Wild</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abhishek</string-name>
          <email>abhishek.abhishek@iitg.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amar Prakash Azad</string-name>
          <email>amarazad@in.ibm.com</email>
          <email>amarazad@in.ibm.com bganesa1@in.ibm.com IBM Research Lab India</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashish Anand</string-name>
          <email>anand.ashish@iitg.ac.in</email>
          <email>anand.ashish@iitg.ac.in awekar@iitg.ac.in Indian Institute of Technology Guwahati Guwahati, Assam, India</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amit Awekar</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Balaji Ganesan</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Guwahati, Guwahati, Assam</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Entity typing (ET) is the problem of assigning labels to given entity mentions in a sentence. Existing works for ET require knowledge about the domain and target label set for a given test instance. ET in the absence of such knowledge is a novel problem that we address as ET in the wild. We hypothesize that the solution to this problem is to build supervised models that generalize better on the ET task as a whole, rather than a specific dataset. In this direction, we propose a Collective Learning Framework (CLF), which enables learning from diverse datasets in a unified way. The CLF first creates a unified hierarchical label set (UHLS) and a label mapping by aggregating label information from all available datasets. Then it builds a single neural network classifier using UHLS, label mapping and a partial loss function. The single classifier predicts the finest possible label across all available domains even though these labels may not be present in any domain-specific dataset. We also propose a set of evaluation schemes and metrics to evaluate the performance of models in this novel problem. Extensive experimentation on seven diverse real-world datasets demonstrates the eficacy of our CLF.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Natural language
processing; Machine learning.
entity typing, hierarchy creation, learning from multiple datasets</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Evolution of ET has led to the generation of multiple datasets.
These datasets difer from each other in terms of their domain or
label set or both. Here, the domain of a dataset represents the data
distribution of its sentences. The label set represents the entity types
annotated. Existing work for ET requires knowledge of the domain
and the target label of a test instance [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Figure 1 illustrates this
issue where four learning models are typing four entity mentions.
We can observe that, in order to make a reasonable prediction
(output with a solid border), it is required to assign labels from a
model which has been trained on a dataset with similar domain
and labels as that of test instances. However, domain and target
label information of a test instance is unknown in several NLP
applications such as entity ranking for web question answering
systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and knowledge base completion [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where ET models
are used.
      </p>
      <p>
        We address ET in the absence of domain and target label set
knowledge as ET in the wild problem. As a result, we have to
predict the best possible labels for all test instances as illustrated in
Figure 1 (output with dashed line border). These labels may not be
present in the same domain dataset. For example, the prediction of
the label sports team for the entity mention Wallaby, when the best
possible fine-grained label ( sports team) is not present in the same
domain CoNLL dataset [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. We hypothesize that the solution to
this problem is to build supervised models that generalize better on
the ET task as a whole, rather than a specific dataset. This solution
requires collective learning from several diverse datasets.
      </p>
      <p>
        However, collectively learning from diverse datasets is a
challenging problem. Figure 2 illustrates the diversity of seven ET datasets.
We can observe that every dataset provides some distinct
information for the ET task such as domain and labels. For example,
CADEC dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] contains informally written sentences from
a medical forum, whereas JNLPBA dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] contains formally
written sentences from scientific abstracts in life sciences. Moreover,
there is an overlap in the label sets as well as a relation between the
labels of these datasets. For example, both CoNLL and Wiki [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
datasets have a label person. However, only Wiki dataset has a label
athlete, a subtype of person. This means that CoNLL dataset can also
contain athlete mentions but were only annotated with a coarse
label person. Thus, learning collectively from these diverse datasets
require models to learn a useful feature or representation of the
sentences from diverse domains as well as to learn the relation
among labels.
      </p>
      <p>This study proposes a collective learning framework for the ET
in the wild problem. CLF first builds a unified hierarchical label
set (UHLS) and associated label mapping by pooling labels from
diverse datasets. Then, a single classifier 1 collectively learns from the
pooled dataset using UHLS, label mapping and a partial hierarchy
aware loss function.</p>
      <p>In the UHLS, the nodes are contributed by diferent datasets,
and a parent-child relation among nodes translate to a coarse-fine
label relation. During construction of UHLS, a mapping from every
dataset specific label to the UHLS nodes is also constructed. We
expect to have one-to-many mappings, as in the case of real-world
datasets. For example, a coarse-grained label for a dataset could
be mapped to multiple nodes in the UHLS introduced by some
other dataset. During the UHLS construction, human judgment
is used when comparing two labels. This efort is four orders of
magnitude lesser compared to annotating every dataset with
finegrained labels.</p>
      <p>
        Utilizing the UHLS and the mapping, we can view several
domainspecific datasets as a collection of a multi-domain dataset having
the same label set. On this combined dataset, we use an LSTM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
based encoder to learn a useful representation of the text followed
by a partial hierarchical loss function [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] for label classification.
This setup enables a single neural network classifier to predict
finegrained labels across all domains, even though the fine-grained
label was not present in any in-domain dataset.
      </p>
      <p>We also propose a set of evaluation schemes and metrics for
the ET in the wild problem. In our evaluation schemes, we
evaluate learning models performance on a test set which is formed by
merging test instances of seven diverse datasets. To excel on this
1We used the term single classifier to denote a learning model with a single
classification head being trained on multiple datasets with diferent labels together.
merged test set, learning models must generalize beyond a single
dataset. Our evaluation metrics are designed to measure learning
models performance to predict the best possible fine-grained label.
We compared a single classifier model trained with our proposed
framework with an ensemble of various models. Our model
outperforms competitive baselines with a significant margin.</p>
      <p>Our contributions can be highlighted as below:
(1) We propose a novel problem of ET in the wild with the
objective of building better generalizable ET models (§ 2).
(2) We propose a novel collective learning framework which
makes it possible to train a single classifier on an amalgam of
diverse ET datasets, enabling fine-grained prediction across
all the datasets, i.e., a generalized model for ET task as a
whole (§ 3).
(3) We propose evaluation schemes and evaluation metrics to
compare learning models for the ET in the wild problem
setting (§ 4.5, 4.6).
2</p>
    </sec>
    <sec id="sec-3">
      <title>TERMINOLOGIES AND PROBLEM</title>
    </sec>
    <sec id="sec-4">
      <title>DEFINITION</title>
      <p>In this section, we formally define the ET in the wild problem and
related terminologies.</p>
      <p>Dataset: A dataset, D, is a collection of (X , D, Y). Here, X
corresponds to a corpus of sentences with entity boundaries annotated,
D corresponds to the domain and Y = {y1, . . . yn } is the set of
labels used to annotate each entity mention in the X . It is possible that
two datasets share domain but difer in their label sets or vice versa.
Here the domain means the data characteristics such as writing
style and vocabulary. For example, sentences in the CoNLL dataset
are sampled from Reuters news stories around 1999, whereas,
sentences in the CADEC dataset are from medical forum posts around
2015. Thus, these datasets have diferent domains.</p>
      <p>Label space: A label space L for a particular label y, is defined
as a set of entities that can be assigned a label y. For example, the
label space for a label car includes mentions of all cars including
that of label space of diferet car types such as hatchback, SUV etc.
For diferent datasets, even if two labels with the same name exist,
their label space can be diferent. The label space information is
defined in the annotation guidelines used to create datasets.</p>
      <p>Type Hierarchy: A type or label hierarchy, T , is a natural way to
organize label set in a hierarchy. It is formally defined as (Y, R),
where Y is the type set and R = {(yi , yj ) | yi , yj ∈ Y &amp; i ,
j &amp; L(yi ) ≺ L(yj )} is the relation set, in which (yi , yj ) means
that yi is a subtype of yj or in other words the label space of yi is
subsumed within the label space of yj .</p>
      <p>ET in the Wild problem definition Given n datasets, D1, . . . , Dn ,
each having its own domain and label set, Di and Yi respectively,
the objective is to predict the best possible fine-grained label from
n
the set of all labels, Y = Ð</p>
      <p>i=1{Yi }, for a test entity mention. The
ifne-grained label might not be present in any in-domain dataset.
3</p>
    </sec>
    <sec id="sec-5">
      <title>COLLECTIVE LEARNING FRAMEWORK (CLF)</title>
      <p>(1) From the set of all available labels Y, it is possible to construct
a type hierarchy Tu = (Yu , Ru ) where Yu ⊆ Y (§ 3.1).
(2) We can map each y ∈ Y, to one or more than one node in Tu ,
such that the L(y) is same as the label space of the union of
the mapped nodes (§ 3.1).
(3) Using the above hierarchy and mapping, now even if for
some datasets we only have the coarse labels, i.e., the labels
which are mapped to non-leaf nodes, a learning model with
a partial hierarchy aware loss function can predict fine labels
(§ 3.2.2, 3.2.3).
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Unified Hierarchy Label Set and Label</title>
    </sec>
    <sec id="sec-7">
      <title>Mapping</title>
      <p>
        The labels of entity mentions can be arranged in a hierarchy. For
example, the label space of airports is subsumed in the label space of
facilities. In literature, several hierarchies, such as WordNet [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and
ConceptNet [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] exists. Even two ET datasets, BBN [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and Wiki
organize labels in a hierarchy. However, none of these hierarchies
can be directly used as discussed next.
v = arg min {v | v ∈ Yu &amp; L(y) ≺ L(v)}
      </p>
      <p>size(L(v))
Yu = Yu ∪ {y}
Ru = Ru ∪ {(y, v)}
ϕ(y) 7→ y
for (x, v) ∈ Ru do // Update existing nodes
if x , y &amp; L(x ) ≺ L(y) then</p>
      <p>Ru = Ru − {(x, v)}</p>
      <p>Ru = Ru ∪ {(x, y)}
for vˆ ∈ Yu do // Restrict to tree hierarchy
if L(vˆ) ≺ L(y) &amp; vˆ &lt; subtree(y) then</p>
      <p>ϕ(y) 7→ vˆ
Algorithm 1: UHLS and label mapping creation algorithm.</p>
      <p>
        Our analysis of the labels of several ET datasets suggests that the
presence of the same label name in the two or more datasets may
not necessarily imply that their label spaces are same. For example,
in the CoNLL dataset, the label space for the label location includes
facilities, whereas in the OntoNotes dataset [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] the location
label space excludes facilities. These diferences are because these
datasets were created by diferent organizations, at diferent times
and for a diferent objective. Figure 4 illustrates this label space
interaction. Additionally, some of these labels are very specific to the
domains, and not all of them are present in any publicly available
hierarchies such as WordNet, ConceptNet or even knowledge bases
(Freebase [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or WikiData [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]).
      </p>
      <p>Thus, to construct UHLS, we analyzed the annotation
guidelines of several datasets and came up with an algorithm formally
described in Algorithm 1 and explained below.</p>
      <p>Given the set of all labels, Y, the goal is to construct a type
hierarchy, Tu = (Yu , Ru ) and a label mapping ϕ : Y 7→ P(Yu ).
Here, Yu is the set of labels present in the hierarchy, Ru is the
relation set and P(Yu ) is the power set of the label set. To construct
Tu , we start with an initial type hierarchy, which can be Yu =
{root }, Ru = {} or initialized by any existing hierarchy. We keep
on processing each label y ∈ Y and decide if there is a need to
update Tu and update the mapping ϕ. For each label y there are
only two possible cases, either Tu is updated or not.</p>
      <p>Case 1, Tu is updated: In this case y is added to a child of an
existing node in the Tu , say v. While updating Tu it is ensured
that v = arg min {v | v ∈ Yu &amp; L(y) ≺ L(v) }, i.e., L(v) is the
size(L(v))
smallest possible label space that completely subsumes the label
space of y (lines 6-8). After the update, if there are existing subtrees
rooted at v, then if the label space of y subsumes any of the subtree
space, then y becomes the root of those subtrees (lines 10-13). In
this case the label mapping is updated as ϕ(y) 7→ y, i.e., the label
in an individual dataset is mapped to a same label name in UHLS.
Additionally, if there exist any other nodes, vˆ ∈ Yu s .t . L(vˆ) ≺
L(y) &amp; vˆ &lt; subtree(y), we add ϕ(y) 7→ vˆ for all such nodes (lines
14-16). This additional condition ensures that even in the cases
where the actual hierarchy will be a directed acyclic graph, we
restrict it to a tree hierarchy by adding additional mappings.
Case 2, Tu is not updated: In this case, ∃S ⊆ Y s .t . L(y) ==
L(S), i.e, there exists a subset of nodes whose union of label space
is equal to the label space of y. If |S | &gt; 1, intuitively this means that
the label space of y is a mixed space, and from some other datasets
labels with finer label spaces were added to Yu . If |S | = 1, this
means that some other dataset added a label which has the same
label space. In this case we will only update the label mapping as
ϕ(y) 7→ S (lines 3-4).</p>
      <p>In Algorithm 1 whenever a decision has to be made related to a
comparison between two label spaces, we refer a domain expert.
The expert makes the decision based on the annotation guidelines
for the queried labels and using existing organization of the queried
label space in WordNet or Freebase if the queried labels are present
in these resources. We argue that since the overall size of Y is
several order of magnitude less than the size of annotated instances
(≈ 250 &lt;&lt; ≈ 3 × 106), having a human in the loop preserves
the overall semantic property of the tree, which will be exploited
by a partial loss function to enable fine-grained prediction across
domains. An illustration of UHLS and label mapping is provided in
Figure 4.</p>
      <p>In the next section, we will describe how the UHLS and the label
mapping will be used by a learning model to make finest possible
predictions across datasets.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Learning Model</title>
      <p>
        Our learning model can be decomposed into two parts: (1) Neural
Mention and Context Encoders to encode the entity mention and its
surrounding context into a feature vector; (2) Unified Type Predictor
to infer entity types in the UHLS.
3.2.1 Neural Mention and Context Encoder. The input to our model
is a sentence with the start and end index of entity mentions.
Following the work of [
        <xref ref-type="bibr" rid="ref1 ref24 ref29">1, 24, 29</xref>
        ] we use Bi-directional LSTMs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to
encode left and right context surrounding the entity mention and
use a character level LSTM to encode the entity mention. After
this we concatenate the output of the three encoders, to generate a
single representation (R) for the input.
3.2.2 Unified Type Predictor. Given the input representation, R, the
objective of the predictor is to assign a type from the unified label
set Yu . Thus, during model training, using the mapping function
ϕ : Y 7→ P(Yu ) we convert individual dataset specific labels to the
unified label set, Yu . Due to one to many mapping, now there are
multiple positive labels available for each individual input label y.
Lets call the mapped label set for an input label y as Ym . Now, if any
of the mapped label yˆ ∈ Ym has descendants, then the descendants
are also added to Ym 2. For example, if the label GPE from the
2This is exempted when the annotated label is a coarse label and a fine label from the
same dataset exist in the subtree.
      </p>
      <p>(1)
(2)
OntoNotes dataset, is mapped to the label GPE in the UHLS, then
GPE as well as all descendants of GPE are possible candidates. This
is because, even though the original example in OntoNotes is a
name of a city, the annotation guidelines restrict the fine-labeling.
Thus the mapped set would be updated to {GPE, City, Country,
County, ...}. Additionally, some label have a one-to-many mapping,
for example, for the label MISC in CoNLL dataset, the candidate
labels could be {product, event, ...}.</p>
      <p>
        From the set of mapped candidate labels, a partial label loss
function will select the best candidate label. Due to the inherent design
of the UHLS and label mapping, there will always be examples
available that will be mapped only at a single leaf node. Thus
allowing fine labels in the candidate set for actual coarse labels, will
encourage model to predict finer labels across datasets.
3.2.3 Partial Hierarchical Label Loss. A partial label loss deals with
the situation where training example have a set of candidate labels
and among which only a subset is correct for that given example
[
        <xref ref-type="bibr" rid="ref18 ref30 ref4">4, 18, 30</xref>
        ].
      </p>
      <p>
        In our case, this situation arises because of the mapping of the
individual dataset labels to the UHLS. We use a hierarchy aware
partial loss function as proposed in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. We first compute the
probability distribution for the labels available in Yu as described
in equation 1. Here W is a weight matrix of size |R | × |Yu | and x is
the input entity mention along with its context.
      </p>
      <p>p(y |x ) = so f tmax (RW + b)
Then we compute pˆ(y |x ), a distribution adjusted to include a weighted
sum of the ancestors probability for each label as defined in
equation 2. Here At is the set of ancestors of the label y in Ru and β is
a hyperparameter.</p>
      <p>pˆ(y |x ) = p(y |x ) + β ∗
Õ
t ∈At
p(t |x )
Then we normalize pˆ(y |x ). From this normalized distribution, we
select a label which has the highest probability and is also a member
of the mapped labels Ym . We assumed the selected label to be
correct and propagate the log-likelihood loss. The intuition behind
this is that given the design of the ULHS and label mapping; there
will always be examples where Ym will contain only one element,
in that case, the model gets trained for that label. In the case where
there are multiple labels, the model has already built a belief about
the fine label suitable for that example because of simultaneously
training with inputs having a single mapped label. Restricting that
belief to the mapped labels encourages correct fine-predictions for
these coarsely labeled examples.
4</p>
    </sec>
    <sec id="sec-9">
      <title>EXPERIMENTS AND ANALYSIS</title>
      <p>In this section, we describe the datasets used, details of
experiments related to UHLS creation, baseline models, model training,
evaluation schemes and result analysis.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Datasets</title>
      <p>
        Table 1 describes the seven datasets used in this work. These datasets
are diverse, as they span several domains, none of them have an
identical label set and some datasets capture fine-grained labels
while others only have coarse labels. Also, the Wiki [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] dataset is
automatically generated using distant supervision process [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
has multiple labels per entity mention in its label set. The other
remaining datasets have a single label per entity mention.
      </p>
    </sec>
    <sec id="sec-11">
      <title>4.2 UHLS and Label Mapping</title>
      <p>We followed the Algorithm 1 to create the UHLS and the label
mapping. To reduce the load on domain experts for verification of the
label spaces, we initialized the UHLS with the BBN dataset
hierarchy. We keep on updating the initial hierarchy until all the labels
from the seven datasets were processed. There were total 223 labels
in Y and in the end Yu had 168 labels. This diference in label count
is due to the mapping of several labels to one or multiple existing
nodes, without the creation of a new node. This corresponds to case
2 of the UHLS creation process (lines 3-4, Algorithm 1). Also, this
indicates the overlapping nature of the seven datasets. The label
set overlap is illustrated in Figure 2. The MISC label from CoNLL
dataset has the highest ten number of mappings to the UHLS nodes.
Wiki and BBN datasets were the largest contributor towards fine
labels with 96 and 57 labels at the leaf of UHLS. However, only 25
ifne-grained labels were shared by these two datasets. This
indicates that even though these are the fine-grained datasets with one
of the largest label sets, each of them has complementary labels.</p>
    </sec>
    <sec id="sec-12">
      <title>4.3 Baselines</title>
      <p>We compared our learning model with two baseline models. The
ifrst baseline is an ensemble of seven learning models, where each
model is trained on one of the seven datasets. We name this model
a silo ensemble model3. In this ensemble model, each silo model
has the same mention and context encoder structure described in
Section 3.2.1. However, the loss function is diferent. For single-label
datasets, we use a standard softmax based cross-entropy loss. For
multi-label datasets, we use a sigmoid based cross-entropy loss.</p>
      <p>
        The second baseline is a learning model trained using a classic
hard parameter sharing multi-task learning framework [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this
baseline, all the seven datasets are fed through a common mention
and context encoder. For each dataset, there is a separate classifier
head with the output labels same as that was available in the
respective original dataset. We name this baseline as a multi-head
ensemble baseline4. Similar to the silo models, the appropriate loss
function is selected for each head. The only diference between the
silo and multi-head model is the way mention and context
representations are learned. In the multi-head model, the representations
are shared across datasets. In silo models, the representations are
learned separately for each dataset.
      </p>
    </sec>
    <sec id="sec-13">
      <title>4.4 Model Training</title>
      <p>For each of the seven datasets, we use the standard train, validation
and testing split. If the standard splits are not available, we randomly
split the available data into 70%, 15%, and 15%, and use them as train,
validation, and testing set respectively. In the case of the silo model,
for each dataset, we train a model on its training split and select the
best model using its validation split. In the case of the multi-head
and our proposed model, we train the model on the training splits
of all seven datasets together and select the best model using the
combined validation split.5.</p>
    </sec>
    <sec id="sec-14">
      <title>4.5 Experimental Setup</title>
      <p>Figure 5 illustrates the complete experimental setup along with
the learning models compared. In this setup, the objective is to
measure the learning model’s generalizability for the ET task as
a whole, rather than on any specific dataset. To achieve this, we
3Here unlike traditional ensemble models, in silo ensemble, the learning models are
trained on diferent datasets.
4Here since the “task" is the same, i.e., entity typing, we use the term multi-head
instead of multi-task for the baseline.
5The source code and the implementation details are available at: https://github.com/
abhipec/ET_in_the_wild
merged the test instances from the seven datasets listed in Table
1 to form a combined test corpus. On this test set, we compared
the performance of the baseline models with the learning model
trained via our proposed framework. We compare these models
performance using the following evaluation schemes.
4.5.1 Evaluation schemes. Idealistic scheme: Given a test instance,
this scheme picks a silo model from the silo ensemble model (or
head of the multi-head ensemble model) which has been trained
on a training dataset with the same domain and target labels set as
the test instance. This scheme gives an advantage to the ensemble
baselines and compares the models in the traditional ways.
Realistic scheme: In this scheme, all of the test instances are
indistinguishable in their domain and candidate label set. In other
words, given a test instance, learning models do not have
information about its domain and target labels. This is a challenging
evaluation scheme and close to real-world setting, where once
learning models are deployed, it cannot be guaranteed that the
user submitted test instances will be from the same domain. In this
scheme, the silo ensemble and multi-head ensemble models assign
a label to a test instance based on the following criteria:</p>
      <sec id="sec-14-1">
        <title>Highest confidence label (HCL): The label which has the highest</title>
        <p>confidence score among the diferent models/heads of an ensemble
model. For example, let there be two models/heads, MA and MB, in
a silo/multi-head ensemble model. For a test instance, MA assigns
the score of 0.1, 0.2 and 0.7 for the labels l1, l2 and l3 respectively.
For the same test instance, MB assigns the score of 0.05 and 0.95
for the labels l4 and l5 respectively. Then the final label will be the
label l5 which has a confidence score of 0.95.</p>
        <p>Relative highest confidence label (RHCL): The label which has
the highest normalized confidence score among the diferent
models/heads from an ensemble model. Continuing with the example
mentioned above for MA and MB, in this criteria, we normalize the
confidence score for each model based on the number of labels the
model is predicting. In this example, MA is predicting three labels
and MB is predicting two labels. Here the normalized scores for
MA will be 0.3, 0.6 and 2.1 for the label l1, l2, and l3 respectively.
Similarly, the normalized scores for MB will be 0.1 and 1.9 for the
label l4 and l5. Then the final label will be the label l3 with the
confidence score of 2.1.</p>
        <p>
          Recall that the experimental setup includes multiple models,
each having a diferent label set. The existing classifier integration
strategies [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], such as sum rule or majority voting are not suitable
in this setup. For these evaluation schemes, we use the evaluation
metrics described in the following section.
4.6
        </p>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>Evaluation metrics</title>
      <p>In the evaluation schemes, there are cases where the predicted label
is not part of the gold dataset label set. For example, our proposed
model or the ensemble model might predict a label city for a test
instance which has a gold label annotated as a geopolitical entity.
Here, the models are predicting a fine-grained label, however, the
dataset from where the test instance came only had annotations
at the coarse level. Thus, without manually verifying, it is not
possible to know whether the model’s prediction was correct or
not. To overcome this issue, we propose two evaluation metrics,
which allows us to compare learning models making predictions in
diferent label sets with minimum re-annotation efort.</p>
      <p>In the first metric, we compute an aggregate micro-averaged F1
score on best efort basis. It is based on the intuition that if the labels
are only annotated at a coarse level in the gold test annotations,
then even if a model predicts a fine-label within that coarse label,
this metric should not penalize such cases6. To find the fine-coarse
subtype information, we use the UHLS and the label mapping. We
map both prediction and gold label to the UHLS and evaluate in
that space. We compute this metric both in an idealistic and realistic
scheme. By design, this metric will not capture errors made at a
ifner level, which the next metric will capture.</p>
      <p>In the second metric, we measure how good are the fine-grained
predictions on examples where the gold dataset has only coarse
labels. We re-annotate a representative sample of a coarse-grained
dataset and evaluate the model’s performance on this sample.
4.7</p>
      <p>Result and Analysis
4.7.1 Analysis of the idealistic scheme results. In Figure 6, we can
observe that the multi-head ensemble model outperforms the silo
ensemble model (95.19% vs. 94.12%). The primary reason could be
that the multi-head model has learned better representations using
the multi-task framework as well as has an independent head for
each dataset to learn dataset specific idiosyncrasy. The performance
of our single model (UHLS) is between the silo ensemble model and
multi-head ensemble model. Note that this performance comparison
is in a setting which is the best possible case for ensemble models
where the ensemble models know complete information about the
test instance domain and label set. Despite this, UHLS model which
does not require any information about test instance domain and
candidate labels performs competitive (94.29%), even better than
the silo ensemble model. Moreover, the ensemble models do not
always predict the finest possible label, whereas UHLS can ( § 4.7.3).
4.7.2 Analysis of the realistic scheme results. In Figure 6, we can
observe that both silo ensemble and multi-head ensemble model
performs poorly in this scheme. The best result for ensemble
models (73.08%) is obtained by the silo ensemble model when the labels
were assigned using the HCL criteria. We analyzed some of the
outputs of ensemble models and found that there were several cases
6Exception is where the source dataset also has fine-grained labels.
where a narrowly focused model predicts with very high confidence
(0.99 probability or above) out-of-scope labels. For example,
prediction of label ADR with confidence 0.999 by a silo model trained on
the CADEC dataset for a sports event test instance of Wiki domain.
The performance of our UHLS model is 94.29%, which is an absolute
improvement of 21.21% compared to the next best model Silo (HCL)
model in the realistic scheme of evaluation.
4.7.3 Analysis of the fine-grained predictions. For this analysis, we
re-annotate the examples of type MISC from the CoNLL test set into
nationality (support of 351), sports event (support of 117) and others
(support 234). We analyzed the prediction of diferent models for the
nationality and sports event labels. Note that this is an interesting
evaluation where the test instances domain is Reuters News, and
the in-domain dataset does not have labels nationality and sports
event. The nationality label is contributed by the BBN dataset whose
domain is Wall Street Journal. The sports event label is contributed
by the Wiki dataset whose domain is Wikipedia. The results (Figure
7) are categorized into three parts as described below:
In-domain results: The bottom two rows, Silo (CoNLL) and MH
(CoNLL) represent these results. We can observe that in this case,
since train and test dataset are from the same domain, these models
can predict accurately the label MISC for both the nationality and
sports event instances. However, MISC is not a fine-grained label.
These results are from the idealistic scheme where it is known
about the test instance characteristics.</p>
      <sec id="sec-15-1">
        <title>Out of domain but with known candidate label: The middle</title>
        <p>four rows, Silo (BBN), MH (BBN), Silo (Wiki) and MH (Wiki)
represent these results. In this case, we assume that the candidate
labels are known, and pick the models which can predict that label.
However, there is not a single silo/head model in the ensemble
models which can predict both nationality and sports event labels. For
example, model/head with the BBN label set can predict the label
nationality but not the sports event label. For sports event instances,
it assigns a coarse label events other, which also subsumes other
events such as elections. Similarly, model/head with the Wiki label
set can predict the label sports event but not the label nationality.
For nationality instances, it assigns completely out of scope labels
such as location and organizations. The out of scope predictions are
due to the domain mismatch.</p>
        <p>No information about domain or candidate label: The top two
rows, Silo (HCL) and UHLS represent these results. The Silo (HCL) is
a silo ensemble model with the realistic evaluation scheme. We can
observe that this model makes out of scope predictions such as
predicting ADR for sports event instances. The UHLS model is trained
using our proposed framework. It can predict fine-grained labels in
both nationality and sports event test instances, even though two
diferent datasets contributed these labels. Also, it does not use any
information about the test instance domain or candidate labels.
4.7.4 Example output on diferent datasets. In Figure 8, we show
the labels assigned by the model trained using the proposed
framework on the sentences from the CoNLL, BBN and BC5CDR datasets.
We can observe that, even though the BBN dataset is fine-grained,
it has complementary labels compared with the Wiki dataset. For
example, for the entity mention Magellan, a label spacecraft is
assigned. Spacecraft label is only present in the Wiki dataset.
Additionally, even in sentences from clinical abstracts, the proposed
approach is assigning fine-types, which came from a dataset with
the medical forum domain. For example, ADR label is only present
in the CADEC dataset with the domain of medical forum. The
proposed approach can aggregate fine-labels across datasets and makes
unified fine-grained predictions.
4.7.5 Result and analysis summary. Collective learning framework
allows a limitation of one dataset being covered by some other
dataset(s). Our results convey that a model trained using CLF on
an amalgam of diverse datasets generalizes better for the ET task
as a whole. Thus, the framework is suitable for the ET in the wild
problem.
5</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>RELATED WORK</title>
      <p>
        To the best of our knowledge, the work of [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] in the visual object
recognition task is closet to our work. They consider two datasets.
First a coarse-grained and second, a fine-grained. Label set of the
ifrst dataset is assumed to be subsumed by the label set of the second
dataset. Thus coarse-grained labels can be mapped to fine-grained
dataset labels in a one-to-one mapping. Additionally, they did not
propagate the coarse labels to the finer labels. As demonstrated by
our experiments, when several real-world datasets are merged, one
to one mapping is not possible. In our work, we provide a principled
approach where multiple datasets can contribute to fine-grained
labels. In our framework, a partial loss function enables fine-label
propagation on datasets with coarse labels.
      </p>
      <p>
        In the area of cross-lingual syntactic parsing, there is a notation
of universal POS tagset [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. This tagset is a collection of coarse tags
that exist in similar form across languages. Utilizing this tagset and
a mapping from language-specific fine-tags, it becomes possible
to train a single model in a cross-lingual setting. In this case, the
mapping is many-to-one, i.e., a fine-category to a coarse category,
thus the models are limited to predict a coarse-grained label.
      </p>
      <p>
        Related to the use of partial label loss function in the context of
the ET problem, there exist other notable works including [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In our work, we use the current state-of-the-art hierarchical
partial loss function proposed in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-17">
      <title>CONCLUSION</title>
      <p>
        In this paper, we propose building learning models that generalize
better on the ET as a whole, rather than on a specific dataset. We
comprehensively studied ET in the wild task which includes
problem definition, collective learning framework, and evaluation setup.
We demonstrated that by using in conjunction a UHLS, one-to-many
label mappings, and a partial hierarchical loss function; we can train
a single classifier on several diverse datasets together. The single
classifier collectively learns from diverse datasets and predicts the
best possible fine-grained label across all datasets, outperforming
an ensemble of narrowly focused models in their best possible
case. Also, during collective learning there is a multi-directional
knowledge flow, i.e., there is no one source or target dataset. This
knowledge flow is diferent from the well studied multi-task and
transfer learning approaches [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] where the prime objective is to
transfer knowledge from a source dataset to a target dataset.
      </p>
      <p>
        In NLP there are several tasks such as entity linking [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ],
relation classification [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and named entity recognition [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], where
the current focus in on excelling at a particular dataset, not on a
particular task. We expect that collective learning approaches will
open up a new research direction for each of these tasks.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Abhishek</surname>
            <given-names>Abhishek</given-names>
          </string-name>
          , Ashish Anand, and
          <string-name>
            <given-names>Amit</given-names>
            <surname>Awekar</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Fine-grained entity type classification by jointly learning representations and label embeddings</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <surname>Long</surname>
            <given-names>Papers</given-names>
          </string-name>
          , Vol.
          <volume>1</volume>
          .
          <fpage>797</fpage>
          -
          <lpage>807</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kurt</given-names>
            <surname>Bollacker</surname>
          </string-name>
          , Colin Evans, Praveen Paritosh, Tim Sturge, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Freebase: a collaboratively created graph database for structuring human knowledge</article-title>
          .
          <source>In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. AcM</source>
          ,
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Rich</given-names>
            <surname>Caruana</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Multitask learning</article-title>
          .
          <source>Machine learning 28, 1</source>
          (
          <year>1997</year>
          ),
          <fpage>41</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Timothee</given-names>
            <surname>Cour</surname>
          </string-name>
          , Ben Sapp, and
          <string-name>
            <given-names>Ben</given-names>
            <surname>Taskar</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Learning from partial labels</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <string-name>
            <surname>May</surname>
          </string-name>
          (
          <year>2011</year>
          ),
          <fpage>1501</fpage>
          -
          <lpage>1536</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Craven</surname>
          </string-name>
          and
          <string-name>
            <given-names>Johan</given-names>
            <surname>Kumlien</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Constructing Biological Knowledge Bases by Extracting Information from Text Sources</article-title>
          .
          <source>In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology</source>
          . AAAI Press,
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Li</given-names>
            <surname>Dong</surname>
          </string-name>
          , Furu Wei, Hong Sun,
          <string-name>
            <surname>Ming Zhou</surname>
            , and
            <given-names>Ke</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A hybrid neural model for type classification of entity mentions</article-title>
          .
          <source>In Twenty-Fourth International Joint Conference on Artificial Intelligence .</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Xin</given-names>
            <surname>Dong</surname>
          </string-name>
          , Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun,
          <string-name>
            <given-names>and Wei</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion</article-title>
          .
          <source>In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14)</source>
          . ACM, New York, NY, USA,
          <fpage>601</fpage>
          -
          <lpage>610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Graves</surname>
          </string-name>
          , Abdel-rahman
          <string-name>
            <surname>Mohamed</surname>
            , and
            <given-names>Geofrey</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Speech recognition with deep recurrent neural networks</article-title>
          .
          <source>In Acoustics, speech and signal processing (icassp)</source>
          ,
          <source>2013 ieee international conference on. IEEE</source>
          ,
          <fpage>6645</fpage>
          -
          <lpage>6649</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Iris</given-names>
            <surname>Hendrickx</surname>
          </string-name>
          , Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and
          <string-name>
            <given-names>Stan</given-names>
            <surname>Szpakowicz</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals</article-title>
          .
          <source>In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions. Association for Computational Linguistics</source>
          ,
          <fpage>94</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9</source>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sarvnaz</surname>
            <given-names>Karimi</given-names>
          </string-name>
          , Alejandro Metke-Jimenez,
          <string-name>
            <given-names>Madonna</given-names>
            <surname>Kemp</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Chen</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Cadec: A corpus of adverse drug event annotations</article-title>
          .
          <source>Journal of biomedical informatics 55</source>
          (
          <year>2015</year>
          ),
          <fpage>73</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and
          <string-name>
            <given-names>Nigel</given-names>
            <surname>Collier</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Introduction to the bio-entity recognition task at JNLPBA</article-title>
          .
          <source>In Proceedings of the international joint workshop on natural language processing in biomedicine and its applications</source>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>70</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Jiao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yueping</given-names>
            <surname>Sun</surname>
          </string-name>
          , Robin J Johnson, Daniela Sciaky,
          <string-name>
            <surname>Chih-Hsuan</surname>
            <given-names>Wei</given-names>
          </string-name>
          , Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and
          <string-name>
            <given-names>Zhiyong</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>BioCreative V CDR task corpus: a resource for chemical disease relation extraction</article-title>
          .
          <source>Database</source>
          <year>2016</year>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Xiao</given-names>
            <surname>Ling</surname>
          </string-name>
          and Daniel S Weld.
          <year>2012</year>
          .
          <article-title>Fine-Grained Entity Recognition.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>AAAI</given-names>
          </string-name>
          , Vol.
          <volume>12</volume>
          .
          <fpage>94</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Push</given-names>
            <surname>Singh</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>ConceptNetâĂŤa practical commonsense reasoning tool-kit</article-title>
          .
          <source>BT technology journal 22</source>
          ,
          <issue>4</issue>
          (
          <year>2004</year>
          ),
          <fpage>211</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>George</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>WordNet: a lexical database for English</article-title>
          .
          <source>Commun. ACM</source>
          <volume>38</volume>
          ,
          <issue>11</issue>
          (
          <year>1995</year>
          ),
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>David</given-names>
            <surname>Nadeau</surname>
          </string-name>
          and
          <string-name>
            <given-names>Satoshi</given-names>
            <surname>Sekine</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>A survey of named entity recognition and classification</article-title>
          .
          <source>Lingvisticae Investigationes</source>
          <volume>30</volume>
          ,
          <issue>1</issue>
          (
          <year>2007</year>
          ),
          <fpage>3</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Nam</given-names>
            <surname>Nguyen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rich</given-names>
            <surname>Caruana</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Classification with partial labels</article-title>
          .
          <source>In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM</source>
          ,
          <volume>551</volume>
          -
          <fpage>559</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Sinno</given-names>
            <surname>Jialin</surname>
          </string-name>
          <string-name>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.
          <year>2010</year>
          .
          <article-title>A survey on transfer learning</article-title>
          .
          <source>IEEE Transactions on knowledge and data engineering 22</source>
          , 10 (
          <year>2010</year>
          ),
          <fpage>1345</fpage>
          -
          <lpage>1359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Slav</surname>
            <given-names>Petrov</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ryan McDonald</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A Universal Part-ofSpeech Tagset</article-title>
          .
          <source>In Proceedings of the Eighth International Conference on Language Resources</source>
          and
          <article-title>Evaluation (LREC-</article-title>
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Redmon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ali</given-names>
            <surname>Farhadi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>YOLO9000: Better, Faster, Stronger</article-title>
          . In
          <source>2017 IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (CVPR)</article-title>
          . IEEE,
          <fpage>6517</fpage>
          -
          <lpage>6525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Xiang</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Wenqi He, Meng Qu, Lifu Huang,
          <string-name>
            <given-names>Heng</given-names>
            <surname>Ji</surname>
          </string-name>
          , and Jiawei Han.
          <year>2016</year>
          .
          <article-title>AFET: Automatic fine-grained entity typing by hierarchical partial-label embedding</article-title>
          .
          <source>In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          .
          <fpage>1369</fpage>
          -
          <lpage>1378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Wei</surname>
            <given-names>Shen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jianyong</given-names>
            <surname>Wang</surname>
          </string-name>
          , and Jiawei Han.
          <year>2014</year>
          .
          <article-title>Entity linking with a knowledge base: Issues, techniques, and solutions</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>27</volume>
          ,
          <issue>2</issue>
          (
          <year>2014</year>
          ),
          <fpage>443</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Sonse</surname>
            <given-names>Shimaoka</given-names>
          </string-name>
          , Pontus Stenetorp, Kentaro Inui, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Riedel</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural Architectures for Fine-grained Entity Type Classification</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <surname>Long</surname>
            <given-names>Papers</given-names>
          </string-name>
          , Vol.
          <volume>1</volume>
          .
          <fpage>1271</fpage>
          -
          <lpage>1280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Erik F Tjong Kim Sang and Fien De Meulder</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Introduction to the CoNLL2003 shared task: Language-independent named entity recognition</article-title>
          .
          <source>In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics</source>
          ,
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Denny</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          <volume>57</volume>
          ,
          <issue>10</issue>
          (
          <year>2014</year>
          ),
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Ralph</given-names>
            <surname>Weischedel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ada</given-names>
            <surname>Brunstein</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>BBN pronoun coreference and entity type corpus</article-title>
          .
          <source>Linguistic Data Consortium, Philadelphia</source>
          <volume>112</volume>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Ralph</surname>
            <given-names>Weischedel</given-names>
          </string-name>
          , Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jef Kaufman, Michelle Franchini, et al.
          <source>2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium</source>
          , Philadelphia, PA (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Peng</given-names>
            <surname>Xu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Denilson</given-names>
            <surname>Barbosa</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Neural Fine-Grained Entity Type Classification with Hierarchy-Aware Loss</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>03378</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Min-Ling</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Fei Yu, and
          <string-name>
            <surname>Cai-Zhi Tang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Disambiguation-free partial label learning</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>29</volume>
          ,
          <issue>10</issue>
          (
          <year>2017</year>
          ),
          <fpage>2155</fpage>
          -
          <lpage>2167</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Zhi-Hua Zhou</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Ensemble methods: foundations and algorithms</article-title>
          . Chapman and Hall/CRC.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>