=Paper= {{Paper |id=Vol-1747/BP03_ICBO2016 |storemode=property |title=Label Embedding Approach for Transfer Learning |pdfUrl=https://ceur-ws.org/Vol-1747/BP03_ICBO2016.pdf |volume=Vol-1747 |authors=Rasha Obeidat,Xiaoli Fern,Prasad Tadepalli |dblpUrl=https://dblp.org/rec/conf/icbo/ObeidatFT16 }} ==Label Embedding Approach for Transfer Learning == https://ceur-ws.org/Vol-1747/BP03_ICBO2016.pdf
                 Label Embedding for Transfer Learning
                                            Rasha Obeidat, Xiaoli Fern, Prasad Tadepalli
                                       School of Electrical Engineering and Computer Science
                                                       Oregon State University
                                          {obeidatr, xfern, tadepall}@eecs.oregonstate.edu


    Abstract. Automatically tagging textual mentions with the        Standard Domain adaptation techniques [3,4] are not directly
concepts, types and entities that they represent are important       applicable to this problem because they assume that the label
tasks for which supervised learning has been found to be very        sets are invariant. Recent work proposed a solution based on
effective. In this paper, we consider the problem of exploiting      finding a mapping between the labels using Canonical
multiple sources of training data with variant ontologies. We
                                                                     Correlation Analysis (CCA), and then reducing the problem to
present a new transfer learning approach based on embedding
multiple label sets in a shared space, and using it to augment the   the standard domain adaptation setting [5].
training data.                                                       We develop a method that embeds the source and target labels
                                                                     in a shared space and takes advantage of the shared space to
   Keywords— transfer learning, Label embedding.
                                                                     transfer the knowledge. Instead of using the label embedding to
                                                                     produce a mapping between the source and target labels, we
                      I. INTRODUCTION                                directly employ the label embeddings to augment the feature
Automatically tagging textual mentions with ontological              representation of the target examples by the predicted source
concepts, types, and entities that they represent is useful in       label embeddings. After that, a model is trained on the target
many knowledge-intensive fields such as biology and                  side. We conducted a preliminary study on the task of Named
medicine. This problem is studied under the names of Named           Entity Recognition in which we used a two dataset that use
Entity Recognition, Entity Linking, and Wikification.                different but related annotation scheme. We ashow that our
Supervised learning from annotated training data has been            approach significantly outperforms several baselines.
found to be an effective method to tackle this task. However,
in most fields in general, and biology in particular, there are                           II. PROBLEM SETUP
often multiple ontologies. For example, different ontologies
                                                                     A domain Di = (Xi, P(Xi)) consists of two components: the
such as the Cell Type Ontology, the Protein Ontology, the
                                                                     feature space Xi and the corresponding marginal distribution
Sequence Ontology, and the Gene Ontology might overlap,
                                                                     P(Xi). Let Ti = (Yi, fi(.)) be the task i where Yi is the label set of
but use different vocabulary, and provide complementary
                                                                     the domain i, and let fi(.) = Xi →Yi be a function that maps Xi to
information [11]. Each ontology comes with its own annotated
                                                                     Yi. The goal of transfer learning is to use the knowledge of fs
training data, which presents the problem of reconciling the
                                                                     learned from source domain-task pair (Ds, Ts) to improve the
different ontologies and effectively using the training data for
                                                                     learning of ft on the target side (Dt, Tt).
the old (source) ontologies in training for a new (target)
ontology.                                                            In standard domain adaptation (aka transductive transfer
                                                                     learning [3,4,6]), the source and the target tasks are the same ,
The above problem is an instance of Transfer learning, which
                                                                     i.e., Ts = Tt., while the domains differ (either Xs = Xt or P(Xs)
aims to leverage the training data from one or more source
                                                                     = P(Xt) ). On the other hand, in the inductive transfer learning
domains to improve the sample efficiency in a related target
                                                                     setting [7,5], which includes our work, the domains are the
domain. Domain Adaptation is Transfer learning where the
                                                                     same or closely related, but the tasks differ, i.e, Ts ≠ Tt. .
source and the target domains use the same label set but have
different distributions [1]. Transfer learning where the label
sets are variant across domains is far less studied. In many real     III. TRANSFER LEARNING VIA LABEL EMBEDDING
world applications, the ontologies or label sets of different        In this section, we describe our approach to learn label
tasks could be (implicitly) overlapping and/or intricately           embeddings and use them to transfer the learning across the
related. For example, one biological application of natural          domains. We follow the method presented in Kim et al. [5] to
language processing is to tag natural texts with proteins from a     induce the label embeddings. Specifically, we use Canonical
given protein ontology. In a related task, we might need to tag      Correlation Analysis (CCA)[8] to project both source and
the text with genes based on a specific gene ontology. The two       target labels to a shared space where the correlation between
ontologies are clearly related and may provide useful                the projected vectors is maximized. Then, we employ these
information toward one another. For such tasks, we need a            embeddings to transfer the knowledge from the source domain
transfer learning approach that can be applied with variant          to the target domain. The projection vectors then can be used to
ontologies/label sets, which will learn simultaneously from          reduce the dimensionality of the variables by projecting them
both domains and thus enhance the efficiency of learning.            into k-dimensional space, where k is a parameter to be tuned.
To use the extracted embeddings in transferring the knowledge,          labels transfers the knowledge from CoNLL2003 to TAC-
we propose a method that works as follows: first, we train a            KBP2015 via these embeddings.
model on the source domain, and use it to make predictions on
the target domain. Then, we augment the feature space of each                  TABLE I.       MICRO-AVERAGED AND MACRO-AVERAGED RECALL,
instance in the target domain with the label embedding                        PRECISION AND F1 -SCORES OF THE METHODS TARGETONLY, PRED, AND
corresponding to the predicted source label. Finally, a model is                  AUGMNTTR ON THE TASK OF NAMED ENTITY RECOGNITION.
trained on the target domain.                                                  Baseline         Avg-R              Avg-P              Avg-F1
A nice property of this method is that it can be applied                      TargetOnly        0.618              0.753               0.679
                                                                                 Pred           0.576              0.756               0.654
regardless the type of relationships between the source and the                                 0.745              0.746               0.745
                                                                              AugmntTr
target labels. It works with 1-to-1, n-to-1, and 1-to-n
relationships. It is also applicable if the label types overlap.                                      CONCLUSION

                    IV. EXPERIMENTAL SETUP                              We present an approach to transfer the learning with different
                                                                        label sets between the source and the target domains. Our
In this section, we describe our experimental setup and results         approach makes use of label embeddings induced by CCA. We
on the task of Named Entity Recognition (NER).                          augment the feature space of the target data with embeddings
                                                                        of the predicted source labels, and then, train a model on the
Dataset. We used CoNLL 20031 NER benchmark dataset as a
                                                                        target domain. We find that CCA is able to produce high
source domain and a small dataset called TAC-KBP20152 NER
                                                                        quality label embeddings that are capable of transferring the
dataset as a target. CoNLL2003 defines four entity types:
                                                                        knowledge across domains, this explains the superiority of our
Person (PER), Organization (ORG), Location (LOC), and
                                                                        approach over the baselines.
Miscellaneous (MICS). TAC-KBP2015 defines six entity
types: Person (PER), Title (TTL), Organization (ORG),
Geopolitical Entities (GPE), Location (LOC), and Facilities                                   ACKNOWLEDGMENTS
(FAC). Our approach doesn’t need any prior knowledge of the             We gratefully acknowledge the support of DARPA and AFRL
matching types between CoNLL 2003 and TAC-KBP2015.                      under the contract number FA8750-13- 2-0033.
Evaluation. We follow CoNLL exact match evaluation
protocol for the NER task [9]. In particular, we calculate the                                        REFERENCES
recall, the precision, and the F1-score for each entity type, and       [1]  S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledge and
then micro-average the recalls, the precisions, and the F1-                  Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345–
                                                                             1359, 2010.
scores.
                                                                        [2] G. Schweikert, G. R¨atsch, C. Widmer, and B. Sch¨olkopf, “An
Features and Training. We employ the standard set of                         empirical analysis of domain adaptation algorithms for genomic
                                                                             sequence analysis,”in Advances in Neural Information Processing
features used by Stanford NLP group to train their NER3. The                 Systems, 2009, pp.1433–1440.
feature set includes: word features, orthographic features,             [3] H. Daum´e III, “Frustratingly easy domain adaptation,” arXiv preprint
feature conjunctions and others. We also train our model using               arXiv:0907.1815, 2009.
Stanford NER system4. It provides a general implementation of           [4] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with
Conditional Random Field [10]. We use label embeddings of                    structural correspondence learning,” in Proceedings of the 2006
size 5 in all of our experiments.                                            conference on empirical methods in natural language processing.
                                                                             Association for Computational Linguistics, 2006, pp. 120–128.
Baselines.To investigate the effectiveness of our method                [5] Y.-B. Kim, K. Stratos, R. Sarikaya, and M. Jeong, “New transfer
AugmntTr, we compare it to two other baselines:                              learning techniques for disparate label sets,” ACL. Association for
                                                                             Computational Linguistics, 2015.
     TargetOnly: train a model on the target dataset.                  [6] J. Jiang and C. Zhai, “Instance weighting for domain adaptation in nlp,”
                                                                             in ACL, vol. 7, 2007, pp. 264–271.
     Pred: use the output of source predictor as an additional         [7] S. J. Pan, Z. Toh, and J. Su, “Transfer joint embedding for cross-domain
         feature to train a model on the target dataset.                     named entity recognition,” ACM Transactions on Information Systems
                                                                             (TOIS), vol. 31, no. 2, p. 7, 2013.
                   V. RESULTS AND DISCUSSION                            [8] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol.
                                                                             28, no. 3/4, pp. 321–377, 1936.
In this section, we present the experimental results of all             [9] D. Nadeau and S. Sekine, “A survey of named entity recognition and
approaches under study. The results are summarized in Table                  classification,” Lingvisticae Investigationes, vol. 30, no. 1, pp. 3–26,
                                                                             2007.
1. it shows that our method AugmntTr produces about 7% and
                                                                        [10] R. Leaman, G. Gonzalez et al., “Banner: an executable survey of
9% F1-score improvement         over TargetOnly and Pred                     advances in biomedical named entity recognition.” in Pacific
methods. This illustrates the ability of CCA to discover the                 Symposium on Biocomputing, vol. 13. Citeseer, 2008, pp. 652–663.
relationship between label types in CoNLL2003 and TAC-                  [11] C.-T. Tsai and D. Roth, “Concept grounding to multiple knowledge
KBP2015 datasets. Augmenting the feature space of TAC-                       bases via indirect supervision,” Transactions of the Association for
                                                                             Computational Linguistics, vol. 4, pp. 141–154, 2016.
KBP2015 dataset with the label embedding of CoNLL2003


1                                                                   3
http://www.cnts.ua.ac.be/conll2003/ner/                              http://nlp.stanford.edu/projects/biNER/en.prop
2                                                                   4
http://www.nist.gov/tac/2015/KBP/                                    http://nlp.stanford.edu/software/CRF-NER.shtml