<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Taxonomy Embeddings on PubMed Article Sub ject Headings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alyssa Lees</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacek Korycki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Welty</string-name>
          <email>cawelty@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shubin Zhao</string-name>
          <email>shubing@google.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Google Research</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine learning approaches for hierarchical partial-orders, such as taxonomies, are of increasing interest in the research community, though practical applications have not yet emerged. The basic intuition of hierarchical embeddings is that some signal from taxonomic knowledge can be harnessed in broader machine learning problems; when we learn similarity of words using word embeddings, the similarity of lion and tiger are indistinguishable from the similarity of lion and animal. The ability to tease apart these two kinds of similarities in a machine learning setting yields improvements in quality as well as enabling the exploitation of the numerous human-curated taxonomies available across domains, while at the same time improving upon known taxonomic organization problems, such as partial or conditional membership. We explore some of the practical problems in learning taxonomies using Bayesian Networks, partial order embeddings, and box lattice embeddings, where box containment represents category containment. Using open data from PubMed articles with human assigned MeSH labels, we investigate the impact of taxonomic information, negative sampling, instance sampling, and objective functions to improve performance on the taxonomy learning problem. We discovered a particular problem for learning box embeddings for taxonomies we called the box crossing problem, and developed strategies to overcome it. Finally we make some initial contributions to using taxonomy embeddings to improve another learning problem: inferring disease (anatomical) locations from their use as subject labels in journal articles. In most experiments, after our improvements to box models, the box models outperformed the simpler Bayes Net approach as well as order embeddings.</p>
      </abstract>
      <kwd-group>
        <kwd>box embeddings</kwd>
        <kwd>taxonomies</kwd>
        <kwd>semantic relations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Recent work in machine learning on representational structures such as taxonomies
promises to blend the value of human curated taxonomies with the power and exibility of
machine learning systems. There are many human curated taxonomies such as in libraries,
medicine and popular culture. They are a widely used organizational tool for inventories
and information in general. Yet, they rely on an assumption of discreteness { category
membership is either true or false { and this assumption does not generally play well with
continuous valued systems or spatial embeddings.</p>
      <p>Much of the recent machine learning work on taxonomies has focused on the toy
problem of reconstructing an existing taxonomy, which is of little practical value except as
a stepping stone to a broader interaction. Rather, the key question is whether taxonomic
similarity can be teased apart from others. For example, we want to say that while
lion, tiger, and animal are all similar, animal is more general. In many cases, this
explicit knowledge can be derived from existing taxonomies, but without some way of
understanding this knowledge in the embedding space, it is just another similarity signal.
Recently, order embeddings addressed this issue by orienting similarity in two axes and
adding a constraint to similarity embeddings that ensured more general terms had to be
closer to the origin.</p>
      <p>More recently we've seen the introduction of box lattice embeddings, which treat
categories in a taxonomy as n-dimensional boxes, with a constraint that boxes representing
more general terms should contain boxes representing more speci c terms.</p>
      <p>In this work, we are primarily interested in learning a typical inference problem in
which there are two categories of entities with a target of learning the relationships
between them. We have chosen an open dataset to exemplify the problem: PubMed
articles with human labeled subject headings. We want to learn the relationship between
diseases and parts of the body using the co-occurrence of these labels as subjects and
further informed by the inclusion of taxonomic information. It stands to reason that if
asthma (disease) and bronchi (anatomy) co-occur in data, and bronchi is a subcategory of
lung in the taxonomy, then asthma and lung are related. Our primary research question is
whether these two types of knowledge (co-occurrence and taxonomy) can be integrated in
a way that can capture softer constraints than, for example, a set of logic rules. However,
in order to answer this question we had to explore the practicalities of learning embeddings
for categories in a taxonomic structure.</p>
      <p>The contributions of this paper center on rigorous analysis of taxonomy embeddings
from data, in particular training regimes that impact the faithfulness of the embeddings,
and the degree to which taxonomic information can be incorporated in a learning problem.
Our experiments focused on Box Lattice models, though we also include some analysis
of Bayes Nets and Partial Order Embeddings. Our rst contribution is a simple x to
the non-di erentiable hinge property of the original box model objective function. We
then examine the in uence of negative sampling methods and weighting on learning the
taxonomy. We argue that, while sensitivity to negatives is not a particularly new machine
learning problem, the peculiar nature of learning boxes with volumes { as opposed to
vectors { makes this increasingly important. Next we consider various options for using
instance data to support the taxonomy embedding task, and contribute methods and
analysis for summarizing instances to avoid explosion of the parameter space. We also
perform analysis of metrics for evaluation, noting that the generalization implicit in
taxonomies makes some categories easier and others harder.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        MeSH, the NLM Medical Subject Headings, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a taxonomy of subject headings for
categorizing medical writing. It is organized in a traditional library subject classi cation
style, with the slight di erence that subject headings in the taxonomy can have multiple
parents { it is a DAG not a tree. Pubmed [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a very large collection of meta-data for
over 30 million medical journal articles, each with human labeled subject categories from
MeSH. The medical journal community of authors, editors, publishers and the NLM's
librarians, work together to keep the process scalable and of reasonable quality. That's
not to say there aren't errors, but no taxonomy is free of errors.
      </p>
      <p>MeSH is organized into 16 top level categories, such as A: Anatomy, B: Organisms,
C: Diseases, which themselves cannot be the subjects of articles. The taxonomy is on
average 8 levels deep, with a generalization or broader-term relationship from child to
parent nodes, e.g. &lt;Respiratory System, Anatomy&gt;, &lt;Larynx, Respiratory System&gt;,
&lt;Asthma, Diseases&gt;. For simplicity we will adopt this notation, where the taxonomy is
represented linearly as a collection of hchild; parenti edges in the tree-shaped taxonomy
graph (noting again that it is not strictly a tree).</p>
      <p>
        The MeSH anatomy hierarchy mostly follows a Mereological generalization
(subparts to parts), while diseases follow a slightly more causal generalization (&lt;Pneumonia,
bacterial Infection&gt;). This kind of semantic promiscuity is extremely common in
taxonomies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and causes an imprecision that begs for an approach with soft,
continuousvalued constraints, as opposed to the traditional discrete reading of the taxonomic
relationship (all members of the subcategory are members of the super-category). It was
this observation that led us to taxonomy embeddings.
      </p>
      <p>Articles listed in PubMed can have any number of subject headings, on average 8-10,
and it is fully expected by the published methodology that assigning a particular subject
heading to an article also assigns the MeSH parents and ancestors { they expect the
transitive closure to hold.</p>
      <p>Most taxonomies take a universal vs. particular (ie category vs. instance) view of the
world, that instances are members of set-like categories. The primary semantic distinction
is that the relationship between instances and categories (instance-of) is not transitive,
whereas the relationship between categories (subcategory) is transitive. However, like
the subcategory relations shown above, the instance-of relation is very promiscuous in
practice, and we shall continue that practice by considering the relationship between
articles in PubMed and MeSH categories as the instance-of relation, the articles are
members of their subject categories.</p>
      <p>We side-step whatever philosophical fallout these choices may entail by de ning our
research question to be a data-mining one:</p>
      <p>
        RQ1: Can we use the co-occurrence of subject labels on PubMed articles to
learn associations between those labels, similar to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]? As a simplifying example,
can we learn the association between diseases and parts of the body from the
co-occurrence of MeSH labels on PubMed articles in the Anatomy and Diseases
branch of the MeSH taxonomy?
RQ2: Can the taxonomy be used to improve the learning of these associations?
Intuitively, knowing &lt;asthma, lung disease&gt; and &lt;bronchi, lung&gt; should help
us associate asthma to lung and even lung disease to lung.
      </p>
      <p>The principal contributions of this paper are an empirical analysis of these embedding
techniques, primarily box embeddings, and how the basics of the taxonomy problem
interact with learning.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Taxonomy Embeddings</title>
      <sec id="sec-3-1">
        <title>Partial Order Embeddings</title>
        <p>
          X
(u;v)2P
We applied Partial Order Embeddings to our datasets, an established technique for
modeling taxonomy data, as described in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The model assigns each entity u an
embedding f (u) 2 &lt;n in such a way that the order in the space of entities, de ned by
edges (u; v), is maintained in the embedding space by requiring that f (u) f (v), which
holds for all dimensions. This is accomplished by de ning a continuous score function for
embeddings that measures compliance with the order constraint:
        </p>
        <p>E(x; y) = jjmax(0; y
x)jj2</p>
        <p>This score should be zero, or close to it, for a pair forming an edge (from a set of
positive examples P ) and large positive for a pair that is not an edge (from a set of negative
examples N ). This requirement is enforced by minimizing the following max-margin loss
function over embeddings f :
w(u; v)E(f (u); f (v)) + W
max(0;</p>
        <p>E(f (u0); f (v0)))</p>
        <p>X
(u0;v0)2N</p>
        <p>It uses a margin &gt; 0 to express the minimum desired value for a score of the negative
example. Relative importance of positive examples may be controlled with edge weights
w, if such values are available in the dataset (for example as conditional probabilities).
Hyper-parameter W can be used to control relative importance of positive and negative
parts of the loss.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Box Lattice Embeddings</title>
        <p>
          Box lattice embeddings associate each category with 2 vectors in [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], (xm; xM ), the
minimum and maximum value of the box at each dimension [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. For numerical reasons
these are stored as a minimum, a positive o set plus an term to prevent boxes from
becoming too small. This representation of boxes in Cartesian space can de ne a partial
ordering by containment of boxes, and a lattice structure as:
x ^ y = ? if x and y disjoint, else
x ^ y = Qi [max(xm;i; ym;i); min(xM;i; yM;i)]
x _ y = Qi [min(xm;i; ym;i); max(xM;i; yM;i)]
if A = (xm; xM ) then p(A) = Qin(xM;i xm;i)
and the objective function follows as
log(p(a _ b)
p(a)
p(b))
        </p>
        <p>In other words, maximize the overlap of boxes with a positive edge. The original boxes
paper showed results for reconstructing a taxonomy derived from WordNet using the
transitive closure of edges, and the team made the code available in Github, which we
reused and modi ed for our experiments.
For comparison, we used a Bayesian model to predict whether an instance belongs to
a category. Given the training data, the model computes the conditional probability
p(CxjCy) for any category pair (Cx; Cy) that exists in the training data. Negative training
examples are ignored because the category pairs are randomly generated. There is no
correlation between the categories. Given a test example (C; C0), the model needs to
predict whether there is a positive relationship. (C; C0) should not appear in the training
data, but C and C0 must appear in the training data. Otherwise we would have no clue
to predict the correlation. Assuming the seen parent categories of C in the training data
are (C1; :::; Cm), The model makes an estimate that</p>
        <p>m
p(C0jC) = X p(C0jCk; C)p(CkjC)
k=1
m
X p(C0jCk)p(CkjC)
k=1
.</p>
        <p>Given a score threshold T , if p(C0jC) &gt; T , the model predicts a positive relationship
for (C; C0), otherwise it predicts a negative relationship. The precision and recall were
picked as the best ones in the precision-recall curve with di erent T values.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Learning Taxonomies</title>
      <p>There are many taxonomies for organizing many kinds of information, and they can be
di erentiated in many ways. For our purposes, we focus on a few that became important
in our experiments.</p>
      <p>Many taxonomies are extensional : the categories organize sets of entities in some
problem domain, such as movies, stores, restaurants, books, songs, etc. There is a
fairly clear instance/category distinction in these cases and the number of instances
vastly outweighs the number of categories. The set of instances of a category is its
extension. Viewed in its entirety, such taxonomies have edges from instances to categories,
which we call inst-cat edges (e.g. &lt;Star Wars, Science-Fiction Movie&gt;), and edges
between categories, which we call cat-cat edges (e.g. &lt;Cult Science-Fiction Movie,
Science-Fiction Movie&gt;).</p>
      <p>Some taxonomies are intensional : they have no instances, at least none represented
in data. WordNet synsets, for example, are arranged in a taxonomy, and there are very
few instance-like synsets in WordNet, referring e.g. to speci c people and organizations,
but for the most part, synsets like \amount of matter" have no clear extension. Such
taxonomies have only cat-cat edges.</p>
      <p>Most previous experiments on learning taxonomy embeddings were done on intensional
taxonomies, with no instances, and the objective (for training and evaluation) was simply
the number of correct cat-cat edges learned above some con dence threshold.</p>
      <p>Our datasets are extensional and contain millions of instances: PubMed 2018 has over
30 million. Surely this wealth of data could be used to help focus the taxonomy learning.
At the very least, inst-cat edges could tell us the relative importance of getting certain
cat-cat edges correct.
4.1</p>
      <sec id="sec-4-1">
        <title>Learning taxonomies from instances</title>
        <p>We tried many approaches to utilizing the rich extension of MeSH in PubMed articles.
The most obvious was to treat each inst-cat edge as a part of the graph, and ignore the
semantic di erences with cat-cat edges. However, in our embedding techniques, edges
are training examples and the vertices (e.g. each category or instance) are the learnable
parameters, so this results in an explosion of parameters almost to the point of having
fewer examples than parameters. Further, instances as embeddings causes many spurious
inst-inst edges to be inferred from the dense encodings, generating tremendous noise. One
solution may be to use di erent forms of optimization, such as annealing or Brownian
Motion. We save this for future work.</p>
        <p>Another approach is to reuse the embedding techniques designed for cat-cat edges and
summarize the inst-cat edges into the categories. For every pair of co-occurring inst-cat
edgeshc; p1i hc; p2i, we emit a cat-cat edge hp1; p2i. This leads to repetition of edges in
the training data that re ects the magnitude of the co-occurrence. We compared these
two approaches:</p>
        <p>Taxo: Use the edges from the taxonomy, with no consideration for the instances.</p>
        <p>Summary: Include the co-occurrence of categories in instances as a weight on the
taxonomy edges.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Transitive Closure</title>
        <p>In addition to summarizing instances, we experimented with the use of deterministic
reasoning on the training data, speci cally the transitive closure of cat-cat and inst-cat
edges. In previous work, this is the only approach taken. However it seems possible that,
given only direct cat-cat edges (e.g. if we have ha; bi and hb; ci, then we do not have ha; ci),
we can learn embeddings for the taxonomy that approximates it.</p>
        <p>For instance edges, we can similarly compute the transitive closure in the usual way
(e.g. if we have hi; ai and ha; bi then we add hi; bi), and then summarize from the inferred
inst-cat edges as described above.</p>
        <p>This gives us two more approaches for each data set:</p>
        <p>Direct: do not compute the transitive closure of the category edges. For PubMed
articles, only the human labelled subject heading are used. For the most part, this
means no instance will have edges to a category and its ancestor, however there are a
few exceptions. Reconstructing a taxonomy from only direct edges is expected to be an
extremely hard problem to solve with embedding methods, but does serve the purpose of
informing us how these particular methods can be used to learn other relations, and how
to isolate the signal coming from the taxonomy itself. See more discussion below.</p>
        <p>Closure: compute the transitive closure of the category edges. PubMed articles will
have edges to each human supplied subject heading and all its ancestors. Clearly the
closure multiplies the positive training data by large amount, and this is expected to be
the simpler of the two approaches to learn the taxonomy.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Negative sampling</title>
        <p>In the original box and order embedding papers, negatives were sampled uniformly from
the complement of positive edges with a ratio of 1:1.</p>
        <p>
          We found this uninformed sampling created con icting constraints and could generate
negatives that prevented desirable generalizations. Some improvements to sampling were
proposed in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], but these were more speci c to order embeddings.
        </p>
        <p>Clearly negative sampling is a general problem in machine learning, but again a bit of
analysis and understanding of the taxonomy problem can help reduce the impact of the
problem.</p>
        <p>When we tried to replicate the original box embedding results on our taxonomy, we
found that manual inspection of the untrained positives (i.e. the model generalizations)
showed many bad edges, such as hheart; lungi or hbacteria; virusi. As we added more
test negatives, we found the false positive rate was simply a function of the neg:pos ratio.
For boxes it is somewhat trivial with a low neg:pos ratio to draw large boxes that overlap
nearly everything, or long thin boxes that overlap with exactly the set of boxes they
need to, while avoiding the few negatives. In Fig. 1, the idealized box embedding for a
simple taxononmy is shown on the left; category C should be contained in both A and B,
therefore the latter should overlap. Without negatives, the simplest zero-loss solution is
to make all boxes overlap, as shown on the right of the gure. With a low negative rate
(and many more boxes), the solutions ended up looking similar.</p>
        <p>Raising the neg:pos ratio seems like an obvious solution, however at 10:1 box
embeddings failed to learn anything on all our datasets, achieving 90% accuracy by making all
boxes disjoint. We compared performance of the various approaches based on two neg:pos
ratios:
neg1: Uses a 1:1 neg:pos ratio
neg10: Uses a 10:1 neg:pos ratio</p>
        <p>In our error analysis we found negative edges that seemed to cause learning problems.
First, the transitive closure of positive edges should not be in the the negative set. This
is not a problem when the transitive closure is in the positive set, as with previous work,
but the direct approach in Sec. 4.2) does not assume this, and so we must guard against
it.</p>
        <p>If negatives are generated on-the- y (as they were in the original box embeddings
implementation), negatives from the transitive closure can leak into training after a
test/train split. We must generate negatives in batch and split into test/train.</p>
        <p>Self edges should never be negative.</p>
        <p>The inverse edges should never be negative, e.g. if hc; pi is a positive edge, hp; ci should
not be negative. This is an interesting divergence between the box objective and the set
semantics of categories: in a set semantics, the pair and its inverse can only be true if
c = p, thus negating the inverse is equivalent to asserting strict subset. The box objective,
however, pushes negative edges apart, and a sub-category must clearly overlap with its
parent.</p>
        <p>Edges between ancestors of the same category should never be negative, e.g. if
hc; p1i ; hc; p2i are positive, then neither hp1; p2i nor its inverse should be negative since,
in a strict reading of taxonomy, their boxes should overlap. In Fig. 1, the idealized box
embedding for a simple taxonomy is shown on the left; category C should be contained
in both A and B, therefore the latter should overlap. Without negatives, the simplest
zero-loss solution is to make all boxes overlap. In Fig. 1 the e ect of a poorly chosen
negative edge between A and B is shown, it forces the two boxes apart and C must deal
with the loss. Again, in a set semantics, one may want to negate the subcategory edge
between two overlapping categories, which would simply be interpreted as a constraint
against one being a subset of the other. With boxes, doing so would again con ict with
the objective to make them overlap.</p>
        <p>We introduce informed sampling (Alg. 0) as a way of avoiding these problems. The
algorithm incrementally builds N N , the set of non-negative edges, by processing the
constraints discussed above, and subtracts that from the edge set formed by the cross
product of all categories in the taxonomy, C.</p>
        <p>Algorithm 1: Informed sampling of taxonomy negatives</p>
        <p>Input: Set of positive edges P = fhc; pi, ...g; c; p 2 C</p>
        <p>Output: Set of negative edges to sample from
1 T Closure(P)
2 N N T
3 for hc; pi 2 T do
4 N N N N [ fhc; ci, hp; pi, hp; cig
5 map(c) map(c) [ fpg
6 for c 2 map do
7 for p1 2 map(c) do
8 for p2 2 map(c) do
9 N N N N [ fhp1; p2ig
10 return C</p>
        <p>C N N
In experiments we compare these two approaches to generating negatives:
Naive sampling: Sample negatives from the complement of the transitive closure of
positive edges.</p>
        <p>Informed sampling: Sample negatives using informed sampling.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Box crossings</title>
        <p>Despite discovering these problems in negative sampling, we were unable to get any
experiments with boxes to learn anything with higher negative ratios.</p>
        <p>In further analysis we found the cause to be what we named the box crossing problem,
illustrated in Figs. 3, 4, 5. When two boxes representing categories that have a positive
edge are on either side of a box for which they each have a negative edge, the cost for
the two boxes to cross the negative one is too large. As the neg:pos ratio increases, the
chances of box crossing standing in the way increases dramatically. While in some sense
this is no more than a gradient descent problem, getting stuck in a local neighborhood,
the speci c learning problem, rather than some generic approach to optimization, gives
us insight into solutions that smooth the objective.</p>
        <p>In previous work, boxes were initialized with random positions. This feeds the box
crossing problem; as the negative ratio increases, boxes are more likely to start o
surrounded by negatives. To overcome this problem, we add two more experimental
approaches:</p>
        <p>Rand: Boxes have random initial size and position.</p>
        <p>Center: Boxes all start out in a small central sphere position in the middle of the
space. With all boxes equally overlapping positives and negatives, box crossing should
not prevent them from moving in more optimal directions at the start.</p>
        <p>In addition, we explored weighting the negative edges in training much lower than
positives, to allow the positive signal to draw boxes together, enough to overcome box
crossings, and once the positive constraints are satis ed, the boxes should then move and
re-size to obey the negatives. This gives us another set of experimental approaches to
explore:
nloss-n: The negative loss is weighted by n compared to the positive loss.</p>
        <p>
          Although the failure of the original box model to generalize in the presence of more
negatives forced us to discover and propose xes to negative sampling, and the box
crossing problem, ultimately we found the primary problem was a hinge property of the
loss function. When the box crossing problem arises, there is a non-di erentiable hinge
in the negative loss on the step where the negative boxes rst overlap. An approach to
smoothing the loss has appeared very recently [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], we used a far simpler x developed
before that result came out, which in combination with lower negative weight is capable
of overcoming the box crossing problem as shown in Fig. 5.
        </p>
        <p>The hinge x ensures a smooth transition when boxes with a positive edge meet, and
when boxes with a negative edge move apart. When positive boxes are disjoint, there is
loss proportional to their distance, when they meet the loss is inversely proportional to
the amount of overlap. At the point where they meet, these two kinds of loss must also
meet, otherwise an unbounded gradient occurs.</p>
        <p>For negative edges, when two overlapping boxes come apart, the negative loss is zero.
The original box model clipped this value at in order to avoid a zero value, and added
simple smoothing. This was not working in practice, as we saw sharp gradient spikes
during optimization. To solve this, we merely reversed the approach for positive loss,
splitting the loss into overlapping and disjoint loss, and ipping the sense.</p>
        <p>We do not present many experiments with the initial box models, as they were unable
to learn anything with more than a 1:1 neg:pos ratio, for clarity we refer to the two
versions as:</p>
        <p>Box orig: the original box code without a hinge x</p>
        <p>Box: the box model with the hinge x
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Inter-taxonomy Relations</title>
      <p>
        Our nal set of experiments begins to touch on our central research questions (q.v. RQ1
&amp; RQ2, page 3). In our experiments, we used several slices of data from MeSH [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
PubMed [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Instances are summarized from the rst ten shards of the PubMed 2018
database, consisting of 300k articles. We use the subject descriptors, which are associated
with one or more coded entries in the strict MeSH taxonomy. Descriptors give MeSH a
DAG structure, whereas the coded taxonomy is a tree.
      </p>
      <p>Our datasets are grouped in three ways:
single taxonomy: we use one branch of the MeSH taxonomy, C: Disease,
containing 4800 categories. In these experiments, like the previous literature, we attempt to
reconstruct the taxonomy.</p>
      <p>dual taxonomy: we use two branches of the MeSH taxononmy, A: Anatomy and
C: Disease, for a total of 6610 categories. In these experiments, we attempt to learn the
cross-taxonomy relations between diseases and their anatomical locations, while at the
same time reconstructing each taxonomy. The intention is for the two taxonomies to
overlay each other to represent the location relation. To measure the additional signal
added by the taxonomy, we compare the direct and closure versions of the dual taxonomy
datasets.</p>
      <p>multi taxonomy: we use all branches of the MeSH taxonomy, a total of 28954
categories, summarized from 300k articles. In this case we simply try to reconstruct all
edges, within and across taxonomies. We do not expect any technique to perform well
here.</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>In our rst set of experiments, we attempt to reproduce the results of the original box
embedding paper on a di erent taxonomy. We identi ed the following characteristics,
described in Section 4, that we tested:</p>
      <p>taxo vs. summary: Edges are straight from the taxonomy or summarized from
instances
neg1 vs.neg10: neg:pos ratio is 1:1 or 10:1
direct vs. closure: Edges include the transitive closure of taxonomy or only direct
edges
rand vs. center: Boxes are initialized in random position or at the center
naive vs. informed: Negatives are generated randomly or informed by the taxonomy
nloss1 vs. nloss.1: Negative loss is weighted or not</p>
      <p>The results are shown in Table 1. Experiment 1 replicates the original box embeddings
paper using the code made available by that team. With 1:1 neg:pos sampling, as discussed
above, the model shows non-zero performance on the test set, but on inspection this was
due to over-generalization - many boxes overlapped many others on average. In experiment
2, we added a 10x as many negatives and the original code converges to an accuracy of
0.9 by making no boxes overlap. This is the baseline for the subsequent experiments.</p>
      <p>Experiment 3 shows the impact of informed sampling of negatives. This does not a ect
POE or Bayes approaches, but has a large e ect on boxes, dropping from .90 to .83. We
illustrated the problem of naive negatives on boxes in Fig. 2.</p>
      <p>Experiment 4 (compared to 2) is speci c to boxes and shows a small increase in
F-measure when boxes are initialized in the center of the space. While the increase is not
signi cant, it does show a very good increase in speed of convergence of the models.</p>
      <p>Experiment 5 shows that down-weighting negative instances does not have a positive
e ect on F-measure. In our early experiments, this did have a positive e ect but appears
to be overwhelmed by the other improvements. We intend to further vet this particular
result as it contradicts results we saw in earlier experiments, based on which we ran the
larger inter-taxonomy experiments with nloss.1.</p>
      <p>In Experiment 6, we have replaced the taxonomy edges with edges resulting from the
instance summarization process described in Sec. 4.1. This is the best peforming overall
set of characteristics and seems to work well in combination with low negative weighting.
It seems most likely this is due to the strategic repetition of edges, as opposed to simply
repeating edges by cycling through a training set multiple times. Instance summarization
identi es the important edges to re-sample. The technique helps POE and Bayes as well,
giving POE the top single taxonomy score. In multi-taxonomy experiments below, instance
summarization gives us the inter-taxonomy links as well, making this a powerful technique
for taxonomy learning. It's worth noting that the BayesNet approach is incredibly simple,
and that with instance summarization provides quite good performance.</p>
      <p>Experiments 7 and 8 show the negative impact of dropping the transitive closure edges.
As discussed in Sec. 4.2, it seems that simply having the direct links, e.g. hc; bi, hb; ai and
not hc; ai should make it possible to learn that transitive edge, however our results show
this simply doesn't happen in any approach. For boxes, the transitive edges provide a
powerful ampli cation of the containment constraint, ensuring that when a higher level
box moves, all its children move with it, as opposed to cascading that movement across
the next several steps.</p>
      <p>From manual inspection of the single taxonomy experiments, the results seem to hold
up, and we do not nd a lot of spurious edges. A deeper evaluation would be productive,
however we will save that expense for future work on inter-taxonomy results.</p>
      <p>Table 2 shows the results of our inter-taxonomy experiments. All were run with the
winning set of summary, neg10, closure, rand, informed, and we included nloss.1 as that
showed better performance in previous tests. These are bigger experiments and take
longer to validate, and will include a comparison of that feature in the nal paper. For
each set we compared direct and closure edges, continuing to con rm tha tthe closure
edges are important for maintaining the taxonomy.</p>
      <p>As mentioned above, we have not analyzed these results very deeply, the primary
contributions of the paper are analysis of getting box embeddings to work on a single
taxonomy. Like the original box embeddings results, the numbers for inter-taxonomy look
outrageously good, however manual inspection does not tell the same story. It is likely we
are still not covering the negative space adequately.</p>
      <p>For example, while we do see edges associating diseases to parts of the body that make
surprising sense, such as: &lt;Heart Septal Defects - Ventricular, Sarcoplasmic
Reticulum&gt; which is not a terrible association, and a good generalization of that to
&lt;Heart Valve Diseases, Sarcoplasmic Reticulum&gt;, we also see too many edges like
&lt;Heart Failure, Temporal Bone&gt;. More work is needed.</p>
      <p>
        Finally, we explored depth-based scoring [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], on the intuition that higher level
categories have more incoming edges (edges from their children), and are easier to classify
as they are larger. However, we saw no di erence when evaluating edges by their depth in
the taxonomy, except in the cases where the closure edges were not in training.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We presented practical explorations of three types of embeddings for taxonomies: Bayes,
Partial Order Embeddings, and Box Lattice Embeddings. We explored many characteristics
of the learning problem of reconstructing a single taxonomy, with a focus in particular
on the problem of generating appropriate negative examples, using signal from large
instance-based taxonomies such as PubMed and MeSH. We discovered a particular
problem for learning box embeddings for taxonomies we called the box crossing problem,
and developed strategies to overcome it. We showed impressive F-measure scores for
problems of reconstructing the MeSH Disease taxonomy from edges, as well as nding
intertaxonomy relations from associations between them in PubMed article subject headings,
though these latter results have yet to be analyzed thoroughly. In most experiments, after
our improvements to box models, the box models outperformed the simpler Bayes Net
approach as well as Order Embeddings.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Mesh medical subject headings (</article-title>
          <year>2018</year>
          ), https://www.nlm.nih.gov/mesh/meshhome.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Pubmed medical abstracts (
          <year>2018</year>
          ), https://www.ncbi.nlm.nih.gov/pubmed/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Athiwaratkun</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <surname>A.G.</surname>
          </string-name>
          :
          <article-title>On modeling hierarchical data via probabilistic order embeddings</article-title>
          .
          <source>In: International Conference on Learning Representations</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chechik</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tishby</surname>
          </string-name>
          , N.:
          <article-title>Euclidean embedding of co-occurrence data</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>8</volume>
          ,
          <issue>2265</issue>
          {2295 (Dec
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Guarino</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>An Overview of OntoClean</article-title>
          , pp.
          <volume>201</volume>
          {
          <issue>220</issue>
          (05
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hoxha</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automated learning of domain taxonomies from text using background knowledge</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>63</volume>
          ,
          <issue>295</issue>
          {
          <fpage>306</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Langley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Crafting papers on machine learning</article-title>
          . In: Langley,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (ed.)
          <source>Proceedings of the 17th International Conference on Machine Learning (ICML</source>
          <year>2000</year>
          ). pp.
          <volume>1207</volume>
          {
          <fpage>1216</fpage>
          . Morgan Kaufmann, Stanford, CA (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vilnis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boratko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Smoothing the geometry of probabilistic box embeddings</article-title>
          .
          <source>In: International Conference on Learning Representations</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Automatic taxonomy construction from keywords</article-title>
          .
          <source>In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <volume>1433</volume>
          {
          <fpage>1441</fpage>
          . KDD '
          <volume>12</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siddiqi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gordon</surname>
          </string-name>
          , G.:
          <article-title>A latent space approach to dynamic embedding of co-occurrence data</article-title>
          .
          <source>In: Proceedings of the Eleventh International Conference on Arti cial Intelligence</source>
          and
          <string-name>
            <surname>Statistics (AI-STATS)</surname>
          </string-name>
          (
          <year>January 2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vendrov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fidler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urtasun</surname>
          </string-name>
          , R.:
          <article-title>Order-embeddings of images and language</article-title>
          .
          <source>ICLR</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Vilnis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murty</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Probabilistic embedding of knowledge graphs with box lattice measures</article-title>
          .
          <source>In: ACL</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>G.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callan</surname>
            ,
            <given-names>J.P.:</given-names>
          </string-name>
          <article-title>A metric-based framework for automatic taxonomy induction</article-title>
          .
          <source>In: ACL/IJCNLP</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>