<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Can a Convolutional Neural Network Support Auditing of NCI Thesaurus Neoplasm Concepts?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hao Liu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ling Zheng</string-name>
          <email>zdzhengling@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yehoshua Perl</string-name>
          <email>perl@njit.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Geller</string-name>
          <email>geller@njit.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gai Elhanan</string-name>
          <email>gelhanan@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Applied Innovation Center Desert Research Institute Reno</institution>
          ,
          <addr-line>NV</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CSSE Department Monmouth University</institution>
          ,
          <addr-line>West Long Branch, NJ</addr-line>
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science New Jersey Institute of Technology</institution>
          ,
          <addr-line>Newark, NJ</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-We present a Machine Learning methodology using a Convolutional Neural Network to perform a specific case of an ontology Quality Assurance, namely discovery of missing IS-A relationships for Neoplasm concepts in the National Cancer Institute Thesaurus (NCIt). The training step checking all “uncles” of a concept is computationally intensive. To shorten the time and to improve the accuracy, we define a restricted methodology to check only uncles that are similar to each current concept. The restricted technique yields higher classification recall (compared to the unrestricted one) when testing against known errors found by domain experts who manually reviewed Neoplasm concepts in a prior study. The results are encouraging and provide impetus for further improvements to our technique.</p>
      </abstract>
      <kwd-group>
        <kwd>CNN</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Neoplasm National Cancer Institute Thesaurus</kwd>
        <kwd>Quality Abstraction Network</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>
        Ontologies play a major role in enabling precise
communications and in support of healthcare applications, e.g.
EHR systems. Many ontologies are large and complex. For
example, the National Cancer Institute Thesaurus (NCIt) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
serving cancer researchers inside and outside NIH, contains
135,243 concepts interrelated by 480,141 links in the April
2018 release. Due to their size and complexity, errors in
ontologies are unavoidable. Users of ontologies such as the
SNOMED ontology are concerned about errors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Thus,
quality assurance (QA) is essential in the lifecycle of
ontologies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For a summary of auditing (QA) techniques for
ontologies and in particular for SNOMED and NCIt, see [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>However, QA resources for ontologies are typically scarce,
while QA tasks are labor-intensive and time-consuming.
Therefore, automated or semi-automated techniques that can
either help in auditing an ontology or narrow down the places
where to look for errors, are highly desired. Missing
parent/child errors are particularly interesting to ontology
curators, as the IS-A links are the backbone structure of an
ontology, facilitating the inheritance of lateral relationships
(called roles in NCIt).</p>
      <p>
        Machine Learning (ML) has been proven successful in
many fields, e.g., knowledge mining. ML was previously used
in knowledge enrichment for ontologies [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6-8</xref>
        ]. However, can
ML be used for quality assurance of ontologies in spite the
major difference between knowledge enrichment and quality
assurance? Knowledge enrichment mines external sources for
new knowledge that does not exist in the ontology. However,
QA discovers incorrect or missing knowledge. Consider
missing IS-A relationships from an existing concept A to a
concept B. If the concept B is already in the ontology, then
adding an IS-A link between A and B is considered correcting
an omission error. If concept B is not in the ontology and is
added together with adding an IS-A from concept A to it, then
this is knowledge enrichment. We note that curators of some
ontologies, e.g., NCIt, are less interested in knowledge
enrichment, unless required by users, than in quality assurance.
In this paper, we attempt to use ML to address the task of
detecting missing IS-A links between two existing concepts.
This task is more challenging than knowledge enrichment,
since it requires a judgement that concept A is a specification
of concept B. For knowledge enrichment we only recognize
that a concept is missing in the ontology and then insert it into
the proper place.
      </p>
      <p>
        In an unpublished study, we trained a Convolutional Neural
Network (CNN) deep learning model to insert new concepts
into the SNOMED CT ontology, i.e., an enrichment problem.
In the present work, we train a CNN deep learning model to
find missing parent/child errors in the Neoplasm subhierarchy
of NCIt. The vector representations of concepts are obtained
from an unsupervised neural network language model. The
model is evaluated by its classification recall on an unseen
dataset. We check the model’s classification recall by testing
against 18 missing parent/child errors found by domain experts
in a prior study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Due to the size of the Neoplasm
subhierarchy, the application of the training methodology is
computation-intensive and time consuming.
      </p>
      <p>
        In previous research we have introduced Abstraction
Networks (AbNs) [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. An AbN provides a compact
summarization and visual simplification of an ontology. The
SABOC (Structural Analysis of Biomedical Ontologies Center)
team at NJIT has demonstrated that Abstraction Networks are
an effective tool to support quality assurance of ontologies [
        <xref ref-type="bibr" rid="ref12 ref9">9,
12</xref>
        ]. An area taxonomy [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a type of Abstraction Network, is
composed of meta-concepts called areas, connected by
childof links. An area (see Background section) represents a group
of concepts with the same structure.
      </p>
      <p>To accelerate the processing and improve recall, we modify
the CNN methodology to limit its consideration, for each
concept, to the similar concepts of its area (in our formal sense
of area). The modified, restricted methodology achieves 0.81
recall on the unseen testing data. It performs 50% better than
the unrestricted methodology on the 18 known errors in terms
of recall. The results for detecting missing IS-A links are not
yet strong enough. However, the performance in recognition of
known errors is encouraging and supports further improvement
of our methodology in respect to CNNs and the use of AbNs.</p>
      <p>II.</p>
      <p>BACKGROUND
A. Doc2vec</p>
      <p>
        Numeric representation of variable-length texts, ranging
from sentences to documents is a challenging task. Doc2vec, or
Paragraph Vectors [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], an extension of word2vec (word
embedding) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], maps variable-length texts to fixed-length
vectors. It is an unsupervised framework that learns continuous
distributed vector representations from unlabeled text data of a
paragraph/document, while preserving the inter-relationships
of the text in the numeric format. In such vector
representations, similar pieces of text are close to each other in
Euclidean or cosine distance in lower dimensional vector
spaces. The Doc2vec inherits the semantics of the words in the
context, and takes the word order into consideration when
constructing the representation. The latter advantage is
important to our problem, as word order in our setup carries the
concepts’ topological/hierarchical order in the ontology. This is
useful information for feature learning. To the best of our
knowledge, this is the first study to derive vector
representations for biomedical ontology classes via Doc2vec.
B. CNN
      </p>
      <p>
        Convolutional Neural Networks (CNN), initially invented
for image recognition, have been widely used for various
applications, including vision, speech recognition, and
language translation. CNN models have also been successfully
applied to solve various Natural Language Processing (NLP)
problems such as search query retrieval [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], semantic parsing
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and sentence modeling [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. CNN utilizes convolving
filters to automatically learn and extract local features from
various layers, regardless of the input size. This makes CNN a
very powerful tool for classification or prediction tasks, e.g.
text classification [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and relation extraction [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], even if the
data or features have not been manually labeled for learning
purposes. To the best of our knowledge, this is the first effort to
adopt the CNN model for ontology quality assurance.
      </p>
    </sec>
    <sec id="sec-2">
      <title>C. Neoplasms of NCIt</title>
      <p>NCIt is published monthly by the National Cancer Institute
(NCI) in OWL and flat file formats. It is a cancer reference
terminology that is widely used. It covers cancer-related
terminology in various fields, e.g., clinical care and
translational and basic research. Concepts are linked to other
concepts (parent concepts) in the same hierarchy by IS-A
relationships. A concept may have multiple parent concepts.
The semantics of concepts are defined by lateral relationships
(called “roles”), e.g. Disease Has Associated Anatomic Site.</p>
      <p>Due to the NCI’s cancer focus, the Neoplasm subhierarchy
of NCIt is composed of 9,955 concepts. It is a core component
of the Disease, Disorder or Finding, the largest hierarchy with
35,081 concepts, and is modeled with more detail, compared to
non-neoplasm concepts in the hierarchy.</p>
    </sec>
    <sec id="sec-3">
      <title>D. Areas and Area Taxonomy</title>
      <p>
        An area taxonomy [
        <xref ref-type="bibr" rid="ref20 ref21 ref3">3, 20, 21</xref>
        ] is a compact Abstraction
Network summarizing the structure (roles) of an ontology. It is
composed of areas and child-of relationships connecting areas.
An area is a group of concepts having the same set of role
types. A concept can be in only one area, i.e., areas are disjoint.
A concept that has no parent in its area, is called a root of the
area. An area may have multiple roots. If a root concept of area
B has a parent concept in area A, then there is a child-of
relationship from area B to area A.
      </p>
      <p>Fig. 1(a) is an excerpt of 12 Neoplasm concepts from NCIt.
Concepts are represented as rounded-corner boxes and the
arrows denote IS-A relationships. Concepts with the same set
of role types are enclosed within a colored dashed rectangle.
For example, both Benign Neoplasm and Tumorlet have the
two role types Disease Excludes Abnormal Cell and Disease
Has Abnormal Cell, they reside in the left green dashed
rectangle. Fig. 1(b) shows the area taxonomy for Fig. 1(a).
Each colored, dashed rectangle in Fig. 1(a) becomes an area
with the same color in Fig. 1(b). An area is labeled by its role
type set and the number of concepts it summarizes. Areas with
the same number of role types have the same color. For
example, there are two areas colored in green, since both have
two role types. Skin Neoplasm is the root concept of the red
area and its parent Neoplasm by Site is in the grey area. Hence,
there is a child-of relationship from the red area to the grey
area, denoted as a bold arrow in Fig. 1(b).</p>
      <p>We describe two methodologies, the unrestricted
methodology and the refined restricted methodology. The ML
training problem is viewed as a binary classification task: given
a concept pair, we classify it into a positive category (there is
an IS-A link) or a negative category (there is no IS-A link). We
train a Convolutional Neural Network (CNN) model to solve
this classification problem.</p>
      <p>Unrestricted Methodology: To train a CNN model to yield
high precision, we must carefully choose the training data for
both categories. The source of training samples for the positive
category is in the IS-A hierarchy of the ontology. The challenge
is in the choice for the negative samples as we cannot use the
full set of unconnected pairs. For the Neoplasm subhierarchy
of NCIt with 9955 concepts, the size of this set is 99,075,537
(= 9955*9954 – 16533). We subtract 16533 existing IS-A link
pairs from the potential missing parent/child errors. Training
pairs should not be chosen randomly. We need to choose pairs
where there is a reasonable likelihood for an IS-A link, not
pairs that obviously have no taxonomic relation. For example,
a body part concept and a drug concept are not related by an
IS-A relationship and would be a bad training pair. Negative
training samples should be near misses, close to the
“hyperplane of separation” in SVM terms.</p>
      <p>To address these two problems of magnitude and recall in
the ML training, we limit the negative samples for the
unrestricted methodology to only “uncle - nephew” pairs for
only near misses. That is, connections between a concept and a
sibling of its parent. By only choosing “ uncle - nephew”
pairs, we guide the model to learn the underlying features used
to distinguish IS-A-connected concept pairs and similarly
positioned concept pairs that have a high potential to have
secondary IS-A links but are not connected by IS-A links. There
are total 37,147 such “uncle - nephew” pairs in the Neoplasm
subhierarchy of NCIt.</p>
      <p>The unrestricted model must consider all uncles of a
concept (that are not connected to that concepts by an IS-A
link). Due to the size of the Neoplasm subhierarchy, the
number of uncles is large. For example, there are 24, 15 and 15
concepts with 10, 11 and 12 children, respectively and the
maximum number of children of a concept is 60. Each
grandchild of a concept with 15 children has 14 uncles. If the
average number of children of each sibling is 5, then there are
70 concepts with 14 uncles each. Applying the proposed
technique to select negative samples from the whole hierarchy
results in a large number of “uncle - nephew” pairs. This is
computationally expensive for training the model and leads to
low accuracy by distracting the model from learning subtle
features that distinguish between IS-A and “uncle - nephew”
links within similar groups of concepts. In other words, not all
“uncle – nephew” pairs are equally useful for training
purposes. We need pairs that are similar to existing
IS-Aconnected pairs, yet are not themselves IS-A-connected.</p>
      <p>Additionally, many machine learning models work best
with balanced training sets. The number of positive and
negative samples should be approximately equal. The number
of positive training instances in our problem domain is given
and fixed. The number of potential negative training samples is
much larger. Thus a cogent way has to be used to select a
number of negative training samples that is closer to the
number of positive training samples.</p>
      <p>
        The Restricted Methodology: To cope with these problems,
we introduced the restricted approach. The restricted approach
limits the number of negative samples by only choosing a
subset of a concept’s uncles, which are structurally similar to
the investigated concept. To provide such a “closely related”
subset we partition the Neoplasm subhierarchy into sets of
concepts of similar structure. In doing this we are availing
ourselves of a powerful mechanism that derives an area
taxonomy from an ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. An area taxonomy is an
Abstraction Network that clusters together groups of concepts
according to their roles. All concepts in one area have exactly
the same roles. Concepts in different areas differ in at least one
role from each other.
      </p>
      <p>
        In the area taxonomy each cluster of concepts constitutes
an area. Due to the high average number of roles per concept in
the Neoplasm subhierarchy, the number of areas is large, and
the average size of an area is small. Note that AbNs are
automatically derived from the ontology so the limitation
process is automatic [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. By selecting pairs only within areas,
we narrow down 37,147 negative samples to 10,574 more
closely related “uncle – nephew” pairs, of the same magnitude
as the 16,533 positive samples.
      </p>
      <p>Since all uncles in an area will have exactly the same roles
as the current concept, they will be similar to it. Thus the recall
of the CNN training model is expected to be higher than for the
unrestricted model trained with all uncles of a concept, many
of which are not similar to the current concept.</p>
      <p>The following description is related to both methodologies.
Overall, our methodology comprises following four steps:</p>
    </sec>
    <sec id="sec-4">
      <title>A. Document Embedding</title>
      <p>The CNN model requires its input in the format of
fixedlength feature vectors. Thus, before sending concept pairs for
training, we need to transform each concept into its
corresponding vector representation with fixed length.</p>
      <p>
        The Paragraph Vector (Doc2vec) framework introduced by
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] generates fixed-length feature vectors from
variablelength pieces of text, as it was designed for text corpus
processing. Thus, the problem to overcome in applying
Doc2vec to ontologies is to find the vector representation of
single concepts. However, an IS-A link is defined by a pair of
concepts, thus a joint representation of pairs is needed that is
also compatible with the input required by CNN.
      </p>
      <p>To derive the vector representation of a single concept, we
need text “descriptions” of the concept. We recast a concept
into a document such that it preserves hierarchical and partial
semantic information of the concept:</p>
      <p>The document of a concept contains the concept ID, the
name(s) of its ancestor(s), the name(s) of the concept itself, the
name(s) of its child(ren) and the names of its grandchild(ren),
if they exist. In this way, the document implicitly maintains the
hierarchical relationships of the ontology.</p>
      <p>For example, the document representation of the concept</p>
      <sec id="sec-4-1">
        <title>Malignant Nipple Neoplasm (Fig. 2) is “c5213: Neoplasm →</title>
      </sec>
      <sec id="sec-4-2">
        <title>Neoplasm by site → Breast neoplasm → Malignant Breast</title>
      </sec>
      <sec id="sec-4-3">
        <title>Neoplasm → Nipple Neoplasm → Malignant Nipple Neoplasm → Female Malignant Nipple Neoplasm → Male Malignant</title>
      </sec>
      <sec id="sec-4-4">
        <title>Nipple Neoplasm →Nipple Carcinoma.” Thus, the generated</title>
        <p>distributed vector representation maintains the most important
hierarchical relationship semantics.</p>
        <p>
          The document vectors are derived using the Distributed
Memory version of Paragraph Vector (PV-DM) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] via the
Gensim [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] Doc2Vec implementation. Each vector has the
dimensionality of 128. Pairs of concepts connected by an IS-A
link are represented by the concatenation of the document
vectors of the two concepts, with the child concept first.
        </p>
        <p>The positive samples are directly extracted from the
hierarchy as all the concept pairs connected via an IS-A link.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>For example, (Malignant Nipple Neoplasm, Nipple Neoplasm)</title>
      <p>is a positive sample, because Malignant Nipple Neoplasm is a
Nipple Neoplasm (Fig. 2). As mentioned above, there are
16,533 positive samples in the Neoplasm subhierarchy. We
randomly picked 2,000 positive samples for testing. The
remaining 14,533 (=16,533-2,000) samples are used in a ratio
of 80% for training and 20% validation. The positive samples
are treated equally for both models.</p>
      <p>For the restricted model, there are 10,574 potential pairs
where both concepts of each pair are from the same area. We
randomly picked 2,000 for testing. As noted above, there are
37,147 “uncle – nephew” negative sample pairs in the
hierarchy. However, for the unrestricted model, we use the
same 2000 negative pairs, used for the restricted model. This is
done to enable performance comparison between the two
models. Similar to the way we handle the positive samples, the
remaining 8,574 (=10,574-2,000) samples for the restricted
model and the remaining 35,147 (=37,147-2,000) for the
unrestricted model are divided to 80% vs. 20% ratio for
training and validation, respectively.</p>
      <p>In addition, we down-sampled the negative samples for the
unrestricted model and up-sampled for the restricted model to
14,533 samples, in order to balance the number of samples for
both categories, as customary in Machine Learning.</p>
      <p>We trained a CNN model with 4 convolution layers on top
of vectors derived from the Neoplasm subhierarchy of NCIt via
Doc2vec. The CNN model architecture is shown in Fig. 3. The
input to the CNN model is two 128x1 dimension vectors and
the output is a 2x1 dimension vector.</p>
      <p>Some structural details of this model are summarized as
follows:
•
•
•</p>
      <p>There are four convolution layers, each followed by a
max pooling layer. This choice was informed by
previous research. The first convolution layer has 18
filters with kernel size =1. The filter number doubles
with the increase of convolution layers. We use stride
=1, meaning we slide the filters one number (position)
at a time over the input. The pooling size is 2 for all
max pooling layers.</p>
      <sec id="sec-5-1">
        <title>The Adam [24] optimization algorithm for stochastic gradient descent is used for training with the learning rate set to 0.001.</title>
      </sec>
      <sec id="sec-5-2">
        <title>The ReLU (Rectified Linear Unit) activation function is</title>
        <p>used in every convolution layer, because it has proven
successful in recent research projects. This corresponds
to a “rectifier function” from electrical engineering,
blocking the negative half-wave and letting the positive
half-wave pass through one-to-one.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>D. Test against Reviewed Data</title>
      <p>Traditionally, machine learning models are tested with
kfold cross validation. Thus, all known data is partitioned into k
folds, the model is trained with k-1 folds and tested with the
remaining fold. This process is repeated k times, with resulting
precision, recall and F1 values averaged. We have augmented
this testing by human expert quality assurance results.</p>
      <p>
        In a previous study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], domain experts reviewed 190
concepts from the Neoplasm subhierarchy and reported 18
missing parent errors. This data was used as ground truth in
this study to check the sensitivity/recall of our model’s
performance.
      </p>
      <p>IV.</p>
      <sec id="sec-6-1">
        <title>RESULTS We report our CNN model’s performance in the following three aspects:</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Testing recall and AUC</title>
      <p>The testing recall is 0.75 and 0.81 for the unrestricted and
restricted models, respectively. Fig. 4 (a) and (b) show the
Receiver Operating Characteristic (ROC) curves of the testing
performance for the unrestricted and restricted models,
respectively. The AUC (area under the curve) scores, as the
measure of test accuracy, are 0.84 and 0.90, respectively.</p>
    </sec>
    <sec id="sec-8">
      <title>B. Confirmed errors found by domain experts</title>
      <p>
        The restricted model detected 10 out of the 18 errors that
domain experts found [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], while the unrestricted model
detected only five errors, all contained in the above 10 errors.
Table I shows two missing parent examples confirmed by both
models, and two examples confirmed only by the restricted
model.
      </p>
    </sec>
    <sec id="sec-9">
      <title>C. Training time efficiency</title>
      <p>Each model is trained with 2000 epochs, with batch size =
2000. We recorded the duration of the training. With the same
computer hardware configuration, training the unrestricted and
restricted models took 1116 and 1110 seconds, respectively.</p>
      <p>To the best of our knowledge, this is the first published
attempt to use Machine Learning (ML) for QA of ontologies.
Such technique can prepare a subset of pairs of concepts as
candidates for missing IS-A link omission errors, optimizing
the use of scarce QA resources.</p>
      <p>In this paper, we discussed two ML approaches. The
unrestricted model utilizes all uncle concepts of the processed
concept. The restricted model is further taking advantage of the
area taxonomy of the ontology to utilize only uncles in the area
of the processed concept. Both ideas save processing time
while improving the accuracy, by training unconnected pairs of
concepts similar to the original IS-A links of the processed
concept. The uncles are similarly positioned as siblings of the
parent of the processed concept. The uncles within the area
share the same roles as the processed concept.</p>
      <p>
        Table II compares the performance of the two models. As
can be expected, the restricted model utilizing training with
similar concepts achieves higher performance. We evaluated
the results of the two ML models based on a list of errors that
domain experts found in a previous study [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Such an
evaluation is usually not available for ML studies. An
interesting observation is that the set of five errors confirmed
by the unrestricted model is a subset of the 10 errors found by
the restricted model. The reason for this may be that the two
models use the same negative test pairs where the uncles are
from the same area as the nephew. This choice was made in
order to be able to compare performance of the two models,
but tends to unnecessarily limit the unrestricted model to find
error pairs of similar concepts. In the future, we will perform
experiments where the unrestricted and restricted models will
use disjoint negative training data in an effort to optimize the
results of each model rather than to compare them.
      </p>
      <p>
        From Table II we can see that the restricted model performs
6% better than the unrestricted model in recall. The area under
the curve in Fig. 4(b) is 6% larger than in Fig. 4(a), reflecting a
better classification of the restricted model than the unrestricted
model. The document vectors used in this study were derived
using the Distributed Memory version of Paragraph Vector
(PV-DM) that works well for most tasks, as stated in the
original paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However, it is also recommended in the
paper to combine Paragraph Vector with Distributed Bag of
Words (PV-DBOW) to obtain consistency. The more accurate
the vector representations of concepts are, the better recall
should be expected. This is left for future work.
      </p>
      <p>The recall obtained is not high enough for reliable QA for
missing IS-A links. For example, out of a random subset of 20
suggested errors, a domain expert (GE) confirmed only one
error pair, Hair Follicle Neoplasm missing the parent Dermal
Neoplasm, since hair follicles reside in the dermal layer of the
skin. In the NCIt, Dermal Neoplasm has 13 children,
representing a mix, based on cell origin as well as malignancy
status. Currently, Hair Follicle Neoplasm has Skin Appendage</p>
    </sec>
    <sec id="sec-10">
      <title>Neoplasm as a parent. Dermal Neoplasm and Skin Appendage</title>
    </sec>
    <sec id="sec-11">
      <title>Neoplasm are siblings. However, Hair Follicle Neoplasm has</title>
      <p>only a very indirect relationship to the dermis through the</p>
    </sec>
    <sec id="sec-12">
      <title>Disease Has Primary Anatomic Site role with Hair Follicle as</title>
      <p>the target value and the Anatomic Structure Is Physical Part Of
role of Hair Follicle with the target value of Dermis. Adding
Dermal Neoplasm as a parent, or even possibly replacing Skin
Appendage Neoplasm with it, might be better.</p>
      <p>A higher recall will imply fewer suggested errors with a
higher percentage of confirmed errors by domain experts. In
future research, we will further explore the properties of the
two models as well as properties of ML processing in an effort
to best fine-tune and utilize both models, to increase the recall.</p>
      <p>LIMITATIONS</p>
      <p>We presented a supervised training technique, therefore
“annotated corpora” are required for its applicability beyond
finding missing IS-A relationships in the NCI neoplasm
taxonomy. The results are also limited, because the models are
not trained on more general data for more general problems.</p>
      <p>VII.</p>
      <p>CONCLUSION</p>
      <p>We explored whether ML methods can be applied to the
task of ontological QA, in particular whether ML can help in
detecting missing IS-A links in an ontology. Two models are
presented. Our application of ML to the Neoplasm
subhierarchy of NCIt demonstrated that the restricted model
performs better than the unrestricted one. However, the
performance of the restricted model is not yet sufficient for
QA. In future research, we will explore improvements in ML
processing and more accurate restrictions for concepts in the
training stage to improve performance of QA of ontologies.</p>
      <sec id="sec-12-1">
        <title>ACKNOWLEDGMENT</title>
        <p>Research reported in this publication was partially
supported by the National Cancer Institute of the National
Institutes of Health under Award Number R01CA190779. The
content is solely the responsibility of the authors and does not
necessarily represent the official views of the National
Institutes of Health.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] S. de Coronado,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Haber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sioutos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Tuttle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. W.</given-names>
            <surname>Wright</surname>
          </string-name>
          ,
          <article-title>"NCI Thesaurus: using science-based terminology to integrate cancer research results," Stud Health Technol Inform</article-title>
          , vol.
          <volume>107</volume>
          , no.
          <source>Pt 1</source>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>7</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Elhanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Geller</surname>
          </string-name>
          ,
          <article-title>"A survey of SNOMED CT direct users, 2010: impressions and preferences regarding content and quality,"</article-title>
          <source>J Am Med Inform Assoc</source>
          , vol.
          <volume>18</volume>
          <issue>Suppl 1</issue>
          , pp.
          <fpage>i36</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>"Auditing as Part of the Terminology Design Life Cycle,"</article-title>
          <source>J Am Med Inform Assoc</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>676</fpage>
          -
          <lpage>690</lpage>
          , Nov-Dec
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Baorto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Weng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Cimino</surname>
          </string-name>
          ,
          <article-title>"A review of auditing methods applied to the content of controlled biomedical terminologies," J Biomed Inform</article-title>
          , vol.
          <volume>42</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>413</fpage>
          -
          <lpage>25</lpage>
          ,
          <year>Jun 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Geller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halper</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Cornet</surname>
          </string-name>
          ,
          <article-title>"Special issue on auditing of terminologies," J Biomed Inform</article-title>
          , vol.
          <volume>42</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>407</fpage>
          -
          <lpage>11</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arguello</surname>
          </string-name>
          Casteleiro et al.,
          <article-title>"A case study on sepsis using PubMed and Deep Learning for ontology learning," Informatics for Health: Connected Citizen-Led Wellness and Population Health</article-title>
          , vol.
          <volume>235</volume>
          , pp.
          <fpage>516</fpage>
          -
          <lpage>520</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Minarro-Giménez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Marin-Alonso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Samwald</surname>
          </string-name>
          ,
          <article-title>"Exploring the application of deep learning techniques on medical text corpora," Stud Health Technol Inform</article-title>
          , vol.
          <volume>205</volume>
          , pp.
          <fpage>584</fpage>
          -
          <lpage>588</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>İ.</given-names>
            <surname>Pembeci</surname>
          </string-name>
          ,
          <article-title>"Using Word Embeddings for Ontology Enrichment,"</article-title>
          <source>International Journal of Intelligent Systems and Applications in Engineering</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          ,
          <article-title>"Auditing National Cancer Institute thesaurus neoplasm concepts in groups of high error concentration,"</article-title>
          <source>Appl Ontol</source>
          , vol.
          <volume>12</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>130</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Halper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ochs</surname>
          </string-name>
          ,
          <article-title>"Abstraction networks for terminologies: Supporting management of "big knowledge","</article-title>
          <source>Artif Intell Med</source>
          , vol.
          <volume>64</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          , May
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Geller</surname>
          </string-name>
          ,
          <article-title>"Abstraction of complex concepts with a refined partial-area taxonomy of SNOMED," J Biomed Inform</article-title>
          , vol.
          <volume>45</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>Feb 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          et al.,
          <article-title>"Auditing Complex Concepts of SNOMED using a Refined Hierarchical Abstraction Network,"</article-title>
          <source>Journal of Biomedical Informatics</source>
          , vol.
          <volume>45</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <issue>14</issue>
          ,
          <issue>09</issue>
          /01 2012.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <article-title>"Distributed Representations of Sentences and Documents," CoRR, vol</article-title>
          .
          <source>abs/1405</source>
          .4053, /
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>"Efficient estimation of word representations in vector space,"</article-title>
          <source>arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Mesnil</surname>
          </string-name>
          ,
          <article-title>"Learning semantic representations using convolutional neural networks for web search,"</article-title>
          <source>in Proc. of the 23rd Int. Conf. World Wide Web</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>373</fpage>
          -
          <lpage>374</lpage>
          : ACM.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16] W.-t. Yih,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Meek</surname>
          </string-name>
          ,
          <article-title>"Semantic parsing for single-relation question answering," in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2014</year>
          , vol.
          <volume>2</volume>
          , pp.
          <fpage>643</fpage>
          -
          <lpage>648</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          , E. Grefenstette, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <article-title>"A convolutional neural network for modelling sentences,"</article-title>
          <source>arXiv preprint arXiv:1404.2188</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>"Convolutional neural networks for sentence classification,"</article-title>
          <source>arXiv preprint arXiv:1408.5882</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>"Neural relation extraction with selective attention over instances," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2016</year>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>2124</fpage>
          -
          <lpage>2133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Spackman</surname>
          </string-name>
          ,
          <article-title>"Structural methodologies for auditing SNOMED,"</article-title>
          <source>Journal of biomedical informatics</source>
          , vol.
          <volume>40</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>561</fpage>
          -
          <lpage>581</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halper</surname>
          </string-name>
          , S. de Coronado, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ochs</surname>
          </string-name>
          ,
          <article-title>"Relating Complexity and Error Rates of Ontology Concepts,"</article-title>
          <source>Methods of Information in Medicine</source>
          , vol.
          <volume>56</volume>
          , no.
          <issue>03</issue>
          , pp.
          <fpage>200</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ochs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Perl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Musen</surname>
          </string-name>
          ,
          <article-title>"A unified software framework for deriving, visualizing, and exploring abstraction networks for ontologies,"</article-title>
          <source>JBI</source>
          , vol.
          <volume>62</volume>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>2016</year>
          /08/01/
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rehurek</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Sojka</surname>
          </string-name>
          ,
          <article-title>"Software framework for topic modelling with large corpora,"</article-title>
          in
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          ,
          <year>2010</year>
          : Citeseer.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>"Adam: A method for stochastic optimization,"</article-title>
          <source>arXiv preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>