<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Can a Convolutional Neural Network Support Auditing of NCI Thesaurus Neoplasm Concepts?</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hao</forename><surname>Liu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">New Jersey Institute of Technology</orgName>
								<address>
									<settlement>Newark</settlement>
									<region>NJ</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ling</forename><surname>Zheng</surname></persName>
							<email>zdzhengling@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="institution">CSSE Department Monmouth University</orgName>
								<address>
									<settlement>West Long Branch</settlement>
									<region>NJ</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yehoshua</forename><surname>Perl</surname></persName>
							<email>perl@njit.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">New Jersey Institute of Technology</orgName>
								<address>
									<settlement>Newark</settlement>
									<region>NJ</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">James</forename><surname>Geller</surname></persName>
							<email>geller@njit.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">New Jersey Institute of Technology</orgName>
								<address>
									<settlement>Newark</settlement>
									<region>NJ</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gai</forename><surname>Elhanan</surname></persName>
							<email>gelhanan@gmail.com</email>
							<affiliation key="aff2">
								<orgName type="department">Applied Innovation Center Desert Research Institute Reno</orgName>
								<address>
									<region>NV</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Can a Convolutional Neural Network Support Auditing of NCI Thesaurus Neoplasm Concepts?</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">85867A7AC54478185E4B0571F5169FE6</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:06+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>CNN</term>
					<term>Deep Learning</term>
					<term>Neoplasm Hierarchy</term>
					<term>National Cancer Institute Thesaurus</term>
					<term>Quality Assurance</term>
					<term>Abstraction Network</term>
					<term>Machine Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present a Machine Learning methodology using a Convolutional Neural Network to perform a specific case of an ontology Quality Assurance, namely discovery of missing IS-A relationships for Neoplasm concepts in the National Cancer Institute Thesaurus (NCIt). The training step checking all "uncles" of a concept is computationally intensive. To shorten the time and to improve the accuracy, we define a restricted methodology to check only uncles that are similar to each current concept. The restricted technique yields higher classification recall (compared to the unrestricted one) when testing against known errors found by domain experts who manually reviewed Neoplasm concepts in a prior study. The results are encouraging and provide impetus for further improvements to our technique.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>Ontologies play a major role in enabling precise communications and in support of healthcare applications, e.g. EHR systems. Many ontologies are large and complex. For example, the National Cancer Institute Thesaurus (NCIt) <ref type="bibr" target="#b0">[1]</ref>, serving cancer researchers inside and outside NIH, contains 135,243 concepts interrelated by 480,141 links in the April 2018 release. Due to their size and complexity, errors in ontologies are unavoidable. Users of ontologies such as the SNOMED ontology are concerned about errors <ref type="bibr" target="#b1">[2]</ref>. Thus, quality assurance (QA) is essential in the lifecycle of ontologies <ref type="bibr" target="#b2">[3]</ref>. For a summary of auditing (QA) techniques for ontologies and in particular for SNOMED and NCIt, see <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>. However, QA resources for ontologies are typically scarce, while QA tasks are labor-intensive and time-consuming. Therefore, automated or semi-automated techniques that can either help in auditing an ontology or narrow down the places where to look for errors, are highly desired. Missing parent/child errors are particularly interesting to ontology curators, as the IS-A links are the backbone structure of an ontology, facilitating the inheritance of lateral relationships (called roles in NCIt).</p><p>Machine Learning (ML) has been proven successful in many fields, e.g., knowledge mining. ML was previously used in knowledge enrichment for ontologies <ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b6">[7]</ref><ref type="bibr" target="#b7">[8]</ref>. However, can ML be used for quality assurance of ontologies in spite the major difference between knowledge enrichment and quality assurance? Knowledge enrichment mines external sources for new knowledge that does not exist in the ontology. However, QA discovers incorrect or missing knowledge. Consider missing IS-A relationships from an existing concept A to a concept B. If the concept B is already in the ontology, then adding an IS-A link between A and B is considered correcting an omission error. If concept B is not in the ontology and is added together with adding an IS-A from concept A to it, then this is knowledge enrichment. We note that curators of some ontologies, e.g., NCIt, are less interested in knowledge enrichment, unless required by users, than in quality assurance. In this paper, we attempt to use ML to address the task of detecting missing IS-A links between two existing concepts. This task is more challenging than knowledge enrichment, since it requires a judgement that concept A is a specification of concept B. For knowledge enrichment we only recognize that a concept is missing in the ontology and then insert it into the proper place.</p><p>In an unpublished study, we trained a Convolutional Neural Network (CNN) deep learning model to insert new concepts into the SNOMED CT ontology, i.e., an enrichment problem. In the present work, we train a CNN deep learning model to find missing parent/child errors in the Neoplasm subhierarchy of NCIt. The vector representations of concepts are obtained from an unsupervised neural network language model. The model is evaluated by its classification recall on an unseen dataset. We check the model's classification recall by testing against 18 missing parent/child errors found by domain experts in a prior study <ref type="bibr" target="#b8">[9]</ref>. Due to the size of the Neoplasm subhierarchy, the application of the training methodology is computation-intensive and time consuming.</p><p>In previous research we have introduced Abstraction Networks (AbNs) <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref>. An AbN provides a compact summarization and visual simplification of an ontology. The SABOC (Structural Analysis of Biomedical Ontologies Center) team at NJIT has demonstrated that Abstraction Networks are an effective tool to support quality assurance of ontologies <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b11">12]</ref>. An area taxonomy <ref type="bibr" target="#b2">[3]</ref>, a type of Abstraction Network, is composed of meta-concepts called areas, connected by childof links. An area (see Background section) represents a group of concepts with the same structure.</p><p>To accelerate the processing and improve recall, we modify the CNN methodology to limit its consideration, for each concept, to the similar concepts of its area (in our formal sense of area). The modified, restricted methodology achieves 0.81 recall on the unseen testing data. It performs 50% better than the unrestricted methodology on the 18 known errors in terms of recall. The results for detecting missing IS-A links are not yet strong enough. However, the performance in recognition of known errors is encouraging and supports further improvement of our methodology in respect to CNNs and the use of AbNs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Doc2vec</head><p>Numeric representation of variable-length texts, ranging from sentences to documents is a challenging task. Doc2vec, or Paragraph Vectors <ref type="bibr" target="#b12">[13]</ref>, an extension of word2vec (word embedding) <ref type="bibr" target="#b13">[14]</ref>, maps variable-length texts to fixed-length vectors. It is an unsupervised framework that learns continuous distributed vector representations from unlabeled text data of a paragraph/document, while preserving the inter-relationships of the text in the numeric format. In such vector representations, similar pieces of text are close to each other in Euclidean or cosine distance in lower dimensional vector spaces. The Doc2vec inherits the semantics of the words in the context, and takes the word order into consideration when constructing the representation. The latter advantage is important to our problem, as word order in our setup carries the concepts' topological/hierarchical order in the ontology. This is useful information for feature learning. To the best of our knowledge, this is the first study to derive vector representations for biomedical ontology classes via Doc2vec.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. CNN</head><p>Convolutional Neural Networks (CNN), initially invented for image recognition, have been widely used for various applications, including vision, speech recognition, and language translation. CNN models have also been successfully applied to solve various Natural Language Processing (NLP) problems such as search query retrieval <ref type="bibr" target="#b14">[15]</ref>, semantic parsing <ref type="bibr" target="#b15">[16]</ref> and sentence modeling <ref type="bibr" target="#b16">[17]</ref>. CNN utilizes convolving filters to automatically learn and extract local features from various layers, regardless of the input size. This makes CNN a very powerful tool for classification or prediction tasks, e.g. text classification <ref type="bibr" target="#b17">[18]</ref> and relation extraction <ref type="bibr" target="#b18">[19]</ref>, even if the data or features have not been manually labeled for learning purposes. To the best of our knowledge, this is the first effort to adopt the CNN model for ontology quality assurance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Neoplasms of NCIt</head><p>NCIt is published monthly by the National Cancer Institute (NCI) in OWL and flat file formats. It is a cancer reference terminology that is widely used. It covers cancer-related terminology in various fields, e.g., clinical care and translational and basic research. Concepts are linked to other concepts (parent concepts) in the same hierarchy by IS-A relationships. A concept may have multiple parent concepts. The semantics of concepts are defined by lateral relationships (called "roles"), e.g. Disease Has Associated Anatomic Site.</p><p>Due to the NCI's cancer focus, the Neoplasm subhierarchy of NCIt is composed of 9,955 concepts. It is a core component of the Disease, Disorder or Finding, the largest hierarchy with 35,081 concepts, and is modeled with more detail, compared to non-neoplasm concepts in the hierarchy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Areas and Area Taxonomy</head><p>An area taxonomy <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21]</ref> is a compact Abstraction Network summarizing the structure (roles) of an ontology. It is composed of areas and child-of relationships connecting areas. An area is a group of concepts having the same set of role types. A concept can be in only one area, i.e., areas are disjoint. A concept that has no parent in its area, is called a root of the area. An area may have multiple roots. If a root concept of area B has a parent concept in area A, then there is a child-of relationship from area B to area A. Each colored, dashed rectangle in Fig. <ref type="figure" target="#fig_0">1</ref>(a) becomes an area with the same color in Fig. <ref type="figure" target="#fig_1">1(b</ref>). An area is labeled by its role type set and the number of concepts it summarizes. Areas with the same number of role types have the same color. For example, there are two areas colored in green, since both have two role types. Skin Neoplasm is the root concept of the red area and its parent Neoplasm by Site is in the grey area. Hence, there is a child-of relationship from the red area to the grey area, denoted as a bold arrow in Fig. <ref type="figure" target="#fig_1">1(b)</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. METHOD</head><p>We describe two methodologies, the unrestricted methodology and the refined restricted methodology. The ML training problem is viewed as a binary classification task: given a concept pair, we classify it into a positive category (there is an IS-A link) or a negative category (there is no IS-A link). We train a Convolutional Neural Network (CNN) model to solve this classification problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Unrestricted Methodology:</head><p>To train a CNN model to yield high precision, we must carefully choose the training data for both categories. The source of training samples for the positive category is in the IS-A hierarchy of the ontology. The challenge is in the choice for the negative samples as we cannot use the full set of unconnected pairs. For the Neoplasm subhierarchy of NCIt with 9955 concepts, the size of this set is 99,075,537 (= 9955*9954 -16533). We subtract 16533 existing IS-A link pairs from the potential missing parent/child errors. Training pairs should not be chosen randomly. We need to choose pairs where there is a reasonable likelihood for an IS-A link, not pairs that obviously have no taxonomic relation. For example, a body part concept and a drug concept are not related by an IS-A relationship and would be a bad training pair. Negative training samples should be near misses, close to the "hyperplane of separation" in SVM terms.</p><p>To address these two problems of magnitude and recall in the ML training, we limit the negative samples for the unrestricted methodology to only "unclenephew" pairs for only near misses. That is, connections between a concept and a sibling of its parent. By only choosing "unclenephew" pairs, we guide the model to learn the underlying features used to distinguish IS-A-connected concept pairs and similarly positioned concept pairs that have a high potential to have secondary IS-A links but are not connected by IS-A links. There are total 37,147 such "unclenephew" pairs in the Neoplasm subhierarchy of NCIt.</p><p>The unrestricted model must consider all uncles of a concept (that are not connected to that concepts by an IS-A link). Due to the size of the Neoplasm subhierarchy, the number of uncles is large. For example, there are 24, 15 and 15 concepts with 10, 11 and 12 children, respectively and the maximum number of children of a concept is 60. Each grandchild of a concept with 15 children has 14 uncles. If the average number of children of each sibling is 5, then there are 70 concepts with 14 uncles each. Applying the proposed technique to select negative samples from the whole hierarchy results in a large number of "unclenephew" pairs. This is computationally expensive for training the model and leads to low accuracy by distracting the model from learning subtle features that distinguish between IS-A and "unclenephew" links within similar groups of concepts. In other words, not all "unclenephew" pairs are equally useful for training purposes. We need pairs that are similar to existing IS-Aconnected pairs, yet are not themselves IS-A-connected.</p><p>Additionally, many machine learning models work best with balanced training sets. The number of positive and negative samples should be approximately equal. The number of positive training instances in our problem domain is given and fixed. The number of potential negative training samples is much larger. Thus a cogent way has to be used to select a number of negative training samples that is closer to the number of positive training samples.</p><p>The Restricted Methodology: To cope with these problems, we introduced the restricted approach. The restricted approach limits the number of negative samples by only choosing a subset of a concept's uncles, which are structurally similar to the investigated concept. To provide such a "closely related" subset we partition the Neoplasm subhierarchy into sets of concepts of similar structure. In doing this we are availing ourselves of a powerful mechanism that derives an area taxonomy from an ontology <ref type="bibr" target="#b2">[3]</ref>. An area taxonomy is an Abstraction Network that clusters together groups of concepts according to their roles. All concepts in one area have exactly the same roles. Concepts in different areas differ in at least one role from each other.</p><p>In the area taxonomy each cluster of concepts constitutes an area. Due to the high average number of roles per concept in the Neoplasm subhierarchy, the number of areas is large, and the average size of an area is small. Note that AbNs are automatically derived from the ontology so the limitation process is automatic <ref type="bibr" target="#b21">[22]</ref>. By selecting pairs only within areas, we narrow down 37,147 negative samples to 10,574 more closely related "uncle -nephew" pairs, of the same magnitude as the 16,533 positive samples. Since all uncles in an area will have exactly the same roles as the current concept, they will be similar to it. Thus the recall of the CNN training model is expected to be higher than for the unrestricted model trained with all uncles of a concept, many of which are not similar to the current concept.</p><p>The following description is related to both methodologies. Overall, our methodology comprises following four steps:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Document Embedding</head><p>The CNN model requires its input in the format of fixedlength feature vectors. Thus, before sending concept pairs for training, we need to transform each concept into its corresponding vector representation with fixed length.</p><p>The Paragraph Vector (Doc2vec) framework introduced by <ref type="bibr" target="#b12">[13]</ref> generates fixed-length feature vectors from variablelength pieces of text, as it was designed for text corpus processing. Thus, the problem to overcome in applying Doc2vec to ontologies is to find the vector representation of single concepts. However, an IS-A link is defined by a pair of concepts, thus a joint representation of pairs is needed that is also compatible with the input required by CNN.</p><p>To derive the vector representation of a single concept, we need text "descriptions" of the concept. We recast a concept into a document such that it preserves hierarchical and partial semantic information of the concept:</p><p>The document of a concept contains the concept ID, the name(s) of its ancestor(s), the name(s) of the concept itself, the name(s) of its child(ren) and the names of its grandchild(ren), if they exist. In this way, the document implicitly maintains the hierarchical relationships of the ontology.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Training and Testing Data</head><p>The positive samples are directly extracted from the hierarchy as all the concept pairs connected via an IS-A link. For example, (Malignant Nipple Neoplasm, Nipple Neoplasm) is a positive sample, because Malignant Nipple Neoplasm is a Nipple Neoplasm (Fig. <ref type="figure" target="#fig_3">2</ref>). As mentioned above, there are 16,533 positive samples in the Neoplasm subhierarchy. We randomly picked 2,000 positive samples for testing. The remaining 14,533 (=16,533-2,000) samples are used in a ratio of 80% for training and 20% validation. The positive samples are treated equally for both models.</p><p>For the restricted model, there are 10,574 potential pairs where both concepts of each pair are from the same area. We randomly picked 2,000 for testing. As noted above, there are 37,147 "uncle -nephew" negative sample pairs in the hierarchy. However, for the unrestricted model, we use the same 2000 negative pairs, used for the restricted model. This is done to enable performance comparison between the two models. Similar to the way we handle the positive samples, the remaining 8,574 (=10,574-2,000) samples for the restricted model and the remaining 35,147 for the unrestricted model are divided to 80% vs. 20% ratio for training and validation, respectively.</p><p>In addition, we down-sampled the negative samples for the unrestricted model and up-sampled for the restricted model to 14,533 samples, in order to balance the number of samples for both categories, as customary in Machine Learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. CNN Model</head><p>We trained a CNN model with 4 convolution layers on top of vectors derived from the Neoplasm subhierarchy of NCIt via Doc2vec. The CNN model architecture is shown in Fig. <ref type="figure" target="#fig_4">3</ref>. The input to the CNN model is two 128x1 dimension vectors and the output is a 2x1 dimension vector. • There are four convolution layers, each followed by a max pooling layer. This choice was informed by previous research. The first convolution layer has 18 filters with kernel size =1. The filter number doubles with the increase of convolution layers. We use stride =1, meaning we slide the filters one number (position) at a time over the input. The pooling size is 2 for all max pooling layers.</p><p>• The Adam <ref type="bibr" target="#b23">[24]</ref> optimization algorithm for stochastic gradient descent is used for training with the learning rate set to 0.001.</p><p>• The ReLU (Rectified Linear Unit) activation function is used in every convolution layer, because it has proven successful in recent research projects. This corresponds to a "rectifier function" from electrical engineering, blocking the negative half-wave and letting the positive half-wave pass through one-to-one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Test against Reviewed Data</head><p>Traditionally, machine learning models are tested with kfold cross validation. Thus, all known data is partitioned into k folds, the model is trained with k-1 folds and tested with the remaining fold. This process is repeated k times, with resulting precision, recall and F1 values averaged. We have augmented this testing by human expert quality assurance results.</p><p>In a previous study <ref type="bibr" target="#b8">[9]</ref>, domain experts reviewed 190 concepts from the Neoplasm subhierarchy and reported 18 missing parent errors. This data was used as ground truth in this study to check the sensitivity/recall of our model's performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. RESULTS</head><p>We report our CNN model's performance in the following three aspects: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Testing recall and AUC</head><p>The testing recall is 0.75 and 0.81 for the unrestricted and restricted models, respectively. Fig. <ref type="figure" target="#fig_5">4</ref> (a) and (b) show the Receiver Operating Characteristic (ROC) curves of the testing performance for the unrestricted and restricted models, respectively. The AUC (area under the curve) scores, as the measure of test accuracy, are 0.84 and 0.90, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Confirmed errors found by domain experts</head><p>The restricted model detected 10 out of the 18 errors that domain experts found <ref type="bibr" target="#b8">[9]</ref>, while the unrestricted model detected only five errors, all contained in the above 10 errors. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Training time efficiency</head><p>Each model is trained with 2000 epochs, with batch size = 2000. We recorded the duration of the training. With the same computer hardware configuration, training the unrestricted and restricted models took 1116 and 1110 seconds, respectively. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. DISCUSSION</head><p>To the best of our knowledge, this is the first published attempt to use Machine Learning (ML) for QA of ontologies. Such technique can prepare a subset of pairs of concepts as candidates for missing IS-A link omission errors, optimizing the use of scarce QA resources.</p><p>In this paper, we discussed two ML approaches. The unrestricted model utilizes all uncle concepts of the processed concept. The restricted model is further taking advantage of the area taxonomy of the ontology to utilize only uncles in the area of the processed concept. Both ideas save processing time while improving the accuracy, by training unconnected pairs of concepts similar to the original IS-A links of the processed concept. The uncles are similarly positioned as siblings of the parent of the processed concept. The uncles within the area share the same roles as the processed concept.</p><p>Table II compares the performance of the two models. As can be expected, the restricted model utilizing training with similar concepts achieves higher performance. We evaluated the results of the two ML models based on a list of errors that domain experts found in a previous study <ref type="bibr" target="#b8">[9]</ref>. Such an evaluation is usually not available for ML studies. An interesting observation is that the set of five errors confirmed by the unrestricted model is a subset of the 10 errors found by the restricted model. The reason for this may be that the two models use the same negative test pairs where the uncles are from the same area as the nephew. This choice was made in order to be able to compare performance of the two models, but tends to unnecessarily limit the unrestricted model to find error pairs of similar concepts. In the future, we will perform experiments where the unrestricted and restricted models will use disjoint negative training data in an effort to optimize the results of each model rather than to compare them. From Table <ref type="table" target="#tab_2">II</ref> we can see that the restricted model performs 6% better than the unrestricted model in recall. The area under the curve in Fig. <ref type="figure" target="#fig_5">4(b</ref>) is 6% larger than in Fig. <ref type="figure" target="#fig_5">4</ref>(a), reflecting a better classification of the restricted model than the unrestricted model. The document vectors used in this study were derived using the Distributed Memory version of Paragraph Vector (PV-DM) that works well for most tasks, as stated in the original paper <ref type="bibr" target="#b12">[13]</ref>. However, it is also recommended in the paper to combine Paragraph Vector with Distributed Bag of Words (PV-DBOW) to obtain consistency. The more accurate the vector representations of concepts are, the better recall should be expected. This is left for future work.</p><p>The recall obtained is not high enough for reliable QA for missing IS-A links. For example, out of a random subset of 20 suggested errors, a domain expert (GE) confirmed only one error pair, Hair Follicle Neoplasm missing the parent Dermal Neoplasm, since hair follicles reside in the dermal layer of the skin. In the NCIt, Dermal Neoplasm has 13 children, representing a mix, based on cell origin as well as malignancy status. Currently, Hair Follicle Neoplasm has Skin Appendage Neoplasm as a parent. Dermal Neoplasm and Skin Appendage Neoplasm are siblings. However, Hair Follicle Neoplasm has only a very indirect relationship to the dermis through the Disease Has Primary Anatomic Site role with Hair Follicle as the target value and the Anatomic Structure Is Physical Part Of role of Hair Follicle with the target value of Dermis. Adding Dermal Neoplasm as a parent, or even possibly replacing Skin Appendage Neoplasm with it, might be better.</p><p>A higher recall will imply fewer suggested errors with a higher percentage of confirmed errors by domain experts. In future research, we will further explore the properties of the two models as well as properties of ML processing in an effort to best fine-tune and utilize both models, to increase the recall.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. LIMITATIONS</head><p>We presented a supervised training technique, therefore "annotated corpora" are required for its applicability beyond finding missing IS-A relationships in the NCI neoplasm taxonomy. The results are also limited, because the models are not trained on more general data for more general problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. CONCLUSION</head><p>We explored whether ML methods can be applied to the task of ontological QA, in particular whether ML can help in detecting missing IS-A links in an ontology. Two models are presented. Our application of ML to the Neoplasm subhierarchy of NCIt demonstrated that the restricted model performs better than the unrestricted one. However, the performance of the restricted model is not yet sufficient for QA. In future research, we will explore improvements in ML processing and more accurate restrictions for concepts in the training stage to improve performance of QA of ontologies.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 (</head><label>1</label><figDesc>Fig. 1(a) is an excerpt of 12 Neoplasm concepts from NCIt. Concepts are represented as rounded-corner boxes and the arrows denote IS-A relationships. Concepts with the same set of role types are enclosed within a colored dashed rectangle. For example, both Benign Neoplasm and Tumorlet have the two role types Disease Excludes Abnormal Cell and Disease Has Abnormal Cell, they reside in the left green dashed rectangle. Fig. 1(b) shows the area taxonomy for Fig. 1(a).Each colored, dashed rectangle in Fig.1(a) becomes an area with the same color in Fig.1(b). An area is labeled by its role type set and the number of concepts it summarizes. Areas with the same number of role types have the same color. For example, there are two areas colored in green, since both have two role types. Skin Neoplasm is the root concept of the red area and its parent Neoplasm by Site is in the grey area. Hence, there is a child-of relationship from the red area to the grey area, denoted as a bold arrow in Fig.1(b).</figDesc><graphic coords="2,315.00,466.45,252.25,100.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. (a) Excerpt of 12 concepts from the Neoplasm subhierarchy. (b) The area taxonomy for (a) with 4 areas.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>For example, the document representation of the concept Malignant Nipple Neoplasm (Fig.2) is "c5213: Neoplasm → Neoplasm by site → Breast neoplasm → Malignant Breast Neoplasm → Nipple Neoplasm→ Malignant Nipple Neoplasm → Female Malignant Nipple Neoplasm → Male Malignant Nipple Neoplasm →Nipple Carcinoma." Thus, the generated distributed vector representation maintains the most important hierarchical relationship semantics. The document vectors are derived using the Distributed Memory version of Paragraph Vector (PV-DM) [13] via the Gensim [23] Doc2Vec implementation. Each vector has the dimensionality of 128. Pairs of concepts connected by an IS-A link are represented by the concatenation of the document vectors of the two concepts, with the child concept first.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Concept document derivation for Malignant Nipple Neoplasm</figDesc><graphic coords="4,45.35,169.45,251.40,145.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. CNN model architecture Some structural details of this model are summarized as follows:</figDesc><graphic coords="4,315.00,94.45,251.50,109.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. ROC curve of the two models</figDesc><graphic coords="5,90.50,54.00,430.90,161.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Table I shows two missing parent examples confirmed by both models, and two examples confirmed only by the restricted model.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>TABLE I .</head><label>I</label><figDesc>MISSING PARENT ERRORS CONFIRMED BY THE TWO MODELS</figDesc><table><row><cell>Child</cell><cell>Model</cell><cell>Missing Parent</cell></row><row><cell>Reproductive Endocrine Neoplasm</cell><cell>Both</cell><cell>Endocrine Neoplasm</cell></row><row><cell>Basophilic Adenocarcinoma</cell><cell>Both</cell><cell>Anterior Pituitary Gland Neoplasm</cell></row><row><cell>Breast Tubular Adenoma</cell><cell>Restricted</cell><cell>Tubular Adenoma</cell></row><row><cell>Cutaneous Glomangioma</cell><cell>Restricted</cell><cell>Benign Skin Neoplasm</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>TABLE II</head><label>II</label><figDesc></figDesc><table><row><cell>.</cell><cell cols="3">TWO MODELS PERFORMANCE COMPARISON</cell></row><row><cell></cell><cell>Unrestricted Model</cell><cell>Restricted Model</cell><cell>Difference</cell></row><row><cell>Confirmed Errors</cell><cell>5</cell><cell>10</cell><cell>5</cell></row><row><cell>Corresponding Recall (out of 18)</cell><cell>0.28</cell><cell>0.56</cell><cell>0.28</cell></row><row><cell>Testing Recall</cell><cell>0.75</cell><cell>0.81</cell><cell>0.06</cell></row><row><cell>Testing AUC</cell><cell>0.84</cell><cell>0.90</cell><cell>0.06</cell></row><row><cell>Training Time (sec)</cell><cell>1116</cell><cell>1110</cell><cell>6</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENT</head><p>Research reported in this publication was partially supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA190779. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">NCI Thesaurus: using science-based terminology to integrate cancer research results</title>
		<author>
			<persName><forename type="first">S</forename><surname>Coronado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Haber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sioutos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Tuttle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">W</forename><surname>Wright</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Stud Health Technol Inform</title>
		<imprint>
			<biblScope unit="volume">107</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="33" to="37" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey of SNOMED CT direct users, 2010: impressions and preferences regarding content and quality</title>
		<author>
			<persName><forename type="first">G</forename><surname>Elhanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Geller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J Am Med Inform Assoc</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="36" to="44" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
	<note>Suppl</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Auditing as Part of the Terminology Design Life Cycle</title>
		<author>
			<persName><forename type="first">H</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Halper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Geller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J Am Med Inform Assoc</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="676" to="690" />
			<date type="published" when="2006-12">Nov-Dec 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A review of auditing methods applied to the content of controlled biomedical terminologies</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Baorto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Cimino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J Biomed Inform</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="413" to="425" />
			<date type="published" when="2009-06">Jun 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Special issue on auditing of terminologies</title>
		<author>
			<persName><forename type="first">J</forename><surname>Geller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Halper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cornet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J Biomed Inform</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="407" to="411" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A case study on sepsis using PubMed and Deep Learning for ontology learning</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Arguello</forename><surname>Casteleiro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Informatics for Health: Connected Citizen-Led Wellness and Population Health</title>
		<imprint>
			<biblScope unit="volume">235</biblScope>
			<biblScope unit="page" from="516" to="520" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Exploring the application of deep learning techniques on medical text corpora</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Minarro-Gimé Nez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Marin-Alonso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Samwald</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Stud Health Technol Inform</title>
		<imprint>
			<biblScope unit="volume">205</biblScope>
			<biblScope unit="page" from="584" to="588" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Using Word Embeddings for Ontology Enrichment</title>
		<author>
			<persName><forename type="first">İ</forename><surname>Pembeci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Intelligent Systems and Applications in Engineering</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="49" to="56" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Auditing National Cancer Institute thesaurus neoplasm concepts in groups of high error concentration</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Geller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Appl Ontol</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="113" to="130" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Abstraction networks for terminologies: Supporting management of &quot;big knowledge</title>
		<author>
			<persName><forename type="first">M</forename><surname>Halper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ochs</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artif Intell Med</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="16" />
			<date type="published" when="2015-05">May 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Abstraction of complex concepts with a refined partial-area taxonomy of SNOMED</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Halper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Geller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J Biomed Inform</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="15" to="29" />
			<date type="published" when="2012-02">Feb 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Auditing Complex Concepts of SNOMED using a Refined Hierarchical Abstraction Network</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Biomedical Informatics</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="14" />
			<date type="published" when="2012-01-09">09/01 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Distributed Representations of Sentences and Documents</title>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<idno>. abs/1405.4053</idno>
	</analytic>
	<monogr>
		<title level="j">CoRR</title>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1301.3781</idno>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Learning semantic representations using convolutional neural networks for web search</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mesnil</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 23rd Int. Conf. World Wide Web</title>
				<meeting>of the 23rd Int. Conf. World Wide Web</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="373" to="374" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Semantic parsing for single-relation question answering</title>
		<author>
			<persName><forename type="first">W.-T</forename><surname>Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Meek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Short Papers</title>
		<meeting>the 52nd Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="643" to="648" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">A convolutional neural network for modelling sentences</title>
		<author>
			<persName><forename type="first">N</forename><surname>Kalchbrenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grefenstette</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Blunsom</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1404.2188</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Convolutional neural networks for sentence classification</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1408.5882</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Neural relation extraction with selective attention over instances</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 54th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="2124" to="2133" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Structural methodologies for auditing SNOMED</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Halper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">A</forename><surname>Spackman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of biomedical informatics</title>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="561" to="581" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Relating Complexity and Error Rates of Ontology Concepts</title>
		<author>
			<persName><forename type="first">H</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Halper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Coronado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ochs</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Methods of Information in Medicine</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="issue">03</biblScope>
			<biblScope unit="page" from="200" to="208" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">A unified software framework for deriving, visualizing, and exploring abstraction networks for ontologies</title>
		<author>
			<persName><forename type="first">C</forename><surname>Ochs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Geller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Perl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Musen</surname></persName>
		</author>
		<idno>/08/01/ 2016</idno>
	</analytic>
	<monogr>
		<title level="j">JBI</title>
		<imprint>
			<biblScope unit="volume">62</biblScope>
			<biblScope unit="page" from="90" to="105" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Software framework for topic modelling with large corpora</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rehurek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sojka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</title>
				<meeting>the LREC 2010 Workshop on New Challenges for NLP Frameworks</meeting>
		<imprint>
			<publisher>Citeseer</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Adam: A method for stochastic optimization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
