<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Saminda</forename><surname>Abeyruwan</surname></persName>
							<email>saminda@cs.miami.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Miami</orgName>
								<address>
									<settlement>Florida</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ubbo</forename><surname>Visser</surname></persName>
							<email>visser@cs.miami.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Miami</orgName>
								<address>
									<settlement>Florida</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Vance</forename><surname>Lemmon</surname></persName>
							<email>vlemmon@miami.edu</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">The Miami Project to Cure Paralysis</orgName>
								<orgName type="department" key="dep2">School of Medicine</orgName>
								<orgName type="institution">University of Miami Miller</orgName>
								<address>
									<settlement>Florida</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stephan</forename><surname>Schürer</surname></persName>
							<email>sschuerer@med.miami.edu</email>
							<affiliation key="aff2">
								<orgName type="department">Department of Molecular and Cellular Pharmacology</orgName>
								<orgName type="institution">University of Miami Miller</orgName>
							</affiliation>
							<affiliation key="aff3">
								<orgName type="department">School of Medicine</orgName>
								<address>
									<settlement>Florida</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8CBC3300343C101DD66C87640BFF85F9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T06:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Ontology Modeling</term>
					<term>Ontology Learning</term>
					<term>Probabilistic Methods</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Formalizing an ontology for a domain manually is well-known as a tedious and cumbersome process. It is constrained by the knowledge acquisition bottleneck. Therefore, researchers developed algorithms and systems that can help to automatize the process. Among them are systems that include text corpora for the acquisition. Our idea is also based on vast amount of text corpora. Here, we provide a novel unsupervised bottom-up ontology generation method. It is based on lexico-semantic structures and Bayesian reasoning to expedite the ontology generation process. We provide a quantitative and two qualitative results illustrating our approach using a high throughput screening assay corpus and two custom text corpora. This process could also provide evidence for domain experts to build ontologies based on top-down approaches.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>An ontology is a formal, explicit specification of a shared conceptualization <ref type="bibr" target="#b9">[10]</ref>, <ref type="bibr" target="#b21">[22]</ref>. Formalizing an ontology for a given domain with the supervision of domain experts is a tedious and cumbersome process. The identification of the structures and the characteristics of the domain knowledge through an ontology is a demanding task. This problem is known as the knowledge acquisition bottleneck (KAB) and a suitable solution presently does not exist.</p><p>There exists a large number of text corpora available from different domains (e.g., the BioAssay high throughput screening assays <ref type="foot" target="#foot_0">4</ref> ) that need to be classified into ontologies to faciliate the discovery of new knowledge. A domain of discourse (i.e., sequential number of sentences) shows characteristics such as 1) redundancy 2) structured and unstructured text 3) noisy and uncertain data that provide a degree of belief 4) lexical disambiguity, and 5) semantic heterogeneity problems. We discuss in depth the importance of these characteristics in section 3. Our goal in this research is to provide a novel method to construct an ontology from the evidence collected from the corpus. In order to achieve our goal, we use the lexico-semantic features of the lexicon and probabilistic reasoning to handle the uncertainty of features. Since our method is applied to build an ontology for a corpus without domain experts, this method can be seen as an unsupervised learning technique. Since the method starts from the evidence present in the corpus, it is can be seen as a reverse engineering technique. We use WordNet <ref type="foot" target="#foot_1">5</ref> to handle lexico-semantic structures, and the Bayesian reasoning to handle degree of belief of an uncertain event. We implement a Java based application to serialize the learned conceptualization to OWL DL<ref type="foot" target="#foot_2">6</ref> format.</p><p>The rest of the paper is organized as follows: section 2 provides a broad investigation of the related work. Section 3 provides details of our research approach. Section 4 provides a detail description of the experiments based on three different text corpora and the discussion. Finally, section 5 provides the summary and the future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>The problem of learning a conceptualization from a corpus has been studied in many disciplines such as machine learning, text mining, information retrieval, natural language processing, and Semantic Web. Table <ref type="table">1</ref> shows the pros and cons of different techniques to solve the problem of ontology learning. Each method covers some portion of the problem and each method learns the conceptualization from terms, and present it as taxonomies and axioms to an ontology. On the other hand, most of the methods use a top-down approach, i.e., an initial classification of an ontology is given. The uncertainty inherited from the domain is usually dealt with by a domain expert, and the conceptualization is normally defined using predefined rules or templates. These methods show the characteristics of a semi-supervised and a semi-automated learning paradigm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Approach</head><p>Our research focuses on an unsupervised method to quantify the degree of belief that a grouping of words in the corpus will provide a substantial conceptualization of the domain of interest. The degree of belief in world states influences the uncertainty of the conceptualization. The uncertainty arises from partial observability, non-determinism, laziness and theoretical and practical ignorance <ref type="bibr" target="#b18">[19]</ref>. The partial observability arises from the size of the corpus. Even though Table <ref type="table">1</ref>. The summary of the related work. Probabilistic learning (PR), never ending language learning (NELL), discovery and aggregation of relations in text (DART), recognizing textual entailment (RTE), automated theorem proving (ATP), natural language understanding (NLU), formal concept analysis (FCA), and ontology population (OP).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Work</head><p>Purpose T-Box A-Box Method PR <ref type="bibr" target="#b8">[9]</ref>, <ref type="bibr" target="#b11">[12]</ref>, <ref type="bibr" target="#b13">[14]</ref> and <ref type="bibr" target="#b16">[17]</ref> reasoning available available prob. theory NELL <ref type="bibr" target="#b2">[3]</ref> 24 × 7 learning fixed dynamic ML techniques DART <ref type="bibr" target="#b6">[7]</ref> world knowledge × × semi-automated RTE <ref type="bibr" target="#b1">[2]</ref>, and <ref type="bibr" target="#b12">[13]</ref> entailment × × ATP NLU <ref type="bibr" target="#b19">[20]</ref> commonsense rules × × semi-supervised Text2Onto <ref type="bibr" target="#b5">[6]</ref> ontology learning √ √ semi-supervised LexO <ref type="bibr" target="#b23">[24]</ref> complex classes √ × semi-supervised FCA <ref type="bibr" target="#b4">[5]</ref> taxonomy √ × FCA OP <ref type="bibr" target="#b3">[4]</ref>, and <ref type="bibr" target="#b22">[23]</ref> ontology population available available semi-/supervised a corpus many be large, it might not contain all the necessary evidence of an event of interest. A corpus contains ambiguous statements about an event that leads to a non-determinism of the state of the event. The laziness arises from the too much work that needs to be done in order to learn exceptionless rules and it is too hard to learn such rules. The theoretical and practical ignorance arises from lack of complete evidence and it is not possible to conduct all the necessary tests to learn a particular event. Hence, the domain knowledge, and in our case the domain conceptualization, can at best provide only a degree of belief of the relevant groups of words. We use probability theory to deal with the degrees of belief. As mentioned in <ref type="bibr" target="#b18">[19]</ref>, the probability theory has the same ontological commitment as the formal logic, though the epistemological commitment differs. The process of learning and presenting a probabilistic conceptualization is divided into four phases as shown in Figure <ref type="figure" target="#fig_0">1</ref>. They are, 1) pre-processing 2) syntactic analysis 3) semantic analysis, and 4) representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Pre-processing</head><p>A corpus contains a plethora of structured and unstructured sentences. A lexicon of a language is its vocabulary built from lexemes <ref type="bibr" target="#b10">[11]</ref>, <ref type="bibr" target="#b14">[15]</ref>. A lexicon contains words belonging to a language and in our work individual words from the corpus. In pure form, the lexicon may contain words that appear frequently in the corpus but have little value in formalizing a meaningful criterion. These words are called stop words or in our terminology: negated lexicon, and they are excluded from the vocabulary. We, first, part-of-speech tagged the corpus with the Penn Treebank English POS tag set <ref type="bibr" target="#b15">[16]</ref>. We use the subset of tagset NN, NNP, NNS, NNPS, JJ, JJR, JJS, VB, VBD, VBG, VBN, VBP, and VBZ. The word length W L above some threshold W LT is also considered. The length of a word, with respect to POS context, is the sequence of characters or symbols that made up the word. By default, we consider that a word with W L &gt; 2 sufficiently formalizes to some criterion. The pure form of the lexicon might contain words that need to be further purified according to some criterion. We use regular expressions for this task. Then we normalize and case-fold the words <ref type="bibr" target="#b14">[15]</ref>. In addition to this there are families of derivationally related words with similar meanings. We use stemming and lemmatization to reduce the inflectional forms and derivational forms of a word to a common base form <ref type="bibr" target="#b14">[15]</ref>. We achieve this with the aid of WordNets' stemming algorithms. We couple the knowledge of POS tag of the word to get the correct context when finding the common base form.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Syntactic Analysis</head><p>The primary focus on this phase is to look at the structure of the sentences and learn the associations among the vocabulary. We assume that each sentence of the corpus follows the POS pattern 1. 1,</p><formula xml:id="formula_0">(Subject N ounP hrase +)(V erb+)(Object N ounP hrase +)<label>(1)</label></formula><p>We hypothesize that the associations learned from this phase provides the potential candidates for concepts and relations of the ontology. But the vocabulary itself does not provide sufficient ontology concepts. We use a notion of grouping of consecutive sequence of words to form an OWL concept. This grouping is done using an appropriate N-gram model <ref type="bibr" target="#b0">[1]</ref>. We illustrate this idea using Figure <ref type="figure" target="#fig_1">2</ref>. The group w 1 • w 2 forms a potential concept in the conceptualization. We use the notation x • y to show that the word y is appended to the word x. The groups w 2 •w 3 , w 3 •w 4 etc. form other potential concepts in the conceptualization. Word w 3 comes after group w 1 •w 2 . According to the Bayes viewpoint, we collect information to estimate the probability P (w 3 |{w 1 • w 2 }), which will be used to form IS-A relationships, w 1 • w 2 w 3 using an independent Bayesian network with conditional probability P ({w 1 • w 2 }|w 3 ). In addition to this, we count the groups appear in the left hand side and the right hand side of the expression 1 and the association of of these groups given the verbs. These counts are used in the third phase to create the relations among concepts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Semantic Analysis</head><p>This phase conducts the semantic analysis with probabilistic reasoning, which constitutes the most important operation of our work. This phase determines the conceptualization of the domain using a probability distribution for IS-A relations and relations among the concepts. Our main definition of concept learning is given in Definition 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 1.</head><p>The set W = {w 1 , . . . , w n } represents words of the vocabulary and each w i has a prior probability θ i &gt; τ . τ is a prior threshold, which is known as the knowlege factor. The set G = {g 1 , . . . , g m } represents N-gram groups learned from the corpus and each g j has a prior probability η j . When w ∈ W and g ∈ G, P (w|g) is the likelihood probability π learned from the corpus. The entities w and g represent the potential concepts of the conceptualization and the set W provide the potential super-concepts of the conceptualization. Within this environment, an IS-A relationship between w and g is given by the posterior probability P (g|w) and this is represented with a Bayesian network having two nodes w and g and is modeled by the equation,</p><formula xml:id="formula_1">P (g|w) = π × η i p(w|g i ) × p(g i )</formula><p>.</p><p>(</p><formula xml:id="formula_2">)<label>2</label></formula><p>Using the Definition 1, the probabilistic conceptualization of a domain is defined as follows.</p><p>Definition 2. The probabilistic conceptualization of the domain is represented by an n-number of independent Bayesian networks sharing groups.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>. , g n }.</head><p>There exist a group g i , which is shared by m words {w 1 , . . . , w m }. Then, with respect to the Bayesian framework, BN i of P (g i |w i ) is calculated and max(P (g i |m i )) is selected for the construction of the ontology. This means that if there exists two Bayesian networks and the Bayesian network one is given by the pair w 1 , g 1 and the Bayesian network two is given by the pair {w 2 , g 1 } then the Bayesian network that has the most substantial IS-A relationship is obtained through max BNi (P (g 1 |w 1 )) and this network is retained and other Bayesian networks will be ignored when building the ontology. If all P (g 1 |w 1 ) remains equal, then the Bayesian network with the highest super-concept probability will be retained. These two conditions will resolve any naming issues.</p><p>The next step is to induce the relationships to complete the conceptualization. In order to do this, we need to find semantics associated with each verb. We hypothesize that relations are generated by the verbs and the definition is as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 3. The relationships of the conceptualization are learned from the syntactic structure model by the expression 1 and the semantic structure model by the lambda expression λobj.λsub.V erb(sub, obj)</head><p>, where β-reduction is applied for obj and sub of the expression 1. If there exists a verb V between two groups of concepts C 1 and C 2 , the relationship of the triple (V, C 1 , C 2 ) is written as V (C 1 , C 2 ) and model with conditional probability P (C 1 , C 2 |V ). The Bayesian network for relationship is and the model semantic relationship is given by, The relations learned from Defintions 3 needs to be subjected to a lower bound. The lower bound is known as the relations factor. When the corpus is substantially large, the number of relations is proportional to the number of verbs. Not all relations may relevant and the factor is used as the threashold. A verb may have antonyms. If a verb is associated with some concepts and these concepts happen to be associated with a antonym, the verb with the highest Bayesian probability value is selected for the relations map and the other relationship will be removed. Finally, the probabilistic conceptualization is serialized as an OWL DL ontology in the representation phase.</p><formula xml:id="formula_3">P (C 1 , C 2 |V ) = p(C 1 |V )p(C 2 |V ) → V (C 1 , C 2 )</formula><p>Our implementation of the above phases is based on Java 6 and it is named as PrOntoLearn (Probabilistic Ontology Learning).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>We have conducted experiments on three main data corpora, 1) the PCAssay, of the BioAssay Ontology (BAO) project, Department of Molecular and Cellular Pharmacology University of Miami, School of Medicine 2) a sample collection of 38 PDF files from ISWC 2009 proceedings, and 3) a substantial portion of the web pages extracted from the University of Miami, Department of Computer Science<ref type="foot" target="#foot_3">7</ref> domain . We have constructed ontologies for all three corpora with different parameter settings.</p><p>The first corpus contains high throughput screening assays performed on various screening centers. This corpus grows rapidly each month. We specifically limited our dataset to assays available on the 1 st of January 2010. Table <ref type="table" target="#tab_0">2</ref> provides the statistics of the corpus. We extract the vocabulary generated from [a-zA-Z]+ <ref type="bibr">[-]</ref>? \w* regular expression, and normalized them to create the vocabulary. The average file size of the corpus is approximately 6 Kb. We conducted these experiments in a Genuine Intel(R) CPU 585 @ 2.16GHz, 32 bits, 2 Gb Toshiba laptop. It is found that the time required to build the conceptualization grows linearly. We use precision, recall and F1 measures to evaluate the ontology and recommendations from domain experts, specially to get comments on the generated bioassay ontology. The ontology that is generated is too large to show in here.Instead, we provide a few distinct snapshots of the ontology with the help of Protégé OWLViz plugin. Figures <ref type="figure">5 and 6</ref> show snapshots of the ontology created from the BioAssay Ontology corpus for input parameters KF = 0.5, N-gram = 3, and RF = 0.9. Figure <ref type="figure">5</ref> shows the IS-A relationships and Figure <ref type="figure">6</ref> shows the binary relationships.</p><p>According to experts, the ontology contains rich set of vocabulary, which is very useful for top-down ontology construction. The experts also mentioned that the ontology has good enough structure. The www.cs.miami.edu corpus is used to calculate quantitative measurements. The gold standard based approaches such as precision (P ) and recall (R) and F-measure (F 1 ) are used to evaluate ontologies <ref type="bibr" target="#b7">[8]</ref>. We use a slightly modified version of <ref type="bibr" target="#b20">[21]</ref> as our reference ontology. Table <ref type="table">3</ref> shows the results. The average precision of the constructed ontology is approximately 42%. It is to be noted that we use only one reference ontology. If we use another reference ontology the precision values varies. This means that the precision value depends on the available ground truth.</p><p>The results show that our method creates an ontology for any given domain with acceptable results. This is shown in the precision value, if the ground truth is available. On the other hand, if the domain does not have ground truth the results are subject to domain expert evaluation of the ontology. One of the potential problems we have seen in our approach is search space. Since our method is unsupervised, it tends to search the entire space for results, which is computationally costly. We thus need a better method to prune the search space so that out method provide better results. According to domain experts, our method extracts good vocabulary but provides a flat structure. They have proposed a sort of a semi-supervised approach to correct this problem, by combining the knowledge from domain experts and results produced by our system. We left the detailed investigation for future work. Since our method is based on the Bayesian reasoning (which uses N-gram probabilities), it is paramount that the corpus contains enough evidence of the redundant information. This condition requires that the corpus to be large enough so that we can hypothesize that the corpus provides enough evidence to build the ontology.</p><p>We hypothesize that a sentence of the corpus would generally be subjected to the grammar rule given in expression 1. This constituent is the main factor that uses to build the relationships among concepts. In NLP, there are many other finer grained grammar rules that specifically fit for given sentences. If these grammar rules are used, we believe we can build a better relationship model. We have left this for future work.</p><p>At the moment our system does not distinguish between concepts and the individuals of the concepts. The learned A-Box primarily consists of the probabilities of each concepts. This is one area where we are eager to work on. Using the state-of-the art NLP techniques, we plan to fill this gap in a future work.</p><p>Since our method has the potential to be used in any corpus, it could be seen that the lemmatizing and stemming algorithms that are available in WordNet would not recognize some of the words. Specially in the BioAssay corpus, we observe that some of the domain specific words are not recognized by WordNet. We use the Porter stemming algorithm <ref type="bibr" target="#b17">[18]</ref> to get the word form and it shows that this algorithm constructs peculiar word forms. Therefore, we deliberately remove it from the processing pipeline.</p><p>The complexity of our algorithms is as follows. The bootstrapping algorithm available in the syntactic layer has a worst case running time of O(M ×max(s j )× max(w k )), where M is the number of documents, s j is a the number of sentences in a document, and w k is the number of words in a sentence. The probabilistic reasoning algorithm has the worst case running time of O(|L|×|SuperConcepts|), where |L| is the size of the lexicon and |SuperConcepts| is the size of the super concepts set. The ontologies generated from the system are consistent with Pellet<ref type="foot" target="#foot_4">8</ref> and FaCT++<ref type="foot" target="#foot_5">9</ref> reasoners.</p><p>Finally, our method provides a process to create a lexico-semantic ontology for any domain. For our knowledge, this is a very first research on this line of work. So we continue our research along this line and to provide better results for future use.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We have introduced a novel process to generate an ontology for any random text corpus. We have shown that our process constructs a flexible ontology. It is also shown that in order to achieve high precision, it is paramount that the corpus should be large enough to extract important evidence. Our research has also shown that probabilistic reasoning on lexico-semantic structures is a powerful solution to overcome or at least mitigate the knowledge acquisition bottleneck.</p><p>Our method also provides evidence to domain experts to build ontologies using a top-down approach. Though we have introduced a powerful technique to construct ontologies, we believe that there is a lot of work that can be done to improve the performance of our system. One of the areas our method lacks is the separation between concepts and individuals. We would like to use the generated ontology as a seed ontology to generate instances for the concepts and extract the individuals already classified as concepts. Finally, we would like to increase the lexicon of the system with more tags available from the Penn Treebank tag set. We believe that if we introduce more tags into the system, our system can be trained to construct human readable (friendly) concepts and relations names.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Overall process: process categorizes into four phases; pre-processing, syntactic analysis, semantic analysis &amp; representation</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. An example three-gram model</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. w1, w2, w3, w4 and w5 are super-concepts. g1, g2, g3 and g4 are candidate subconcepts. There are 5 independent Bayesian networks. Bayesian networks 2 and 5 share the group g2 when representing the concepts of the conceptualization</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3</head><label>3</label><figDesc>Figure3shows a simple example of the Definition 2. The interpretation of Definition 2 is: Let a set G contains an n-number of finite random variables {g 1 , . . . , g n }. There exist a group g i , which is shared by m words {w 1 , . . . , w m }. Then, with respect to the Bayesian framework, BN i of P (g i |w i ) is calculated and max(P (g i |m i )) is selected for the construction of the ontology. This means that if there exists two Bayesian networks and the Bayesian network one is given by the pair w 1 , g 1 and the Bayesian network two is given by the pair {w 2 , g 1 } then the Bayesian network that has the most substantial IS-A relationship is obtained through max BNi (P (g 1 |w 1 )) and this network is retained and other Bayesian networks will be ignored when building the ontology. If all P (g 1 |w 1 ) remains equal, then the Bayesian network with the highest super-concept probability will be retained. These two conditions will resolve any naming issues.The next step is to induce the relationships to complete the conceptualization. In order to do this, we need to find semantics associated with each verb. We hypothesize that relations are generated by the verbs and the definition is as follows.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Bayesian networks for relations modeling. C1 and C2 are groups and V is a verb</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 5 .Fig. 6 .</head><label>56</label><figDesc>Fig. 5. An example snapshot of the BioAssay Ontology corpus with IS-A relations</figDesc><graphic coords="9,134.86,120.06,345.25,164.14" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2 .</head><label>2</label><figDesc>The PCAssay (the BioAssay Ontology project) corpus statistics</figDesc><table><row><cell>Title</cell><cell cols="2">Statistics Description</cell></row><row><cell></cell><cell></cell><cell>All documents are XHTML</cell></row><row><cell>Documents</cell><cell>1,759</cell><cell>formated with a given template</cell></row><row><cell></cell><cell></cell><cell>Normalized candidate concept words from</cell></row><row><cell>Unique ConceptW ords</cell><cell>13,017</cell><cell>NN, NNP, NNS, JJ, JJR &amp; JJS</cell></row><row><cell></cell><cell></cell><cell>using [a-zA-Z]+[-]?\w*</cell></row><row><cell></cell><cell></cell><cell>Normalized verbs from</cell></row><row><cell>Unique V erbs</cell><cell>1,337</cell><cell>VB, VBD, VBG, VBN, VBP &amp; VBZ</cell></row><row><cell></cell><cell></cell><cell>using [a-zA-Z]+[-]?\w*</cell></row><row><cell cols="2">Total ConceptW ords 631,623</cell><cell></cell></row><row><cell>Total V erbs</cell><cell>109,421</cell><cell></cell></row><row><cell>Total Lexicon</cell><cell cols="2">741,044 Lexicon = ConceptW ords V erbs</cell></row><row><cell>Total Groups</cell><cell>631,623</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">http://bioassayontology.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">http://wordnet.princeton.edu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2">http://www.w3.org/TR/owl-guide/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3">http://www.cs.miami.edu</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_4">http://clarkparsia.com/pellet</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_5">http://owl.man.ac.uk/factplusplus/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This work was partially funded by the NIH grant RC2 HG005668.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The design, implementation and use of the n-gram statistics package</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Pedersen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics</title>
				<meeting>the Fourth International Conference on Intelligent Text Processing and Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="370" to="381" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Recognising textual entailment with logical inference</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Markert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">HLT &apos;05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing</title>
				<meeting><address><addrLine>Morristown, NJ, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="628" to="635" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Coupled semi-supervised learning for information extraction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Carlson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Betteridge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">R</forename><surname>Hruschka</surname><genName>Jr</genName></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Mitchell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WSDM &apos;10: Proceedings of the third ACM international conference on Web search and data mining</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="101" to="110" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Modeling documents by combining semantic concepts with unsupervised statistical learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Chemudugunta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holloway</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Smyth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Steyvers</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ISWC &apos;08: Proceedings of the 7th International Conference on The Semantic Web</title>
				<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="229" to="244" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Learning concept hierarchies from text corpora using formal concept analysis</title>
		<author>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hotho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Staab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence research</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="page" from="305" to="339" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Völker</surname></persName>
		</author>
		<title level="m">Text2Onto -A Framework for Ontology Learning and Data-driven Change Discovery</title>
				<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Large-scale extraction and use of knowledge from text</title>
		<author>
			<persName><forename type="first">P</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Harrison</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the fifth international conference on Knowledge capture</title>
				<meeting>the fifth international conference on Knowledge capture<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="153" to="160" />
		</imprint>
	</monogr>
	<note>K-CAP &apos;09</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Strategies for the evaluation of ontology learning</title>
		<author>
			<persName><forename type="first">K</forename><surname>Dellschaft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Staab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceeding of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge</title>
				<meeting>eeding of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge<address><addrLine>Amsterdam, The Netherlands, The Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="253" to="272" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A Probabilistic Extension to Ontology Language OWL</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Peng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">HICSS &apos;04: Proceedings of the Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS&apos;04) -Track 4</title>
				<meeting><address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page">40111</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A translation approach to portable ontology specifications</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">R</forename><surname>Gruber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge Acquisition</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="199" to="220" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Speech and Language Processing: An Introduction to Natural Language Processing</title>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Martin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computational Linguistics and Speech Recognition</title>
				<imprint>
			<publisher>Pearson Education International</publisher>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
	<note>2. edn</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">P-CLASSIC: A tractable probabilistic description logic</title>
		<author>
			<persName><forename type="first">D</forename><surname>Koller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pfeffer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of AAAI-97</title>
				<meeting>AAAI-97</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="390" to="397" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Discovery of inference rules for question-answering</title>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Pantel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Natural Language Engineering</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="343" to="360" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Probabilistic description logics for the semantic web</title>
		<author>
			<persName><forename type="first">T</forename><surname>Lukasiewicz</surname></persName>
		</author>
		<idno>Nr. 1843-06-05</idno>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
		<respStmt>
			<orgName>Institut fur Informationssysteme, Technische Universitat Wien</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Tech. rep.</note>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Introduction to Information Retrieval</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Cambridge University Press</publisher>
			<pubPlace>New York, NY, USA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Building a large annotated corpus of english: the penn treebank</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Marcus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Marcinkiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Santorini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Comput. Linguist</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="313" to="330" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Poon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Domingos</surname></persName>
		</author>
		<title level="m">Proceedings of the Forty-Eighth Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the Forty-Eighth Annual Meeting of the Association for Computational Linguistics<address><addrLine>Uppsala, Sweden</addrLine></address></meeting>
		<imprint>
			<publisher>ACL</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">An algorithm for suffix stripping</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Porter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Program</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="130" to="137" />
			<date type="published" when="1980">1980</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Artificial Intelligence: A Modern Approach</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Russell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Norvig</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>Prentice Hall</publisher>
		</imprint>
	</monogr>
	<note>3rd edn</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A question answering system based on conceptual graph formalism</title>
		<author>
			<persName><forename type="first">W</forename><surname>Salloum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2009 Second International Symposium on Knowledge Acquisition and Modeling</title>
				<meeting>the 2009 Second International Symposium on Knowledge Acquisition and Modeling<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="383" to="386" />
		</imprint>
	</monogr>
	<note>KAM &apos;09</note>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><surname>Shoe</surname></persName>
		</author>
		<ptr target="http://www.cs.umd.edu/projects/plus/SHOE/cs.html" />
		<title level="m">Example computer science department ontology</title>
				<imprint>
			<date type="published" when="2010-06-14">June 14, 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Knowledge engineering: Principles and methods</title>
		<author>
			<persName><forename type="first">R</forename><surname>Studer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">R</forename><surname>Benjamins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fensel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data and Knowledge Engineering</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">1-2</biblScope>
			<biblScope unit="page" from="161" to="197" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Weakly supervised approaches for ontology population</title>
		<author>
			<persName><forename type="first">H</forename><surname>Tanev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceeding of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge</title>
				<meeting>eeding of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge<address><addrLine>Amsterdam, The Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="129" to="143" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Acquisition of owl dl axioms from lexical resources</title>
		<author>
			<persName><forename type="first">J</forename><surname>Völker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hitzler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ESWC &apos;07: Proceedings of the 4th European conference on The Semantic Web</title>
				<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="670" to="685" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
