<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Semi-Automated Data-Driven Methods to Support Ontology Development A Case Study on a Rehabilitation Therapy Ontology</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mohammad</forename><forename type="middle">K</forename><surname>Halawani</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Newcastle University</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Department of Information Systems</orgName>
								<orgName type="institution">Umm Al-Qura University</orgName>
								<address>
									<country key="SA">Saudi Arabia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rob</forename><surname>Forsyth</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Institute of Neuroscience</orgName>
								<orgName type="institution">Newcastle University</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Phillip</forename><surname>Lord</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Newcastle University</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Semi-Automated Data-Driven Methods to Support Ontology Development A Case Study on a Rehabilitation Therapy Ontology</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">975172D0E46EC38E496C223D4DD470CD</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:32+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Ontology development is expensive and requires significant efforts from both domain experts and ontologists. Automating the process usually produces unsatisfactory results and involves knowledge acquisition, which is intrinsically hard. In this abstract, we are investigating semi-automated techniques for bootstrapping and and supporting data-driven ontology development.</p><p>Rehabilitation therapies are hard to describe, measure and compare; unlike pharmacologic therapies, they are not precisely defined. This brings an interesting ontological challenge, because rehabilitation treatments are practice-based, diverse and involve interactions between a therapist, a patient and their environment. Therefore, we are using the domain of rehabilitation as a case study to build a rehabilitation therapy ontology (RTO).</p><p>Here, we are proposing a pipeline for building semantic knowledge structures to support developing ontologies from biomedical literature. The pipeline starts with an initial small set of articles provided by experts in the domain. This requires relatively little from the domain expert, beyond a set of references to appropriate papers, something that most researchers will have through their normal bibliography management facilities. The initial set of articles does not cover the domain; therefore, we expand this to a corpus of PubMed records that are relevant and cover the scope of the initial set using live PubMed's similar articles functionality and our pioneered relative similarity measure <ref type="bibr" target="#b0">[1]</ref>, that retrieves articles related to the whole initial set. In our case study , we were able to expand from initial set of 200 references, provided from two experts in the domain of rehabilitation, to around 28,000 references using this technique.</p><p>Full texts of the identified records of the corpus are then retrieved and pass through several text pre-processing and cleaning steps. For phrase detection, then, we apply word2phrase which is based on words' co-occurrences. Words and phrases in the text are the terms of the corpus, but they are not representative of the domain. To determine semantically meaningful and domain-related representative terminology, we apply the term frequency-inverse document frequency (tf-idf ) technique. The result is a list of terms and phrases that are ranked according to their representation of the domain. Domain experts can arbitrarily threshold through the tf-idf scores to identify and extract top ranked representative terms.</p><p>The list of extracted terms can neither represent the semantics of the terms nor the relationships amongst them. Therefore, we develop a semantic knowledge structure that represents those. To develop the knowledge structure, we facilitate the list of extracted terms, their word embeddings from a trained word2vec <ref type="bibr" target="#b1">[2]</ref> model, and a Directed Acyclic Graph (DAG) based on their lexical similarities, i.e. string-substring relationships. Semantic "subclass" relationships were found amongst the terms using the word2vec analogy technique. These were confirmed via the lexical DAG. Thus, we have a taxonomy-like knowledge structure based on word2vec semantic relationships. To add more relationships to the structure that are different from the "subclass" relationships, we can modify the word2vec analogy questions.</p><p>We hope that the final structure can be used to bootstrap an ontology by domain experts and curators rather than starting from scratch. This is similar to scaffolding the mitochondrial disease ontology <ref type="bibr" target="#b2">[3]</ref>; nevertheless rather than using scaffolds from existing knowledge sources, here, we have generated the scaffolds in a data-driven method. These scaffolds are initially linked to easily discover semantic relations, and have a "todo" list ranked with their importance (i.e. the ranked list of terms) for curators to bootstrap the ontology in order.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Pipeline to support ontology development from literature.</figDesc><graphic coords="1,206.80,481.26,201.75,159.75" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">of biomedical ontologies: A case study on a rehabilitation therapy ontology</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Halawani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Forsyth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lord</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1709.09450</idno>
	</analytic>
	<monogr>
		<title level="m">A literature based approach to define the scope</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="3111" to="3119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Scaffolding the mitochondrial disease ontology from extant knowledge sources</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Warrender</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lord</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1505.04114</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
