1 INTRODUCTION

A Literature Based Approach to Define the Scope of Biomedical Ontologies: A Case Study on a Rehabilitation Therapy Ontology

Mohammad K. Halawani

M.K.H.Halawani2@newcastle.ac.uk 0 2

Rob Forsyth

Phillip Lord

2 0 Department of Information Systems Umm Al-Qura University , Saudi Arabia 1 Institute of Neuroscience Newcastle University , UK 2 School of Computing Science Newcastle University , UK

In this article, we investigate our early attempts at building an ontology describing rehabilitation therapies following brain injury. These therapies are wide-ranging, involving interventions of many different kinds. As a result, these therapies are hard to describe. As well as restricting actual practice, this is also a major impediment to evidence-based medicine as it is hard to meaningfully compare two treatment plans. Ontology development requires significant effort from both ontologists and domain experts. Knowledge elicited from domain experts forms the scope of the ontology. The process of knowledge elicitation is expensive, consumes experts' time and might have biases depending on the selection of the experts. Various methodologies and techniques exist for enabling this knowledge elicitation, including community groups and open development practices. A related problem is that of defining scope. By defining the scope, we can decide whether a concept (i.e. term) should be represented in the ontology. This is the opposite of knowledge elicitation, in the sense that it defines what should not be in the ontology. This can be addressed by pre-defining a set of competency questions. These approaches are, however, expensive and time-consuming. Here, we describe our work toward an alternative approach, bootstrapping the ontology from an initially small corpus of literature that will define the scope of the ontology, expanding this to a set covering the domain, then using information extraction to define an initial terminology to provide the basis and the competencies for the ontology. Here, we discuss four approaches to building a suitable corpus that is both sufficiently covering and precise.

1 INTRODUCTION

Rehabilitation therapies, unlike pharmacologic therapies, are difficult to define precisely both qualitatively and quantitatively (van Heugten et al., 2012) and many approaches have been taken to trying to parse them. It is recognised that traditional approaches to designation (e.g. “dressing practice”) are flawed as two professionals’ rehabilitation sessions both targeting difficulties in dressing could differ in pertinent active ingredients (e.g. actions, chemicals, devices, or forms of energy) as experienced by the patient. Assumptions that rehabilitation content can be inferred from the targeted should addressed: impairment (e.g. “balance training”) as flawed: no one would consider it appropriate to consider bariatric surgery, calorie-restricted diets and exercise programmes together as equivalent forms of “obesity therapy” (Whyte et al., 2014) . This lack of a shared terminology makes it difficult to describe, measure and meaningfully compare rehabilitation therapies and treatments.

Building a taxonomy for rehabilitation treatments could lead to a better shared understanding of rehabilitation interventions (Dijkers, 2014) . Hence, a rehabilitation treatment ontology (RTO) of rehabilitation terms, as the terms represent the concepts and knowledge of the domain (Sowa, 2000) , should ease the dissemination of treatments to communicate about them clearly and effectively, through a shared understanding.

To enable building the RTO, we need to define both the terms that we wish to be in the ontology and those that should not. Some ontologies have extremely well-defined scopes, such as the Karyotype ontology (Warrender and Lord, 2013) , which is an ontological representation of a previously defined informal specification. Others, such as the mitochondrial disease ontology (Warrender, 2015) relate to a specific area of knowledge, or like the Gene Ontology(GO) (Ashburner et al., 2000) to a broad area, but at a specific granularity. For the RTO, unfortunately, the breadth of the area means that we lack this clear statement of scope.

Of course, there has been significant research on ontology learning, enabling either automation or semi-automation of the ontology construction process (Buitelaar et al., 2005) . For the RTO, we aim to use a semi-automated approach, combined with a highly programmatic, pattern-driven ontology construction methodology that we have pioneered previously with the mitochondrial disease ontology (Warrender and Lord, 2015) : this separates terms out into a scaffold generated automatically, often from a pre-existing structured source such as a database. This is followed by manual refinement using the vocabulary provided this scaffold.

With the RTO, we plan to extend this ontology construction methodology: first, we will build a corpus of appropriate literature that will define the scope of the ontology; then we can use this to extract a set of representative terms and phrases; finally, we will use these terms and phrases as the basis for our ontological scaffold (Warrender and Lord, 2015) . This should provide both coverage and scope for our ontology, which we can then refine and build further either manually or through the addition of further scaffolded terms, identified during the first phase of development. We have previously used a similar methodology to ensure good coverage and define the scope of MITAP, a minimum information model (Lord et al., 2016) .

This leaves us with the problem of defining an appropriate corpus of literature for the RTO. This corpus needs to cover the domain adequately; at the same time, we would like this corpus to be reflective of opinions of a wider community than the experts involved it its construction. This is a common problem with ontology development: if the scope is too narrow, the ontology will fulfil the needs of only a few; if it is too broad, the ontology will either get large or only have general terms.

The aim of this article is to investigate different semi-automated methods and search strategies to retrieve a corpus with a high level of accuracy and coverage with respect to the communities needs for the RTO. The accuracy and coverage of a corpus are its precision and recall, respectively, in relation to the scope of rehabilitation. We describe four different techniques that we have used all based around use of PubMed, and describe their advantages and disadvantages. 2

METHODS

For this work, we have used PubMed exclusively to define our corpus. As a corpus, PubMed is far from ideal. While it contains many papers about rehabilitation, they are mostly written from an academic perspective and may make a different use of vocabulary from the clinicians. A significant percentage of the papers in PubMed have only abstracts accessible (although, under UK law, we may be able to access full text by other means (gre, 2014)). However, it has other significant advantages: it is freely available; there are no patient confidentiality restrictions as there would be with medical records; finally, it has a good API and is easy to access computationally.

We use two additional features of PubMed in this paper. First, papers are annotated with Medical Subject Headings (MeSH). MeSH is a thesaurus organised into a hierarchy; searches with a single term, also search the transitive closure of that term. Curators can also define a MeSH annotation as the “major term” or MAJR. Secondly, PubMed provides a similar articles functionality (PMSA), based on text similarity ( U.S. National Library of Medicine (NLM), 2017). Currently, this functionality only allows retrieving MEDLINE records (i.e. PubMed citation) similar to a single user-selected record. We discuss this limitation later.

Additional search functionality described in this paper was implemented using Python, exploiting the Entrez module of BioPython (Cock et al., 2009) . 2.1

Forming a Corpus The simplest approach to generating a suitable corpus is a keyword search. We tried this for RTO, searching with the term “rehabilitation”. This naive approach does not work well, as it misses many papers which contain the same stem but with a different ending (such as “rehabilitate” or “rehabilitator”). Moreover, it retrieves many less relevant results (for example, those relating to drug rehabilitation).

Our next approach is to use MeSH or MAJR terms. PubMed’s search engine automatically searches the transitive closure of any MeSH term given, therefore searches with “Rehabilitation” will also search “Physical Therapy Modalities”, as can be seen in figure 1.

Clearly searching for “Rehabilitation” as the MAJR term will produce a result which is an exact subset of searching for the equivalent MeSH term. In fact, the simple search approach automatically incorporates MeSH search, as PubMed’s search engine translates search terms to its equivalent MeSH term if it exists. For example, the term “Physiotherapy” is translated to the “Physical Therapy Modalities” MeSH term.

MeSH search approach also runs the risk of missing papers which have not been annotated at all, or have been annotated with alternative terms from MeSH.

To address this latter problem, we have tried query expansion. Here, we expand the transitive closure of the MeSH term, then add alternative endings manually. Sub-terms, more specifically narrower terms, of “rehabilitation” were extracted using “MeSH SPARQL” tool 1. The following SPARQL query was used: PREFIX r d f s : <h t t p : / /www. w3 . org / 2 0 0 0 / 0 1 / rdf schema#> PREFIX meshv : <h t t p : / / id . nlm . nih . gov / mesh / vocab #> PREFIX mesh : <h t t p : / / id . nlm . nih . gov / mesh/> SELECT ? l a b e l FROM <h t t p : / / id . nlm . nih . gov / mesh> WHERE f ? term meshv : b r o a d e r D e s c r i p t o r + mesh : D012046 .

? term r d f s : l a b e l ? l a b e l . g

We collected and filtered general terms used in other domains such as “Yoga” and “Massage” by inspection. These are mostly the ones without medical words such as rehabilitation or therapy. Synonyms of the term “rehabilitation” were defined by consultation with a domain expert: specifically, “restoration” and “recovery”. Multiple variations of these words were determined manually using a dictionary. Variations of the words “therapy”, “rehabilitation” include “therapies”, “therapist” and “rehabilitant”. rehabilitated, and were injected in the query. Finally, the collected general terms were combined into a MeSH approach query, the rest of the terms were combined into a query that is disjunctive between noun phrases and their variants. For example, the term “physical therapy” was converted to:

P h y s i c a l therapy OR P h y s i c a l AND ( therapy OR t h e r a p i e s OR t h e r a p i s t OR t h e r a p i s t s OR t h e r a p e u t i c OR . . . OR r e h a b i l i t a t i o n OR r e h a b i l i t a t e OR r e h a b i l i t a t o r OR . . .

OR r e s t o r a t i o n OR r e s t o r e OR . . .

OR recovery OR . . . )

The two queries were combined to form the expanded query.The result of this approach subsumes the results of the two previous approaches. Thus, this approach provides the most coverage. In fact, we retrieved around 2.9 million MEDLINE records using the query expansion approach. Table 1 shows the search terms for each approach along with the number of retrieved records.

The query expansion search approach provides a significant increase in the number of records. We tested the accuracy of the approach 1 MeSH SPARQL is available at https://id.nlm.nih.gov/mesh/query Search Strategy

Query Search Term(s) by random selection of papers, followed by expert analysis to determine whether the papers were in scope. Unfortunately, the accuracy of this approach appears fairly low, with around 5% of the papers considered in scope.

Finally, we have pioneered a relative similarity measure. This builds on PubMed’s existing article similarity score, and allows us to define similarity to a set of articles. Retrieved records are ranked with a relatively score which is calculated as follows: relativity score(a) = #similar articles(a) that are in s

max(#s; #similar articles(a)) where s : seed set; a : article (i:e: M EDLIN E record) From this equation, for a record to have a relativity score of 1:0, all of its similar records need to cover all of the records in the seed set. In other words, a record can only have a relativity score of 1:0 if its set of similar records is equivalent to the seed set. If it has a similar record that is not in the seed set or if there is a record in the seed set that is not similar to it, the relativity score will be less than 1:0. Thus, for higher scores, a record not only must be similar to more records in the seed set, but also needs to have fewer similar records out of the seed set.

Figure 2 shows an example of this approach. There are 3 seed MEDLINE records (i.e. records). The relativity score for the node D is 1:0 , as all of its similar records are in the seed set. Below are some of the other records scores: relativity score(E) = relativity score(G) = relativity score(K) = 1 3 2 3 3 8

Although K, like D, is similar to all the records in the seed set, unlike D, its score is lower than that of G as it has more similarity with other records out of the seed set. Records with higher scores can be considered as more relatively similar to the seed set. A significant advantage of this approach is that the result is continuous and can be thresholded according to contain more or less papers as required.

Using Literature to Define Biomedical Ontologies’ Scope

After achieving a maximal set of citations covering the topic, a minimal accurate set was provided by a domain expert. The expert set of articles was provided as an EndNote library file. We converted the articles in the library file to PMIDs. We can test the coverage of the maximal set by checking whether it subsumes the minimal set. In fact, all of the articles provided by the expert were included in the maximal set.

Now, we can use this approach to retrieve relatively similar articles from the experts seed set, i.e. the minimal set. The retrieved articles that are not included in the maximal set are filtered to restrict similar articles that are out of the maximal set’s scope. The expert, then, can set a threshold score to select the most related articles. The articles above the threshold, or ones chosen by the expert, can then be added to the seed set to perform the process again. This process can be repeated iteratively with the help of the expert until the results are satisfying or until they converge. The choice of the threshold might partly depend on the required number of retrieved articles, especially in the final stages. This process is depicted in Figure 3. In this article, we described four complementary search strategies to retrieve an accurate and covering corpus of PubMed records for the topic of rehabilitation. We use this approach to ensure that we have a covering and unbiased corpus. Of the approaches tried, the simple search and MeSH based strategies were too restrictive, the expanded query too broad. To address these issues, we have developed a new measure for paper similarity which enables us to select papers similar to a group of papers. This approach enables us to threshold arbitrarily and define for ourselves the “Goldilocks” zone.

The key advantage of this technique is that it requires relatively little from the domain expert, beyond a set of references to appropriate papers, something that most researchers will have through their normal bibliography management facilities. Operationally, this technique is also straight-forward as it works on PubMed similarity (although it generalises to any similarity measure), and can operate directly over PubMed’s normal search facilities. This avoids the necessity of performing bespoke analysis over the whole of PubMed locally.

A significant advantage of this technique is that it works on PubMed similarity (although it could work on any pair-wise similarity metric), which makes it easy to perform. We can envisage perhaps richer techniques that generalize the current over PubMed’s similar articles approach. However, until and unless these are directly supported by PubMed, they would require warehousing PubMed locally. For the next step, we plan to use this corpus to define a covering set of terms for the Rehabilitation Therapy Ontology, using inverse document frequency statitics that we have previously used to define the scope of a minimum information model (Lord et al., 2016) .

We note that this approach is largely independent of domain. We do not require a suitable MeSH term, or a pre-existing set of keywords that can be used for querying. It raises the possibility of moving the initial knowledge capture stage of ontology development away from expert user groups and competency questions, toward an approach which is more data-driven, embedding ontology development in the explosion of interest in big data analytics that have characterised the last few years.

Ashburner , M. , Ball , C. A. , Blake , J. A. , Botstein , D. , Butler , H. , Cherry , J. M. , Davis , A. P. , Dolinski , K. , Dwight , S. S. , Eppig , J. T. , et al. ( 2000 ). Gene ontology: tool for the unification of biology . Nature genetics , 25 ( 1 ), 25 .

Buitelaar , P. , Cimiano , P. , and Magnini , B. ( 2005 ). Ontology learning from text: methods, evaluation and applications , volume 123 . IOS press.

Cock , P. J. , Antao , T. , Chang , J. T. , Chapman , B. A. , Cox , C. J. , Dalke , A. , Friedberg , I. , Hamelryck , T. , Kauff , F. , Wilczynski , B. , et al. ( 2009 ). Biopython: freely available python tools for computational molecular biology and bioinformatics . Bioinformatics, 25 ( 11 ), 1422 - 1423 .

Dijkers , M. P. ( 2014 ). Rehabilitation treatment taxonomy: establishing common ground . Archives of physical medicine and rehabilitation , 95 ( 1 ), S1 - S5 .

Lord , P. , Spiering , R. , Aguillon , J. C. , Anderson , A. E. , Appel , S. , Benitez-Ribas , D. , ten Brinke , A. , Broere , F. , Cools , N. , Cuturi , M. C. , et al. ( 2016 ). Minimum information about tolerogenic antigen-presenting cells (mitap): a first step towards reproducibility and standardisation of cellular therapies . PeerJ , 4 , e2300 .

Sowa , J. F. ( 2000 ). Ontology, metadata, and semiotics . In International Conference on Conceptual Structures , pages 55 - 81 . Springer.

U.S. National Library of Medicine (NLM) ( 2017 ). Pubmed tutorial - similar articles .

van Heugten , C. , Wolters Grego´rio, G., and Wade , D. ( 2012 ). Evidence-based cognitive rehabilitation after acquired brain injury: a systematic review of content of treatment . Neuropsychological rehabilitation , 22 ( 5 ), 653 - 673 .

Warrender , J. D. ( 2015 ). The consistent representation of scientific knowledge: investigations into the ontology of karyotypes and mitochondria .

Warrender , J. D. and Lord , P. ( 2013 ). The karyotype ontology: a computational representation for human cytogenetic patterns . arXiv preprint arXiv:1305 . 3758 .

Warrender , J. D. and Lord , P. ( 2015 ). Scaffolding the mitochondrial disease ontology from extant knowledge sources . arXiv preprint arXiv:1505 . 04114 .

Whyte , J. , Dijkers , M. P. , Hart , T. , Zanca , J. M. , Packel , A. , Ferraro , M. , and Tsaousides , T. ( 2014 ). Development of a theory-driven rehabilitation treatment taxonomy: conceptual issues . Archives of physical medicine and rehabilitation , 95 ( 1 ), S24 - S32 .