=Paper=
{{Paper
|id=None
|storemode=property
|title=Developing an Application Ontology for Mining Free Text Clinical Reports: The Extended Syndromic Surveillance Ontology
|pdfUrl=https://ceur-ws.org/Vol-744/paper10.pdf
|volume=Vol-744
}}
==Developing an Application Ontology for Mining Free Text Clinical Reports: The Extended Syndromic Surveillance Ontology==
Developing an Application Ontology for Mining
Free Text Clinical Reports: The Extended
Syndromic Surveillance Ontology
Mike Conway1 , John Dowling2 , and Wendy Chapman1
1
University of California, San Diego, Division of Biomedical Informatics
La Jolla, California 92093, USA
http://dbmi.ucsd.edu
{mconway@ucsd.edu|wwchapman@ucsd.edu}
2
University of Pittsburgh, Department of Biomedical Informatics
Pittsburgh, PA 15260, USA
http://www.dbmi.pitt.edu
dowling@pitt.edu
Abstract. In an increasingly globalised world, where infectious disease
outbreaks can rapidly circulate through the international transport sys-
tem, and the threat of bioterrorism is constant, there is a need to develop
reusable resources to support early-stage disease outbreak detection. This
paper presents the Extended Syndromic Surveillance Ontology (ESSO),
an open source terminological ontology designed to facilitate the min-
ing of free-text clinical documents in English to support timely disease
outbreak surveillance. ESSO consists of 279 clinical concepts (Fever,
Slurred Speech, Diplopia, and so on) across eight syndromes (res-
piratory syndrome, constitutional syndrome, and so on) and is enriched
with regular expressions to support concept identification in text. The
ontology is shown to have good coverage in the target domain.
Keywords: syndromic surveillance, biosurveillance, terminology, ontol-
ogy, natural language processing
1 Introduction & Motivation
Effective syndromic surveillance is useful if we are to detect and contain in-
fectious disease outbreaks at an early stage [1, 2]. The United States Centers
for Disease Control (CDC) defines syndromic surveillance as “surveillance using
health-related data that precede diagnosis and signal a sufficient probability of a
case or outbreak to warrant further public health response.”3 That is, the focus
of syndromic surveillance is the identification of disease outbreaks before the tra-
ditional public health apparatus of confirmatory diagnostic testing and official
diagnosis can be used. Data sources for syndromic surveillance have included
over the counter pharmacy sales [3], school absenteeism records [4], calls to NHS
3
www.webcitation.org/5pxhlyaxX
75
2 Developing
Developing an Application
an Application Ontology
Ontology for Miningfor Mining
Clinical Clinical Text
Text
Direct (a nurse led information and advice service in the United Kingdom) [5],
and search engine queries [6].
Grouping cases into syndromes (for example, respiratory syndrome) rather
than into specific diagnoses (for example, pneumonia) may provide earlier evi-
dence of infections of public health interest, because, in their early stages, many
diseases have overlapping symptoms that may not initially alarm physicians [7,
8]. Typically, clinical interactions between health workers and patients generate
substantial amounts of textual data in the form of radiography reports, Emer-
gency Room4 reports, chief complaints and so on, which provide an obvious
source of pre-diagnostic information for syndromic surveillance. However, devel-
oping methods and resources that allow public health experts to gain maximum
use from these data sources has been challenging.
This paper presents an application ontology — the Extended Syndromic
Surveillance Ontology (ESSO) [9] — designed to support syndromic surveillance
from clinical text, building on previous work in this area, in particular the Syn-
dromic Surveillance Ontology [10]. The remainder of the paper consists of four
sections. First, we briefly review related work, before going on to describe the on-
tology development process. We then set forth a short evaluation section before
concluding with an outline of future work.
2 Related Work
Our work has focussed on the representation of concepts (and their lexical instan-
tiations) as they occur in clinical text (in particular Emergency Room reports).
While the widely used biomedical taxonomies, for example, the Unified Medical
Language System5 (UMLS) and the Systematised Nomenclature of Medical Clin-
ical Terms6 (SNOMED-CT) contain many of the syndromic surveillance related
terms found in clinical texts, these general resources do not have the specific
relations (and lexical information) relevant to syndromic surveillance from clin-
ical reports. Currently, there are at least four major terminological resources
available that focus on the public health domain: PHSkb, SSO, ILI-SSO, and
the BioCaster ontology.
The Public Health Surveillance knowledge base (PHSkb) [11] developed by
the CDC is a coding system for the communication of notifiable disease find-
ings for public health officials in the United States. PHSkb is not suitable as a
resource for syndromic surveillance as its focus is on diagnosed diseases rather
than pre-diagnostic surveillance. Additionally, PHSkb is no longer under active
development.
The Syndromic Surveillance Ontology (SSO) [10] was developed to provide
a set of common syndrome definitions for public health professionals in order
to facilitate data sharing. A working group of eighteen researchers, representing
ten syndromic surveillance systems in the United States convened to develop
4
Also known as Casualty Departments or Accident & Emergency Departments
5
www.nlm.nih.gov/research/umls
6
www.ihtsdo.org/snomed-ct
76
Developing
Developing an Application an Application
Ontology Ontology
for Mining Clinicalfor Mining Clinical Text
Text 3
standard definitions for four syndromes of interest [12] (respiratory, gastroin-
testinal, influenza-like-illness and constitutional ) and constructed an OWL7 on-
tology based on these definitions. While the SSO provides a useful starting point,
there are two main reasons why — on its own — it is insufficient for clinical re-
port processing: First, SSO is centred on chief complaints. Chief complaints (or
“presenting complaints” in British English) are phrases that briefly describe a
patient’s presenting condition on first contact with a medical facility. They usu-
ally describe symptoms, refrain from diagnostic speculation and employ frequent
abbreviations and misspellings (for example “vom + naus” for “vomiting and
nausea”). Clinical texts — the focus of attention in this paper — are full length
documents that describe not only symptoms, but patient history and diagnoses.
Second, the number of syndromes in SSO is limited to four, whereas compre-
hensive syndromic surveillance requires the representation of further syndromes
(for example, hemorrhagic syndrome and neurological syndrome).
The Influenza-Like-Illness Syndromic Ontology (ILI-SSO) [13] is an extension
of the SSO designed to supplement the limited consensus definitions found in the
SSO, with the goal of providing a general NLP-oriented terminological resource
for identifying Influenza-Like-Illness syndrome in clinical texts. The ILI-SSO is
subsumed by the current work.
The BioCaster application ontology was built to facilitate text mining of
news articles for disease outbreaks in several different Pacific Rim languages
(Japanese, Thai, Vietnamese, Simplified Chinese, and so on) in addition to En-
glish [14]. It is used to power a real time, multi-lingual, publicly accessible online
biosurveillance text mining system8 that classifies news stories of epidemiologi-
cal interest and populates a Google Map with geographically coded new cases.
However, as the BioCaster system concentrates on news reports, representing the
concepts, relations and lexical instantiations found in clinical reports is beyond
the scope of the BioCaster ontology.
In addition to the application ontologies described above, the Infectious Dis-
ease Ontology9 provides coverage of symptoms and diagnoses relevant to syn-
dromic surveillance.
3 Developing the Ontology
Work began with the construction of a term list by author JD (a board certified
infectious disease physician with thirty years of experience in clinical practice).
The term identification process involved the domain expert reading multiple clin-
ical reports, searching through textbooks and utilising professional knowledge.
Terms were then consolidated into a list of concepts. Next, the concept list was
compared to the Syndromic Surveillance Ontology, and concepts from the SSO
reused where available. ESSO consists of 279 concepts (compared to 94 in SSO)
7
The Web Ontology Language (OWL) is a World Wide Web Consortium standard
for representing ontologies: http://www.w3.org/TR/owl-ref/
8
http://born.nii.ac.jp
9
http://infectiousdiseaseontology.org
77
4 Developing
Developing an Application
an Application Ontology
Ontology for Miningfor Mining
Clinical Clinical Text
Text
spread across eight syndromes important to syndromic surveillance (see Table 1
for a list of syndromes and example concepts).
Table 1. ESSO Syndromes and Example Concepts
Syndrome No. Concepts* Example concepts
Rash 33 Hives, Itching, Sores
Hemorrhagic 21 Hemoptysis, Melena, Epistaxis
Botulism 16 Botulism, BellsPalsy, SlurredSpeech
Neurological 52 Coma, Confusion, Headache
Constitutional 40 Fever† , Lethargy, Myalgia
InfluenzaLikeIllness 55 Fever† , Chill, Malaise
Respiratory ‡ 84 Plague, Rales, QFever
Gastrointestinal ‡ 30 AbdominalPain, Nausea, Rotavirus
*
Number of concepts in each syndromic category
†
Note that the SKOS data model allows “polyhierarchies” (for example, the concept
Fever has skos:broader syndrome InfluenzaLikeIllness and Constitutional )
‡
Respiratory and Gastrointestinal syndromes are subdivided into specific and sen-
sitive syndromes
The ontology is encoded in SKOS (Simple Knowledge Organisation System10 ,
a World Wide Web Consortium data standard for encoding thesauri and termi-
nologies), with the syndromic hierarchical backbone of the ontology represented
using skos:narrower and skos:broader (see Figure 1 for a screenshot of the
Fever concept within the Protégé editor). Note that the Extended SSO sub-
sumes all the concepts and relations present in the SSO, with all SSO concepts
and relations reorganised to conform with the SKOS standard.
In addition to the standard thesaurus apparatus of preferred labels, alter-
native labels and hidden labels provided by SKOS, in order to facilitate “off
the shelf” concept recognition, for each concept we include both regular expres-
sions and links to external vocabularies. Table 2 provides a description of SKOS
data relations for the concept Fever, while Figure 2 shows a simplified graph
representation of the same concept.
The ontology is freely available under an open source licence.11
4 Evaluation
In recent years, significant research effort has focussed on evaluation methods
for ontologies and terminologies [15, 16], yet no single “best practice” approach
to ontology evaluation has emerged. We have adopted a “triangulation” strat-
egy to audit the ESSO, concentrating on coverage (does the ontology contain
10
http://www.w3.org/2004/02/skos/
11
http://code.google.com/p/ss-ontology/
78
Developing
Developing an Application an Application
Ontology Ontology
for Mining Clinicalfor Mining Clinical Text
Text 5
Fig. 1. Example of Fever concept within the Protégé 4 Editor (SKOS-plugin)
esso:influenzaLikeIllnessSyndrome esso:constitutionalSyndrome "fever"@en
"fevers"@en
skos:prefLabel skos:altLabel
skos:broader
skos:broader "feels hot"@en
esso:epidemicTyphus
skos:altLabel
esso:hasDiagnosis "febrile"@en
skos:altLabel
esso:fever \bfever\b
esso:hasDiagnosis skos:notation ^^englishRegExp
esso:influenza
skos:notation \bfebrile\b
^^englishRegExp
esso:hasDiagnosis
skos:notation
\bfevers\b
esso:pleurisy skos:notation ^^englishRegExp
"C23.888.1119.344:Fever"
esso:hasDiagnosis
^^meshPrefLabel
dc:source skos:notation
esso:anthrax
"C0015967:Fever"
dc:modified dc:definition ^^umlsPrefLabel
skos:notation
esso:hasDiagnosis dc:creator
"Elevated body "780.60: Fever"
esso:smallpox "2011-03-31" "sso" "MC" temperature" ^^icd9PrefLabel
Fig. 2. Extended SSO Relations for the Concept Fever
79
6 Developing
Developing an Application
an Application Ontology
Ontology for Miningfor Mining
Clinical Clinical Text
Text
Table 2. Selected Relations for the Extended SSO Concept Fever
Relation Example
skos:inSchemea Fever inScheme ExtendedSSO
skos:broaderb Fever broader ConstitutionalSyndrome
skos:prefLabel Fever prefLabel “fever”
skos:altLabel Fever altLabel “febrile”
skos:notation^^umlsPrefLabelc Fever umlsPrefLabel “C0015967”
skos:notation^^meshPrefLabel Fever meshPrefLabel “C23.888.119.344”
skos:notation^^englishRegExp Fever englishRegExp “\bfev\b”
esso:has diagnosis Fever hasDiagnosis ChickenPox
esso:dataCategoryd Fever dataCategory “sign”
dc:creatore Fever creator “MC”
dc:source Fever source “sso”
dc:created Fever created “2011-03-31”
dc:modified Fever modified “2011-03-31”
dc:definition Fever definition “Elevated body temperature”
a
The skos:inScheme relation places a SKOS concept in a named Knowledge Organ-
isation System
b
skos:broader is read as “has broader category”
c
skos:notation provides a mechanism for creating links to external vocabularies
d
Clinical concept types are: diagnosis, syndrome, sign, chest radiography, and bioter-
rorism disease
e
“dc” (Dublin Core) is a widely used metadata standard that can be used to augment
SKOS with editorial information
the concepts we need for syndromic surveillance?), relation quality (are the re-
lations in the ontology correct?) and classification accuracy (how well do the
terms and regular expressions in ESSO perform at classifying clinical texts?).
Currently, we have completed preliminary evaluation of ESSO’s coverage of the
target domain using a technique derived from terminology extraction and corpus
linguistics [17]. First, we extracted terms from 300 Emergency Room reports12
using the TerMine13 term extraction tool [18]. We then went on to examine the
twenty most statistically significant terms generated by TerMine (filtering out
terms not relevant to the infectious disease domain) and found that only two of
the TerMine-generated terms were not represented in ESSO — the two terms
were “acute distress” and “apparent distress” — indicating that our domain
coverage is adequate. Examples of significant terms extracted by TerMine which
are contained in ESSO include “chest pain”, “sore throat”, “night sweat”, and
“vaginal bleeding.”
12
Deidentified Emergency Room reports were sourced from the University of Pitts-
burgh Medical Center.
13
TerMine uses a combination of linguistic and statistical techniques to identify
all terms in a document set, and then ranks these extracted terms accord-
ing to their “termness”. A web accessible version of the tool is hosted at:
http://www.nactem.ac.uk/software/termine/
80
Developing
Developing an Application an Application
Ontology Ontology
for Mining Clinicalfor Mining Clinical Text
Text 7
5 Conclusion
In conclusion, we have presented the Extended Syndromic Surveillance Ontology,
an open source terminological resource designed to facilitate English language
clinical text mining for syndromic surveillance. Our next task is to extend our
preliminary evaluation to assessing relation quality and classification accuracy,
with the medium term goal of using the ESSO as a gold standard against which
we can evaluate new synonym extraction algorithms.
References
1. Henning, K.: What is Syndromic Surveillance? MMWR Morb Mortal Wkly Rep
53 Suppl, 5–11 (2004)
2. Wagner, M., Gresham, L., Dato, V.: Case Detection, Outbreak Detection, and
Outbreak Characterization. In: Wagner, M., Moore, A., Aryel, R. (eds.) Handbook
of Biosurveillance, pp. 27–50. Elsevier Academic Press (2006)
3. Tsui, F., Espino, J., Dato, V., Gesteland, P., Hutman, J., Wagner, M.: Technical
Description of RODS: A Real-time Public Health Surveillance System. J Am Med
Inform Assoc 10(5), 399–408 (2003)
4. Lombardo, J., Burkom, H., Elbert, E., Magruder, S., Lewis, S.H., Loschen, W., Sari,
J., Sniegoski, C., Wojcik, R., Pavlin, J.: A Systems Overview of the Electronic
Surveillance System for the Early Notification of Community-Based Epidemics
(ESSENCE II). J Urban Health 80(2 Suppl 1), 32–42 (2003)
5. Cooper, D.: Case Study: Use of Tele-health Data for Syndromic Surveillance in
England and Wales. In: Lombardo, J., Buckeridge, D. (eds.) Disease Surveillance:
A Public Health Informatics Approach pp. 335–365. Wiley, New York (2007)
6. Eysenbach, G.: Infodemiology: Tracking Flu-Related Searches on the Web for Syn-
dromic Surveillance. In: American Medical Informatics Association Annual Sym-
posium Proceedings (AMIA 2006). pp. 244–248 (2006)
7. Centers for Disease Control: Recognition of Illness Associated with the Intentional
Release of a Biologic Agent. MMWR Morb Mortal Wkly Rep 50(41), 893–7 (2001)
8. Kuehnert, M.J., Doyle, T.J., Hill, H.A., Bridges, C.B., Jernigan, J.A., Dull, P.M.,
Reissman, D.B., Ashford, D.A., Jernigan, D.B.: Clinical Features that Discriminate
Inhalation Anthrax from Other Acute Respiratory Illnesses. Clin Infect Dis 36(3),
328–36 (2003)
9. Conway, M., Dowling, J., Tsui, R., Chapman, W.: Developing an Application On-
tology for Mining Clinical Reports: The Extended Syndromic Surveillance Ontol-
ogy. In: International Society for Disease Surveillance. Abstract (2010)
10. Okhmatovskaia, A., Chapman, W., Collier, N., Espino, J., Buckeridge, D.: SSO:
The Syndromic Surveillance Ontology. In: Proceedings of the International Society
for Disease Surveillance (2009)
11. Doyle, T., Ma, H., Groseclose, S., Hopkins, R.: PHSkb: A Knowledgebase to Sup-
port Notifiable Disease Surveillance. BMC Med Inform Decis Mak 5, 27 (2005)
12. Chapman, W., Dowling, J., Baer, A., Buckeridge, D., Cochrane, D., Conway, M.,
Elkin, P., Espino, J., Gunn, J., Hales, C., Hutwagner, L., Keller, M., Larson, C.,
Noe, R., Okhmoatovskaia, A., Olson, K., Paladini, M., Scholer, M., Sniegoski, C.,
Thompson, D., Lober, B.: Developing Syndrome Definitions Based on Consensus
and Current Use. Journal of the American Medical Informatics Association 17,
595–601 (2010)
81
8 Developing
Developing an Application
an Application Ontology
Ontology for Miningfor Mining
Clinical Clinical Text
Text
13. Conway, M., Dowling, J., Chapman, W.: Developing a Biosurveillance Applica-
tion Ontology for Influenza-Like-Illness. In: Proceedings of the 6th Workshop on
Ontologies and Lexical Resources. pp. 58–66. Coling 2010 Organizing Committee,
Beijing, China (2010)
14. Collier, N., Matsuda Goodwin, R., McCrae, J., Doan, S., Kawazoe, A., Conway, M.,
Kawtrakul, A., Takeuchi, K., Dien, D.: An Ontology-Driven System for Detecting
Global Health Events. In: Proceedings of the 23rd International Conference on
Computational Linguistics (Coling 2010). pp. 215–222. Coling 2010 Organizing
Committee, Beijing, China (2010)
15. Zhu, X., Fan, J.W., Baorto, D., Weng, C., Cimino, J.: A Review of Auditing
Methods Applied to the Content of Controlled Biomedical Terminologies. Journal
of Biomedical Informatics 42(3), 413 – 425 (2009)
16. Brank, J., Grobelnik, M., Mladenić, D.: A Survey of Ontology Evaluation Tech-
niques. In: Proceedings of the Conference on Data Mining and Data Warehouses
(SiKDD 2005). pp. 166–170 (2005)
17. Grigonyte, G., Brochhausen, M., Martin, L., Tsiknakis, M., Haller, J.: Evaluating
Ontologies with NLP-Based Terminologies - A Case Study on ACGT and its Master
Ontology. In: Formal Ontology in Information Systems: Proceedings of the Sixth
International Conference (FOIS 2010). pp. 331–344 (2010)
18. Frantzi, K., Ananiadou, S., Mima, H.: Automatic Recognition for Multi-word
Terms. International Journal of Digital Libraries 3(2), 117–132 (2000)
82