<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Re ning Health Outcomes of Interest using Formal Concept Analysis and Semantic Query Expansion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olivier Cure</string-name>
          <email>ocure@univ-mlv.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henri Maurer</string-name>
          <email>henri.maurer@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nigma H. Shah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paea Le Pendu</string-name>
          <email>plependug@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stanford University</institution>
          ,
          <addr-line>Stanford, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universite Paris-Est, LIGM - UMR CNRS 8049</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Edinburgh</institution>
          ,
          <addr-line>Scotland</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Clinicians and researchers using Electronic Health Records (EHRs) often search for, extract, and analyze groups of patients by de ning a health outcome of interest, which may include a set of diseases, conditions, signs, or symptoms. In our work on pharmacovigilance using clinical notes, for example, we use a method that operates over many (potentially hundreds) of ontologies at once, expands the input query, and increases the search space over clinical text as well as structured data. This method requires specifying an initial set of seed concepts, based on concept unique identi ers from the UMLS Metathesaurus. In some cases, such as for progressive multifocal leukoencephalopathy, the seed query is easy to specify, but in other cases this task can be more subtle, such as for chronic obstructive pulmonary disease. We have developed a method using formal concept analysis and semantic query expansion that assists the user in de ning their seed query and in re ning the expanded search space that it encompasses. These methods can be used for text-mining clinical notes from EHRs (this paper) and cohort building tools.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In applications that use Electronic Health Records (EHRs), such as in drug safety
surveillance or observational studies, groups of patients are selected, extracted,
compared, and analyzed based on de nitions of certain health outcomes of
interest (HOIs) [
        <xref ref-type="bibr" rid="ref1 ref5">1,5</xref>
        ]. Common examples of HOIs include myocardial infarction (MI),
juvenile idiopathic arthritis, acute renal failure, chronic obstructive pulmonary
disease (COPD), or peripheral artery disease. While the entry points for a search
may at times be obvious, their full de nitions are anything but easy to obtain.
In work that also incorporates clinical text as data inputs, such as discharge
summaries or nurses notes, these de nitions should also capture variations of
terms that are likely to appear in written notes so that the text-mining process
has good recall [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For example, there are at least 47 distinct ways that we
have seen so far for saying `myocardial infarction' that appears with frequency
in real clinical notes. The challenge in de ning an HOI arises because medical
and health terminologies are numerous and complex. There are currently over
160 terminologies in the Uni ed Medical Language System (UMLS)
Metathesaurus with over 2-million distinct Concept Unique Identi ers (CUIs) and over
6-million distinct strings related to health and medicine. Some of these, like the
International Classi cation of Diseases, Ninth Revision, Clinical Modi cation
(ICD-9-CM), are well-known and often used to de ne the selection criteria for
health outcomes of interest. We have found that most clinicians and researchers
will say that choosing from lists of ICD-9 or SNOMED CT codes, is not a good
way to perform this task. The problem is that the organization and presentation
of concepts in any one of these terminologies, let alone 160 of them, might be
intuitive for one person yet completely obtuse to another.
      </p>
      <p>
        This work addresses these challenges by enabling the user to specify their
HOI using plain terms, much like an ordinary search query. We use Semantic
Query Expansion (SQE) and Formal Concept Analysis (FCA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to induce a
lattice of concepts over hundreds of ontologies and terminologies at once and to
nd the best-matching concepts. Brie y, a lattice is a partially ordered set where
every two elements have a least and greatest upper bounds. FCA is a machine
learning technique that uses a lattice to reveal associations between elements
of some pre-de ned structure { in this case, a set of ontologies with hierarchies
and mappings. SQE leverages the hierarchies and mappings to expand concepts
to include all subsumed ones. The system aims for broad coverage initially and
asks for feedback on the furthest matches at the highest points of the lattice so
that the search space can be rapidly re ned with minimal input from the user.
At the same time the user is issuing queries and con rming or denying concept
matches, we display search results that highlight snippets from clinical notes
matching their HOI, which provides instant feedback for the user as they re ne
their HOI. We have implemented these tools as REST service WEB APIs. Used
together, FCA and SQE make it possible to order concepts semantically and
search for matches that best re ect the intension of the user, i:e:, the medical
concepts she wants to express through a set of proposed terms. Intuitively, the
system aims initially for greater coverage and relies on user feedback to improve
relevance. The user's search query and feedback essentially helps to pinpoint
and prune sections of the lattice until only those concepts that t their intension
remain. The amount of feedback can be minimized by measuring coverage of
concepts at each major branch.
      </p>
      <p>The paper is organized as follows. In the next section, background and
methods are presented. In Section 3, implementation details are provided. Section 4
presents preliminary results conducted on the i2b2 Obesity NLP reference set.
A discussion and some future works are provided in section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Methods</title>
      <sec id="sec-2-1">
        <title>A previous method for myocardial infarction HOI</title>
        <p>
          In [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], a query-driven approach called \2-hop" was used to create a set of HOIs
for detecting drug-adverse event associations and adverse events associated with
drug-drug interactions. The algorithm takes a set of concepts C and derives all
subconcepts C0 in each ontology O available in an ontology repository. This
process is repeated one more time for all derived concepts C0 to obtain another
set of concepts denoted C00. Because concepts are mapped across ontologies, the
process traverses simultaneously all ontologies that contain C (and C0), thereby
\hopping" across ontologies twice. With this approach, C00 can capture more
concepts from the adjacent ontologies that would have otherwise been missed
with a single iteration. In principle, recursion with a least xed-point semantics
would apply; however, recursion does not work well in practice because of
differing abstraction levels among ontologies, which induce cycles. We have found
that two hops achieve an adequate balance between soundness and completeness
for the current use case.
        </p>
        <p>This method was able to recognize events and exposures with enough
accuracy for the drug safety use case. This accuracy was determined using a
goldstandard corpus from the 2008 i2b2 Obesity Challenge. This corpus has been
manually annotated by two annotators for 16 conditions and was designed to
evaluate the ability of NLP systems to identify a condition present for a patient
given a textual note. Moreover, this corpus was extended by manually
annotating each of the events. Using the set of terms corresponding to the de nition of
the event of interest and the set of terms recognized by our annotation work ow
in the i2b2 notes, we evaluate the sensitivity and speci city of identifying each
of the events.</p>
        <p>Although e cient, this method requires a signi cant amount of manual e ort,
i:e:, some medical experts have to provide concept identi ers, implying a deep
knowledge of some ontologies. The goal of this work is to increase automation
by only requiring from the medical experts to provide terms associated to their
serach. Our goal is also to retain/improve previous results.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Semantic Query Expansion Theory</title>
        <p>One of the major assets of ontologies is the set of hierarchical relationships they
often include. For every concept, a set of parent or super-concepts will usually
induce a directed acyclic graph (DAG) structure for most biomedical ontologies.
We can use standard graph traversal algorithms to compute the transitive closure
and store the set of all ancestors and descendants of every concept.4 SQE is the
process of taking a set of concepts as a query and utilizing the transitive closure
to expand that set to include all descendants or ancestors depending on whether
the goal is to generalize or specialize the query.
4 We should note that ontologies in general can be more structurally complex than a
DAG, in which case inference engines should be used to compute the subsumptions
hierarchy in place of graph traversal algorithms.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Formal Concept Analysis Theory</title>
        <p>FCA is the process of abstracting conceptual descriptions from a set of objects
described by some attributes and nds practical application in elds such as data
and text mining, machine learning, and knowledge management. Given a set of
ontologies, we represent all objects and their attributes { including hierarchical
relations { as a binary matrix. Using the standard machinery of FCA, a concept
lattice can be generated from this matrix. Because every lattice is a partial
order, FCA will group similar objects according to their attributes in an ordered
manner. We utilize these properties to identify improbable relations that are not
explicitly stated in the ontology as a means of ranking questions to the user that
will cover the greatest bene t in narrowing the SQE search space.</p>
        <p>More formally, FCA is based on the notion of a formal context, which is
a triple K = (G; M; I), where G is a set of objects, M is a set of attributes
and I is a binary relation between G and M, i.e., I G M . For an object g
and an attribute m, (g; m) 2 I is read as \object g has attribute m." Given a
formal context, we can de ne the notion of formal concepts, where, for A G,
we de ne A0 = fm 2 M j8g 2 A : (g; m) 2 Ig and for B M , we de ne
B0 = fg 2 Gj8m 2 B : (g; m) 2 Ig. A formal concept of K is de ned as a pair
(A; B) with A G, B M , A0 = B and B0 = A. The hierarchy of formal
concepts is formalized by (A1; B1) (A2; B2) () A1 A2 and B2 B1. The
concept lattice of K is the set of all its formal concepts with the partial order .</p>
        <p>
          This hierarchy of formal concepts obeys the mathematical axioms de ning a
lattice, and is called a concept lattice since the relation between the sets of objects
and attributes is a Galois connection. A Galois connection plays an important
role in lattice theory, universal algebras, and recently in computer science [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
Let (P; ) and (Q; ) be two partially ordered sets (poset). A Galois connection
between P and Q is a pair of mappings ( ; ) such that : P ! Q, : Q ! P
and: (i) x x0 implies (x) (x0), (ii) y y0 implies (y) (y0) and (iii)
x ( (x)) and y ( (y)), for x,x' 2 P and y,y' 2 Q.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Implementation</title>
      <sec id="sec-3-1">
        <title>Expansion using SQE</title>
        <p>The user inputs a free-text query representing their HOI, such as 'pituitary
cancer'. The query is tokenized and matched against all synonyms of every concept
from our set of ontologies. For UMLS, this results in a set of matching CUIs. This
step is purely lexical and does not use any semantics. We utilize the structure
of the ontologies to expand the search and identify matches such as 'neoplasm'
or 'tumor' that are closely related to the search. We perform this expansion by
navigating to all super-concepts in each ontology incrementally until we achieve
a minimal cover of lexical matches. Broadening the query using SQE in this
way will identify many closely related terms (i.e., increase coverage), but it may
also introduce many unrelated ones (i.e., decrease relevance). Thus, we have to
prune the expanded set aggressively to improve relevance, which we facilitate
using FCA (described in more detail below).</p>
        <p>In Figure 1, for example, the m concepts at the bottom represent the initial
lexical matches, and the p concepts are the set of super-concepts that cover them.
Based on the `poor' coverage of d1 (perhaps because c1 and c2 seem too unrelated
to m1 and m2) the user may be asked whether d1 is a t, and if not, it is pruned
(which automatically eliminates c1 and c2 and everything else subsumed by d1).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Pruning using FCA</title>
        <p>One advantage of the FCA lattice is a compact representation of the hierarchy
that enables us to e ciently nd the highest cut among concepts in our ontologies
that provide maximal coverage of terms identi ed via lexical matching. It also
enables us to identify concepts worth pruning in the following manner:</p>
        <p>Let Fi = hOi; Aii be a formal concept traversed with Oi and Ai respectively
the set of objects and attributes. For each Fi traversed during our top-down
navigation of the lattice, we create the two following lists: one denoted LiA,
corresponding to the transitive closure of the sub-concepts for each element of
Ai and another one denoted LO, corresponding to the transitive closure of the
sub-concepts of all Oi. We compute the intersection of LiO with each LiA and we
use the ratio jLiA\LiOj=jLiAj. If this value is below a prede ned threshold (denoted
pruning threshold), e.g., 75%, then we consider that the considered Ai concept is
not relevant to the search, i.e., it has too many sub-concepts not corresponding
to sub-concepts of the matched concepts. Otherwise it is relevant and we store
it in a candidate list which will be proposed later on to the physician.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Example (hypercholesterolemia): Using hypercholesterolemia as an</title>
        <p>
          example search query and a set of 18 ontologies, we identify 20 objects and 102
attributes initially, as in Figure 1. This induces a lattice of 67 formal concepts.
Its top most formal concept contains all 67 objects with an empty attribute
set. The lattice's second level has 2 formal concepts, one with 23 objects (and
one attribute) and another one with 17 objects (and one attribute). Both of
these concepts have too many sub-concepts not corresponding to the set of
subconcepts of our 20 original objects, hence they are pruned. At the fourth level of
the lattice, we discover a rst potential concept contained in a formal concept
containing 9 objects and 9 attributes one of which has a 75% ratio, i.e., satisfying
our pruning threshold. It has 16 sub-concepts out of which only 4 are not covered
by the sub-concepts of the 9 objects sub-concepts. Some of these 4 concepts could
be unrelated, so we drill down futher, identify the speci c area of the lattice with
the smallest ratio, and ask the user whether this concept is a t, if not, we prune
the lattice above and work at this lower level instead until the user is satis ed. In
the example, the labels of the 4 non-covered concepts are: hypercholesterolemia,
cholesterolosis, secondary hypercholesterolemia and hyperlipidemia.
In [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the online Supplementary Data S3 reports that the \2-hop" method
identi es HOIs with a sensitivity of 74% and a speci city of 96% for the i2b2 Obesity
NLP reference set, henceforth i2b2 obesity. Our evaluation is performed on this
same dataset in order to enable a comparison of our method. As such, we
consider the set of terms associated to the 16 di erent conditions surveyed in i2b2
Obesity and retrieve from our database the corresponding concept identi ers.
Given this setting, we are able to compute the sensitivity and speci city of our
approach. This is presented in Table 1. The experiments have been conducted on
commodity hardware running a Java implementation using a MySQL database
instance, we thus consider that there rooms for performance gains.
        </p>
        <p>The analysis of our method results is given in 3 dimensions: sensitivity,
speci city and processing duration (not including end-user interactions). The
obtained statistical measures are encouraging with averages of sensitivity and
speci city of respectively 77.3 and 99.1%, hence improving on the results of the
less automatized \2-hop" solution. In the rst hand, these values have to be
considered in the context of a fast processing of potential terms, i.e., duration
in seconds range from 0.6 to 67.3. The 2 order of magnitude between the
slowest and fastest executions are explained by the size of the matching concepts
retrieved from the query patterns, the size of their subsumption relationship
transitive closure and the structure of the associated FCA lattice. The slowest,
i.e. Diabetes mellitus, involves the computation of matrix of more than 330
objects and 7000 attributes resulting in an FCA lattice of more than 3000 formal
concepts out of which most candidates concepts are pruned. In the second hand,
the matching and candidate concepts are proposed to the physician for
acceptance and rejection. Hence, through an intuitive and user-friendly interface, she
is able to easily improve the sensitivity measure. An evaluation on the e ciency
and interactivity of the Web interface has yet to be conducted on with physicians
on real case scenarios.</p>
        <p>Finally, we consider for some of the false negative concepts discovered by our
method may end up being positive propositions. Moreover, these propositions
come from both the matching (e.g., "Sitosterolemia for hypercholesterolemia"
for hypercholesterolemia) and potential (e.g., "h/o: raised blood, familial
hyperlipoproteinemia", "fh: raised blood lipids" for hypercholesterolemia while the
gold standard contains concepts such as "hyperlipoproteinemia type ii") concepts
which con rms the relevance of using a semantic approach. Note that among our
true positive, depending on the use case, a signi cant number of items have been
retrieved from the potential concept set, i.e., using our FCA statistical approach.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have presented a novel, semi-automatic solution for de ning health outcomes
of interest. Our work is inspired by previous, manually-intensive work done for
the purpose of text-mining clinical notes from EHRs. Our approach is composed
of a cooperation between Semantic Query Expansion, to leverage the hierarchical
structure of ontologies, and Formal Concept Analysis, to organize, reason, and
prune discovered concepts. We implemented a RESTful API and a graphical
Web-based interface to illustrate the process that users would follow to browse
data and re ne their HOI query rapidly. A preliminary evaluation of this work
emphasizes positive results for sensitivity and speci city measures. The most
promising aspect of this approach is the discovery of positive results not present
our i2b2 Obesity NLP reference set. Thus, this approach provides better recall
with high precision of the obtained results.</p>
      <p>Some of our future goals include (i) conducting user driven evaluation of the
Web interface, (ii) analyzing the acceptance/rejection of physicians in several
practical scenarios, (iii) using active learning over past query re nements to
improve future queries, and (iv) studying our method impact's on mining EHRs
clinical notes or cohort building tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Carnahan</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Moores</surname>
          </string-name>
          .
          <article-title>Mini-sentinel systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned</article-title>
          .
          <source>Pharmacoepidemiol Drug Saf</source>
          ,
          <volume>21</volume>
          <issue>Suppl 1</issue>
          :
          <issue>82</issue>
          {
          <fpage>9</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Davey</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Priestley</surname>
          </string-name>
          .
          <article-title>Introduction to Lattices and Order (2</article-title>
          . ed.). Cambridge University Press,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ganter</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wille</surname>
          </string-name>
          .
          <source>Formal concept analysis - mathematical foundations</source>
          . Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Lependu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bauer-Mehren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Harpaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Mortensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Podchiyska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Ferris</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Pharmacovigilance using clinical notes</article-title>
          .
          <source>Clin Pharmacol Ther</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Stang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ryan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Racoosin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Overhage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartzema</surname>
          </string-name>
          , C. Reich, E. Welebob,
          <string-name>
            <given-names>T.</given-names>
            <surname>Scarnecchia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Woodcock</surname>
          </string-name>
          .
          <article-title>Advancing the science for active surveillance: rationale and design for the observational medical outcomes partnership</article-title>
          .
          <source>Ann Intern Med</source>
          ,
          <volume>153</volume>
          (
          <issue>9</issue>
          ):
          <volume>600</volume>
          {
          <fpage>6</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>