<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting process graphs from medical text data An approach towards a systematic framework to extract and mine medical sequential processes descriptions from large text sources.</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Leipzig, Department of Computer Science, Natural Language Group</institution>
          ,
          <addr-line>Augustusplatz 10, 04109 Leipzig</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work a natural language processing workflow to extract sequential activities from large collections of medical text documents is developed. The approach utilizes graph structures to process, link and assess activities found in the documents. * Contact: aniekler@informatik.uni-leipzig.de relation extraction, natural language processing, graph processing, process models</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Medical publications, surgical procedure reports or medical records
typically contain procedural descriptions. For example, all activities
included in a medical study must be documented for reproducibility
purposes, in surgical reports a stepwise description of included
procedures is documented and in medical records a history of medical
treatment is listed. Additionally, related studies or reports describe
alike activities with some alterations or rely on preceding activities
that may be described in other documents. Consider for example
the preparation steps before DNA could be sequenced which are
described in scientific papers. They are often the same but need to
be documented for each study. Such redundant activity descriptions
can be found amongst many documents describing research within
the same domain or field of research. Nevertheless, differences
amongst the activities in related documents also exist. A complete
overview of activities from a defined document collection
provides an easy insight to workflows and paradigms within a domain
or field of study.Thus, finding this link between the documents and
aligning the activities w.r.t. redundant activities helps to structure
and analyze procedural knowledge from topical- or domain-related
medical texts. For example, early stages or parts of a larger process
might be documented separately to other parts or later stages.</p>
      <p>In this work a general natural language processing workflow to
extract sequential activities from large collections of medical text
documents is developed. A graph-based data structure is
introduced to merge extracted sequences which contain similar activities in
order to build a global graph on procedures which are described in
documents on similar topics or tasks.
In this section we describe a methodology which extracts and links
activities from medical text documents. The described system
follows a sequence of procedures in order to create an activity graph
as a result. First, the text sources have to be processed in order to
access the entity items in the text. Different entities in a sentence
are related and form an expressed activity. Therefore, the extraction
of valid relations that form activities is introduced to the text
processing step. The second step in our proposed methodology is the
creation of a directed graph structure which can be further used for
the representation of the activities contained within a text collection.
2.1</p>
      <sec id="sec-1-1">
        <title>Text processing and classification for activity extraction</title>
        <p>The text sources must be separated into sentences and tokens first
by using state of the art tools. Additionally, POS-Tagging must
be applied to the text sources. To extract the procedural
knowledge from the texts, named entity recognition (NER) is required
as a pre-processing step. In the separated and preprocessed
sentences multiple entities may form an activity. Consider the sentence
“Real-time JJ PCR NNP was VBD done VBN using VBG the DT
fluorescent-labelled JJ oligonucleotide NN probes NNS”.
“Realtime PCR” and “fluorescent-labelled oligonucleotide probes” are the
identified entities which form the activity “done”. This activity can
be part of a chain of activities throughout multiple documents.</p>
        <p>The characteristics of activities or relations between entities
change within different domains or described procedures. Thus, the
process for identifying and connecting entities to activities within
the sentences should not be fixed or static. To answer this fact the
identification of relations or activities is defined as classification task
using a Support Vector Machine (SVM) along with word- and
POSTag-level features GuoDong et al. [2005]. If a sentence contains an
entity E1 and E2 the two words before E1, the two words after E2
and all words between E1 and E2 are extracted as features.
Furthermore, the POS-tags of the extracted words are used as features for
the SVM.</p>
        <p>Before the training process is applied the user must define the
type and the form of the desired relation. On the basis of this
definition training examples are collected from the data. For this purpose
an active learning procedure is introduced where the user iteratively
collects training data with the support of an automatic classification.
An initial search for sentences that include a minimum of entities
and verbs that indicate an activity is conducted. The set of
matching sentences which contain this custom pattern is presented to the
user. Correct entities are selected from the proposed sentences along
with the definition whether there is a relation between them or not.
The features are extracted automatically and the set of positive and
negative examples is used to train an initial SVM model. The trained
model is used to identify additional examples in the data. The user
judges on those examples and with every batch of new examples
the classifier can be refined. If the training quality of the SVM does
not change with new examples a final model is trained and applied
to all documents. The result is a set of sentences from a document
collection where each sentence contains an activity or valid relation
between entities.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Process graphs for activity representation and processing</title>
        <p>can be said, A0 is an unconnected graph where a set N of connected
components can be identified. This set represents different graphs
where the interaction and coherence of related processes, described
in different documents, is encoded.
In order to quantitatively judge on the quality of the extraction
process an evaluation dataset and evaluation strategy needs to be
developed as prequisite for future work. More research on suitable
similarity functions for relations which can also handle semantic
similarities will optimize the quality of the graph merging process.
Future work will also include the adoption of domain knowledge
from knowledge graphs. Those has been described as very helpful
resources in order to adopt to a domain in Roberts and
Harabagiu [2011]. The links and dependencies between entities and their
possible representations in the data can be encoded in those data
structures by domain experts. This will add supervision and control
to the graph creation process and thus allows for a higher
precision of the graph. Additionally, anaphora resolution can be modeled
with knowledge bases to connect graph structures where the
relations represent processes which produce other entities as results. Such
edges can’t be established with character or semantic comparison of
the relations. In the moment a connection can only be established
if the producing process is encoded within a single document.
Furthermore, the introduction of manual corrections steps to refine the
graph and the optimization of the relation extraction classification
may also optimize the quality.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>ACKNOWLEDGEMENT</title>
      <p>ExB Labs GmbH kindly helped to compile and preprocess the
corpus for this work.</p>
      <p>Funding: The project is funded by European Regional
Development Fund (ERDF/EFRE) and the European Social Fund (ESF).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Bordag</surname>
          </string-name>
          .
          <article-title>A comparison of co-occurrence and similarity measures as simulations of context</article-title>
          .
          <source>In Computational Linguistics and Intelligent Text Processing</source>
          , pages
          <fpage>52</fpage>
          -
          <lpage>63</lpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Zhou</surname>
            <given-names>GuoDong</given-names>
          </string-name>
          , Su Jian, Zhang Jie, and Zhang Min.
          <article-title>Exploring various knowledge in relation extraction</article-title>
          .
          <source>In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics</source>
          , pages
          <fpage>427</fpage>
          -
          <lpage>434</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>CoRR, abs/1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Kirk</given-names>
            <surname>Roberts and Sanda M Harabagiu.</surname>
          </string-name>
          <article-title>A flexible framework for deriving assertions from electronic medical records</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>18</volume>
          (
          <issue>5</issue>
          ):
          <fpage>568</fpage>
          -
          <lpage>573</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>