Extracting process graphs from medical text data
An approach towards a systematic framework to extract and mine medical
sequential processes descriptions from large text sources.
Andreas Niekler 1,∗
1
 University of Leipzig, Department of Computer Science, Natural Language Group, Augustusplatz
10, 04109 Leipzig


ABSTRACT                                                                 of valid relations that form activities is introduced to the text pro-
In this work a natural language processing workflow to extract sequ-     cessing step. The second step in our proposed methodology is the
ential activities from large collections of medical text documents is    creation of a directed graph structure which can be further used for
developed. The approach utilizes graph structures to process, link       the representation of the activities contained within a text collection.
and assess activities found in the documents.
* Contact: aniekler@informatik.uni-leipzig.de                            2.1   Text processing and classification for activity
                                                                               extraction
relation extraction, natural language processing, graph processing,      The text sources must be separated into sentences and tokens first
process models                                                           by using state of the art tools. Additionally, POS-Tagging must
                                                                         be applied to the text sources. To extract the procedural know-
1   INTRODUCTION                                                         ledge from the texts, named entity recognition (NER) is required
                                                                         as a pre-processing step. In the separated and preprocessed sente-
Medical publications, surgical procedure reports or medical records
                                                                         nces multiple entities may form an activity. Consider the sentence
typically contain procedural descriptions. For example, all activities
                                                                         “Real-time JJ PCR NNP was VBD done VBN using VBG the DT
included in a medical study must be documented for reproducibility
                                                                         fluorescent-labelled JJ oligonucleotide NN probes NNS”. “Real-
purposes, in surgical reports a stepwise description of included pro-
                                                                         time PCR” and “fluorescent-labelled oligonucleotide probes” are the
cedures is documented and in medical records a history of medical
                                                                         identified entities which form the activity “done”. This activity can
treatment is listed. Additionally, related studies or reports describe
                                                                         be part of a chain of activities throughout multiple documents.
alike activities with some alterations or rely on preceding activities
                                                                            The characteristics of activities or relations between entities
that may be described in other documents. Consider for example
                                                                         change within different domains or described procedures. Thus, the
the preparation steps before DNA could be sequenced which are
                                                                         process for identifying and connecting entities to activities within
described in scientific papers. They are often the same but need to
                                                                         the sentences should not be fixed or static. To answer this fact the
be documented for each study. Such redundant activity descriptions
                                                                         identification of relations or activities is defined as classification task
can be found amongst many documents describing research within
                                                                         using a Support Vector Machine (SVM) along with word- and POS-
the same domain or field of research. Nevertheless, differences
                                                                         Tag-level features GuoDong et al. [2005]. If a sentence contains an
amongst the activities in related documents also exist. A complete
                                                                         entity E1 and E2 the two words before E1 , the two words after E2
overview of activities from a defined document collection provi-
                                                                         and all words between E1 and E2 are extracted as features. Furth-
des an easy insight to workflows and paradigms within a domain
                                                                         ermore, the POS-tags of the extracted words are used as features for
or field of study.Thus, finding this link between the documents and
                                                                         the SVM.
aligning the activities w.r.t. redundant activities helps to structure
                                                                            Before the training process is applied the user must define the
and analyze procedural knowledge from topical- or domain-related
                                                                         type and the form of the desired relation. On the basis of this defini-
medical texts. For example, early stages or parts of a larger process
                                                                         tion training examples are collected from the data. For this purpose
might be documented separately to other parts or later stages.
                                                                         an active learning procedure is introduced where the user iteratively
   In this work a general natural language processing workflow to
                                                                         collects training data with the support of an automatic classification.
extract sequential activities from large collections of medical text
                                                                         An initial search for sentences that include a minimum of entities
documents is developed. A graph-based data structure is introdu-
                                                                         and verbs that indicate an activity is conducted. The set of match-
ced to merge extracted sequences which contain similar activities in
                                                                         ing sentences which contain this custom pattern is presented to the
order to build a global graph on procedures which are described in
                                                                         user. Correct entities are selected from the proposed sentences along
documents on similar topics or tasks.
                                                                         with the definition whether there is a relation between them or not.
                                                                         The features are extracted automatically and the set of positive and
2   TEXT MINING METHODOLOGY FOR PROCESS                                  negative examples is used to train an initial SVM model. The trained
    EXTRACTION                                                           model is used to identify additional examples in the data. The user
In this section we describe a methodology which extracts and links       judges on those examples and with every batch of new examples
activities from medical text documents. The described system fol-        the classifier can be refined. If the training quality of the SVM does
lows a sequence of procedures in order to create an activity graph       not change with new examples a final model is trained and applied
as a result. First, the text sources have to be processed in order to    to all documents. The result is a set of sentences from a document
access the entity items in the text. Different entities in a sentence    collection where each sentence contains an activity or valid relation
are related and form an expressed activity. Therefore, the extraction    between entities.


                                                                                                                                                  1
Niekler


2.2   Process graphs for activity representation and
                                                                             can be said, A0 is an unconnected graph where a set N of connected
      processing
                                                                             components can be identified. This set represents different graphs
In the next processing step a data structure is constructed on the           where the interaction and coherence of related processes, described
basis of the set of activities that were identified by the classification    in different documents, is encoded.
process. For each activity the two entities E1 and E2 , the Verb V
(past participle) between them, a document identifier and a sente-           2.3   Future work
nce identifier are stored. A graph structure A, a directed graph, is         In order to quantitatively judge on the quality of the extraction
introduced where all identified activities are represented as vertices.      process an evaluation dataset and evaluation strategy needs to be
All vertices that build a sequence of activities within a document           developed as prequisite for future work. More research on suitable
are connected with directed edges, e.g. consecutive activities will          similarity functions for relations which can also handle semantic
be connected as a chain of activities within the graph structure. This       similarities will optimize the quality of the graph merging process.
procedure creates a chain of connected vertices for every document           Future work will also include the adoption of domain knowledge
represented in A. The main target for the further processing of A            from knowledge graphs. Those has been described as very helpful
is the linking of different activity chains from multiple documents.         resources in order to adopt to a domain in Roberts and Haraba-
This will produce a graph structure which represents networks of             giu [2011]. The links and dependencies between entities and their
activities that supplement each other. In A the connected compo-             possible representations in the data can be encoded in those data
nents can be understood as a summary of activities which come                structures by domain experts. This will add supervision and control
from, or lead to, similar activities. For example, multiple surgical         to the graph creation process and thus allows for a higher preci-
reports contain many redundant descriptions for a certain type of            sion of the graph. Additionally, anaphora resolution can be modeled
surgical procedure. In some cases there might have been complica-            with knowledge bases to connect graph structures where the relati-
tions and the surgeon had to react on those. A graph which merges            ons represent processes which produce other entities as results. Such
different sequential activities from different documents should intro-       edges can’t be established with character or semantic comparison of
duce a cycle of activities describing such additional complications.         the relations. In the moment a connection can only be established
In a later review of the graph such cycles represents differences from       if the producing process is encoded within a single document. Fur-
standard procedure described in the document collection.                     thermore, the introduction of manual corrections steps to refine the
   To detect similar relations throughout different documents a              graph and the optimization of the relation extraction classification
similarity operation sim(RD1 , RD2 ) is defined. This similarity             may also optimize the quality.
operation can be constructed on the basis of word level simila-
rity or semantic similarity. With a preprocessing of the corpus like
                                                                             ACKNOWLEDGEMENT
word2vec or a co-occurrence analysis each of the relation com-
ponents can be augmented by semantic vectors representing the                ExB Labs GmbH kindly helped to compile and preprocess the
associated vocabulary, e.g. the semantic embedding Mikolov et al.            corpus for this work.
[2013], Bordag [2008]. This allows to compare entities semanti-              Funding: The project is funded by European Regional Develo-
cally and conceptual similarities between entities can be used to            pment Fund (ERDF/EFRE) and the European Social Fund (ESF).
find alike relations. Note, that the similarity function is an exchan-
gable component of the described information extraction approach.
The similarity between all activities is calculated for E1 , E2 and
V separately which results in three different similarity matrices
which are transformed to adjacency matrices by applying a thresh-
old to the similarity values. It is also imaginable to set three different
thresholds for each single similarity matrix or to weight the matrices
for further processing. All resulting adjacency matrices are multipli-
cated element-wise in order to create a single adjacency matrix S of
similar activities, e.g. two activities where E1 , E2 and V are similar      REFERENCES
between the two activities are represented by the value of 1 in the          Stefan Bordag. A comparison of co-occurrence and similarity mea-
final matrix.                                                                  sures as simulations of context. In Computational Linguistics and
   In the following step the activities considered to be similar are           Intelligent Text Processing, pages 52–63. Springer, 2008.
collapsed using the adjacency matrix S resulting in a graph A0 .             Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. Exploring vari-
Starting from graph A all edges from similar vertices are taken                ous knowledge in relation extraction. In Proceedings of the 43rd
over to a single vertex and all vertices where the edges where taken           Annual Meeting on Association for Computational Linguistics,
from are deleted. That means similar vertices are collapsed to a sin-          pages 427–434, 2005.
gle vertice and the associated ingoing and outgoing edges of those           Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Effi-
relations are merged. The resulting graph connects the sequences               cient estimation of word representations in vector space. CoRR,
of different documents where similar activities build single verti-            abs/1301.3781, 2013.
ces with more than one incoming or outgoing edge (d+          G (v) > 1      Kirk Roberts and Sanda M Harabagiu. A flexible framework for
or d−G (v) > 1). Activities with this property are identified more             deriving assertions from electronic medical records. Journal of
frequently than other activities in the data and thus are of some              the American Medical Informatics Association, 18(5):568–573,
importance for the overall activity summarization. In summary, it            2011.


2