Extracting process graphs from medical text data An approach towards a systematic framework to extract and mine medical sequential processes descriptions from large text sources. Andreas Niekler 1,∗ 1 University of Leipzig, Department of Computer Science, Natural Language Group, Augustusplatz 10, 04109 Leipzig ABSTRACT of valid relations that form activities is introduced to the text pro- In this work a natural language processing workflow to extract sequ- cessing step. The second step in our proposed methodology is the ential activities from large collections of medical text documents is creation of a directed graph structure which can be further used for developed. The approach utilizes graph structures to process, link the representation of the activities contained within a text collection. and assess activities found in the documents. * Contact: aniekler@informatik.uni-leipzig.de 2.1 Text processing and classification for activity extraction relation extraction, natural language processing, graph processing, The text sources must be separated into sentences and tokens first process models by using state of the art tools. Additionally, POS-Tagging must be applied to the text sources. To extract the procedural know- 1 INTRODUCTION ledge from the texts, named entity recognition (NER) is required as a pre-processing step. In the separated and preprocessed sente- Medical publications, surgical procedure reports or medical records nces multiple entities may form an activity. Consider the sentence typically contain procedural descriptions. For example, all activities “Real-time JJ PCR NNP was VBD done VBN using VBG the DT included in a medical study must be documented for reproducibility fluorescent-labelled JJ oligonucleotide NN probes NNS”. “Real- purposes, in surgical reports a stepwise description of included pro- time PCR” and “fluorescent-labelled oligonucleotide probes” are the cedures is documented and in medical records a history of medical identified entities which form the activity “done”. This activity can treatment is listed. Additionally, related studies or reports describe be part of a chain of activities throughout multiple documents. alike activities with some alterations or rely on preceding activities The characteristics of activities or relations between entities that may be described in other documents. Consider for example change within different domains or described procedures. Thus, the the preparation steps before DNA could be sequenced which are process for identifying and connecting entities to activities within described in scientific papers. They are often the same but need to the sentences should not be fixed or static. To answer this fact the be documented for each study. Such redundant activity descriptions identification of relations or activities is defined as classification task can be found amongst many documents describing research within using a Support Vector Machine (SVM) along with word- and POS- the same domain or field of research. Nevertheless, differences Tag-level features GuoDong et al. [2005]. If a sentence contains an amongst the activities in related documents also exist. A complete entity E1 and E2 the two words before E1 , the two words after E2 overview of activities from a defined document collection provi- and all words between E1 and E2 are extracted as features. Furth- des an easy insight to workflows and paradigms within a domain ermore, the POS-tags of the extracted words are used as features for or field of study.Thus, finding this link between the documents and the SVM. aligning the activities w.r.t. redundant activities helps to structure Before the training process is applied the user must define the and analyze procedural knowledge from topical- or domain-related type and the form of the desired relation. On the basis of this defini- medical texts. For example, early stages or parts of a larger process tion training examples are collected from the data. For this purpose might be documented separately to other parts or later stages. an active learning procedure is introduced where the user iteratively In this work a general natural language processing workflow to collects training data with the support of an automatic classification. extract sequential activities from large collections of medical text An initial search for sentences that include a minimum of entities documents is developed. A graph-based data structure is introdu- and verbs that indicate an activity is conducted. The set of match- ced to merge extracted sequences which contain similar activities in ing sentences which contain this custom pattern is presented to the order to build a global graph on procedures which are described in user. Correct entities are selected from the proposed sentences along documents on similar topics or tasks. with the definition whether there is a relation between them or not. The features are extracted automatically and the set of positive and 2 TEXT MINING METHODOLOGY FOR PROCESS negative examples is used to train an initial SVM model. The trained EXTRACTION model is used to identify additional examples in the data. The user In this section we describe a methodology which extracts and links judges on those examples and with every batch of new examples activities from medical text documents. The described system fol- the classifier can be refined. If the training quality of the SVM does lows a sequence of procedures in order to create an activity graph not change with new examples a final model is trained and applied as a result. First, the text sources have to be processed in order to to all documents. The result is a set of sentences from a document access the entity items in the text. Different entities in a sentence collection where each sentence contains an activity or valid relation are related and form an expressed activity. Therefore, the extraction between entities. 1 Niekler 2.2 Process graphs for activity representation and can be said, A0 is an unconnected graph where a set N of connected processing components can be identified. This set represents different graphs In the next processing step a data structure is constructed on the where the interaction and coherence of related processes, described basis of the set of activities that were identified by the classification in different documents, is encoded. process. For each activity the two entities E1 and E2 , the Verb V (past participle) between them, a document identifier and a sente- 2.3 Future work nce identifier are stored. A graph structure A, a directed graph, is In order to quantitatively judge on the quality of the extraction introduced where all identified activities are represented as vertices. process an evaluation dataset and evaluation strategy needs to be All vertices that build a sequence of activities within a document developed as prequisite for future work. More research on suitable are connected with directed edges, e.g. consecutive activities will similarity functions for relations which can also handle semantic be connected as a chain of activities within the graph structure. This similarities will optimize the quality of the graph merging process. procedure creates a chain of connected vertices for every document Future work will also include the adoption of domain knowledge represented in A. The main target for the further processing of A from knowledge graphs. Those has been described as very helpful is the linking of different activity chains from multiple documents. resources in order to adopt to a domain in Roberts and Haraba- This will produce a graph structure which represents networks of giu [2011]. The links and dependencies between entities and their activities that supplement each other. In A the connected compo- possible representations in the data can be encoded in those data nents can be understood as a summary of activities which come structures by domain experts. This will add supervision and control from, or lead to, similar activities. For example, multiple surgical to the graph creation process and thus allows for a higher preci- reports contain many redundant descriptions for a certain type of sion of the graph. Additionally, anaphora resolution can be modeled surgical procedure. In some cases there might have been complica- with knowledge bases to connect graph structures where the relati- tions and the surgeon had to react on those. A graph which merges ons represent processes which produce other entities as results. Such different sequential activities from different documents should intro- edges can’t be established with character or semantic comparison of duce a cycle of activities describing such additional complications. the relations. In the moment a connection can only be established In a later review of the graph such cycles represents differences from if the producing process is encoded within a single document. Fur- standard procedure described in the document collection. thermore, the introduction of manual corrections steps to refine the To detect similar relations throughout different documents a graph and the optimization of the relation extraction classification similarity operation sim(RD1 , RD2 ) is defined. This similarity may also optimize the quality. operation can be constructed on the basis of word level simila- rity or semantic similarity. With a preprocessing of the corpus like ACKNOWLEDGEMENT word2vec or a co-occurrence analysis each of the relation com- ponents can be augmented by semantic vectors representing the ExB Labs GmbH kindly helped to compile and preprocess the associated vocabulary, e.g. the semantic embedding Mikolov et al. corpus for this work. [2013], Bordag [2008]. This allows to compare entities semanti- Funding: The project is funded by European Regional Develo- cally and conceptual similarities between entities can be used to pment Fund (ERDF/EFRE) and the European Social Fund (ESF). find alike relations. Note, that the similarity function is an exchan- gable component of the described information extraction approach. The similarity between all activities is calculated for E1 , E2 and V separately which results in three different similarity matrices which are transformed to adjacency matrices by applying a thresh- old to the similarity values. It is also imaginable to set three different thresholds for each single similarity matrix or to weight the matrices for further processing. All resulting adjacency matrices are multipli- cated element-wise in order to create a single adjacency matrix S of similar activities, e.g. two activities where E1 , E2 and V are similar REFERENCES between the two activities are represented by the value of 1 in the Stefan Bordag. A comparison of co-occurrence and similarity mea- final matrix. sures as simulations of context. In Computational Linguistics and In the following step the activities considered to be similar are Intelligent Text Processing, pages 52–63. Springer, 2008. collapsed using the adjacency matrix S resulting in a graph A0 . Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. Exploring vari- Starting from graph A all edges from similar vertices are taken ous knowledge in relation extraction. In Proceedings of the 43rd over to a single vertex and all vertices where the edges where taken Annual Meeting on Association for Computational Linguistics, from are deleted. That means similar vertices are collapsed to a sin- pages 427–434, 2005. gle vertice and the associated ingoing and outgoing edges of those Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Effi- relations are merged. The resulting graph connects the sequences cient estimation of word representations in vector space. CoRR, of different documents where similar activities build single verti- abs/1301.3781, 2013. ces with more than one incoming or outgoing edge (d+ G (v) > 1 Kirk Roberts and Sanda M Harabagiu. A flexible framework for or d−G (v) > 1). Activities with this property are identified more deriving assertions from electronic medical records. Journal of frequently than other activities in the data and thus are of some the American Medical Informatics Association, 18(5):568–573, importance for the overall activity summarization. In summary, it 2011. 2