VALIDATION OF MIXED-STRUCTURED DATA USING PATTERN MINING AND INFORMATION EXTRACTION Martin Atzmueller Stephanie Beer University of Kassel University-Hospital of Würzburg Knowledge and Data Engineering Gastroontologics Research Group Kassel, Germany Würzburg, Germany atzmueller@cs.uni-kassel.de beer_s@klinik.uni-wuerzburg.de ABSTRACT In this paper, we propose an approach for the vali- dation of mixed-structured data using data mining and For large-scale data mining utilizing data from ubiq- information extraction and propose appropriate refine- uitous and mixed-structured data sources, the appro- ment options. We focus on a data mining technique for priate extraction and integration into a comprehensive mining local patterns, i.e., subgroup discovery, e.g., [5, data-warehouse is of prime importance. Then, appro- 6, 7] that are especially suitable for the task: Local priate methods for validation and potential refinement patterns consider local regularities (and irregularities) are essential. This paper presents an approach apply- of the data and are therefore useful for spotting non- ing data mining and information extraction methods expected, contradicting, and otherwise unusual patterns for data validation: We apply subgroup discovery and potentially indicating problems and errors in the data. (rule-based) information extraction for data integration and validation. The methods are integrated into an in- Concerning the information extraction techniques, cremental process for continuous validation options. The we consider popular methods implemented in the UIMA results of a medical application demonstrate that sub- [8] and ClearTK [9] framework, and especially focus group discovery and the applied information extraction on the T EXT M ARKER system, e.g., [10, 11] for rule- methods are well suited for mining, extracting and val- based information extraction. Rules are especially suit- idating clinically relevant knowledge. able for the proposed information extraction task since they allow a concise and declarative formalization of the relevant domain knowledge that is especially easy 1. INTRODUCTION to acquire, to comprehend and to maintain. Further- more, in the case of errors, the cause can easily be iden- Whenever data is continously collected, for example, tified by tracing the application of the individual rules. using intelligent documentation systems [1], data min- The combined approach enables data mining from ing and data analysis provide a broad range of options heterogenous sources. The user can specify simple rules for scientific purposes. The mining and analysis step that consider features of the text, e.g., structural or syn- is often implemented using a data-warehouse [2, 3, 4]. tactic features of the textual content. We focus on an For the data preprocessing and integration of several incremental level-wise approach, such that both meth- heterogenous sources, there exist standardized extract- ods can complement each other in the validation and transform-load (ETL) procedures, that need to incorpo- refinement setting. Furthermore, validation knowledge rate suitable data schemas, and integration rules. Ad- can be formalized in a knowledge base, for assessing ditionally, for unstructured or semi-structured textual known and expected relations in the data. data sources, the integration requires effective informa- tion extraction methods. For clinical discharge letters, The approach has been implemented in a clinical for example, the structure of the letter is usually non- application for mining data from clinical information standardized, and thus dependent on different writing systems, documentation systems, and clinical discharge styles of different authors. letters. This application scenario concerns the data in- However, a prerequisite of data mining is the vali- tegration from heterogenous databases and the infor- dation and the quality assurance of the integrated data. mation extraction from textual documents. The experi- Especially concerning unreliable extraction and inte- ences and results so far demonstrate the flexibility and gration methods, the quality of the obtained data can effectiveness of the presented approach that make the vary significantly. If the data has been successfully val- data mining and information extraction methods suit- idated, then the trust in the data mining results and their able components in the mining, validation and refine- acceptance can be increased. ment process. 2. BACKGROUND 3. THE MINING AND VALIDATION PROCESS In the following, we shortly summarize the methods Figure 1 depicts the process of validation and refine- for data mining and information extraction, subgroup ment of mixed-structured data using pattern mining and discovery, and rule-based information extraction using information extraction methods. The input of the pro- T EXT M ARKER. cess is given by data from heterogenous data sources, and by textual documents. The former are processed 2.1. Subgroup Discovery by appropriate data integration methods adapted to the different sources. The latter are handled by information Subgroup discovery is a flexible data mining method extraction techniques, e.g., rule-based methods that uti- for discovering local patterns that can be utilized for lize appropriate extraction rules for the extraction of global modeling in the context of exploratory data anal- concepts and relations from the documents. In general, ysis, description, characterization and classification. a variety of methods can be applied. Subgroup discovery is applied for identifying rela- The process supports arbitrary information extrac- tions between a (dependent) target concept and a set tion methods, e.g., automatic techniques like support- of explaining (independent) variables. Then, the goal vector machines or conditional random fields as imple- is to describe subsets of the data, that have the most mented in the ClearTK [9] toolkit for statistical natu- unusual characteristics with respect to the concept of ral language processing. However, the refinement ca- interest given by the target variable [6]. For example, pabilies vary for the different extraction approaches: the risk of coronary heart disease (target variable) is While black-box methods like support vector machines significantly higher in the subgroup of smokers with a or conditional random fields only allow an indirect re- positive family history than in the general population. finement and adaptation of the model, i.e., based on In the context of the proposed validation approach, adapting the input data and/or the method parameters we consider certain gold-standard concepts as targets, for constructing the model, a white-box approach im- as well as target concepts that are true, if and only if plemented using rules provides for a direct modifica- equivalent concepts from two different sources match. tion of its model, namely the provided rules. Therefore, Then, we can identify combinations of factors that cause we especially focus on rule-based methods due to their a mismatch between the concepts. These combinations rich refinement capabilities. can then indicate candidates for refinement. After the integration and extraction of the data, the result is provided to the pattern mining system which 2.2. Rule-based Information Extraction obtains a set of validation patterns as output. This set Information extractions aims at extracting a set of con- is then checked both for internal consistency and com- cepts, entities and relations from a set of documents. pared to formalized background knowledge. In the case T EXT M ARKER [10, 11] is a robust system for rule- of discrepancies and errors, refinement are proposed based information extraction. It can be applied very for the data integration and/or the information extrac- intuitively, since the used rules are especially easy to tion steps. After the rules have been refined, the pro- acquire and to comprehend. Using the extracted infor- cess iterates with the updated schemas and models. mation, data records can be easily created in a post- In the following we discuss exemplary results ob- processing step. Humans often apply a strategy accord- tained from a medical project. We applied data col- ing to a highlighter metaphor during ’manual’ informa- lected by the S ONO C ONSULT system, a multifunctional tion extraction: First, top-level text blocks are consid- knowledge system for sonography, which has been in ered and classified according to their content by col- routine use since 2002 documenting more than 12000 oring them with different highlighters. The contained patients in two clinics. The system covers the entire elements of the annotated texts segments are then con- field of abdominal ultrasound (liver, portal tract, gall- sidered further. The T EXT M ARKER [10, 11] system bladder, spleen, kidneys, adrenal glands, pancreas, in- tries to imitate this manual extraction method by for- testine, lymph nodes, abdominal aorta, cava inferior, malizing the appropriate actions using matching rules: prostate, and urinary bladder). The data was integrated The rules mark sequences of words, extract text seg- with the SAP-based i.s.h.med system, and the infor- ments or modify the input document depending on tex- mation extraction techniques were applied for textual tual features. discharge letters from the respective patients; S ONO - TextMarker aims at supporting the knowledge en- C ONSULT was used for documentation. By integrating gineer in the rapid prototyping of information extrac- different data sources into the warehouse it is possible tion applications. The default input for the system is to measure the conformity of sonographic results with semi-structured text, but it can also process structured other methods or inputs. In our evaluations, we applied or free text. Technically, HTML is often the input for- computer-tomography diagnoses and additional billing mat, since most word processing documents can be ob- diagnoses (from the hospital information system) as a tained in HTML format, or converted appropriately. gold-standard. Data Refine Rules, Schema Integration Data Sources Pattern Mining Validation & Background System Pattern Set Quality Assurance Knowledge Information Data Refine Model/Rules Integration Extraction Documents Fig. 1. Process Model: Validation of Mixed-Structured Data using Pattern Mining and Information Extraction Table 1 shows the correlation of S ONO C ONSULT both for checking the data integration and information based diagnosis with CT/MR, diagnoses listed in the extraction tasks. We only require a partial gold-standard, discharge letter and diagnoses contained in the hospi- i.e., a sample of the correct relations, because we need tal information system for a selection of cases from to test the functional requirements of the data integra- a certain examiner. It was quite interesting that the tion and extraction phases. On the next level, we can conformity between S ONO C ONSULT based diagnoses incrementally validate the integrated data using the ex- with the diagnoses contained in the hospital informa- tracted information, or vice versa, using the mined pat- tion system was relatively low. Evaluating this issue it terns. In the case of discrepancies, we can rely on the was obvious that various diagnosis were not listed in partial gold-standard data for verification, or we can the hospital information system because they were not identify potential causes and verify these on concrete revenue enhancing and not relevant for all clinical situ- cases. Therefore, the final decision for the refinements ations. Therefore, we looked at the accordance with the relies on the user, which reviews all proposed refine- discharge letters which were found to be highly concor- ments in a semi-automatic approach. dant at least for the diagnosis of liver metastasis. Liver For the refinement steps, we can either extend the cirrhosis is more awkward to detect using ultrasound (partial) gold-standard, or we perform a boot-strapping and has to be in a more advanced stage. Therefore, approach, using a small gold-standard sample of tar- some of the discharge diagnoses "liver cirrhosis" were get concepts for validation, e.g., for validating and re- only detected using histology or other methods. fining the information extraction approach, which is in In some cases, there are discrepancies with respect turn used for the validation of the data sources. In the to the formalized background knowledge that still per- next step, the validation targets can be extended and sist after refinement of the rules and checking the data the process for refinement is applied inversely. The sources. In such cases, explanation-aware mining and boot-strapping approach for validation and refinement analysis components provide appropriate solutions for is thus similar to the idea of co-training, e.g., [13] in resolving conflicts and inconsistencies. By support- machine learning that also starts with a small labeled ing the user with appropriate justifications and expla- (correct) dataset and iteratively adapts the models us- nations, misleading patterns can be identified, and the ing another co-trained dataset. background knowledge can be adapted. The decision whether the background knowledge needs to be adapted 4. CONCLUSIONS is performed by the domain specialist. As we have de- scribed in [12] there are several continuous explanation This paper presented an approach for the validation of dimensions in the context of data mining and analysis, mixed-structured data using information extraction and that can be utilized for improving the explanation ca- pattern mining methods. In an incremental approach, pabilities. In the medical domain, for example, pat- data can both be validated and refined with an increas- terns are usually first assessed on the abstract level, ing level of accuracy. The presented approach has been before they are checked and verified on concrete pa- successfully implemented in a medical project targeted tient records, i.e., on a very detailed level of abstrac- at integrating data from clinical information systems, tion. Then, discrepancies are modeled in the back- documentation systems, and textual discharge letters. ground knowledge, for example, certain exception con- The experiences and results so far demonstrate the ditions for certain subgroups of patients. flexibility and effectiveness of the pattern mining and The validation phase is performed on several levels: information extraction methods for the presented vali- On the first level, we can use a (partial) gold-standard dation and refinement approach. Total SONO SAP % CT/MR % Discharge % Case CONSULT Diagnoses Conformity Diagnoses Conformity Letter Conformity Number Diagnoses with with Diagnoses with SONO SONO SONO CONSULT CONSULT CONSULT Liver cirrhosis 16 12 6 20 1 33 9 50 Liver metastasis 28 16 11 65 15 87 17 94 Fig.1. Conformity of various sources of diagnosis input. Correlation of the different Table 1. Exemplary study for with sources a selection of casesdiagnoses. SONOCONSULT concerning liver examinations performed by a certain examiner: Conformity of system diagnoses with various sources of diagnosis input. The columns indicate the degree of Having correlation of the different different sources data with sources S ONO in the warehouse C ONSULT it ismeasured diagnoses possible toby measure the the number of covered cases. conformity of sonographic results with other methods or inputs. Figure 1 shows the correlation of SONOCONSULT based diagnosis with CT/MR, diagnoses listed in the discharge letter and diagnoses nursed in the hospital information system for a first number of cases. 5. REFERENCES [7] ItWilli was quite interesting Klösgen, that the A Multipattern and “Explora: conformity between SONOCONSULT based diagnoses with the diagnoses Multistrategy Discovery Assistant,” in Ad- listedAtzmueller, [1] Frank Puppe, Martin in the hospital information Georg system was quite low. Evaluating this issue it Buscher, vances in Knowledge Discovery and Data Min- was obvious that various diagnosis were not listed in the hospital information Matthias Huettig, Hardi Lührs, and Hans-Peter ing, Usama M. Fayyad, Gregory Piatetsky- system because they were not revenue enhancing. Therefore, we looked at Buscher, “Application and Evaluation of a Med- the accordance with the discharge letters whichShapiro, were foundPadraic Smyth, and Ramasamy Uthu- to be highly ical Knowledge-System in Sonography (Sono- rusamy, concordant at least for the diagnosis of liver metastasis. Eds., pp. 249–271. AAAI Press, 1996. Consult),” in Proc. 18th Europ. Conf. on Artificial Intelligence (ECAILiver 2008), 2008, cirrhosis pp. 683–687. is more [8] David Ferrucci and Adam Lally, “UIMA: An awkward to be diagnosed with ultrasound and has to be in a more advanced stage. Therefore, some of the Architectural Approach discharge diagnoses to Unstructured Informa- “liver [2] Jonathan C. Prather, David cirrhosis” F. found were Lobach,with Linda tion Processing K.or other methods. histology In one casein the Corporate Research Envi- liver Goodwin, Joseph cirrhosis W. Hales,wasMarvin listed in L. theHage, hospital and ronment,” information system but was Nat. Lang. neither Eng., vol. 10, no. 3-4, pp. found W. Edward Hammond,with ultrasound “Medical nor inData the discharge Mining:letter. It came out that 2004. 327–348, the input was performed by another department (neurology). Knowledge Discovery in a Clinical Data Ware- [9] P. V. Ogren, P. G. Wetzler, and S. Bethard, house,” in Proc. Within AMIAthe Annual limited Fall number Symposium of examined cases we “ClearTK: A UIMA found only one case of Toolkit for Statistical Nat- (AIMA-1997), 1997, pp. 101–105. pancreatic mass which was found in the ultrasound uralexamination Language and listed in Processing,” in UIMA for NLP the discharge letter. However, it was not included in the hospital workshop information at Language Resources and Evaluation [3] Rüdiger Wirth and Jochen Hipp, “CRISP-DM: system. Towards a Standard Process Model for Data Min- Conference (LREC), 2008. ing,” in Proc. 4th The Intl.first results Conf. on ofthe thePractical correlations Ap-of diagnoses which were [10] Martin input by various Atzmueller, Peter Kluegl, and Frank sourcesDiscovery plication of Knowledge show that there is a promising and Data Min- high conformity between SonoConsult Puppe, “Rule-Based Information Extraction for and discharge letters, but for further quality improvement the correlation with ing. 2000, pp. 29–39, Morgan Kaufmann. Structured Data Acquisition using TextMarker,” other imaging techniques is very important. With a higher number of cases it in Proc. of the LWA-2008, Special Track on [4] Martin Atzmueller, Stephanie Beer, and Frank Knowledge Discovery and Machine Learning, Puppe, “A Data Warehouse-Based Approach for 2008, pp. 1–7. Quality Management, Evaluation and Analysis of Intelligent Systems using Subgroup Mining,” in [11] Peter Kluegl, Martin Atzmueller, and Frank Proc. 22nd International Florida Artificial Intel- Puppe, “Textmarker: A tool for rule-based in- ligence Research Society Conference (FLAIRS), formation extraction,” in Proc. Biennial GSCL accepted. 2009, pp. 372–377, AAAI Press. Conference 2009, 2nd UIMA@GSCL Workshop. 2009, pp. 233–240, Gunter Narr Verlag. [5] Martin Atzmueller, Frank Puppe, and Hans-Peter Buscher, “Exploiting Background Knowledge [12] Martin Atzmueller and Thomas Roth-Berghofer, for Knowledge-Intensive Subgroup Discovery,” “Ready for the MACE? The Mining and Analy- in Proc. 19th Intl. Joint Conference on Artifi- sis Continuum of Explaining Uncovered,” in AI- cial Intelligence (IJCAI-05), Edinburgh, Scot- 2010: 30th SGAI International Conference on Ar- land, 2005, pp. 647–652. tificial Intelligence. Accepted. [13] Avrim Blum and Tom Mitchel, “Combining La- [6] Stefan Wrobel, “An Algorithm for Multi- beled and Unlabeled Data with Co-Training,” in Relational Discovery of Subgroups,” in Proc. COLT: Proceedings of the Workshop on Com- 1st European Symposium on Principles of Data putational Learning Theory. 1998, pp. 92–100, Mining and Knowledge Discovery (PKDD-97), Morgan Kaufmann. Berlin, 1997, pp. 78–87, Springer Verlag.