VALIDATION OF MIXED-STRUCTURED DATA USING PATTERN MINING AND INFORMATION EXTRACTION

VALIDATION OF MIXED-STRUCTURED DATA USING PATTERN MINING AND INFORMATION EXTRACTION MartinAtzmueller atzmueller@cs.uni-kassel.de University of Kassel Knowledge and Data Engineering

Kassel Germany

StephanieBeer beer_s@klinik.uni-wuerzburg.de University-Hospital of Würzburg Gastroontologics Research Group

Würzburg Germany

VALIDATION OF MIXED-STRUCTURED DATA USING PATTERN MINING AND INFORMATION EXTRACTION 33E52D1751A7DA074E9AA1EE7586DBD5 GROBID - A machine learning software for extracting information from scholarly documents

For large-scale data mining utilizing data from ubiquitous and mixed-structured data sources, the appropriate extraction and integration into a comprehensive data-warehouse is of prime importance. Then, appropriate methods for validation and potential refinement are essential. This paper presents an approach applying data mining and information extraction methods for data validation: We apply subgroup discovery and (rule-based) information extraction for data integration and validation. The methods are integrated into an incremental process for continuous validation options. The results of a medical application demonstrate that subgroup discovery and the applied information extraction methods are well suited for mining, extracting and validating clinically relevant knowledge.

INTRODUCTION

Whenever data is continously collected, for example, using intelligent documentation systems [1], data mining and data analysis provide a broad range of options for scientific purposes. The mining and analysis step is often implemented using a data-warehouse [2,3,4]. For the data preprocessing and integration of several heterogenous sources, there exist standardized extracttransform-load (ETL) procedures, that need to incorporate suitable data schemas, and integration rules. Additionally, for unstructured or semi-structured textual data sources, the integration requires effective information extraction methods. For clinical discharge letters, for example, the structure of the letter is usually nonstandardized, and thus dependent on different writing styles of different authors.

However, a prerequisite of data mining is the validation and the quality assurance of the integrated data. Especially concerning unreliable extraction and integration methods, the quality of the obtained data can vary significantly. If the data has been successfully validated, then the trust in the data mining results and their acceptance can be increased.

In this paper, we propose an approach for the validation of mixed-structured data using data mining and information extraction and propose appropriate refinement options. We focus on a data mining technique for mining local patterns, i.e., subgroup discovery, e.g., [5,6,7] that are especially suitable for the task: Local patterns consider local regularities (and irregularities) of the data and are therefore useful for spotting nonexpected, contradicting, and otherwise unusual patterns potentially indicating problems and errors in the data.

Concerning the information extraction techniques, we consider popular methods implemented in the UIMA [8] and ClearTK [9] framework, and especially focus on the TEXTMARKER system, e.g., [10,11] for rulebased information extraction. Rules are especially suitable for the proposed information extraction task since they allow a concise and declarative formalization of the relevant domain knowledge that is especially easy to acquire, to comprehend and to maintain. Furthermore, in the case of errors, the cause can easily be identified by tracing the application of the individual rules.

The combined approach enables data mining from heterogenous sources. The user can specify simple rules that consider features of the text, e.g., structural or syntactic features of the textual content. We focus on an incremental level-wise approach, such that both methods can complement each other in the validation and refinement setting. Furthermore, validation knowledge can be formalized in a knowledge base, for assessing known and expected relations in the data.

The approach has been implemented in a clinical application for mining data from clinical information systems, documentation systems, and clinical discharge letters. This application scenario concerns the data integration from heterogenous databases and the information extraction from textual documents. The experiences and results so far demonstrate the flexibility and effectiveness of the presented approach that make the data mining and information extraction methods suitable components in the mining, validation and refinement process.

In the following, we shortly summarize the methods for data mining and information extraction, subgroup discovery, and rule-based information extraction using TEXTMARKER.

Subgroup Discovery

Subgroup discovery is a flexible data mining method for discovering local patterns that can be utilized for global modeling in the context of exploratory data analysis, description, characterization and classification.

Subgroup discovery is applied for identifying relations between a (dependent) target concept and a set of explaining (independent) variables. Then, the goal is to describe subsets of the data, that have the most unusual characteristics with respect to the concept of interest given by the target variable [6]. For example, the risk of coronary heart disease (target variable) is significantly higher in the subgroup of smokers with a positive family history than in the general population.

In the context of the proposed validation approach, we consider certain gold-standard concepts as targets, as well as target concepts that are true, if and only if equivalent concepts from two different sources match. Then, we can identify combinations of factors that cause a mismatch between the concepts. These combinations can then indicate candidates for refinement.

Rule-based Information Extraction

Information extractions aims at extracting a set of concepts, entities and relations from a set of documents. TEXTMARKER [10,11] is a robust system for rulebased information extraction. It can be applied very intuitively, since the used rules are especially easy to acquire and to comprehend. Using the extracted information, data records can be easily created in a postprocessing step. Humans often apply a strategy according to a highlighter metaphor during 'manual' information extraction: First, top-level text blocks are considered and classified according to their content by coloring them with different highlighters. The contained elements of the annotated texts segments are then considered further. The TEXTMARKER [10,11] system tries to imitate this manual extraction method by formalizing the appropriate actions using matching rules:

The rules mark sequences of words, extract text segments or modify the input document depending on textual features.

TextMarker aims at supporting the knowledge engineer in the rapid prototyping of information extraction applications. The default input for the system is semi-structured text, but it can also process structured or free text. Technically, HTML is often the input format, since most word processing documents can be obtained in HTML format, or converted appropriately.

THE MINING AND VALIDATION PROCESS

Figure 1 depicts the process of validation and refinement of mixed-structured data using pattern mining and information extraction methods. The input of the process is given by data from heterogenous data sources, and by textual documents. The former are processed by appropriate data integration methods adapted to the different sources. The latter are handled by information extraction techniques, e.g., rule-based methods that utilize appropriate extraction rules for the extraction of concepts and relations from the documents. In general, a variety of methods can be applied.

The process supports arbitrary information extraction methods, e.g., automatic techniques like supportvector machines or conditional random fields as implemented in the ClearTK [9] toolkit for statistical natural language processing. However, the refinement capabilies vary for the different extraction approaches: While black-box methods like support vector machines or conditional random fields only allow an indirect refinement and adaptation of the model, i.e., based on adapting the input data and/or the method parameters for constructing the model, a white-box approach implemented using rules provides for a direct modification of its model, namely the provided rules. Therefore, we especially focus on rule-based methods due to their rich refinement capabilities.

After the integration and extraction of the data, the result is provided to the pattern mining system which obtains a set of validation patterns as output. This set is then checked both for internal consistency and compared to formalized background knowledge. In the case of discrepancies and errors, refinement are proposed for the data integration and/or the information extraction steps. After the rules have been refined, the process iterates with the updated schemas and models.

In the following we discuss exemplary results obtained from a medical project. We applied data collected by the SONOCONSULT system, a multifunctional knowledge system for sonography, which has been in routine use since 2002 documenting more than 12000 patients in two clinics. The system covers the entire field of abdominal ultrasound (liver, portal tract, gallbladder, spleen, kidneys, adrenal glands, pancreas, intestine, lymph nodes, abdominal aorta, cava inferior, prostate, and urinary bladder). The data was integrated with the SAP-based i.s.h.med system, and the information extraction techniques were applied for textual discharge letters from the respective patients; SONO-CONSULT was used for documentation. By integrating different data sources into the warehouse it is possible to measure the conformity of sonographic results with other methods or inputs. In our evaluations, we applied computer-tomography diagnoses and additional billing diagnoses (from the hospital information system) as a gold-standard. Table 1 shows the correlation of SONOCONSULT based diagnosis with CT/MR, diagnoses listed in the discharge letter and diagnoses contained in the hospital information system for a selection of cases from a certain examiner. It was quite interesting that the conformity between SONOCONSULT based diagnoses with the diagnoses contained in the hospital information system was relatively low. Evaluating this issue it was obvious that various diagnosis were not listed in the hospital information system because they were not revenue enhancing and not relevant for all clinical situations. Therefore, we looked at the accordance with the discharge letters which were found to be highly concordant at least for the diagnosis of liver metastasis. Liver cirrhosis is more awkward to detect using ultrasound and has to be in a more advanced stage. Therefore, some of the discharge diagnoses "liver cirrhosis" were only detected using histology or other methods.

In some cases, there are discrepancies with respect to the formalized background knowledge that still persist after refinement of the rules and checking the data sources. In such cases, explanation-aware mining and analysis components provide appropriate solutions for resolving conflicts and inconsistencies. By supporting the user with appropriate justifications and explanations, misleading patterns can be identified, and the background knowledge can be adapted. The decision whether the background knowledge needs to be adapted is performed by the domain specialist. As we have described in [12] there are several continuous explanation dimensions in the context of data mining and analysis, that can be utilized for improving the explanation capabilities. In the medical domain, for example, patterns are usually first assessed on the abstract level, before they are checked and verified on concrete patient records, i.e., on a very detailed level of abstraction. Then, discrepancies are modeled in the background knowledge, for example, certain exception conditions for certain subgroups of patients.

The validation phase is performed on several levels: On the first level, we can use a (partial) gold-standard both for checking the data integration and information extraction tasks. We only require a partial gold-standard, i.e., a sample of the correct relations, because we need to test the functional requirements of the data integration and extraction phases. On the next level, we can incrementally validate the integrated data using the extracted information, or vice versa, using the mined patterns. In the case of discrepancies, we can rely on the partial gold-standard data for verification, or we can identify potential causes and verify these on concrete cases. Therefore, the final decision for the refinements relies on the user, which reviews all proposed refinements in a semi-automatic approach.

For the refinement steps, we can either extend the (partial) gold-standard, or we perform a boot-strapping approach, using a small gold-standard sample of target concepts for validation, e.g., for validating and refining the information extraction approach, which is in turn used for the validation of the data sources. In the next step, the validation targets can be extended and the process for refinement is applied inversely. The boot-strapping approach for validation and refinement is thus similar to the idea of co-training, e.g., [13] in machine learning that also starts with a small labeled (correct) dataset and iteratively adapts the models using another co-trained dataset.

CONCLUSIONS

This paper presented an approach for the validation of mixed-structured data using information extraction and pattern mining methods. In an incremental approach, data can both be validated and refined with an increasing level of accuracy. The presented approach has been successfully implemented in a medical project targeted at integrating data from clinical information systems, documentation systems, and textual discharge letters.

The experiences and results so far demonstrate the flexibility and effectiveness of the pattern mining and information extraction methods for the presented validation and refinement approach. Having different data sources in the warehouse it is possible to measure the conformity of sonographic results with other methods or inputs. Figure 1 shows the correlation of SONOCONSULT based diagnosis with CT/MR, diagnoses listed in the discharge letter and diagnoses nursed in the hospital information system for a first number of cases. It was quite interesting that the conformity between SONOCONSULT based diagnoses with the diagnoses listed in the hospital information system was quite low. Evaluating this issue it was obvious that various diagnosis were not listed in the hospital information system because they were not revenue enhancing. Therefore, we looked at the accordance with the discharge letters which were found to be highly concordant at least for the diagnosis of liver metastasis.

Liver cirrhosis is more awkward to be diagnosed with ultrasound and has to be in a more advanced stage. Therefore, some of the discharge diagnoses "liver cirrhosis" were found with histology or other methods. In one case liver cirrhosis was listed in the hospital information system but was neither found with ultrasound nor in the discharge letter. It came out that the input was performed by another department (neurology).

Within the limited number of examined cases we found only one case of pancreatic mass which was found in the ultrasound examination and listed in the discharge letter. However, it was not included in the hospital information system.

The first results of the correlations of diagnoses which were input by various sources show that there is a promising high conformity between SonoConsult and discharge letters, but for further quality improvement the correlation with other imaging techniques is very important. With a higher number of cases it

Fig. 1 .1Fig. 1. Process Model: Validation of Mixed-Structured Data using Pattern Mining and Information Extraction

Table 1 .1Exemplary study for a selection of cases concerning liver examinations performed by a certain examiner: Conformity of system diagnoses with various sources of diagnosis input. The columns indicate the degree of correlation of the different sources with SONOCONSULT diagnoses measured by the number of covered cases.

Data Sources Documents Data Integration Data Integration Pattern Mining System Data Integration Information Extraction

<author> <persName><surname>References</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b1"> <analytic> <title level="a" type="main">Application and Evaluation of a Medical Knowledge-System in Sonography (Sono-Consult) FrankPuppe MartinAtzmueller GeorgBuscher MatthiasHuettig HardiLührs Hans-PeterBuscher Proc. 18th Europ. Conf. on Artificial Intelligence (ECAI 2008) 18th Europ. Conf. on Artificial Intelligence (ECAI 2008) 2008 Medical Data Mining: Knowledge Discovery in a Clinical Data Warehouse JonathanCPrather DavidFLobach LindaKGoodwin JosephWHales MarvinLHage WEdwardHammond Proc. AMIA Annual Fall Symposium (AIMA-1997) AMIA Annual Fall Symposium (AIMA-1997) 1997 CRISP-DM: Towards a Standard Process Model for Data Mining RüdigerWirth JochenHipp Proc. 4th Intl. Conf. on the Practical Application of Knowledge Discovery and Data Mining. 2000 4th Intl. Conf. on the Practical Application of Knowledge Discovery and Data Mining. 2000 Morgan Kaufmann A Data Warehouse-Based Approach for Quality Management, Evaluation and Analysis of Intelligent Systems using Subgroup Mining MartinAtzmueller StephanieBeer FrankPuppe Proc. 22nd International Florida Artificial Intelligence Research Society Conference (FLAIRS), accepted 22nd International Florida Artificial Intelligence Research Society Conference (FLAIRS), accepted AAAI Press 2009 Exploiting Background Knowledge for Knowledge-Intensive Subgroup Discovery MartinAtzmueller FrankPuppe Hans-PeterBuscher Proc. 19th Intl. Joint Conference on Artificial Intelligence (IJCAI-05) 19th Intl. Joint Conference on Artificial Intelligence (IJCAI-05)

Edinburgh, Scotland

2005 An Algorithm for Multi-Relational Discovery of Subgroups StefanWrobel Proc null European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD-97)

Berlin

Springer Verlag 1997 Explora: A Multipattern and Multistrategy Discovery Assistant WilliKlösgen Advances in Knowledge Discovery and Data UsamaMMining GregoryFayyad PadraicPiatetsky-Shapiro RamasamySmyth Uthurusamy AAAI Press 1996 UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment DavidFerrucci AdamLally Nat. Lang. Eng 10 3-4 2004 ClearTK: A UIMA Toolkit for Statistical Natural Language Processing PVOgren PGWetzler SBethard UIMA for NLP workshop at Language Resources and Evaluation Conference (LREC) 2008 Rule-Based Information Extraction for Structured Data Acquisition using TextMarker MartinAtzmueller PeterKluegl FrankPuppe Proc. of the LWA-2008, Special Track on Knowledge Discovery and Machine Learning of the LWA-2008, Special Track on Knowledge Discovery and Machine Learning 2008 Textmarker: A tool for rule-based information extraction PeterKluegl MartinAtzmueller FrankPuppe Proc. Biennial GSCL Conference 2009, 2nd UIMA@GSCL Workshop Biennial GSCL Conference 2009, 2nd UIMA@GSCL Workshop Gunter Narr Verlag 2009 Ready for the MACE? The Mining and Analysis Continuum of Explaining Uncovered MartinAtzmueller ThomasRoth-Berghofer 30th SGAI International Conference on Artificial Intelligence. Accepted AI-2010 Combining Labeled and Unlabeled Data with Co-Training AvrimBlum TomMitchel COLT: Proceedings of the Workshop on Computational Learning Theory Morgan Kaufmann 1998