VALIDATION OF MIXED-STRUCTURED DATA USING
                   PATTERN MINING AND INFORMATION EXTRACTION

                 Martin Atzmueller                                               Stephanie Beer

            University of Kassel                                     University-Hospital of Würzburg
       Knowledge and Data Engineering                                Gastroontologics Research Group
             Kassel, Germany                                               Würzburg, Germany
         atzmueller@cs.uni-kassel.de                                 beer_s@klinik.uni-wuerzburg.de


                      ABSTRACT                                    In this paper, we propose an approach for the vali-
                                                              dation of mixed-structured data using data mining and
For large-scale data mining utilizing data from ubiq-         information extraction and propose appropriate refine-
uitous and mixed-structured data sources, the appro-          ment options. We focus on a data mining technique for
priate extraction and integration into a comprehensive        mining local patterns, i.e., subgroup discovery, e.g., [5,
data-warehouse is of prime importance. Then, appro-           6, 7] that are especially suitable for the task: Local
priate methods for validation and potential refinement        patterns consider local regularities (and irregularities)
are essential. This paper presents an approach apply-         of the data and are therefore useful for spotting non-
ing data mining and information extraction methods            expected, contradicting, and otherwise unusual patterns
for data validation: We apply subgroup discovery and          potentially indicating problems and errors in the data.
(rule-based) information extraction for data integration
and validation. The methods are integrated into an in-             Concerning the information extraction techniques,
cremental process for continuous validation options. The      we consider popular methods implemented in the UIMA
results of a medical application demonstrate that sub-        [8] and ClearTK [9] framework, and especially focus
group discovery and the applied information extraction        on the T EXT M ARKER system, e.g., [10, 11] for rule-
methods are well suited for mining, extracting and val-       based information extraction. Rules are especially suit-
idating clinically relevant knowledge.                        able for the proposed information extraction task since
                                                              they allow a concise and declarative formalization of
                                                              the relevant domain knowledge that is especially easy
                1. INTRODUCTION                               to acquire, to comprehend and to maintain. Further-
                                                              more, in the case of errors, the cause can easily be iden-
Whenever data is continously collected, for example,          tified by tracing the application of the individual rules.
using intelligent documentation systems [1], data min-
                                                                  The combined approach enables data mining from
ing and data analysis provide a broad range of options
                                                              heterogenous sources. The user can specify simple rules
for scientific purposes. The mining and analysis step
                                                              that consider features of the text, e.g., structural or syn-
is often implemented using a data-warehouse [2, 3, 4].
                                                              tactic features of the textual content. We focus on an
For the data preprocessing and integration of several
                                                              incremental level-wise approach, such that both meth-
heterogenous sources, there exist standardized extract-
                                                              ods can complement each other in the validation and
transform-load (ETL) procedures, that need to incorpo-
                                                              refinement setting. Furthermore, validation knowledge
rate suitable data schemas, and integration rules. Ad-
                                                              can be formalized in a knowledge base, for assessing
ditionally, for unstructured or semi-structured textual
                                                              known and expected relations in the data.
data sources, the integration requires effective informa-
tion extraction methods. For clinical discharge letters,           The approach has been implemented in a clinical
for example, the structure of the letter is usually non-      application for mining data from clinical information
standardized, and thus dependent on different writing         systems, documentation systems, and clinical discharge
styles of different authors.                                  letters. This application scenario concerns the data in-
    However, a prerequisite of data mining is the vali-       tegration from heterogenous databases and the infor-
dation and the quality assurance of the integrated data.      mation extraction from textual documents. The experi-
Especially concerning unreliable extraction and inte-         ences and results so far demonstrate the flexibility and
gration methods, the quality of the obtained data can         effectiveness of the presented approach that make the
vary significantly. If the data has been successfully val-    data mining and information extraction methods suit-
idated, then the trust in the data mining results and their   able components in the mining, validation and refine-
acceptance can be increased.                                  ment process.
                 2. BACKGROUND                              3. THE MINING AND VALIDATION PROCESS

In the following, we shortly summarize the methods          Figure 1 depicts the process of validation and refine-
for data mining and information extraction, subgroup        ment of mixed-structured data using pattern mining and
discovery, and rule-based information extraction using      information extraction methods. The input of the pro-
T EXT M ARKER.                                              cess is given by data from heterogenous data sources,
                                                            and by textual documents. The former are processed
2.1. Subgroup Discovery                                     by appropriate data integration methods adapted to the
                                                            different sources. The latter are handled by information
Subgroup discovery is a flexible data mining method
                                                            extraction techniques, e.g., rule-based methods that uti-
for discovering local patterns that can be utilized for
                                                            lize appropriate extraction rules for the extraction of
global modeling in the context of exploratory data anal-
                                                            concepts and relations from the documents. In general,
ysis, description, characterization and classification.
                                                            a variety of methods can be applied.
    Subgroup discovery is applied for identifying rela-
                                                                 The process supports arbitrary information extrac-
tions between a (dependent) target concept and a set
                                                            tion methods, e.g., automatic techniques like support-
of explaining (independent) variables. Then, the goal
                                                            vector machines or conditional random fields as imple-
is to describe subsets of the data, that have the most
                                                            mented in the ClearTK [9] toolkit for statistical natu-
unusual characteristics with respect to the concept of
                                                            ral language processing. However, the refinement ca-
interest given by the target variable [6]. For example,
                                                            pabilies vary for the different extraction approaches:
the risk of coronary heart disease (target variable) is
                                                            While black-box methods like support vector machines
significantly higher in the subgroup of smokers with a
                                                            or conditional random fields only allow an indirect re-
positive family history than in the general population.
                                                            finement and adaptation of the model, i.e., based on
    In the context of the proposed validation approach,
                                                            adapting the input data and/or the method parameters
we consider certain gold-standard concepts as targets,
                                                            for constructing the model, a white-box approach im-
as well as target concepts that are true, if and only if
                                                            plemented using rules provides for a direct modifica-
equivalent concepts from two different sources match.
                                                            tion of its model, namely the provided rules. Therefore,
Then, we can identify combinations of factors that cause
                                                            we especially focus on rule-based methods due to their
a mismatch between the concepts. These combinations
                                                            rich refinement capabilities.
can then indicate candidates for refinement.
                                                                 After the integration and extraction of the data, the
                                                            result is provided to the pattern mining system which
2.2. Rule-based Information Extraction
                                                            obtains a set of validation patterns as output. This set
Information extractions aims at extracting a set of con-    is then checked both for internal consistency and com-
cepts, entities and relations from a set of documents.      pared to formalized background knowledge. In the case
T EXT M ARKER [10, 11] is a robust system for rule-         of discrepancies and errors, refinement are proposed
based information extraction. It can be applied very        for the data integration and/or the information extrac-
intuitively, since the used rules are especially easy to    tion steps. After the rules have been refined, the pro-
acquire and to comprehend. Using the extracted infor-       cess iterates with the updated schemas and models.
mation, data records can be easily created in a post-            In the following we discuss exemplary results ob-
processing step. Humans often apply a strategy accord-      tained from a medical project. We applied data col-
ing to a highlighter metaphor during ’manual’ informa-      lected by the S ONO C ONSULT system, a multifunctional
tion extraction: First, top-level text blocks are consid-   knowledge system for sonography, which has been in
ered and classified according to their content by col-      routine use since 2002 documenting more than 12000
oring them with different highlighters. The contained       patients in two clinics. The system covers the entire
elements of the annotated texts segments are then con-      field of abdominal ultrasound (liver, portal tract, gall-
sidered further. The T EXT M ARKER [10, 11] system          bladder, spleen, kidneys, adrenal glands, pancreas, in-
tries to imitate this manual extraction method by for-      testine, lymph nodes, abdominal aorta, cava inferior,
malizing the appropriate actions using matching rules:      prostate, and urinary bladder). The data was integrated
The rules mark sequences of words, extract text seg-        with the SAP-based i.s.h.med system, and the infor-
ments or modify the input document depending on tex-        mation extraction techniques were applied for textual
tual features.                                              discharge letters from the respective patients; S ONO -
    TextMarker aims at supporting the knowledge en-         C ONSULT was used for documentation. By integrating
gineer in the rapid prototyping of information extrac-      different data sources into the warehouse it is possible
tion applications. The default input for the system is      to measure the conformity of sonographic results with
semi-structured text, but it can also process structured    other methods or inputs. In our evaluations, we applied
or free text. Technically, HTML is often the input for-     computer-tomography diagnoses and additional billing
mat, since most word processing documents can be ob-        diagnoses (from the hospital information system) as a
tained in HTML format, or converted appropriately.          gold-standard.
                              Data
                                                                 Refine Rules, Schema
                           Integration


           Data Sources


                                               Pattern Mining                                Validation &      Background
                                                  System               Pattern Set         Quality Assurance   Knowledge
                           Information
                               Data
                                                                 Refine Model/Rules
                            Integration
                            Extraction

       Documents


  Fig. 1. Process Model: Validation of Mixed-Structured Data using Pattern Mining and Information Extraction


    Table 1 shows the correlation of S ONO C ONSULT             both for checking the data integration and information
based diagnosis with CT/MR, diagnoses listed in the             extraction tasks. We only require a partial gold-standard,
discharge letter and diagnoses contained in the hospi-          i.e., a sample of the correct relations, because we need
tal information system for a selection of cases from            to test the functional requirements of the data integra-
a certain examiner. It was quite interesting that the           tion and extraction phases. On the next level, we can
conformity between S ONO C ONSULT based diagnoses               incrementally validate the integrated data using the ex-
with the diagnoses contained in the hospital informa-           tracted information, or vice versa, using the mined pat-
tion system was relatively low. Evaluating this issue it        terns. In the case of discrepancies, we can rely on the
was obvious that various diagnosis were not listed in           partial gold-standard data for verification, or we can
the hospital information system because they were not           identify potential causes and verify these on concrete
revenue enhancing and not relevant for all clinical situ-       cases. Therefore, the final decision for the refinements
ations. Therefore, we looked at the accordance with the         relies on the user, which reviews all proposed refine-
discharge letters which were found to be highly concor-         ments in a semi-automatic approach.
dant at least for the diagnosis of liver metastasis. Liver           For the refinement steps, we can either extend the
cirrhosis is more awkward to detect using ultrasound            (partial) gold-standard, or we perform a boot-strapping
and has to be in a more advanced stage. Therefore,              approach, using a small gold-standard sample of tar-
some of the discharge diagnoses "liver cirrhosis" were          get concepts for validation, e.g., for validating and re-
only detected using histology or other methods.                 fining the information extraction approach, which is in
     In some cases, there are discrepancies with respect        turn used for the validation of the data sources. In the
to the formalized background knowledge that still per-          next step, the validation targets can be extended and
sist after refinement of the rules and checking the data        the process for refinement is applied inversely. The
sources. In such cases, explanation-aware mining and            boot-strapping approach for validation and refinement
analysis components provide appropriate solutions for           is thus similar to the idea of co-training, e.g., [13] in
resolving conflicts and inconsistencies. By support-            machine learning that also starts with a small labeled
ing the user with appropriate justifications and expla-         (correct) dataset and iteratively adapts the models us-
nations, misleading patterns can be identified, and the         ing another co-trained dataset.
background knowledge can be adapted. The decision
whether the background knowledge needs to be adapted                                    4. CONCLUSIONS
is performed by the domain specialist. As we have de-
scribed in [12] there are several continuous explanation        This paper presented an approach for the validation of
dimensions in the context of data mining and analysis,          mixed-structured data using information extraction and
that can be utilized for improving the explanation ca-          pattern mining methods. In an incremental approach,
pabilities. In the medical domain, for example, pat-            data can both be validated and refined with an increas-
terns are usually first assessed on the abstract level,         ing level of accuracy. The presented approach has been
before they are checked and verified on concrete pa-            successfully implemented in a medical project targeted
tient records, i.e., on a very detailed level of abstrac-       at integrating data from clinical information systems,
tion. Then, discrepancies are modeled in the back-              documentation systems, and textual discharge letters.
ground knowledge, for example, certain exception con-               The experiences and results so far demonstrate the
ditions for certain subgroups of patients.                      flexibility and effectiveness of the pattern mining and
   The validation phase is performed on several levels:         information extraction methods for the presented vali-
On the first level, we can use a (partial) gold-standard        dation and refinement approach.
                   Total  SONO      SAP     %          CT/MR                       %          Discharge       %
                   Case   CONSULT Diagnoses Conformity Diagnoses                   Conformity Letter          Conformity
                   Number Diagnoses         with                                   with       Diagnoses       with
                                            SONO                                   SONO                       SONO
                                            CONSULT                                CONSULT                    CONSULT

                   Liver cirrhosis

                   16         12          6            20            1             33           9             50

                   Liver metastasis

                   28         16          11           65            15            87           17            94


                          Fig.1. Conformity of various sources of diagnosis input. Correlation of the different
Table 1. Exemplary study     for with
                         sources a selection    of casesdiagnoses.
                                      SONOCONSULT         concerning liver examinations performed by a certain examiner:
Conformity of system diagnoses with various sources of diagnosis input. The columns indicate the degree of
                         Having
correlation of the different     different
                              sources      data
                                        with    sources
                                              S ONO      in the warehouse
                                                     C ONSULT             it ismeasured
                                                                   diagnoses   possible toby
                                                                                          measure  the
                                                                                             the number of covered cases.
                     conformity of sonographic results with other methods or inputs. Figure 1
                     shows the correlation of SONOCONSULT based diagnosis with CT/MR,
                     diagnoses listed in the discharge letter and diagnoses nursed in the hospital
                     information system for a first number of cases.
                5. REFERENCES                                     [7] ItWilli
                                                                         was quite interesting
                                                                              Klösgen,         that the A Multipattern and
                                                                                            “Explora:
                     conformity between SONOCONSULT based diagnoses           with the diagnoses
                                                                       Multistrategy     Discovery Assistant,”      in Ad-
                     listedAtzmueller,
 [1] Frank Puppe, Martin    in the hospital information
                                         Georg          system was quite low. Evaluating this issue it
                                                  Buscher,             vances in Knowledge Discovery and Data Min-
                     was obvious that various diagnosis were not listed in the hospital information
     Matthias Huettig, Hardi Lührs, and Hans-Peter                     ing, Usama M. Fayyad, Gregory Piatetsky-
                     system because they were not revenue enhancing. Therefore, we looked at
     Buscher, “Application   and  Evaluation     of a Med-
                     the accordance with the discharge letters whichShapiro,
                                                                        were foundPadraic    Smyth, and Ramasamy Uthu-
                                                                                    to be highly
     ical Knowledge-System      in  Sonography      (Sono-             rusamy,
                     concordant at least for the diagnosis of liver metastasis.  Eds.,  pp.  249–271.  AAAI Press, 1996.
     Consult),” in Proc. 18th Europ. Conf. on Artificial
     Intelligence (ECAILiver
                          2008),  2008,
                             cirrhosis     pp. 683–687.
                                        is more
                                                                      [8] David Ferrucci and Adam Lally, “UIMA: An
                                                  awkward to be diagnosed with ultrasound and has to be
                       in a more advanced stage. Therefore, some of the    Architectural     Approach
                                                                               discharge diagnoses         to Unstructured Informa-
                                                                                                       “liver
 [2] Jonathan C. Prather,   David
                       cirrhosis”    F. found
                                  were   Lobach,with Linda                 tion Processing
                                                             K.or other methods.
                                                     histology                      In one casein  the Corporate Research Envi-
                                                                                                liver
     Goodwin, Joseph cirrhosis
                        W. Hales,wasMarvin
                                      listed in L.
                                                 theHage,
                                                     hospital
                                                           and             ronment,”
                                                              information system  but was Nat. Lang.
                                                                                           neither       Eng., vol. 10, no. 3-4, pp.
                                                                                                     found
     W. Edward Hammond,with ultrasound
                                 “Medical nor inData
                                                 the discharge
                                                      Mining:letter. It came   out that 2004.
                                                                           327–348,     the input was
                       performed by another department (neurology).
     Knowledge Discovery in a Clinical Data Ware-                     [9] P. V. Ogren, P. G. Wetzler, and S. Bethard,
     house,” in Proc. Within
                        AMIAthe  Annual
                                   limited Fall
                                            number Symposium
                                                     of examined cases we  “ClearTK:      A UIMA
                                                                             found only one   case of Toolkit for Statistical Nat-
     (AIMA-1997), 1997,     pp. 101–105.
                       pancreatic   mass which was found in the ultrasound uralexamination
                                                                                 Language and      listed in
                                                                                               Processing,”      in UIMA for NLP
                        the discharge letter. However, it was not included     in the hospital
                                                                            workshop           information
                                                                                        at Language      Resources and Evaluation
 [3] Rüdiger Wirth and     Jochen Hipp, “CRISP-DM:
                        system.
     Towards a Standard Process Model for Data Min-                         Conference     (LREC),    2008.
     ing,” in Proc. 4th The
                         Intl.first results
                                 Conf.  on ofthe
                                               thePractical
                                                   correlations
                                                             Ap-of diagnoses which were
                                                                       [10] Martin       input by various
                                                                                       Atzmueller,     Peter Kluegl, and Frank
                        sourcesDiscovery
     plication of Knowledge         show that there  is a promising
                                                and Data    Min- high conformity     between SonoConsult
                                                                            Puppe, “Rule-Based Information Extraction for
                        and discharge letters, but for further quality improvement the correlation with
     ing. 2000, pp. 29–39,     Morgan Kaufmann.                             Structured Data Acquisition using TextMarker,”
                        other imaging techniques is very important. With a higher number of cases it
                                                                            in Proc. of the LWA-2008, Special Track on
 [4] Martin Atzmueller, Stephanie Beer, and Frank                           Knowledge Discovery and Machine Learning,
     Puppe, “A Data Warehouse-Based Approach for                            2008, pp. 1–7.
     Quality Management, Evaluation and Analysis of
     Intelligent Systems using Subgroup Mining,” in                    [11] Peter Kluegl, Martin Atzmueller, and Frank
     Proc. 22nd International Florida Artificial Intel-                     Puppe, “Textmarker: A tool for rule-based in-
     ligence Research Society Conference (FLAIRS),                          formation extraction,” in Proc. Biennial GSCL
     accepted. 2009, pp. 372–377, AAAI Press.                               Conference 2009, 2nd UIMA@GSCL Workshop.
                                                                            2009, pp. 233–240, Gunter Narr Verlag.
 [5] Martin Atzmueller, Frank Puppe, and Hans-Peter
     Buscher, “Exploiting Background Knowledge                         [12] Martin Atzmueller and Thomas Roth-Berghofer,
     for Knowledge-Intensive Subgroup Discovery,”                           “Ready for the MACE? The Mining and Analy-
     in Proc. 19th Intl. Joint Conference on Artifi-                        sis Continuum of Explaining Uncovered,” in AI-
     cial Intelligence (IJCAI-05), Edinburgh, Scot-                         2010: 30th SGAI International Conference on Ar-
     land, 2005, pp. 647–652.                                               tificial Intelligence. Accepted.
                                                                       [13] Avrim Blum and Tom Mitchel, “Combining La-
 [6] Stefan Wrobel,        “An Algorithm for Multi-                         beled and Unlabeled Data with Co-Training,” in
     Relational Discovery of Subgroups,” in Proc.                           COLT: Proceedings of the Workshop on Com-
     1st European Symposium on Principles of Data                           putational Learning Theory. 1998, pp. 92–100,
     Mining and Knowledge Discovery (PKDD-97),                              Morgan Kaufmann.
     Berlin, 1997, pp. 78–87, Springer Verlag.