<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VALIDATION OF MIXED-STRUCTURED DATA USING PATTERN MINING AND INFORMATION EXTRACTION Martin Atzmueller</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Kassel Knowledge and Data Engineering Kassel</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>For large-scale data mining utilizing data from ubiquitous and mixed-structured data sources, the appropriate extraction and integration into a comprehensive data-warehouse is of prime importance. Then, appropriate methods for validation and potential refinement are essential. This paper presents an approach applying data mining and information extraction methods for data validation: We apply subgroup discovery and (rule-based) information extraction for data integration and validation. The methods are integrated into an incremental process for continuous validation options. The results of a medical application demonstrate that subgroup discovery and the applied information extraction methods are well suited for mining, extracting and validating clinically relevant knowledge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Whenever data is continously collected, for example,
using intelligent documentation systems [1], data
mining and data analysis provide a broad range of options
for scientific purposes. The mining and analysis step
is often implemented using a data-warehouse [
        <xref ref-type="bibr" rid="ref1">2, 3, 4</xref>
        ].
For the data preprocessing and integration of several
heterogenous sources, there exist standardized
extracttransform-load (ETL) procedures, that need to
incorporate suitable data schemas, and integration rules.
Additionally, for unstructured or semi-structured textual
data sources, the integration requires effective
information extraction methods. For clinical discharge letters,
for example, the structure of the letter is usually
nonstandardized, and thus dependent on different writing
styles of different authors.
      </p>
      <p>However, a prerequisite of data mining is the
validation and the quality assurance of the integrated data.
Especially concerning unreliable extraction and
integration methods, the quality of the obtained data can
vary significantly. If the data has been successfully
validated, then the trust in the data mining results and their
acceptance can be increased.</p>
      <p>The approach has been implemented in a clinical
application for mining data from clinical information
systems, documentation systems, and clinical discharge
letters. This application scenario concerns the data
integration from heterogenous databases and the
information extraction from textual documents. The
experiences and results so far demonstrate the flexibility and
effectiveness of the presented approach that make the
data mining and information extraction methods
suitable components in the mining, validation and
refinement process.</p>
    </sec>
    <sec id="sec-2">
      <title>2. BACKGROUND</title>
    </sec>
    <sec id="sec-3">
      <title>3. THE MINING AND VALIDATION PROCESS</title>
      <p>In the following, we shortly summarize the methods
for data mining and information extraction, subgroup
discovery, and rule-based information extraction using
TEXTMARKER.</p>
    </sec>
    <sec id="sec-4">
      <title>2.1. Subgroup Discovery</title>
      <p>Subgroup discovery is a flexible data mining method
for discovering local patterns that can be utilized for
global modeling in the context of exploratory data
analysis, description, characterization and classification.</p>
      <p>
        Subgroup discovery is applied for identifying
relations between a (dependent) target concept and a set
of explaining (independent) variables. Then, the goal
is to describe subsets of the data, that have the most
unusual characteristics with respect to the concept of
interest given by the target variable [
        <xref ref-type="bibr" rid="ref3">6</xref>
        ]. For example,
the risk of coronary heart disease (target variable) is
significantly higher in the subgroup of smokers with a
positive family history than in the general population.
      </p>
      <p>In the context of the proposed validation approach,
we consider certain gold-standard concepts as targets,
as well as target concepts that are true, if and only if
equivalent concepts from two different sources match.
Then, we can identify combinations of factors that cause
a mismatch between the concepts. These combinations
can then indicate candidates for refinement.</p>
    </sec>
    <sec id="sec-5">
      <title>2.2. Rule-based Information Extraction</title>
      <p>
        Information extractions aims at extracting a set of
concepts, entities and relations from a set of documents.
TEXTMARKER [
        <xref ref-type="bibr" rid="ref4">10, 11</xref>
        ] is a robust system for
rulebased information extraction. It can be applied very
intuitively, since the used rules are especially easy to
acquire and to comprehend. Using the extracted
information, data records can be easily created in a
postprocessing step. Humans often apply a strategy
according to a highlighter metaphor during ’manual’
information extraction: First, top-level text blocks are
considered and classified according to their content by
coloring them with different highlighters. The contained
elements of the annotated texts segments are then
considered further. The TEXTMARKER [
        <xref ref-type="bibr" rid="ref4">10, 11</xref>
        ] system
tries to imitate this manual extraction method by
formalizing the appropriate actions using matching rules:
The rules mark sequences of words, extract text
segments or modify the input document depending on
textual features.
      </p>
      <p>TextMarker aims at supporting the knowledge
engineer in the rapid prototyping of information
extraction applications. The default input for the system is
semi-structured text, but it can also process structured
or free text. Technically, HTML is often the input
format, since most word processing documents can be
obtained in HTML format, or converted appropriately.
Figure 1 depicts the process of validation and
refinement of mixed-structured data using pattern mining and
information extraction methods. The input of the
process is given by data from heterogenous data sources,
and by textual documents. The former are processed
by appropriate data integration methods adapted to the
different sources. The latter are handled by information
extraction techniques, e.g., rule-based methods that
utilize appropriate extraction rules for the extraction of
concepts and relations from the documents. In general,
a variety of methods can be applied.</p>
      <p>The process supports arbitrary information
extraction methods, e.g., automatic techniques like
supportvector machines or conditional random fields as
implemented in the ClearTK [9] toolkit for statistical
natural language processing. However, the refinement
capabilies vary for the different extraction approaches:
While black-box methods like support vector machines
or conditional random fields only allow an indirect
refinement and adaptation of the model, i.e., based on
adapting the input data and/or the method parameters
for constructing the model, a white-box approach
implemented using rules provides for a direct
modification of its model, namely the provided rules. Therefore,
we especially focus on rule-based methods due to their
rich refinement capabilities.</p>
      <p>After the integration and extraction of the data, the
result is provided to the pattern mining system which
obtains a set of validation patterns as output. This set
is then checked both for internal consistency and
compared to formalized background knowledge. In the case
of discrepancies and errors, refinement are proposed
for the data integration and/or the information
extraction steps. After the rules have been refined, the
process iterates with the updated schemas and models.</p>
      <p>In the following we discuss exemplary results
obtained from a medical project. We applied data
collected by the SONOCONSULT system, a multifunctional
knowledge system for sonography, which has been in
routine use since 2002 documenting more than 12000
patients in two clinics. The system covers the entire
field of abdominal ultrasound (liver, portal tract,
gallbladder, spleen, kidneys, adrenal glands, pancreas,
intestine, lymph nodes, abdominal aorta, cava inferior,
prostate, and urinary bladder). The data was integrated
with the SAP-based i.s.h.med system, and the
information extraction techniques were applied for textual
discharge letters from the respective patients;
SONOCONSULT was used for documentation. By integrating
different data sources into the warehouse it is possible
to measure the conformity of sonographic results with
other methods or inputs. In our evaluations, we applied
computer-tomography diagnoses and additional billing
diagnoses (from the hospital information system) as a
gold-standard.</p>
      <p>Data Sources
Documents</p>
      <p>Data
Integration
InfoDrmataation
IEnxtetrgarcattioionn</p>
      <p>Pattern Mining</p>
      <p>System</p>
      <p>Refine Rules, Schema</p>
      <p>Pattern Set
Refine Model/Rules</p>
      <p>Validation &amp;
Quality Assurance</p>
      <p>Background
Knowledge</p>
      <p>Table 1 shows the correlation of SONOCONSULT
based diagnosis with CT/MR, diagnoses listed in the
discharge letter and diagnoses contained in the
hospital information system for a selection of cases from
a certain examiner. It was quite interesting that the
conformity between SONOCONSULT based diagnoses
with the diagnoses contained in the hospital
information system was relatively low. Evaluating this issue it
was obvious that various diagnosis were not listed in
the hospital information system because they were not
revenue enhancing and not relevant for all clinical
situations. Therefore, we looked at the accordance with the
discharge letters which were found to be highly
concordant at least for the diagnosis of liver metastasis. Liver
cirrhosis is more awkward to detect using ultrasound
and has to be in a more advanced stage. Therefore,
some of the discharge diagnoses "liver cirrhosis" were
only detected using histology or other methods.</p>
      <p>
        In some cases, there are discrepancies with respect
to the formalized background knowledge that still
persist after refinement of the rules and checking the data
sources. In such cases, explanation-aware mining and
analysis components provide appropriate solutions for
resolving conflicts and inconsistencies. By
supporting the user with appropriate justifications and
explanations, misleading patterns can be identified, and the
background knowledge can be adapted. The decision
whether the background knowledge needs to be adapted
is performed by the domain specialist. As we have
described in [
        <xref ref-type="bibr" rid="ref5">12</xref>
        ] there are several continuous explanation
dimensions in the context of data mining and analysis,
that can be utilized for improving the explanation
capabilities. In the medical domain, for example,
patterns are usually first assessed on the abstract level,
before they are checked and verified on concrete
patient records, i.e., on a very detailed level of
abstraction. Then, discrepancies are modeled in the
background knowledge, for example, certain exception
conditions for certain subgroups of patients.
      </p>
      <p>The validation phase is performed on several levels:
On the first level, we can use a (partial) gold-standard
both for checking the data integration and information
extraction tasks. We only require a partial gold-standard,
i.e., a sample of the correct relations, because we need
to test the functional requirements of the data
integration and extraction phases. On the next level, we can
incrementally validate the integrated data using the
extracted information, or vice versa, using the mined
patterns. In the case of discrepancies, we can rely on the
partial gold-standard data for verification, or we can
identify potential causes and verify these on concrete
cases. Therefore, the final decision for the refinements
relies on the user, which reviews all proposed
refinements in a semi-automatic approach.</p>
      <p>
        For the refinement steps, we can either extend the
(partial) gold-standard, or we perform a boot-strapping
approach, using a small gold-standard sample of
target concepts for validation, e.g., for validating and
refining the information extraction approach, which is in
turn used for the validation of the data sources. In the
next step, the validation targets can be extended and
the process for refinement is applied inversely. The
boot-strapping approach for validation and refinement
is thus similar to the idea of co-training, e.g., [
        <xref ref-type="bibr" rid="ref6">13</xref>
        ] in
machine learning that also starts with a small labeled
(correct) dataset and iteratively adapts the models
using another co-trained dataset.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. CONCLUSIONS</title>
      <p>This paper presented an approach for the validation of
mixed-structured data using information extraction and
pattern mining methods. In an incremental approach,
data can both be validated and refined with an
increasing level of accuracy. The presented approach has been
successfully implemented in a medical project targeted
at integrating data from clinical information systems,
documentation systems, and textual discharge letters.</p>
      <p>The experiences and results so far demonstrate the
flexibility and effectiveness of the pattern mining and
information extraction methods for the presented
validation and refinement approach.</p>
      <p>Total SONO SAP % CT/MR % Discharge %
Case CONSULT Diagnoses Conformity Diagnoses Conformity Letter Conformity
Number Diagnoses with with Diagnoses with
SONO SONO SONO</p>
      <p>CONSULT CONSULT CONSULT
Liver cirrhosis
16
28
Liver metastasis
12
16
6
11
20
65
1
15
33
87
9
17
50
94</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Atzmueller</surname>
          </string-name>
          , Stephanie Beer, and Frank Puppe, “
          <article-title>A Data Warehouse-Based Approach for Quality Management, Evaluation and Analysis of Intelligent Systems using Subgroup Mining,”</article-title>
          <source>in Proc. 22nd International Florida Artificial Intelligence Research Society Conference (FLAIRS)</source>
          ,
          <source>accepted. 2009</source>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>377</lpage>
          , AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Atzmueller</surname>
          </string-name>
          , Frank Puppe, and
          <string-name>
            <surname>Hans-Peter Buscher</surname>
          </string-name>
          , “
          <article-title>Exploiting Background Knowledge for Knowledge-Intensive Subgroup Discovery,”</article-title>
          <source>in Proc. 19th Intl. Joint Conference on Artificial Intelligence (IJCAI-05)</source>
          , Edinburgh, Scotland,
          <year>2005</year>
          , pp.
          <fpage>647</fpage>
          -
          <lpage>652</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Wrobel</surname>
          </string-name>
          , “
          <article-title>An Algorithm for MultiRelational Discovery of Subgroups,”</article-title>
          <source>in Proc. 1st European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD-97)</source>
          , Berlin,
          <year>1997</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>87</lpage>
          , Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Kluegl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Atzmueller</surname>
          </string-name>
          , and Frank Puppe, “
          <article-title>Textmarker: A tool for rule-based information extraction</article-title>
          ,”
          <source>in Proc. Biennial GSCL Conference</source>
          <year>2009</year>
          ,
          <source>2nd UIMA@GSCL Workshop</source>
          .
          <year>2009</year>
          , pp.
          <fpage>233</fpage>
          -
          <lpage>240</lpage>
          , Gunter Narr Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Atzmueller</surname>
          </string-name>
          and Thomas Roth-Berghofer, “
          <article-title>Ready for the MACE? The Mining and Analysis Continuum of Explaining Uncovered,”</article-title>
          <source>in AI2010: 30th SGAI International Conference on Artificial Intelligence. Accepted.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Avrim</given-names>
            <surname>Blum</surname>
          </string-name>
          and Tom Mitchel, “
          <article-title>Combining Labeled and Unlabeled Data with Co-Training,”</article-title>
          <source>in COLT: Proceedings of the Workshop on Computational Learning Theory. 1998</source>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>100</lpage>
          , Morgan Kaufmann.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>