=Paper=
{{Paper
|id=None
|storemode=property
|title=Validation of Mixed-Structured Data using Pattern Mining and Information Extraction
|pdfUrl=https://ceur-ws.org/Vol-646/DERIS2010paper1.pdf
|volume=Vol-646
}}
==Validation of Mixed-Structured Data using Pattern Mining and Information Extraction==
VALIDATION OF MIXED-STRUCTURED DATA USING
PATTERN MINING AND INFORMATION EXTRACTION
Martin Atzmueller Stephanie Beer
University of Kassel University-Hospital of Würzburg
Knowledge and Data Engineering Gastroontologics Research Group
Kassel, Germany Würzburg, Germany
atzmueller@cs.uni-kassel.de beer_s@klinik.uni-wuerzburg.de
ABSTRACT In this paper, we propose an approach for the vali-
dation of mixed-structured data using data mining and
For large-scale data mining utilizing data from ubiq- information extraction and propose appropriate refine-
uitous and mixed-structured data sources, the appro- ment options. We focus on a data mining technique for
priate extraction and integration into a comprehensive mining local patterns, i.e., subgroup discovery, e.g., [5,
data-warehouse is of prime importance. Then, appro- 6, 7] that are especially suitable for the task: Local
priate methods for validation and potential refinement patterns consider local regularities (and irregularities)
are essential. This paper presents an approach apply- of the data and are therefore useful for spotting non-
ing data mining and information extraction methods expected, contradicting, and otherwise unusual patterns
for data validation: We apply subgroup discovery and potentially indicating problems and errors in the data.
(rule-based) information extraction for data integration
and validation. The methods are integrated into an in- Concerning the information extraction techniques,
cremental process for continuous validation options. The we consider popular methods implemented in the UIMA
results of a medical application demonstrate that sub- [8] and ClearTK [9] framework, and especially focus
group discovery and the applied information extraction on the T EXT M ARKER system, e.g., [10, 11] for rule-
methods are well suited for mining, extracting and val- based information extraction. Rules are especially suit-
idating clinically relevant knowledge. able for the proposed information extraction task since
they allow a concise and declarative formalization of
the relevant domain knowledge that is especially easy
1. INTRODUCTION to acquire, to comprehend and to maintain. Further-
more, in the case of errors, the cause can easily be iden-
Whenever data is continously collected, for example, tified by tracing the application of the individual rules.
using intelligent documentation systems [1], data min-
The combined approach enables data mining from
ing and data analysis provide a broad range of options
heterogenous sources. The user can specify simple rules
for scientific purposes. The mining and analysis step
that consider features of the text, e.g., structural or syn-
is often implemented using a data-warehouse [2, 3, 4].
tactic features of the textual content. We focus on an
For the data preprocessing and integration of several
incremental level-wise approach, such that both meth-
heterogenous sources, there exist standardized extract-
ods can complement each other in the validation and
transform-load (ETL) procedures, that need to incorpo-
refinement setting. Furthermore, validation knowledge
rate suitable data schemas, and integration rules. Ad-
can be formalized in a knowledge base, for assessing
ditionally, for unstructured or semi-structured textual
known and expected relations in the data.
data sources, the integration requires effective informa-
tion extraction methods. For clinical discharge letters, The approach has been implemented in a clinical
for example, the structure of the letter is usually non- application for mining data from clinical information
standardized, and thus dependent on different writing systems, documentation systems, and clinical discharge
styles of different authors. letters. This application scenario concerns the data in-
However, a prerequisite of data mining is the vali- tegration from heterogenous databases and the infor-
dation and the quality assurance of the integrated data. mation extraction from textual documents. The experi-
Especially concerning unreliable extraction and inte- ences and results so far demonstrate the flexibility and
gration methods, the quality of the obtained data can effectiveness of the presented approach that make the
vary significantly. If the data has been successfully val- data mining and information extraction methods suit-
idated, then the trust in the data mining results and their able components in the mining, validation and refine-
acceptance can be increased. ment process.
2. BACKGROUND 3. THE MINING AND VALIDATION PROCESS
In the following, we shortly summarize the methods Figure 1 depicts the process of validation and refine-
for data mining and information extraction, subgroup ment of mixed-structured data using pattern mining and
discovery, and rule-based information extraction using information extraction methods. The input of the pro-
T EXT M ARKER. cess is given by data from heterogenous data sources,
and by textual documents. The former are processed
2.1. Subgroup Discovery by appropriate data integration methods adapted to the
different sources. The latter are handled by information
Subgroup discovery is a flexible data mining method
extraction techniques, e.g., rule-based methods that uti-
for discovering local patterns that can be utilized for
lize appropriate extraction rules for the extraction of
global modeling in the context of exploratory data anal-
concepts and relations from the documents. In general,
ysis, description, characterization and classification.
a variety of methods can be applied.
Subgroup discovery is applied for identifying rela-
The process supports arbitrary information extrac-
tions between a (dependent) target concept and a set
tion methods, e.g., automatic techniques like support-
of explaining (independent) variables. Then, the goal
vector machines or conditional random fields as imple-
is to describe subsets of the data, that have the most
mented in the ClearTK [9] toolkit for statistical natu-
unusual characteristics with respect to the concept of
ral language processing. However, the refinement ca-
interest given by the target variable [6]. For example,
pabilies vary for the different extraction approaches:
the risk of coronary heart disease (target variable) is
While black-box methods like support vector machines
significantly higher in the subgroup of smokers with a
or conditional random fields only allow an indirect re-
positive family history than in the general population.
finement and adaptation of the model, i.e., based on
In the context of the proposed validation approach,
adapting the input data and/or the method parameters
we consider certain gold-standard concepts as targets,
for constructing the model, a white-box approach im-
as well as target concepts that are true, if and only if
plemented using rules provides for a direct modifica-
equivalent concepts from two different sources match.
tion of its model, namely the provided rules. Therefore,
Then, we can identify combinations of factors that cause
we especially focus on rule-based methods due to their
a mismatch between the concepts. These combinations
rich refinement capabilities.
can then indicate candidates for refinement.
After the integration and extraction of the data, the
result is provided to the pattern mining system which
2.2. Rule-based Information Extraction
obtains a set of validation patterns as output. This set
Information extractions aims at extracting a set of con- is then checked both for internal consistency and com-
cepts, entities and relations from a set of documents. pared to formalized background knowledge. In the case
T EXT M ARKER [10, 11] is a robust system for rule- of discrepancies and errors, refinement are proposed
based information extraction. It can be applied very for the data integration and/or the information extrac-
intuitively, since the used rules are especially easy to tion steps. After the rules have been refined, the pro-
acquire and to comprehend. Using the extracted infor- cess iterates with the updated schemas and models.
mation, data records can be easily created in a post- In the following we discuss exemplary results ob-
processing step. Humans often apply a strategy accord- tained from a medical project. We applied data col-
ing to a highlighter metaphor during ’manual’ informa- lected by the S ONO C ONSULT system, a multifunctional
tion extraction: First, top-level text blocks are consid- knowledge system for sonography, which has been in
ered and classified according to their content by col- routine use since 2002 documenting more than 12000
oring them with different highlighters. The contained patients in two clinics. The system covers the entire
elements of the annotated texts segments are then con- field of abdominal ultrasound (liver, portal tract, gall-
sidered further. The T EXT M ARKER [10, 11] system bladder, spleen, kidneys, adrenal glands, pancreas, in-
tries to imitate this manual extraction method by for- testine, lymph nodes, abdominal aorta, cava inferior,
malizing the appropriate actions using matching rules: prostate, and urinary bladder). The data was integrated
The rules mark sequences of words, extract text seg- with the SAP-based i.s.h.med system, and the infor-
ments or modify the input document depending on tex- mation extraction techniques were applied for textual
tual features. discharge letters from the respective patients; S ONO -
TextMarker aims at supporting the knowledge en- C ONSULT was used for documentation. By integrating
gineer in the rapid prototyping of information extrac- different data sources into the warehouse it is possible
tion applications. The default input for the system is to measure the conformity of sonographic results with
semi-structured text, but it can also process structured other methods or inputs. In our evaluations, we applied
or free text. Technically, HTML is often the input for- computer-tomography diagnoses and additional billing
mat, since most word processing documents can be ob- diagnoses (from the hospital information system) as a
tained in HTML format, or converted appropriately. gold-standard.
Data
Refine Rules, Schema
Integration
Data Sources
Pattern Mining Validation & Background
System Pattern Set Quality Assurance Knowledge
Information
Data
Refine Model/Rules
Integration
Extraction
Documents
Fig. 1. Process Model: Validation of Mixed-Structured Data using Pattern Mining and Information Extraction
Table 1 shows the correlation of S ONO C ONSULT both for checking the data integration and information
based diagnosis with CT/MR, diagnoses listed in the extraction tasks. We only require a partial gold-standard,
discharge letter and diagnoses contained in the hospi- i.e., a sample of the correct relations, because we need
tal information system for a selection of cases from to test the functional requirements of the data integra-
a certain examiner. It was quite interesting that the tion and extraction phases. On the next level, we can
conformity between S ONO C ONSULT based diagnoses incrementally validate the integrated data using the ex-
with the diagnoses contained in the hospital informa- tracted information, or vice versa, using the mined pat-
tion system was relatively low. Evaluating this issue it terns. In the case of discrepancies, we can rely on the
was obvious that various diagnosis were not listed in partial gold-standard data for verification, or we can
the hospital information system because they were not identify potential causes and verify these on concrete
revenue enhancing and not relevant for all clinical situ- cases. Therefore, the final decision for the refinements
ations. Therefore, we looked at the accordance with the relies on the user, which reviews all proposed refine-
discharge letters which were found to be highly concor- ments in a semi-automatic approach.
dant at least for the diagnosis of liver metastasis. Liver For the refinement steps, we can either extend the
cirrhosis is more awkward to detect using ultrasound (partial) gold-standard, or we perform a boot-strapping
and has to be in a more advanced stage. Therefore, approach, using a small gold-standard sample of tar-
some of the discharge diagnoses "liver cirrhosis" were get concepts for validation, e.g., for validating and re-
only detected using histology or other methods. fining the information extraction approach, which is in
In some cases, there are discrepancies with respect turn used for the validation of the data sources. In the
to the formalized background knowledge that still per- next step, the validation targets can be extended and
sist after refinement of the rules and checking the data the process for refinement is applied inversely. The
sources. In such cases, explanation-aware mining and boot-strapping approach for validation and refinement
analysis components provide appropriate solutions for is thus similar to the idea of co-training, e.g., [13] in
resolving conflicts and inconsistencies. By support- machine learning that also starts with a small labeled
ing the user with appropriate justifications and expla- (correct) dataset and iteratively adapts the models us-
nations, misleading patterns can be identified, and the ing another co-trained dataset.
background knowledge can be adapted. The decision
whether the background knowledge needs to be adapted 4. CONCLUSIONS
is performed by the domain specialist. As we have de-
scribed in [12] there are several continuous explanation This paper presented an approach for the validation of
dimensions in the context of data mining and analysis, mixed-structured data using information extraction and
that can be utilized for improving the explanation ca- pattern mining methods. In an incremental approach,
pabilities. In the medical domain, for example, pat- data can both be validated and refined with an increas-
terns are usually first assessed on the abstract level, ing level of accuracy. The presented approach has been
before they are checked and verified on concrete pa- successfully implemented in a medical project targeted
tient records, i.e., on a very detailed level of abstrac- at integrating data from clinical information systems,
tion. Then, discrepancies are modeled in the back- documentation systems, and textual discharge letters.
ground knowledge, for example, certain exception con- The experiences and results so far demonstrate the
ditions for certain subgroups of patients. flexibility and effectiveness of the pattern mining and
The validation phase is performed on several levels: information extraction methods for the presented vali-
On the first level, we can use a (partial) gold-standard dation and refinement approach.
Total SONO SAP % CT/MR % Discharge %
Case CONSULT Diagnoses Conformity Diagnoses Conformity Letter Conformity
Number Diagnoses with with Diagnoses with
SONO SONO SONO
CONSULT CONSULT CONSULT
Liver cirrhosis
16 12 6 20 1 33 9 50
Liver metastasis
28 16 11 65 15 87 17 94
Fig.1. Conformity of various sources of diagnosis input. Correlation of the different
Table 1. Exemplary study for with
sources a selection of casesdiagnoses.
SONOCONSULT concerning liver examinations performed by a certain examiner:
Conformity of system diagnoses with various sources of diagnosis input. The columns indicate the degree of
Having
correlation of the different different
sources data
with sources
S ONO in the warehouse
C ONSULT it ismeasured
diagnoses possible toby
measure the
the number of covered cases.
conformity of sonographic results with other methods or inputs. Figure 1
shows the correlation of SONOCONSULT based diagnosis with CT/MR,
diagnoses listed in the discharge letter and diagnoses nursed in the hospital
information system for a first number of cases.
5. REFERENCES [7] ItWilli
was quite interesting
Klösgen, that the A Multipattern and
“Explora:
conformity between SONOCONSULT based diagnoses with the diagnoses
Multistrategy Discovery Assistant,” in Ad-
listedAtzmueller,
[1] Frank Puppe, Martin in the hospital information
Georg system was quite low. Evaluating this issue it
Buscher, vances in Knowledge Discovery and Data Min-
was obvious that various diagnosis were not listed in the hospital information
Matthias Huettig, Hardi Lührs, and Hans-Peter ing, Usama M. Fayyad, Gregory Piatetsky-
system because they were not revenue enhancing. Therefore, we looked at
Buscher, “Application and Evaluation of a Med-
the accordance with the discharge letters whichShapiro,
were foundPadraic Smyth, and Ramasamy Uthu-
to be highly
ical Knowledge-System in Sonography (Sono- rusamy,
concordant at least for the diagnosis of liver metastasis. Eds., pp. 249–271. AAAI Press, 1996.
Consult),” in Proc. 18th Europ. Conf. on Artificial
Intelligence (ECAILiver
2008), 2008,
cirrhosis pp. 683–687.
is more
[8] David Ferrucci and Adam Lally, “UIMA: An
awkward to be diagnosed with ultrasound and has to be
in a more advanced stage. Therefore, some of the Architectural Approach
discharge diagnoses to Unstructured Informa-
“liver
[2] Jonathan C. Prather, David
cirrhosis” F. found
were Lobach,with Linda tion Processing
K.or other methods.
histology In one casein the Corporate Research Envi-
liver
Goodwin, Joseph cirrhosis
W. Hales,wasMarvin
listed in L.
theHage,
hospital
and ronment,”
information system but was Nat. Lang.
neither Eng., vol. 10, no. 3-4, pp.
found
W. Edward Hammond,with ultrasound
“Medical nor inData
the discharge
Mining:letter. It came out that 2004.
327–348, the input was
performed by another department (neurology).
Knowledge Discovery in a Clinical Data Ware- [9] P. V. Ogren, P. G. Wetzler, and S. Bethard,
house,” in Proc. Within
AMIAthe Annual
limited Fall
number Symposium
of examined cases we “ClearTK: A UIMA
found only one case of Toolkit for Statistical Nat-
(AIMA-1997), 1997, pp. 101–105.
pancreatic mass which was found in the ultrasound uralexamination
Language and listed in
Processing,” in UIMA for NLP
the discharge letter. However, it was not included in the hospital
workshop information
at Language Resources and Evaluation
[3] Rüdiger Wirth and Jochen Hipp, “CRISP-DM:
system.
Towards a Standard Process Model for Data Min- Conference (LREC), 2008.
ing,” in Proc. 4th The
Intl.first results
Conf. on ofthe
thePractical
correlations
Ap-of diagnoses which were
[10] Martin input by various
Atzmueller, Peter Kluegl, and Frank
sourcesDiscovery
plication of Knowledge show that there is a promising
and Data Min- high conformity between SonoConsult
Puppe, “Rule-Based Information Extraction for
and discharge letters, but for further quality improvement the correlation with
ing. 2000, pp. 29–39, Morgan Kaufmann. Structured Data Acquisition using TextMarker,”
other imaging techniques is very important. With a higher number of cases it
in Proc. of the LWA-2008, Special Track on
[4] Martin Atzmueller, Stephanie Beer, and Frank Knowledge Discovery and Machine Learning,
Puppe, “A Data Warehouse-Based Approach for 2008, pp. 1–7.
Quality Management, Evaluation and Analysis of
Intelligent Systems using Subgroup Mining,” in [11] Peter Kluegl, Martin Atzmueller, and Frank
Proc. 22nd International Florida Artificial Intel- Puppe, “Textmarker: A tool for rule-based in-
ligence Research Society Conference (FLAIRS), formation extraction,” in Proc. Biennial GSCL
accepted. 2009, pp. 372–377, AAAI Press. Conference 2009, 2nd UIMA@GSCL Workshop.
2009, pp. 233–240, Gunter Narr Verlag.
[5] Martin Atzmueller, Frank Puppe, and Hans-Peter
Buscher, “Exploiting Background Knowledge [12] Martin Atzmueller and Thomas Roth-Berghofer,
for Knowledge-Intensive Subgroup Discovery,” “Ready for the MACE? The Mining and Analy-
in Proc. 19th Intl. Joint Conference on Artifi- sis Continuum of Explaining Uncovered,” in AI-
cial Intelligence (IJCAI-05), Edinburgh, Scot- 2010: 30th SGAI International Conference on Ar-
land, 2005, pp. 647–652. tificial Intelligence. Accepted.
[13] Avrim Blum and Tom Mitchel, “Combining La-
[6] Stefan Wrobel, “An Algorithm for Multi- beled and Unlabeled Data with Co-Training,” in
Relational Discovery of Subgroups,” in Proc. COLT: Proceedings of the Workshop on Com-
1st European Symposium on Principles of Data putational Learning Theory. 1998, pp. 92–100,
Mining and Knowledge Discovery (PKDD-97), Morgan Kaufmann.
Berlin, 1997, pp. 78–87, Springer Verlag.