=Paper=
{{Paper
|id=None
|storemode=property
|title=Decision-Maker-Aware Design of Descriptive Data Mining
|pdfUrl=https://ceur-ws.org/Vol-646/DERIS2010paper6.pdf
|volume=Vol-646
}}
==Decision-Maker-Aware Design of Descriptive Data Mining==
<pdf width="1500px">https://ceur-ws.org/Vol-646/DERIS2010paper6.pdf</pdf>
<pre>
        DECISION-MAKER-AWARE DESIGN OF DESCRIPTIVE DATA MINING

       Benedikt Kaempgen                      Florian Lemmerich                       Martin Atzmueller

          Karlsruhe Institute of               University of Würzburg                   University of Kassel
           Technology (AIFB)            Department of Computer Science VI        Knowledge and Data Engineering
          Karlsruhe, Germany                    Würzburg, Germany                    Group, Kassel, Germany
      benedikt.kaempgen@kit.edu       lemmerich@informatik.uni-wuerzburg.de         atzmueller@cs.uni-kassel.de


                      ABSTRACT                                    In this context, the contribution of this work is three-
                                                              fold: First, we propose a process-oriented design for
This paper presents two real-world case studies focus-        describing and performing projects in the context of
sing on descriptive data mining for decision-makers.          decision-maker-aware descriptive data mining. Second,
For that, we first propose a process-oriented design of       since only few descriptions of successful data mining
descriptive data mining that helps in describing and          projects that concentrate on decision-makers as well
performing such projects. Finally, we discuss impor-          as the development team are available, we present two
tant lessons learned during the implementation of the         such case studies. Third, we discuss specific experi-
respective projects.                                          ences and lessons learned during the implementation of
                                                              the case studies. Altogether, it is our motivation to en-
                                                              able more successful descriptive data mining projects.
                1. INTRODUCTION
                                                                  The rest of the paper is structured as follows: Sec-
                                                              tion 2 discusses related approaches. After that, Sec-
With the implementation and collection of data in rou-        tion 3 presents the process-oriented design for describ-
tine fashion, e.g., in industrial, medical, administrative    ing and performing the case studies. Next, the im-
and social-web-based scenarios, the analysis and min-         plemented case studies are described in detail. Sec-
ing of such accumulated data is of prime importance           tion 4 reports specific experiences and lessons learned
for intelligent decision support. However, currently up       obtained during the implementation of the case studies.
to 60% [1] of data mining projects fail. One problem          Finally, Section 5 concludes the paper with a summary
concerns the integration of the key stakeholders in data      and interesting directions for future work.
mining projects, i.e., the decision-makers. They need to
be tightly integrated into the project, similar to the ac-
tual data mining engineers. Thus, in order to improve                         2. RELATED WORK
the common understanding on goal, approach and out-
come a more transparent data mining process consid-           In the following, we describe related work that deals
ering both developer team and decision-maker is rather        with data mining design and implementations.
important.                                                        Process models provide an high level overview of
     In this paper, we consider two case studies: The         the input and output of required data mining tasks. Ac-
first one is concerned with the analysis of the success       cording to Kurgan and Musilek [2] CRISP-DM [3] is
and failures of (bachelor) student groups in order to         most prominently used in data mining projects. It con-
help decision support for improving the success rate          sists of six iteratively executed phases: Business Un-
of individual curricula. The second one is concerned          derstanding and Data Understanding make sure that
with the evaluation of a web-based training system and        the developer team has necessary background know-
aims, e.g., at analyzing the outcomes of different study      ledge to deal with the problem of the decision-maker.
groups and their learning differences.                        In Data Preparation the available data is transformed
     We focus on approaches for obtaining descriptive         for analysis, e.g., by selection, cleaning, construction,
reports and descriptive data mining models, e.g., local       transformation and integration. In the Modeling step
patterns and rules as actionable knowledge for decision       data mining techniques (algorithms) are applied to the
support. Descriptive data mining focuses on describing        prepared data to extract information and knowledge.
the data by the discovered patterns and relations: In         In the Evaluation these results are evaluated, validated
contrast to predictive data mining no specialized model       and checked against the data mining objectives. Fi-
is extracted (for later prediction or classification) but a   nally, in the Deployment phase the results are employed
set of patterns and/or relations is mined for characteriz-    for action, i.e., integrated into the respective processes
ing and describing the data and its hidden components.        of the decision-maker.
    Marbán et al. [1] discuss the evolvement of data         available and useful data, e.g., the data representation
mining to an engineering discipline. They emphasize,         or the data acquisition process, while domain experts
that successful projects take more than CRISP-DM’s           hold knowledge of the application area.
Development Processes: Organizational Processes in-
fluence the whole organization in which data mining          3.1.2. Focused Processes
techniques are being used, e.g., continuous improve-
ment and training or establishing of an appropriate data     We focus on three components (see Figure 1 for an
mining infrastructure. Project Management Processes          overview): First, decision-maker processes are mainly
assure successful project planing, e.g., by continuous       related to the decision-maker, considering his or her
communication with the decision-maker. Furthermore,          specific needs. They include project definition, engi-
Integral Processes support the development, e.g., doc-       neering of data mining requirements and result presen-
umentation or configuration management. Although             tation. Second, developer team processes deal with
process models help developer teams and decision-ma-         techniques and systems that enable the developer team
kers to understand what to do in data mining projects,       to fulfill the requirements and obtain useful results.
they do not describe how it can be done.                     Third, organization processes cover functions shared
    In contrast, methodologies, e.g., Catalyst [4] fea-      by different projects.
ture step-by-step guidance to data mining. However,
as methodologies are more dependent on current tech-
niques and systems, they are difficult to keep up to date.
    Most case studies describe how techniques and sys-
tems can be applied in a specific project and concrete
application domain. However, while many case studies
of data mining projects have been presented (e.g., [5]),
they are primarily used for demonstration of specific
tools, results or techniques and therefore are seldom
more generally applicable.

                 3. CASE STUDIES

In this section, we present two case studies. After pre-        Fig. 1. Case Study Design w/ Information Flow
senting the process-oriented design, we discuss each
one in detail.

3.1. Process-Oriented Design                                 Decision Maker Processes Based on interviews with
                                                             the decision-maker and possible feasibility studies, the
Following Yin’s [6] recommendations for well-designed        developer team proposes a data mining approach to the
case studies the purpose of the covered case studies is      decision-maker’s problem in a Business Case document
thoroughly describing how descriptive data mining can        written “in management terms” [4, p. 205] and asks for
be successfully applied. As such the case studies are        his approval. The Business Case is a central document
aimed at readers with both some technical background         for any data mining project. It should include the back-
and business interest that consider data mining tech-        ground and motivation of the project, an explicit state-
niques in a project.                                         ment of the problem tackled by the project, a detailed
                                                             description of the current situation and available data,
3.1.1. Focused Roles                                         recommended and alternative solutions, a project plan
                                                             with time and cost estimations and a glossary.
On the one hand the decision-maker intends to bene-               As decision-maker and developer team mostly have
fit from data mining techniques. More precisely, the         different backgrounds, exact specification of suitable
decision-maker has access to raw data and expects de-        project requirements is a tedious, however, an essential
scriptive data mining techniques to extract information      task in descriptive data mining [8].
suitable to support his decision(s). The needs of the             For that, the problem is restated in single “report-
decision-maker are formalized as requirements.               ing type questions” [9] asking for attribute-value-pairs
     On the other hand, the team of developers intends to    in tabular form describing instances of an object. These
fulfill the specified requirements by applying descrip-      single Data Reports are then possibly analyzed further
tive data mining tasks. The team usually consists of         by “deeper analytic questions” [9] asking for hidden
three kinds of experts [7]: Data mining experts are fa-      Data Patterns retrieved by techniques ranging from sim-
miliar with data mining techniques and the respective        ple visualizations with diagrams or charts up to cluster-
tools. Data experts offer thorough understanding of          ing or classification by machine learning algorithms.
To improve the decision-maker’s understanding of the           implementing an entity relationship model or multi-
requirements both Data Reports and Patterns may be             dimensional model and or effective querying through
illustrated by (fictional) examples. Additionally, possi-      SQL or MDX 1 is supported by specialized data ware-
bilities for evaluation might be given, e.g., background       house components. A data reporting component makes
information and other (secondary) data.                        it possible to customize data exports (CSV, ARFF) and
     A Business Case is not a static document. In fact,        to create reports with flexible layout information in var-
especially requirements will be exposed to constant            ious formats (e.g., PDF, XLS). A data mining com-
changes. These are mainly due to results from devel-           ponent is able to read such exports and use data min-
opment processes and have strong influence on the life         ing techniques (e.g., diagrams, correlation coefficients,
cycle of a data mining project. In a successful project        subgroup discovery) on their data in order to make data
each requirement is fulfilled and documented in a Busi-        patterns accessible. Finally, a documentation compo-
ness Story [4, p. 509].                                        nent supports web-based content management of ob-
                                                               jects, attributes and relationships.
                                                                   The utilized documentation structure also provided
Developer Team Processes By preparing a Data As-               the necessary information for an extensive description
say [4, p. 278] Business Understanding, Data Under-            of the case studies.
standing and Data Preparation from CRISP-DM are im-
plemented. It involves a concise description of the raw
data, that is made available in a precisely specified tab-     3.2. Case Study I: Student Performance Evaluation
ular form. Additionally, quality issues, for example
                                                               In the following, we describe the decision-maker pro-
missing values, should be mentioned explicitly.
                                                               cesses, the developer team processes, and the organiza-
     Data Preparation is done by making all neccessary         tional aspects of the bachelor project.
data available in a Data Warehouse. The team identifies
objects, attributes and relationships within the raw data
and integrates them in an entity relationship model. Fur-      3.2.1. Decision Maker Processes
thermore, data cubes are developed as a more subject-
                                                               In Germany, the introduction of standardized bachelor
oriented view, if required. Each cell within a data cube
                                                               degrees has been exposed to much criticism lately.
can be described by shared attributes (dimensions) and
                                                                   Therefore, for objective assessment on university
aggregated attributes (measures). From these data
                                                               level an in depth analysis is needed. Basic analytic
cubes, a multidimensional model [10] is developed.
                                                               questions to justify changes in the curriculum are for
     Next, the team creates Data Reports, which consist
                                                               example: “How do important measures of bachelor de-
of a query from the data warehouse and additional lay-
                                                               grees evolve?”, “How do important measures of exams
out information, e.g., a title or content explaining notes.
                                                               evolve?” or “What performance do current students
Additional information can also be included as seman-
                                                               achieve?”.
tic annotations [11, 12], providing additional presenta-
                                                                   The raw data for this proejct was provided by uni-
tion possibilites and extended exchangeability. Based
                                                               versity administration. Since this data includes private
on these reports the team applies data mining algorithms
                                                               student data, it was very carefully selected and pre-
to acquire Data Patterns specified in the requirements.
                                                               cautiously pseudonymized. The legal process for get-
Both data reports and mined patterns are evaluated and
                                                               ting permission to access the sensible data took several
attached to the business story.
                                                               months in total. The data includes information on:

Organization Processes To support knowledge man-                  1. Enrollment information, with the actual semester,
agement between projects a standardized way of doc-                  number of past semesters and degree of all bach-
umentation is necessary. Instead of using single docu-               elor students.
ments, we utilize a Knowledge Base, cf., [13], that sup-
ports references and more efficient searching. Based              2. Exam information, with subject, number of achiev-
upon these approaches, we have designed an object-                   able credits, number of lecture hours per week
oriented documentation structure, that keeps track of                and the type of exam, e.g., module or submod-
various objects, e.g., goals, tasks, results, tools and doc-         ule.
uments, and their relationships, and makes these crucial
                                                                  3. Information about student performance in an exam,
experiences also available across different projects.
                                                                     with pass/fail status, achieved credits and mark.
    Also, a project can only be executed if an appro-
priate Infrastructure of hardware and software is avail-          4. Curricula information, that for each student sep-
able. For the different steps of our case study design               arately defines categories to exams, e.g., obliga-
highly specialized software components are available.                tory or compulsory.
For the Data Assay, for example, an ETL (Extraction,
Transformation, Loading) component can be used, while            1 http://msdn.microsoft.com/en-us/library/aa216767(SQL.80).aspx
     Exemplary requirements, on which the head of the       relationships. Due to the complexity of SQL queries
university faculty of (for example) biology, as a rele-     required for the data mining tasks, the ER-model was
vant decision-maker and the developer team might have       transformed into a multidimensional model. It con-
agreed, is described as follows: As a Data Report, for      tained two data cubes, one of enrollments and one of
each current student of biology the starting semester,      single performances.
number of past semesters, number of university seme-            Both an enrollment and a single performance are
sters, sum of credits, average credits per semester and     described by the student, the semester, the number of
overall average grade should be presented. Addition-        past semesters, the bachelor degree and an informa-
ally, the last two measures should be provided for each     tion whether that student is still enrolled in the actual
category of exam separately. As Data Patterns, for a        semester. Each single performance is further described
better overview the reports were to be sorted on the        by the status, the exam and the type and category of the
number of past semesters and the sum of credits. Also,      exam. For a data cell in the enrollment cube the number
the histogram of credit points acquired by all students     of individual students and both the minimal and maxi-
should be provided. This diagram was expected to re-        mal number of past semesters can be calculated. For a
veal the number of very unsuccessful (and therefore         data cell of single performances the sum, number and
probable to fail) and very successful (e.g. students al-    average mark and the sum of credits can be calculated.
ready going to university before the end of college) stu-       Now the team created reports based on data queries
dents. Finally, student groups with low/high numbers        in MDX and specified layout informations according to
of semesters and particlarly bad/low marks were to be       the requirements. Additionally, exports for tools spe-
discovered. This might extract information as “students     cialized on advanced pattern discovery were created.
in their fifth semester have an average mark of 2.0, stu-   In this case distribution diagrams were created and sub-
dents in their second semester have an average mark of      group discovery tasks were performed.
3.1, wheras all students have an average mark of 2.6”.
     During project life cycle these requirements were
adapted several times. E.g., the formula for the compu-     3.2.3. Organization Processes
tation of the overall average grade was not sufficiently    As infrastructure three separate computer systems (each
specified at the project start. Furthermore, highly de-     common 32-bit machines, 2 GHz, 2 GB RAM) were
tailed requirements on the layout of result representa-     used: On one workstation the team mainly used Pen-
tions evolved. Since the utilized open source reporting     taho Data Integration2 for the ETL processes and both
software could not sufficiently support these require-      VIKAMINE3 and Weka4 for data mining. On a server,
ments, tailored project specific java programs were ad-     MySQL and Pentaho Mondrian OLAP5 were used for
ditionally developed.                                       the data warehouse and Pentaho Business Intelligence
     As part of the resulting business story the data re-   Platform6 was used for creating the data reports. As
port was given to the heads of faculities and provided      knowledge base the team used Semantic MediaWiki7
insight into the overall student’s performance. The cre-    on another server (for an overview, see Figure 2).
dit distribution indicated a credit threshold for likely-
to-fail-students suitable for an automatic warning sys-
tem, that proposes these students for an additional men-
toring program. Influences on student performance in-
dicators will be further enhanced in the future with more
information, e.g., survey answers, nationality, gender
or age. Such reasons might propose actions towards
a more adequate degree program. However, interpreta-
tions should be undertaken carefully. Students studying
two-subject bachelor degrees need less credits in each
subject and may indicate poor performance in compari-
son to others. Separating these student groups is issued                  Fig. 2. Bachelor Infrastructure
to a follow up project.

                                                                The results of the project provided valueable in-
3.2.2. Developer Team Processes                             sights on the performance of the students, on an au-
                                                            tomated and on-demand basis.
The developer team first imported several CSV file ex-
                                                              2 http://kettle.pentaho.org/
ports from the university information system into the
                                                              3 http://www.vikamine.org/
data warehouse system. Based on that data, the team           4 http://www.cs.waikato.ac.nz/ml/weka/
developed an entity relationship model made of five           5 http://mondrian.pentaho.org/
entities: Enrollment, person, exam, performance and           6 http://community.pentaho.com/projects/bi_platform/

exam category, each further described by attributes and       7 http://www.semantic-mediawiki.org/
3.3. Case Study II: E-Learning system evaluation             form. Then, the team developed an entity-relationship
                                                             model made of eight entities: student, case, case ex-
Again, the processes centric to the decision-maker, the      ecution, evaluation, exam result, score, score action
developer team and the organization are discussed.           and case action. A multidimensional model consist-
                                                             ing of three cubes was added for better querying. Each
3.3.1. Decision Maker Processes                              cube is described by several partially shared dimen-
                                                             sions, e.g., student, case and date of execution. A case
Students at the university of Wuerzburg are offered
                                                             action is further described by the time of action (be-
exam-relevant case-based training courses. The ben-
                                                             ginning and end of case execution) and the kind of
efits of such a learning system need to be evaluated
                                                             action (e.g., pause, case summary, link). A case exe-
regularly. Exemplary questions include: “What influ-
                                                             cution is further described by the exam that execution
ence does learning with the system have on exam per-
                                                             was relevant to. For a data cell of case execution ac-
formances?” or “How satisfied are users of the learning
                                                             tions the number and overall time of the actions can
system?”. User logs can provide useful data to answer
                                                             be calculated. For a data cell of case executions can
such questions:
                                                             be given e.g., the number of case executions, the av-
   1. Log data tracks information about users learning       erage overall score, the overall time and the average
      with single cases. Each case execution consists        performance of corresponding exams. For a data cell
      of questions each offering a single score that is      of scores the number of scores, the average score and
      accumulated to a total score. The log data also        the average/overall time taken for viewing the question
      contains information on the usage of help func-        and answer hints can be calculated. Similar to the bach-
      tions, e.g., asking for background information,        elor case study, the developer team now designed data
      reading hints or taking a break. Furthermore, at       reports and exports as stated in the requirements, e.g.,
      the end of most cases the user is asked for sys-       correlation mining.
      tem evaluation: A mark about the case and the
      system and some textual feedback.                      3.3.3. Organization Processes
   2. Meta information contains additional facts about       The Organization processes were executed similar to
      cases: The form of case evaluation and the time        the bachelor case study. Both projects could not only
      the author expects a user to finish a case.            use the same knowledge base but basically rely on the
                                                             same infrastructure.
   3. Exam results are available for some courses sup-           For examining the learning behavior of the students
      ported by case-based training.                         using the CaseTrain system, the performed reports and
                                                             descriptive data mining results proved promising. There-
     Exemplary requirements can be described as fol-         fore, similar data mining approaches will be implemen-
lows: As a Data Report, for each exam result of a stu-       ted as routine mechanisms within the CaseTrain system
dent the number of processed cases, the overall time         in the near future.
used for learning with the system, the average overall
practice score and the mark and percentage of correct
answers in the exam are presented in tabular form. As                      4. LESSONS LEARNED
Data Patterns, correlations between the engagement of
the students with the system and their performances at       From the case studies we could obtain several lessons
the exam should be discovered, e.g., using a scatter plot    learned: The proposed methodology appears to be gen-
and correlation coefficients. This requirement was ini-      erally applicable: Both projects – though substantially
tially expected to show a high influence of a student’s      different in domain and requirements – were success-
effort with the system and his exam results, showing the     fully finished; Data Reports in tabular form are flexible
effectiveness of the system. While providing promising       enough to contain most kinds of information; from sim-
results, however, no statistically significant correlation   ple diagrams to sophisticated machine learning algo-
was discovered, in contrast to expectations: This is pos-    rithms – Data Patterns include the whole range of tech-
sibly due to not considered influences on student per-       niques to retrieve knowledge from this preprocessed
formances, e.g., present knowledge (level) of students,      raw data. Moreover, for most neccessary components
and due to a limited availability of (external) exam re-     open source software is available.
sults in the considered sample of data.                          More than 70% of development time was used for
                                                             the Data Assay and Data Warehouse. Changes to the
3.3.2. Developer Team processes                              data structure, e.g., when adding new features, result in
                                                             significant additional work. Versionizing and refactor-
The developer team first imported the provided data          ing of raw data description and preprocessing steps that
into the data warehouse system. This was a non-trivial       get repeated several times would have been useful and
task, since some data was available in a semi-structured     seem essential in bigger projects.
     Intensive documentation obviously is crucial for        [2] Lukasz A. Kurgan and Petr Musilek, “A Survey
long-running data mining projects, especially if team            of Knowledge Discovery and Data Mining Pro-
members change. By documenting not only the project              cess Models,” Knowl. Eng. Rev., vol. 21, no. 1,
itself, but also sharing experiences and best practices,         pp. 1–24, 2006.
e.g., on applied tools and techniques, the documenta-
                                                             [3] Pete Chapman, Julian Clinton, Randy Ker-
tion of one project proved to be extremely helpful for
                                                                 ber, Thomas Khabaza, Thomas Reinartz, Colin
the other. Further cross-project benefits were achieved,
                                                                 Shearer, and Rudiger Wirth, “CRISP-DM 1.0
since both projects shared a common infrastructure of
                                                                 Step-by-step Data Mining Guide,” Tech. Rep.,
hardware and software.
                                                                 The CRISP-DM consortium, August 2000.
     Legal aspects of a project should be addressed very
early in a project, since the reviewing of data privacy      [4] Dorian Pyle, Business Modeling and Data Min-
issues and the integration of additional data can require        ing, Morgan Kaufmann Publishers Inc., San Fran-
a substantial amount of time. For having several and             cisco, CA, USA, 2003.
long running projects a framework of tools as used here
                                                             [5] Michael Brydon and Andrew Gemino, “Classifi-
seem crucial due to synergistic effects. The projects
                                                                 cation Trees and Decision-Analytic Feedforward
could be executed exclusively using open source sys-
                                                                 Control: A Case Study from the Video Game In-
tems. However, some components of current open-
                                                                 dustry,” Data Min. Knowl. Discov., vol. 17, no. 2,
source system showed to be insufficient to match project
                                                                 pp. 317–342, 2008.
requirements, e.g., highly specialized layouting of the
results. Specifically tailored scripts were suitable to      [6] Robert K. Yin, Case Study Research, Number 5
fill this gap. This combination of a tool suite for gen-         in Applied social research methods series. Sage,
eral purpose tasks and additional project specific imple-        Thousand Oaks, Calif. [u.a.], 4. ed. edition, 2009.
mentations seems to be well suitable to handle highly
specialized requirements.                                    [7] Sarabot S. Anand and Alex G. Buchner, Deci-
                                                                 sion Support Using Data Mining, Trans-Atlantic
                                                                 Publications, 1998.
                 5. CONCLUSIONS
                                                             [8] Paola Britos, Oscar Dieste, and Ramón García-
This paper presented two case studies of successful de-          Martínez, “Requirements Elicitation in Data Min-
scriptive data mining projects in two different contexts,        ing for Business Intelligence Projects,” in Ad-
i.e., the context of the analysis of university students         vances in Information Systems Research, Educa-
performance and in usage data evaluation of an e-learn-          tion and Practice. 2008, pp. 139–150, Springer
ing system. We proposed a decision-maker-aware ap-               Boston.
proach for descriptive data mining, and discussed im-        [9] Ron Kohavi, Llew Mason, Rajesh Parekh, and Zi-
portant lessons learned. In the future, in order to fully        jian Zheng, “Lessons and Challenges from Min-
evaluate the decision-maker-awareness, retrieve general          ing Retail E-Commerce Data,” Mach. Learn., vol.
best practices and finally develop a full-scale method-          57, no. 1-2, pp. 83–113, 2004.
ology for descriptive data mining we aim to apply our
design to further case studies in various domains.          [10] Sergio Luján-Mora, Juan Trujillo, and Il-Yeol
                                                                 Song, “A UML profile for Multidimensional
                                                                 Modeling in Data Warehouses,” Data Knowl.
           6. ACKNOWLEDGEMENTS                                   Eng., vol. 59, no. 3, pp. 725–769, 2006.

Part of this work has been funded by the EU IST FP7         [11] Martin Atzmueller, Fabian Haupt, Stephanie
project ACTIVE under grant 215040, and by the Ger-               Beer, and Frank Puppe, “Knowta: Wiki-Enabled
man Research Council (DFG) under grant Pu 129/8-2.               Social Tagging for Collaborative Knowledge and
Furthermore, this work has been partially supported by           Experience Management,” in Proc. Intl. Work-
the VENUS research cluster at the interdisciplinary Re-          shop on Design, Evaluation and Refinement of In-
search Center for Information System Design (ITeG) at            telligent Systems (DERIS), 2009, vol. CEUR-WS.
Kassel University.                                          [12] Martin Atzmueller, Florian Lemmerich, Jochen
                                                                 Reutelshoefer, and Frank Puppe, “Wiki-Enabled
                 7. REFERENCES                                   Semantic Data Mining - Task Design, Evaluation
                                                                 and Refinement,” in CEUR-WS 545, 2009.
 [1] Oscar Marbán, Javier Segovia, Ernestina                [13] Karin Becker and Cinara Ghedini, “A Documen-
     Menasalvas, and Covadonga Fernández-Baizán,                 tation Infrastructure for the Management of Data
     “Toward Data Mining Engineering: A Software                 Mining Projects,” Information & Software Tech-
     Engineering Approach,” Information Systems,                 nology, vol. 47, no. 2, pp. 95–111, 2005.
     vol. 34, no. 1, pp. 87 – 107, 2009.

</pre>