Fuzzy Expert Systems for Automated Data
Quality Assessment and Improvement Processes

    Corinna Cichy1,2[0000−0002−1745−0126] and Stefan Rass1[0000−0003−2821−2489]?
            1
                Alpen-Adria University Klagenfurt, 9020 Klagenfurt, Austria
                2
                  Volkswagen Bank GmbH, 38112 Braunschweig, Germany


        Abstract. Due to its importance for decision-making processes, data
        quality plays a crucial role in modern data management. However, as-
        sessing data quality still involves a number of manual steps. Moreover,
        these tasks are characterized by subjective decisions performed by do-
        main experts. The goal of this research is to reduce the time spent on
        these activities by investigating the possibility of imitating an expert’s
        quality judgement via an approximation of the expert reasoning by fuzzy
        logic. We lay out the steps to (automatically) set up such a system and
        introduce possible applications. The approach allows us to benefit from
        combining established data management methods with machine learn-
        ing as well as knowledge engineering techniques in order to handle the
        complexity and uncertainty of the presented process in a transparent
        way.

        Keywords: Fuzzy expert systems · Explainable AI · Data quality


1     Introduction

Human capabilities are often insufficient to handle today’s data management
tasks which can lead to increased costs for the business [4,6]. Consequently, the
importance of improving data quality processes is widely recognized [5]. Never-
theless, the prioritization of data quality issues and the assessment of the overall
value of a data set is often performed through manual tasks. To some extent, this
can be attributed to the decisions made by experts when it comes to evaluating
data quality from a business perspective. Yet, existing data quality frameworks
give little advice on how to handle such decisions based on the available meta-
data and the context-dependent aspects of data quality are mostly handled via
survey questionnaires [3]. Especially within the domain of regulatory compliance,
this causes a high risk. This motivates our research with the goal of imitating an
expert’s decision-making about data quality. In particular, we propose a combi-
nation of fuzzy logic reasoning and regression techniques to “machine-learn” the
expert’s judgment and ultimately automate this manual labor. The overall aim
of the proposed approach is to save time regarding critical data quality processes
?
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2       C. Cichy and S. Rass

while providing consistency throughout personnel changes. The choice of apply-
ing fuzzy logic is premised on the assumption that domain experts are more
likely to express their knowledge-based decisions in terms of natural language,
rather than mathematical concepts. Moreover, next to topics such as fairness
and accountability, the aspect of transparency within machine learning becomes
more and more important as one of the key AI principles [1,8]. The provided de-
gree of explainability enables the domain experts to play a specified role during
the development process and, rather than being confronted with plain results,
understand the recommended actions proposed by the system.


2     Fuzzy Reasoning about Data Quality
We propose a framework based on fuzzy logic that aims at simulating an ex-
pert’s behavior throughout his tasks in data quality assessment. In particular,
we describe the development of a knowledge-based system that combines the
domain knowledge of an expert with existing measurement metrics. This ap-
proach produces the recommended actions for a decision process to the expert
in comprehensible way, mainly due to the natural interpretation of the fuzzy logic
involved. In this section, the necessary steps for developing and implementing
such a support system are explained. Moreover, we demonstrate the proposed
procedure for two applications.

2.1    Overview of the Method
For the method to be applicable, the following preleminary conditions have to
be met: The process should
  (i) include objective measurements such as data quality metrics or other key
      performance indicators (KPIs) that provide an informative indication for
      the subsequent decision process,
 (ii) involve an expert evaluation that rests upon the measurements from (i) com-
      bined with his implicit expert knowledge of the process, and
(iii) be a recurring process, i.e. the output represents a regular activity of the
      expert and historic data on previous decisions are available.
   Once these preliminary conditions are met, the approach consists of three
main phases and proceeds as follows:
1.) Capturing the Expert’s Knowledge Using Fuzzy Logic: The expert
    expresses his decision-making in terms of natural language, which is then
    transferred and expressed within a number of candidate fuzzy systems.
2.) Training the Fuzzy Approximation Model: The relevant fuzzy rules
    are selected, e.g. by performing a linear regression on historical data, with
    the regression base functions being fuzzy if-then rules.
3.) Applying the Fuzzy Approximation Model: This represents the practi-
    cal usage of the model with regard to the experts’ recurring decision process
    on a regular basis.
                                                 Fuzzy Data Quality Assessment              3

   The actions of the data scientist and the input from the data quality ex-
pert are distinguished in Figure 1 to emphasize the practical realization of the
method. In particular, the domain expert provides his knowledge (textual input)
which is captured within the model and combined with the objective measure-
ments (numerical input). These steps are demonstrated in Sections 2.2 to 2.4.


                            Fig. 1. Illustration of the Method


2.2    Capturing the Expert’s Knowledge Using Fuzzy Logic


The first step of our proposed method is to develop the relevant fuzzy compo-
nents, i.e. fuzzy variables along with a set of (potentially significant) fuzzy rules.
Advantageous about this choice is that fuzzy variables allow for different degrees
of membership to various categories. This is especially helpful if the experts are
uncertain about assigning clear thresholds (e.g., a record has ”good quality”
if it is at least 60% complete). As a starting point, the expert names relevant
categories and associations (in natural language terms) which builds the basis
for the fuzzy variables and their membership functions. Specifically, the expert
states his interpretation of existing objective measurements and his subsequent
decision process.
     Following the principles of constructing a fuzzy expert system, we proceed by
defining fuzzy rules for inference towards the quality assessment. Here, expert
knowledge is again incorporated since the expert expresses aspects about his
decision-making. Based on natural language explanations, rules of the form: IF
condition THEN conclusion are extracted.
     The result of this step is a set of candidate (basis) fuzzy if-then rules R1 , . . . , Rn ,
which can be converted into a set of continuous functions g1 , . . . , gn : R → R
representing the basis fuzzy systems. The individual importance of the rules is
determined by regression techniques in the next step.
4       C. Cichy and S. Rass

2.3   Training the Fuzzy Approximation Model
To refine the fuzzy rule set towards a best reproduction of the given expert’s
rating (supervised learning), we refer to the techniques laid out in [2] and [7]. In
particular, here we use linear regression to fit a linear combination of fuzzy rules.
This has the advantage of assigning weights to the rules that reflect the particular
rule’s importance. Moreover, the the overall model is open to a variety of further
statistical analyses and model diagnostics. Fitting this “expert imitation model”
(1) requires training data, which we take from historical (manual) data quality
judgments and objective measurements, represented as xi .

          f (xi , α0 , α1 , . . . , αk ) = α0 + α1 g1 (xi ) + . . . + αk gk (xi ) + εi ,   (1)
for i = 1, . . . , n; and with εi being a random error term for xi . Evaluating the
deterministic part of Equation 1 provides the desired output for a given data set.
Moreover, the parameters can be interpreted as weights and more specifically,
the importance of the fuzzy rules.

2.4   Application to Data Quality Assessment and Improvement
Due to the context-dependency of data quality, it is often inevitable to consult
domain experts for its assessment with regard to data quality requirements (e.g.
from business side or driven by regulatory guidelines). They incorporate the
severity of potential consequences for the business when assessing the overall
value of the data. Since the experts have to consider a large amount of metadata
for each data set in form of objective measurements, this can be a highly time
consuming process.
    To use the proposed method within the context of data quality assessment,the
people involved in the data quality assessment follow the steps illustrated in Fig-
ure 1. The metadata stems from data quality measurements which are commonly
obtained via data quality tools, providing automated calculations of key metrics.
A fuzzy rule can have the form “If metric 1 is low then quality is poor” in this
context. Applying the model (1) is then a trivial matter of evaluating the for-
mula on the data set to be quality-assessed, whose quality metadata i = 1, 2, . . .
directly go into (1) as the variables xi . For this practical setting, the model pro-
duces a data quality score representing the quality indicator of the data record.
The explanation of how this assessment was obtained is visible from the linear
model, which is simply an aggregation as a weighted average of the rules about
quality. The judgment is also explainable, since each gj (for j = 1, 2, . . . , k) in
(1) corresponds to natural-language if-then rules and affects the overall assess-
ment in the direction and to the extent told by its coefficient αj by sign and
magnitude.
    For data quality improvement, we consider the prioritization of identified data
quality issues. Here, rather than evaluating the formula, we use the coefficients
therein as indicators of how much impact an improvement would make to the
overall quality. If the coefficient of a fuzzy rule is large, then the priorities on
what criteria to improve can be set accordingly.
                                                Fuzzy Data Quality Assessment            5

3    Conclusion

The challenging tasks within the data quality management of a business are
characterized by subjective and time consuming processes. However, a suitable
adaption of machine learning and knowledge management techniques can provide
assistance for such tasks. We propose an approximation by fuzzy rules to support
the assessment and improvement of data quality. The expert system can provide
assistance throughout the decision-making process of data quality experts and
data consumers. We further contribute a set of guidelines to set up such imitation
systems in a business. This provides the basis for identifying further processes
that are suitable for an implementation. Next steps of our research include the
validation of the method in comparison with other machine learning techniques
for the assessment and improvement phases in the context of risk data including
a test of scalability regarding the number of fuzzy variables, rules and outcomes.


References
1. Adadi, A., Berrada, M.: Peeking inside the black box: A survey on explainable
   artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018)
2. Cichy, C., Rass, S.: A fuzzy-approximation approach to explainable information
   quality assessment. In: Proc. of the 34th International Business Information Man-
   agement Association Conference (IBIMA’19). pp. 3919–3931. Madrid, Spain (2019)
3. Cichy, C., Rass, S.: An overview of data quality frameworks. IEEE Access 7, 24634–
   24648 (2019)
4. Hippold,       S.:   Watch     these   data     and      analytics    challenges    and
   trends              (2018),          https://www.gartner.com/smarterwithgartner/
   watch-these-data-and-analytics-challenges-and-trends/. Accessed 2 Aug 2020
5. Nagle, T., Redman, T., Sammon, D.: Assessing data quality: A managerial call to
   action. Business Horizons 63(3), 325–337 (2020)
6. Redman, T.C.: Getting in Front on Data: Who Does What. Technics Publication,
   Baskin Ridge, NJ, USA (2016)
7. Riza, L.S., Bergmeir, C., Herrera, F., Benitez, J.M.: frbs (2015), https://cran.
   r-project.org/web/packages/frbs/frbs.pdf. Accessed 2 Aug 2020
8. Zeng, Y., Lu, E., Huangfu, C.: Linking artificial intelligence principles. arXiv (2018),
   https://arxiv.org/pdf/1812.04814v1.pdf. Accessed 2 Aug 2020