=Paper= {{Paper |id=Vol-2399/paper09 |storemode=property |title=Make Informed Decisions: Understanding Query Results from Incomplete Databases |pdfUrl=https://ceur-ws.org/Vol-2399/paper09.pdf |volume=Vol-2399 |authors=Poonam Kumari |dblpUrl=https://dblp.org/rec/conf/vldb/Kumari19 }} ==Make Informed Decisions: Understanding Query Results from Incomplete Databases== https://ceur-ws.org/Vol-2399/paper09.pdf
                                                                                                           ceur-ws.org/Vol-2399/paper09.pdf




               Make Informed Decisions:
 Understanding Query Results from Incomplete Databases

                                                             Poonam Kumari
                                                  Supervised by Dr. Oliver Kennedy
                                                State University of New York at Buffalo
                                                 New York, United States of America
                                                       poonamku@buffalo.edu

ABSTRACT                                                                 tic databases [17] let the user query data with uncertainty,
Analyzing data has been central in making decisions whether              the query results might be difficult to understand.
it be a decision to buy stock or detect the chances of diabetes             Probabilistic databases: Probabilistic Databases (PD)
based on family history. Datasets used for analysis might                make use of a user specified probability distribution function
include incomplete, inconsistent, missing data or might in-              for the uncertain data. For instance, in [5] a parameter p
volve integrating two or more sources. Data quality man-                 on each tuple specifies the probability distribution for tuple
agement has been studied extensively with focus on tabular               existence. In [15] a user specified joint probability is used
data. Lot of work has been done in terms of data curation                by PD to determine resulting output tuples and their as-
and imputation, although visualization aspect of data qual-              sociated probabilities. No explanation about query results
ity management remains fairly unexplored. The aim of this                is provided in PD’s (”Why a tuple is present in the query
PhD research is to focus on visualizing the imperfections in             result or why is a high probability associated with the tu-
these datasets in order to help users analyze and interpret              ple?”). Although PD’s help manage uncertain data success-
data and guide them to make informed decisions. We ex-                   fully, the probability distributions in query results might be
plore how di↵erent visualization techniques a↵ect perceived              difficult to understand.
data quality, accuracy and decision confidence.                             Incomplete Databases: To deal with uncertain data,
                                                                         incomplete databases work on a set of all deterministic in-
                                                                         stances known as possible worlds. A typical query result
1.    INTRODUCTION                                                       on these databases might consist of certain answers, possi-
                                                                         ble answers or both (depending on the type of incomplete
   With growing data sizes and di↵erent ways of obtaining                database system). Database instances in fig 1 represent
data, datasets being analyzed are prone to incomplete, in-               two possible worlds i.e ceiling mart database is one pos-
consistent, missing data etc. These errors must be detected              sible world and Aimpoint is another possible world. The
and corrected in order to maintain the quality and usability             ratings for Dell i7 and Lenovo i7 are consistent across both
of data. This takes up to 30-80 percent of an analyst’s time             the possible worlds. If a user issues a query to get the rat-
and resources [14]. Di↵erent systems have been designed to               ings for the two products, the result set would consist of
help analyst curate the data using a wide variety of methods             certain answers (answers in all possible worlds). Whereas
to deal with dirty data.                                                 the query result for getting rating of Asus i5 and Lenovo i7
   For example, simple imputation techniques like hot-deck               would contain possible answers (due to missing value for
imputation substitute values from current sample whereas                 HP AMD in one possible world and inconsistent rating for
cold-deck imputation make use of related datasets [8] or do-             Asus i5).
main heuristics [10]. Methods like linear interpolation, re-                Di↵erent approaches have been used to represent possi-
gression, and adaptive interpolation [7] infer missing values            ble and certain answers. Conservative approaches [1] con-
by using a weighted combination of available data. More                  sider only the certain answer. For instance a query on pos-
complex imputation techniques estimate missing values us-                sible worlds in fig 1 will result in two tuples Dell i7 and
ing machine learning and related techniques [3] or [13] inte-            Lenovo i7 since the ratings are consistent in both worlds.
grate information about the processes used to generate the               Best guess query processing use the best possible world by
dataset.                                                                 making an educated guess and work exclusively with guessed
   Historically, uncertain data could not be queried using               world. Suppose best guess approach chooses first instance
classical databases. Although incomplete [9] and probabilis-             from fig 1. A query to get the ratings of products will
                                                                         present all certain answers ignoring the missing value and
                                                                         inconsistency in rating for Asus in the second instance. (1)
                                                                         Conservative approach ignores the uncertainty al-
                                                                         together missing out on valuable information. (2)
                                                                         Best guess approach takes uncertainty into account
                                                                         but the valuable information about interpreting the
                                                                         uncertainty is lost [6].
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los           To summarize various imputation methods are used to
Angeles, California. Copyright (C) 2019 for this paper by its authors.   deal with uncertain data which make a guess based on ex-
Copying permitted for private and academic purposes
          Ceiling Mart                      Aimpoint              In the earlier example, Bob had completed the data clean-
       Name        Rating              Name        Rating       ing task and the database queried by Alice to obtain a tab-
       Dell i7       4                 Dell i7       4          ular query result(table 1) containing both certain and un-
      HP AMD         2                HP AMD                    certain answers.
       Asus i5      3.5                Asus i5       4.5
                                                                   • If the conservative approach is used then Alice is left
      Lenovo i7      3                Lenovo i7       3
                                                                     with just two products and the user might choose Dell
                                                                     i7. In this method the user misses out on comparing
Figure 1: Product rating data from ceiling mart and Aim-             Dell i7 and Asus i5 which has a higher rating, although
point                                                                uncertain.

isting values, domain heuristics or machine learning tech-         • If the best guess approach is considered, Alice has all
niques. These guesses can be in the form of certain and              the 4 ratings to choose from. Since the distinction
possible answers. Incomplete and probabilistic databases             between certain and possible answers is not clear and
help query these datasets and provide query results as tab-          valuable information about the possible answer is lost,
ular data. These query results might or might not contain            the user might end up with Asus i5.
uncertain data (possible answers) which hinders users abil-
ity to make an informed decision. Uncertainty annotated            A lot of time and e↵ort is put into cleaning the data, mak-
databases (UA-DB’s [6]) help overcome the limitations of        ing guesses and calculating the best possible world. Data
earlier systems caused by ignoring uncertainty or missing       cleaning forms a large chunk in the data management life
out on information about interpreting uncertainty. UA-DB’s      cycle. After all this e↵ort what if the query results are not
also help represent uncertain data e↵ectively and distinguish   understood by the user. For instance, classical probabilistic
between certain answers and merely possible answers.            databases represent query results in the form of certain an-
                                                                swers or probability distribution which might overwhelm a
                                                                naive user. Just having a probability distribution or possi-
2.   MOTIVATING EXAMPLE                                         ble answers for query results is insufficient: the uncertainty
   ABC corp. is a sales company which helps user select         must be communicated to the users who will ultimately de-
products like laptops based on based on a large database of     cide the relevant information (in the results) pertaining to
crowd-sourced and/or web-scraped reviews of those prod-         their task and make an informed decision [6].
ucts. Alice is the customer service representative. Bob is         Incomplete databases cannot decide whether the data pre-
an analyst who maintains the database. Bob is working on        sented as part of query results are relevant for user’s decision
integrating instances shown in fig 1 containing laptop rat-     task. Alice is helping user make a decision in choosing a lap-
ings from di↵erent vendors. Bob needs to clean the data         top based on ratings presented in table 1. Uncertain answers
first (missing value and inconsistent ratings) and load it,     in the query result pose an important question. Are uncer-
which will enable Alice to query the database and make a        tain answers reliable as they are ultimately a guess made by
suggestion to the customer.                                     the system?. [18] conducted a case study with real world
   During data imputation the system decides to ignore the      data to demonstrate the usefulness of discovering knowledge
missing value in case of HP AMD and take an average of          about the patterns of missing values through classification.
ratings in case of Asus i5. Integrated dataset (table 1) is     The data mining task was to find how important a role the
passed on to Alice for analysis.                                race factor played in the home loan assessment process. The
                                                                classifier for the data without the race factor had 64.1% ac-
          Name                 Rating                           curacy for the training data set and 64.2% accuracy for the
          Dell i7              4                                test data set, producing an overall 64.2% accuracy. In the
          HP AMD               2                                medical domain [20] uses naive credal classifier which ex-
          Asus i5              4                                tends the discrete naive Bayes classifier to imprecise proba-
          Lenovo i7            3                                bilities. The diagnostic tool delivers upto 95% correct pre-
                                                                dictions and also proves to be e↵ective in discriminating be-
Table 1: Integrated Product rating data from Ceiling Mart       tween Alzheimers disease and dementia with Lewy bodies.
and Aimpoint                                                    Although di↵erent imputation methods are used and the
                                                                system makes an educated guess, the guesses about possible
                                                                answers are reliable. And excluding possible answers from
  In the above example table 1 represents an incomplete         the query result might result in losing valuable information.
database. Ratings for Dell i7 and Lenovo i7 are certain an-        Since the uncertain answers are reliable, what should the
swers(answers in all possible worlds) where as ratings for HP   user do when they see an uncertain value? Users can take
AMD and Asus i5 are possible answers (uncertain) due to         the conservative approach and ignore the uncertain values.
the system making a guess about the missing and inconsis-       Limitations of this strategy are well known [6]. Second ap-
tent value.                                                     proach is to consider uncertain answers for decision task.
                                                                In table 1 the uncertain rating for HP AMD might not be
                                                                relevant to the user since there are higher rated products.
3.   RESEARCH QUESTIONS                                         But the uncertain rating in case of Asus might be relevant to
  Why is a distinction between certain and uncertain an-        the user, since the user has to choose between a certain 4 for
swers required? And how this distinction would help user        Dell i7 and an uncertain 4.5 for Asus i5. The system cannot
asses relevant information and make an informed decision        decide whether the values are relevant to the user task, the
based on it?                                                    user has to understand and make this decision. We believe
that providing additional information about the uncertain        study aimed at answering two primary questions: (1) Is the
data will guide user to make an informed decision.               representation e↵ective at communicating uncertainty, and
  The focus of this research is to provide guidelines and        (2) What is the cognitive burden of interpreting the represen-
best practices to visualize uncertainty in incomplete            tation? Results showed an insignificant di↵erences in time
databases. For example we would like to help users to            taken to interpret uncertainty by the user. And a change
visually distinguish between certain and uncertain               in the ways people interpreted and reacted to data based
answers in query result for incomplete databases.                on change in uncertainty was observed. Colored text and
As another example, simply knowing that an answer is un-         color coding significantly altered participant behavior which
certain may not be enough and we would like to provide           is consistent with coloring signaling significant errors. Par-
additional contextual hints explaining uncertainty.              ticipants requested additional information when asterisk was
                                                                 used to represent uncertain data.

4.    PRELIMINARY STUDY                                          4.1 Follow up Study
    Why is there a need to visualize uncertainty? We                Through the previous study we have established the need
have already established that presenting possible answers        of representing and ways to represent uncertainty in incom-
do aid users in making a decision. [16] conducted a pair of      plete databases. The next question is to help user under-
crowdsourced studies to measure influence of methods used        stand the reason for data being uncertain. To provide
to impute and visualize missing data on an analysts percep-      additional contextual hints explaining uncertainty.
tion of data quality. The methods used also a↵ected conclu-      A follow up study was conducted to explore this task using
sions. The study concluded that highlighting imputed values      a lighter-weight, two-level interface for presenting uncertain
led to higher perceived confidence, credibility and data qual-   query results to users. First, a preliminary annotation (same
ity. Whereas not visualizing the missing values, downplaying     as preliminary study) notifies users about the presence of
visual encodings, filling out missing values with zero (zero-    uncertainty. If they deem it relevant, users can then inter-
filling) lead to lower subjective perceived measurements.        actively explore the uncertainty to obtain additional detail.
    Apart from improving decision-making and increase in         Why would user need additional information? One
perceived confidence, research carried out in several domains    of the limitation of both PD’s and incomplete databases is
such as health, weather prediction, transportation, and more,    lack of information about the probability/uncertain answer.
indicates displaying uncertainty helps in improving trust        Output tuples in existing systems like TRIO [2] do contain
placed on the system. A simple feedback mechanism in             lineage/provenance along with output probabilities. Lineage
context-aware systems was evaluated in [4]. The results          refers to a boolean formula which qualitatively explains the
suggest that human performance in memory-bounded tasks           reasons for occurrence of the output tuple. However it is
increases by explicitly displaying uncertainty information.      not informative in case of multiple output tuples. The case
    To visually distinguish between certain and uncer-           of projection of a million tuples on to a single tuple results
tain answers in query result for incomplete databases.           in a vary large lineage formula of size one million. This
Most of the imputation methods, require the system to make       can be difficult for the user to obtain any information from.
a guess and form certain and uncertain answers. The type         We believe that information regarding uncertain data can
of system decides whether uncertainty in the data should         be displayed in the form of small contextual hints. The in-
be presented to the user or not. We believe that uncer-          formation should be presented to the user on demand. For
tain data should be presented and uncertainty in the data        instance, in table 1 user might not need this contextual in-
should be e↵ectively communicated in order to help users         formation related to HP AMD laptop, but this additional
interpret the results and decide whether and how to act on       information might prove helpful in case of choice between
the results given. [12] presents our initial e↵orts in com-      Dell and Asus.
municating uncertainty about query results in On Demand
Curation Tools. A preliminary user study was conducted
to evaluate the cognitive burden and expressiveness of four      5.   RESEARCH PLAN
representations of “attribute-level” uncertainty. Uncertain        The current systems cannot decide whether an uncertain
data was annotated using simple one bit representation (as-      value is relevant to the user taks (e.g: Ranking task based on
terisk, colored text and color background) and confidence        data in table 1). [11] talks about the problem of determin-
interval (Figure 2).                                             ing the sensitive input tuples for the given query in PD’s.
                                                                 Sensitive tuples refer to the one’s that can substantially al-
     Product     CeilingMart      Aimpoint       Ibibo           ter output, when their probabilities are modified. Similar
       HP              4.5           3.0         3.5±1           strategy can be used in case of incomplete databases. Our
      Asus             2.5           2.5          3.0            next steps would be: (1) To help the system identify
       Dell           5.0*           3.5          5.0            relevancy of uncertain answers to the user query.
                                                                 E.g. for the ranking task HP rating can be considered ir-
     Figure 2: Example uncertainty representations.
                                                                 relevant and Asus rating as relevant. This can be done by
   Participants were presented with a task to rank three dif-    identifying input tuples which might a↵ect the output. An
ferent products based on the ratings provided. Product se-       algorithm can be developed for identifying such tuples for
lection, re-ordering the product list, and submitting the par-   various known queries like sum, count, min and max. (2)
ticipant’s final order were logged along with timestamps as      Incorporate the results from the user study and rele-
part of interactions with the web form. Think-aloud proto-       vancy algorithm into an existing system Mimir [19].
col was also used in the experiment in order to transcribe       The findings of preliminary study have been incorporated
participants thought process while making a decision. The        into Mimir which uses red text to display uncertain answers.
(3) Visualize the e↵ects of user choice on query re-             [7] J. Gao. Adaptive interpolation algorithms for
sult Mimir provides feedback on each guessed datapoint and           temporal-oriented datasets. In Thirteenth
user can choose to approve or fix the datapoint manually.            International Symposium on Temporal Representation
Changes in query result can be visualized based on infor-            and Reasoning (TIME’06), pages 145–151. IEEE,
mation from the algorithm in step 1 as the user makes a              2006.
decision on the feedback provided. This would help the user      [8] W. Githungo, S. Otengi, J. Wakhungu, and
to visually inspect the e↵ects of their decision before the          E. Masibayi. Infilling monthly rain gauge data gaps
changes are applied. (4) Visualize uncertain answers                 with satellite estimates for asal of kenya. Hydrology,
in query results using data plots Visualization of re-               3(4):40, 2016.
sults will aid data analysis by making it easier for the user    [9] J. Grant. Incomplete information in a relational
to identify trends and outliers in the data. These data plots        database. FUND. INFO., 3(3):363–378, 1980.
can be presented to domain experts for further inspection       [10] K. Gülensoy, C. Gawrilow, and T. von Landesberger.
of data requiring domain knowledge. (5) Visualize miss-              Visual exploration of dirty activity sensor and
ing values in raw data using data plots. Users can                   emotional state data from psychological experiments.
visualize the data and then inspect each data point in raw           In Proceedings of the 14th International Conference on
data by clicking on the data plot and accepting the feedback         Knowledge Technologies and Data-driven Business,
provided by the system or fixing the uncertainty manually.           page 19. Citeseer, 2014.
Similar uncertain data, for e.g. missing values in a column     [11] B. Kanagal, J. Li, and A. Deshpande. Sensitivity
can be fixed in groups based on the feedback provided.               analysis and explanations for robust query evaluation
                                                                     in probabilistic databases. In Proceedings of the 2011
6.   CONCLUSION                                                      ACM SIGMOD International Conference on
  Increasing data sizes pose the problem of uncertainty in           Management of data, pages 841–852. ACM, 2011.
data. Several data curation techniques have been devel-         [12] P. Kumari, S. Achmiz, and O. Kennedy.
oped along with databases (PD’s and incomplete databases)            Communicating data quality in on-demand curation.
to help query this data. Although data cleaning is stud-             arXiv preprint arXiv:1606.02250, 2016.
ied extensively, we need to focus on visualizing the query      [13] S. Rässler. Data fusion: identification problems,
results for a better understanding. We described a user              validity, and multiple imputation. Austrian Journal of
study as part of our initial e↵ort and next steps to help us         Statistics, 33(1&2):153–171, 2004.
design guidelines for visualizing uncertainty in incomplete     [14] B. Saha and D. Srivastava. Data quality: The other
databases.                                                           face of big data. In 2014 IEEE 30th International
                                                                     Conference on Data Engineering, pages 1294–1297.
7.   REFERENCES                                                      IEEE, 2014.
 [1] S. Abiteboul, P. Kanellakis, and G. Grahne. On the
                                                                [15] P. Sen, A. Deshpande, and L. Getoor. Prdb:
     representation and querying of sets of possible worlds.
                                                                     managing and exploiting rich correlations in
     Theoretical computer science, 78(1):159–187, 1991.
                                                                     probabilistic databases. The VLDB JournalThe
 [2] C. C. Aggarwal. Trio a system for data uncertainty              International Journal on Very Large Data Bases,
     and lineage. In Managing and Mining Uncertain Data,             18(5):1065–1090, 2009.
     pages 1–35. Springer, 2009.
                                                                [16] H. Song and D. A. Szafir. Where’s my data?
 [3] S. Ahuja, M. Roth, R. Gangadharaiah, P. Schwarz,                evaluating visualizations with missing data. IEEE
     and R. Bastidas. Using machine learning to accelerate           transactions on visualization and computer graphics,
     data wrangling. In 2016 IEEE 16th International                 25(1):914–924, 2019.
     Conference on Data Mining Workshops (ICDMW),
                                                                [17] D. Suciu, D. Olteanu, C. Ré, and C. Koch.
     pages 343–349. IEEE, 2016.
                                                                     Probabilistic databases, synthesis lectures on data
 [4] S. Antifakos, A. Schwaninger, and B. Schiele.                   management. Morgan & Claypool, 2011.
     Evaluating the e↵ects of displaying uncertainty in
                                                                [18] H. Wang and S. Wang. Mining incomplete survey data
     context-aware applications. In International
                                                                     through classification. Knowledge and information
     Conference on Ubiquitous Computing, pages 54–69.
                                                                     systems, 24(2):221–233, 2010.
     Springer, 2004.
                                                                [19] Y. Yang, N. Meneghetti, R. Fehling, Z. H. Liu, and
 [5] N. Dalvi and D. Suciu. Efficient query evaluation on
                                                                     O. Kennedy. Lenses: An on-demand approach to etl.
     probabilistic databases. The VLDB JournalThe
                                                                     Proceedings of the VLDB Endowment,
     International Journal on Very Large Data Bases,
                                                                     8(12):1578–1589, 2015.
     16(4):523–544, 2007.
                                                                [20] M. Za↵alon, K. Wesnes, and O. Petrini. Reliable
 [6] S. Feng, A. Huber, B. Glavic, and O. Kennedy.
                                                                     diagnoses of dementia by the naive credal classifier
     Uncertainty annotated databases-a lightweight
                                                                     inferred from incomplete cognitive data. Artificial
     approach for approximating certain answers (extended
                                                                     intelligence in medicine, 29(1-2):61–79, 2003.
     version). arXiv preprint arXiv:1904.00234, 2019.