Generating rule suggestions in PBE data transformation

                                                        Kuldeep Reddy
                                                     Southeast University
                                                        Nanjing, China
                                                     brkuldeep@gmail.com


                                                                 browsing query history and results. There has been
                                                                 recent work on developing query specification tech-
                          Abstract                               niques involving just gestures and voice. Text based
                                                                 interface techniques include keyword search and natu-
     This paper presents an idea for generating
                                                                 ral language search. Miscellaneous techniques include
     rule suggestions in performing programming-
                                                                 spreadsheet based interfaces, query-by-example tech-
     by-example data transformation operations.
                                                                 nique, techniques to handle the empty-result problem
     Data transformation is an important opera-
                                                                 with query relaxation or reformulation etc, personal-
     tion in data integration that is carried out
                                                                 ization and diversification techniques.
     frequently in many application domains, its
                                                                    Data transformations Data transformations are of-
     programming-by-example variant consists of a
                                                                 ten required to address the problem of heterogeneity
     grammar from which a rule is built. It is then
                                                                 in various data sources by converting the data from
     used in creating a new data item from an ex-
                                                                 multiple sources into the same format. One common
     isting data item by searching for occurrence
                                                                 approach to handle data transformation is to define a
     of pattern data item and replacing it with a
                                                                 transformation language and then generate rules based
     new data item defined in the rule. This pa-
                                                                 on the language to perform data transformation on
     per proposes two new components in this sys-
                                                                 a large set of data. However, this approach usually
     tem a) generating suggestions for new rules
                                                                 requires expert users to write individual transforma-
     based on rules which have been given as in-
                                                                 tions for each data source manually. To alleviate this
     put, in particular differences between struc-
                                                                 problem, recently approaches have developed as in [2]
     tures in rule sequences b) suggesting rule mod-
                                                                 that take user examples as input and build data trans-
     ifications that would include an item user is
                                                                 formation rules around it taking into consideration a
     interested in, with the final modified ruleset
                                                                 grammar.
     as a way to provide explanation to user why-
     not question.                                                  Query Suggestions An approach to solve the ambi-
                                                                 guity and inaccuracy in the information retrieval sys-
1    Introduction                                                tem is query suggestion [4]. It is very common for a
                                                                 user to reformulate their query when they didnt receive
Data integration Data integration involves combining             ideal result from their original query. The system can
data residing in different sources and providing users           improve the users searching effort by providing sug-
with a unified view of them [1]. Various compo-                  gestions by guessing the user intention, according to
nents of typical data integration process include - data         either users past behaviour or data-driven techniques.
quality assessment, record matching, schema match-                  Why-not There have been many techniques pro-
ing, data transformation, data provenance assessment,            posed in literature that can explain where a piece of
data warehousing etc.                                            data comes from, this can explain surprises in a result
   Database usability primarily concerns with devel-             set. The idea of whynot [5], on the other hand, seeks
oping techniques to make traditional databases more              to explain why a certain item is missing from the result
accessible to the users [3]. Examples of visual inter-           set. In the setting of traditional relational databases
faces include faceted search, template based search,             where the idea of why-not originated, the user provides
Copyright © CIKM 2018 for the individual papers by the papers'   the SQL query and a missing result item, the system
authors. Copyright © CIKM 2018 for the volume as a collection    then pinpoints where the item was not included in the
by its editors. This volume and its papers are published under
                                                                 resultset.
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
2     System design                                          that just makes use of rules written till now and does
                                                             not require any additional data sources and only uti-
The overall system prototype design involves two new
                                                             lizes the graph structure developed earlier. The basic
components - a) a module to generate rule suggestions
                                                             intuition behind the proposed approach is that when-
from the graph structure constructed to manage rule
                                                             ever the user starts with a partial rule, it searches for
sequences c) a module to answer user why-not ques-
                                                             the occurrences of the partial rules in the rules graph.
tions on data transformations and providing explana-
                                                             Performing such a such on the rules graph is done by
tion in the form of modified graph structure. They key
                                                             string similarity matching utilizing various measures
modules are described in more detail below.
                                                             in literature such as jaro-winkler, sorenson-dice etc.
                                                             The suggestions are ranked according to the popularity
2.1   Generate rule suggestions
                                                             score, which represents the frequency of the elements
In the first part of system, the paper proposes a graph      of new partial rule found in the earlier rules graph.
structure to store PBE data transformation rules. The        Of course, utilizing more sophisticated string similar-
rules we consider have been developed as part of foofah      ity heuristics is important and impact on the quality
system [6], where the focus has been on improving its        of suggestions generated and studying them is part of
usability. By having a graph structure we are storing        future work.
relationships between common components in rule se-
quences and if the new rule structure is similar to var-
ious combination of rules in the graph, it can retrieve
relevant common rules more efficiently from graph in-        2.2   Answering why-not questions
stead of the inefficient repeated sequential searches.
The rule sequences are organized in a graph structure        In the second part of the paper, it proposes applying
described as follows. That is the rule sequence at the       why-not questions in PBE data transformation. Fre-
top of graph is represented as root and can be thought       quently situations arise where the user is interested in
of as a union of all the rules. The rule sequence in         knowing why a particular item is missing from the re-
the next level are the identifiers for individual rule se-   sultset. This concept has been developed extensively
quences, which means if we have to build such a struc-       in relational query systems and of late in graph match-
ture for 10 rules there will be 10 nodes with identifiers    ing and clustering. This paper is the applies con-
rule1, rule2, rule3 etc for each of the rules in level       cept of why-not in this context of graph-based struc-
of the graph structure. The next level of the struc-         ture PBE data transfomation which is used to gener-
ture contain the individual rule components as part of       ate suggestions. The user specifies a data item that
a directed subgraph, which means for instance if the         should have appeared in the transformed dataset but
PBE data transformation program contains subrule se-         instead is missing from the transformed dataset. To
quences of the form split(t,1) and replace(t,2) then the     know why the required dataset is missing, it first iden-
level 3 of the structure will contain nodes denoted by       tifies the rule which generated either the data item
these subrules in the form of directed subgraph with         itself exactly or the rule sequence that could have pro-
edges between them denoting the sequence of their ex-        duced a larger transformation of in which the data
ecution. The next level contain nodes identifying the        item in question is part. Infact, there can be more
individual variables that would eventually point to say      than one rule that produced the data item in question.
column names in a database or constant values or lit-        All the rule sequences which produced the data item
erals. The bottom level of the structure contain nodes       in question exactly and those that produced larger
for the various literal or numeric constant values. So,      transformed dataset of which it is part of are identi-
overall we end up with a structure consisting of 5 lev-      fied and which are then modified(through either insert,
els which can be used to manage the rule sequences,          delete or move) to reflect the changes required by the
at the first level we have a root node, at the second        user. Identifying such rules and modifications are done
level we have identifiers for individual rule sequences,     through string similarity algorithms and simulated an-
at the third level we have nodes for subsequences in         nealing techniques. Designing better heuristics is part
rule sequences with directions between them denoted          of future work. On the other hand, replacing the data
their flow of execution, at the fourth level we have vari-   item in question can affect other rules which have the
able names and fifth level we have literal or numeric        data item relevant rule in question either exactly or as
constant values.                                             part of larger ruleset. Therefore, another round of rule
   In order to alleviate some of burden of writing all       identification is required to find rule sequences which
the rules for the user, this section proposes a way to       have the rules with the data item in question. This
generate rule suggestions using the graph structure de-      again requires string similarity algorithms, of course
signed above. This is an entirely data-driven approach       studying better heuristics is again part of future work.
                                                         4   Future work
                                                         The purpose of the paper is to introduces a new ap-
                                                         proach to generate PBE data transformation rule sug-
                                                         gestions and debug it. Designing more comprehensive
                                                         complex solutions to the aforementioned problems is
                                                         part of future work.

                                                         References
                                                              [1] Halevy, Alon Y. Data Integration: A Status
                                                                  Report. , University of Washington , Seattle,
                                                                  Washington, USA (2003)

                                                              [2] Wu, Bo, Szekely, Pedro A. and Knoblock,
                                                                  Craig A. ”Learning Transformation Rules by
                                                                  Examples.” Paper presented at the meeting
                                                                  of the AAAI, 2012
3   Experiments                                               [3] Jagadish, H. V., Chapman, Adriane, Elkiss,
The experiments were conducted on a 64-bit com-                   Aaron, Jayapandian, Magesh, Li, Yunyao,
puter running windows 7 in Java.            The real-             Nandi, Arnab and Yu, Cong. ”Making
world data used in experiments are obtained from                  database systems usable..” Paper presented
the University of Florida Sparse Matrix Collection                at the meeting of the SIGMOD Conference,
(cise.ufl.edu/research) and the Parasol project and               2007.
KONECT(http://konect.uni-koblenz.de/networks).                [4] Niu, Xi and Kelly, Diane. ”The use of query
   The graph shown below in the figure 1 show the                 suggestions during information search..” Inf.
execution times in seconds to construct graph in PBE              Process. Manage. 50 , no. 1 (2014): 218-234.
data transformation for various parameter configura-
tions for sizes of rule 3 to 1000.                            [5] Islam, Md. Saiful, Liu, Chengfei and Li,
   The graph shown below in the figure 2 show the                 Jianxin. ”Efficient answering of why-not
execution times in seconds for the rule suggestion al-            questions in similar graph matching.” Paper
gorithm in PBE data transformation for various pa-                presented at the meeting of the ICDE, 2016.
rameter configurations for sizes of graph that is the         [6] Zhongjun Jin, Michael R. Anderson, Michael
number of rules varying from 3 to 1000.                           J. Cafarella, H. V. Jagadish: Foofah: A
   The graph shown below in the figure 3 show the ex-             Programming-By-Example System for Syn-
ecution times in seconds for the why-not algorithm in             thesizing Data Transformation Programs.
PBE data transformation for various parameter con-                SIGMOD Conference 2017: 1607-1610
figurations for sizes of data item in question varying
between 3 to 1000.