Generating rule suggestions in PBE data transformation Kuldeep Reddy Southeast University Nanjing, China brkuldeep@gmail.com browsing query history and results. There has been recent work on developing query specification tech- Abstract niques involving just gestures and voice. Text based interface techniques include keyword search and natu- This paper presents an idea for generating ral language search. Miscellaneous techniques include rule suggestions in performing programming- spreadsheet based interfaces, query-by-example tech- by-example data transformation operations. nique, techniques to handle the empty-result problem Data transformation is an important opera- with query relaxation or reformulation etc, personal- tion in data integration that is carried out ization and diversification techniques. frequently in many application domains, its Data transformations Data transformations are of- programming-by-example variant consists of a ten required to address the problem of heterogeneity grammar from which a rule is built. It is then in various data sources by converting the data from used in creating a new data item from an ex- multiple sources into the same format. One common isting data item by searching for occurrence approach to handle data transformation is to define a of pattern data item and replacing it with a transformation language and then generate rules based new data item defined in the rule. This pa- on the language to perform data transformation on per proposes two new components in this sys- a large set of data. However, this approach usually tem a) generating suggestions for new rules requires expert users to write individual transforma- based on rules which have been given as in- tions for each data source manually. To alleviate this put, in particular differences between struc- problem, recently approaches have developed as in [2] tures in rule sequences b) suggesting rule mod- that take user examples as input and build data trans- ifications that would include an item user is formation rules around it taking into consideration a interested in, with the final modified ruleset grammar. as a way to provide explanation to user why- not question. Query Suggestions An approach to solve the ambi- guity and inaccuracy in the information retrieval sys- 1 Introduction tem is query suggestion [4]. It is very common for a user to reformulate their query when they didnt receive Data integration Data integration involves combining ideal result from their original query. The system can data residing in different sources and providing users improve the users searching effort by providing sug- with a unified view of them [1]. Various compo- gestions by guessing the user intention, according to nents of typical data integration process include - data either users past behaviour or data-driven techniques. quality assessment, record matching, schema match- Why-not There have been many techniques pro- ing, data transformation, data provenance assessment, posed in literature that can explain where a piece of data warehousing etc. data comes from, this can explain surprises in a result Database usability primarily concerns with devel- set. The idea of whynot [5], on the other hand, seeks oping techniques to make traditional databases more to explain why a certain item is missing from the result accessible to the users [3]. Examples of visual inter- set. In the setting of traditional relational databases faces include faceted search, template based search, where the idea of why-not originated, the user provides Copyright © CIKM 2018 for the individual papers by the papers' the SQL query and a missing result item, the system authors. Copyright © CIKM 2018 for the volume as a collection then pinpoints where the item was not included in the by its editors. This volume and its papers are published under resultset. the Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 System design that just makes use of rules written till now and does not require any additional data sources and only uti- The overall system prototype design involves two new lizes the graph structure developed earlier. The basic components - a) a module to generate rule suggestions intuition behind the proposed approach is that when- from the graph structure constructed to manage rule ever the user starts with a partial rule, it searches for sequences c) a module to answer user why-not ques- the occurrences of the partial rules in the rules graph. tions on data transformations and providing explana- Performing such a such on the rules graph is done by tion in the form of modified graph structure. They key string similarity matching utilizing various measures modules are described in more detail below. in literature such as jaro-winkler, sorenson-dice etc. The suggestions are ranked according to the popularity 2.1 Generate rule suggestions score, which represents the frequency of the elements In the first part of system, the paper proposes a graph of new partial rule found in the earlier rules graph. structure to store PBE data transformation rules. The Of course, utilizing more sophisticated string similar- rules we consider have been developed as part of foofah ity heuristics is important and impact on the quality system [6], where the focus has been on improving its of suggestions generated and studying them is part of usability. By having a graph structure we are storing future work. relationships between common components in rule se- quences and if the new rule structure is similar to var- ious combination of rules in the graph, it can retrieve relevant common rules more efficiently from graph in- 2.2 Answering why-not questions stead of the inefficient repeated sequential searches. The rule sequences are organized in a graph structure In the second part of the paper, it proposes applying described as follows. That is the rule sequence at the why-not questions in PBE data transformation. Fre- top of graph is represented as root and can be thought quently situations arise where the user is interested in of as a union of all the rules. The rule sequence in knowing why a particular item is missing from the re- the next level are the identifiers for individual rule se- sultset. This concept has been developed extensively quences, which means if we have to build such a struc- in relational query systems and of late in graph match- ture for 10 rules there will be 10 nodes with identifiers ing and clustering. This paper is the applies con- rule1, rule2, rule3 etc for each of the rules in level cept of why-not in this context of graph-based struc- of the graph structure. The next level of the struc- ture PBE data transfomation which is used to gener- ture contain the individual rule components as part of ate suggestions. The user specifies a data item that a directed subgraph, which means for instance if the should have appeared in the transformed dataset but PBE data transformation program contains subrule se- instead is missing from the transformed dataset. To quences of the form split(t,1) and replace(t,2) then the know why the required dataset is missing, it first iden- level 3 of the structure will contain nodes denoted by tifies the rule which generated either the data item these subrules in the form of directed subgraph with itself exactly or the rule sequence that could have pro- edges between them denoting the sequence of their ex- duced a larger transformation of in which the data ecution. The next level contain nodes identifying the item in question is part. Infact, there can be more individual variables that would eventually point to say than one rule that produced the data item in question. column names in a database or constant values or lit- All the rule sequences which produced the data item erals. The bottom level of the structure contain nodes in question exactly and those that produced larger for the various literal or numeric constant values. So, transformed dataset of which it is part of are identi- overall we end up with a structure consisting of 5 lev- fied and which are then modified(through either insert, els which can be used to manage the rule sequences, delete or move) to reflect the changes required by the at the first level we have a root node, at the second user. Identifying such rules and modifications are done level we have identifiers for individual rule sequences, through string similarity algorithms and simulated an- at the third level we have nodes for subsequences in nealing techniques. Designing better heuristics is part rule sequences with directions between them denoted of future work. On the other hand, replacing the data their flow of execution, at the fourth level we have vari- item in question can affect other rules which have the able names and fifth level we have literal or numeric data item relevant rule in question either exactly or as constant values. part of larger ruleset. Therefore, another round of rule In order to alleviate some of burden of writing all identification is required to find rule sequences which the rules for the user, this section proposes a way to have the rules with the data item in question. This generate rule suggestions using the graph structure de- again requires string similarity algorithms, of course signed above. This is an entirely data-driven approach studying better heuristics is again part of future work. 4 Future work The purpose of the paper is to introduces a new ap- proach to generate PBE data transformation rule sug- gestions and debug it. Designing more comprehensive complex solutions to the aforementioned problems is part of future work. References [1] Halevy, Alon Y. Data Integration: A Status Report. , University of Washington , Seattle, Washington, USA (2003) [2] Wu, Bo, Szekely, Pedro A. and Knoblock, Craig A. ”Learning Transformation Rules by Examples.” Paper presented at the meeting of the AAAI, 2012 3 Experiments [3] Jagadish, H. V., Chapman, Adriane, Elkiss, The experiments were conducted on a 64-bit com- Aaron, Jayapandian, Magesh, Li, Yunyao, puter running windows 7 in Java. The real- Nandi, Arnab and Yu, Cong. ”Making world data used in experiments are obtained from database systems usable..” Paper presented the University of Florida Sparse Matrix Collection at the meeting of the SIGMOD Conference, (cise.ufl.edu/research) and the Parasol project and 2007. KONECT(http://konect.uni-koblenz.de/networks). [4] Niu, Xi and Kelly, Diane. ”The use of query The graph shown below in the figure 1 show the suggestions during information search..” Inf. execution times in seconds to construct graph in PBE Process. Manage. 50 , no. 1 (2014): 218-234. data transformation for various parameter configura- tions for sizes of rule 3 to 1000. [5] Islam, Md. Saiful, Liu, Chengfei and Li, The graph shown below in the figure 2 show the Jianxin. ”Efficient answering of why-not execution times in seconds for the rule suggestion al- questions in similar graph matching.” Paper gorithm in PBE data transformation for various pa- presented at the meeting of the ICDE, 2016. rameter configurations for sizes of graph that is the [6] Zhongjun Jin, Michael R. Anderson, Michael number of rules varying from 3 to 1000. J. Cafarella, H. V. Jagadish: Foofah: A The graph shown below in the figure 3 show the ex- Programming-By-Example System for Syn- ecution times in seconds for the why-not algorithm in thesizing Data Transformation Programs. PBE data transformation for various parameter con- SIGMOD Conference 2017: 1607-1610 figurations for sizes of data item in question varying between 3 to 1000.