-

Findings from Two Decades of Research on Schema Discovery using a Systematic Literature Review

Silvio Normey

Lorena Etcheverry

Adriana Marotta

Mariano P. Consens

2 0 Instituto Federal de Educaca~o Ci 1 Universidad de la Republica , Uruguay 2 University of Toronto 3 encia e Tecnologia Sul-Rio-Grandense

We present a systematic literature review applied to the last twenty years of research in the area of schema discovery (also known as schema inference, or schema extraction) applied to semistructured data. Our survey characterizes the di erent objectives, methodologies, and evaluations that are described in the literature. We present the preliminary ndings of our analysis and make observations that can bene t future research and development e orts in the area.

Introduction Our approach follows the systematic survey methodology described in [ 1, 2 ]. This Section describes the rst two phases of the process; planning the review, and conducting the review. The next Section reports the results. 2.1

Planning the Review

The rst phase, review planning, consists of the following three activities.

Identifying the need for the review. As far as we know, there is no comprehensive literature survey that synthesizes the knowledge developed over the last two decades to address schema discovery in semi-structured data. We believe that a systematic literature review shall shed light over a variety of issues relevant to future schema discovery research and development e orts.

Formulating the research questions. Formulating one or more research questions (abbreviated RQ) is a critical step in the systematic literature review methodology we follow. Our study starts by focusing in the following research question.

RQ:What are the objectives, methodologies, and evaluations that are present in the schema discovery literature, applied to semistructured data formats (excluding schema discovery from web pages)?

Developing the review protocol. The review protocol de nes the methods used during the execution of the systematic review (described in the next Section). 2.2

Conducting the review

The second phase, conducting the review, is composed of two steps (search strategy and study selection), described below.

Search strategy The search strategy objective is to nd publications strongly related to the RQ, while completing and capturing potentially reproducible bibliographic searches. The procedure consists of the following three steps.

Identify the search terms Search terms are formulated from the RQ, and synonyms are incorporated (using the boolean OR connector). In our study, the search expression corresponds to "schema discovery OR schema extraction OR schema inference".

Identify the literature resources The authors judgment selected ve electronic bibliographic databases; ACM Digital Library, IEEE Xplore, SpringerLink, Science Direct, Scopus. The authors consider that ACM (Digital Library), IEEE (Xplore), Springer (Link), and Elsevier (ScienceDirect) are the main publishers (and corresponding bibliographic portals) of highly ranked journals and conferences in the computer science area. The authors also consider that Scopus, an abstract and citation database that indexes a broad set of sources, can contribute by expanding the search space.

Conduct the search process The search process consists in submitting the search expressions in each one of the ve selected libraries, and storing all the results obtained. This requires adapting the search expression (and choosing appropriate advanced search options) for each portal interface.

Study selection The set of references obtained from the searches conducted in all the libraries is ltered in various steps; duplicates are removed, the title and the abstract of each paper is judged in order to discard out-of-topic papers, and then inclusion and exclusion criteria is applied to obtain a re ned set of papers. The initial search returned 412 pertinent papers, of which 107 papers were identi ed as duplicates, and therefore excluded, resulting in a set of 305 papers. Then, out-of-topic papers were discarded after reading their title and abstract. Finally, inclusion and exclusion criteria were applied to further lter the set of papers. The inclusion criteria consisted in only keeping computer science papers related to the research question, which have been published between 1997 and 2017. Exclusion criteria consisted in ltering papers that are not writen in english, or focused on HTML based sources or Deep Web. We excluded works that deal with schema discovery from structured web pages since they have been already reviewed in extent in the context of web mining tasks [ 3 ]. The outcome of this selection process was 76 selected papers, and 229 excluded. 3

Review results and discussion In this section we rst de ne the criteria used to analyze the selected papers. Then, we present the results of a preliminary analysis, which consists in applying these criteria to a subset of 31 of the selected papers. Table 1 summarizes the results of this analysis. Finally, we discuss on some interesting aspects observed.

The analysis criteria is organized in three aspects: the objectives of the paper, the methodology outlined in the paper, and the evaluation strategy. We further re ne these aspects as follows: { Objectives. We identify the problems and contexts addressed by the work.

We de ne four categories: concrete motivation and applications (OM), semistructured data formats supported (OF), schema languages supported for the input (OSI) and the output (OSO). For example, observing the row corresponding to [ 4 ] in Table 1 we see that the motivation for extracting the schema is to obtain a schema description in order to query data (OM), while the addressed data format is JSON (OF), and JSON appears as the output format used in the proposal (OSO). { Methodology. This criterion focuses on the main characteristics of the proposed solutions. The de ned categories are: internal data representation (MD), inferring attributes, related-entities, constraints, types (MI), software environment and availability of an implementation (MS). Continuing with the previous example, in Table 1, row [ 4 ], we nd that the proposed solution uses a graph as internal representation (MD), it infers attributes and data types (MI), and the paper presents information about the implementation (MS). { Evaluation. This analysis aspect aims to answer how experiments were carried out and how their results were studied and validated. For this purpose the following categories were de ned: quality measures for the result schema (EQ), experimental input data (ED), experimental measures (EM), comparison with alternative solutions (EC), support for updates, appends, streaming (EU), support for schema evolution (EE), and scalability of the solution and parallelization (ES). Returning to our example in Table 1, in the row corresponding to [ 4 ] we observe that the authors do not present quality measures for the obtained schema (EQ), that they use real data in the experiments (ED), that they measure the execution time of their process (EM), and that they present a comparison with other solutions (EC). However, they do not show experimentation about updates, appends, streaming or evolution in schemas (EU and EE) and neither they carry out experiments on scalability or parallelization (ES). 3.1

Discussion

Most of the selected works do not present a motivation for schema extraction, they are only focused on the methodology. In some cases the motivation is the need of an schema to improve data querying, to implement query veri cation, or to manipulate data. Few works emphasize on the need for schema extraction to check constraints.

Regarding data formats, most of the works use either XML, JSON, or RDF. We observe that oldest data formats, such as OEM and XML, were object of investigation in the 90s and the beginning of the past decade. In the current decade JSON and RDF are the main objects of study. Most of the reviewed solutions receive raw data as input (e.g., XML or JSON documents), while the output format varies. In the case of XML data, the extracted schemas are often presented as DTDs and XML schemas. In the cases of RDF and JSON, the extracted schema often consists of a class structure.

Most of the reviewed works on JSON and XML use trees to internally represent the inferred schema, and also as output. In the case of RDF data tuples, classes, and graphs are used, and there is not a clear preference.

Regarding on what the reviewed works produce, we observe that all the proposals infer the structure of the schema, while 39% of them also infer data types and 26% also infer related-entities.

In regard to the experimentation, we observe that most of the papers measure the quality of the extracted schema. These evaluation is often carried out on real data, while few works use synthetic data. Two metrics are frequently used to evaluate the solutions: the e ectiveness of the schema to evaluate the accuracy of the proposed methodology, and the execution time to test its e ciency. Most of the reviewed works (62%) do not compare their approach with others, and in most of the cases scalability tests are omitted. A small portion of the literature reviewed addresses evaluation. A similar comment applies to the availability of tools and implementations.

Another signi cant point of analysis is the shortage of solutions that support schema evolution, updates, appends or stream. This means that in most of the algorithms proposed it is necessary to re-process all the database and infer a new schema in order to keep it updated. e t p la la y y y e e R

C ,

, , , u M i r I t ,

, s s s s s e e e e e t t t t t t t t s e e s e p e e ph le s

h s s

a D a e e l r r l

G m h b p to ap ee ee tom ,e a u u r r r u

u a u re re eg eg tr r MC T T C T G T T G T C T C G T T A G T T A T G R A T T R R S G

e p la p l e y e y , s s e n ,

, t l l t n a a n

y , s e i C e e t p la y e

s e e e n n o e i t n D E to r r

a a r a n n io is l

o s

s e

X X X la , , , U - - - - - - - 3 - - - - - 3 - - - - - - - - - - - - - - - - E

c u u c E - S E - E E - E E - E - - E E - - - - E E S E - S - E E E S l l l l l l l l, l

, D e e

3 - - - - 3 3 3 - - 3 - 3 - - - - 3 - - - 3 - - - - - - 3 - n n i o r

e la l u M M ,

y a V

V Q

s m s O s e e s h e a c r l

N s O a S l

D D ab la S

, L D e D D a g l

e u D D D D rp

D p

M T re T T e e T T T T x S e O C S T C - J C S S C C C C R T C X X D T D D R R D D D D E X R O J J J J - - - - J R R R R R R R X X X X X X X X X X X X A X O

S S S S S S S S S D D D D D D D M M M M M M M M M M M M M M E O J J J J J J J J J R R R R R R R X X X X X X X X X X X X X X O m

F F F F F F F L L L L L L L L L L L L to L M , , , , o i e e e

e e h h h h h h h h h h h h h h h h h c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c O S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S m m e e

1. Kitchenham , B. : Procedures for performing systematic reviews . Keele, UK, Keele University 33( 2004 ) ( 2004 ) 1 { 26

2. Brereton , P. , Kitchenham , B.A. , Budgen , D. , Turner , M. , Khalil , M. : Lessons from applying the systematic literature review process within the software engineering domain . Journal of systems and software 80(4) ( 2007 ) 571 { 583

3. Kosala , R. , Blockeel , H.: Web mining research: A survey . SIGKDD Explor. Newsl . 2 ( 1 ) ( June 2000 ) 1 { 15

4. Storl, U., Darmstadt , H. ,

Scherzinger

Othregensburg , S. : Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores . ( 2015 )

Canovas

Izquierdo , J.L. , Cabot , J.: JSONDiscoverer: Visualizing the schema lurking behind JSON documents . Knowledge-Based Systems ( 2016 )

6. Baazizi , M.A. , Colazzo , D. , Ghelli , G. , Sartiani , C. : Counting types for massive JSON datasets . In: Proceedings of The 16th International Symposium on Database Programming Languages - DBPL '17 . ( 2017 )

7. Gallinucci , E. , Golfarelli , M. , Rizzi , S. : Schema Pro ling of Document Stores . ( 2017 )

8. Ruiz , D.S. , Morales , S.F. , Molina , J.G. : Inferring versioned schemas from NoSQL databases and its applications . In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics) . ( 2015 )

9. Wang , L. , Zhang , S. , Shi , J. , Jiao , L. , Hassanzadeh , O. , Zou , J. , Wangz , C. : Schema management for document stores . Proc. VLDB Endow . 8 ( 9 ) (May 2015 ) 922 { 933

10.

Canovas

Izquierdo , J.L. , Cabot , J.: Discovering implicit schemas in JSON data . In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics) . ( 2013 )

11. Baazizi , M.A. , Lahmar , H.B. , Ben , H. , Colazzo , D. , Ghelli , G. , Sartiani , C. : Schema Inference for Massive JSON Datasets

12. DiScala , M. , Abadi , D.J.: Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data . In: Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16 . ( 2016 )

13. Christodoulou , K. , Paton , N.W. , Fernandes , A.A.A. : Structure inference for linked data sources using clustering. In: Transactions on Large-Scale Data- and Knowledge-Centered Systems

XIX

: Special Issue on Big Data and Open Data . ( 2015 ) 1 { 25

14. Kellou-Menouer , K. , Kedad , Z. : On-line Versioned Schema Inference for Large Semantic Web Data Sources . ( 2017 )

15. Abedjan , Z. , Gruetze , T. , Jentzsch , A. , Naumann , F. : Pro ling and mining RDF data with ProLOD++ . In: Proceedings - International Conference on Data Engineering . ( 2014 )

16. Weise , M. , Lohmann , S. , Haag , F. : LD-VOWL: Extracting and visualizing schema information for linked data . In: CEUR Workshop Proceedings . ( 2016 )

17. Konrath , M. , Gottron , T. , Staab , S. , Scherp , A. : SchemEX - E cient construction of a data catalogue by stream-based indexing of linked data . In: Journal of Web Semantics . ( 2012 )

18. Matono , A. , Kojima , I. : Paragraph tables: A storage scheme based on RDF document structure . In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics) . ( 2012 )

19. Kellou-Menouer , K. , Kedad , Z. : Schema Discovery in RDF Data Sources . In: ER. ( 2015 )

20. Mlynkova , I. , Necasky , M. : Towards Inference of More Realistic XSDs . ( 2009 )

21. Marciniak , J.: XML schema and data summarization . In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics) . ( 2010 )

22. Mlynkova , I. , Necasky , M. : Heuristic Methods for Inference of XML Schemas: Lessons Learned and Open Issues. 24 ( 4 ) ( 2013 ) 577 { 602

23. Guen-Hae , K. , Sang-Ki , K. , Yo-Sub , H.: Inferring a Relax NG Schema from XML Documents . ( 2016 )

24. Xing , G. , Parthepan , V.: E cient schema extraction from a large collection of XML documents . ( 2011 )

25. Klempa , M. , Kozak , M. , Mikula , M. , Smetana , R. , Starka , J. , Svirec , M. , Vitasek , M. , Necasky , M. , Holubova , I.: JInfer: A framework for XML schema inference . Computer Journal ( 2013 )

26. Janga , P. , Davis , K.C. : Mapping Heterogeneous XML Document Collections to Relational Databases . LNCS 8824 ( 2014 ) 86 { 99

27. Peng , F. , Chen , H.: Discovering restricted regular expressions with interleaving . In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics) . ( 2015 )

28. Klempa , M. , Starka , J. , Mlnkova , I. : Optimization and Re nement of XML Schema Inference Approaches . Procedia Computer Science 10 ( 2012 ) 120 { 127

29. Cao , H. , Qi , Y. , Selcuk , K. , #3, C. , Sapino , M.L. : XML Data Integration: Schema Extraction and Mapping . ( 2010 )

30. Janga , P. , Davis , K.C. : Schema extraction and integration of heterogeneous XML document collections . In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics) . ( 2013 )

31. Garofalakis , M. , Gionis , A. , Rastogi , R. , Seshadri , S. , Shim , K. , Kaist , A. : XTRACT: A System for Extracting Document Type Descriptors from XML Documents . ( 2000 )

32. Bex , G.J. , Gelade , W. , Neven , F. , Vansummeren , S. : Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data. ACM Transactions on the Web ( 2010 )

33. Hegewald , J. , Naumann , F. , Weis , M.: XStruct: E cient schema extraction from multiple and large XML documents . In: ICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops . ( 2006 )

34. Nestorov , S. , Ullman , J. , Wiener , J. , Chawathe , S. : Representative Objects: Concise Representations of Semistructured, Hierarchical Data . ( 1997 ) a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a