<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Findings from Two Decades of Research on Schema Discovery using a Systematic Literature Review</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Silvio Normey</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorena Etcheverry</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriana Marotta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariano P. Consens</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Federal de Educaca~o Ci</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de la Republica</institution>
          ,
          <country country="UY">Uruguay</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Toronto</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>encia e Tecnologia Sul-Rio-Grandense</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a systematic literature review applied to the last twenty years of research in the area of schema discovery (also known as schema inference, or schema extraction) applied to semistructured data. Our survey characterizes the di erent objectives, methodologies, and evaluations that are described in the literature. We present the preliminary ndings of our analysis and make observations that can bene t future research and development e orts in the area.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Introduction
Our approach follows the systematic survey methodology described in [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This
Section describes the rst two phases of the process; planning the review, and
conducting the review. The next Section reports the results.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Planning the Review</title>
      <p>The rst phase, review planning, consists of the following three activities.</p>
      <p>Identifying the need for the review. As far as we know, there is no
comprehensive literature survey that synthesizes the knowledge developed over
the last two decades to address schema discovery in semi-structured data. We
believe that a systematic literature review shall shed light over a variety of issues
relevant to future schema discovery research and development e orts.</p>
      <p>Formulating the research questions. Formulating one or more research
questions (abbreviated RQ) is a critical step in the systematic literature review
methodology we follow. Our study starts by focusing in the following research
question.</p>
      <p>RQ:What are the objectives, methodologies, and evaluations that are present
in the schema discovery literature, applied to semistructured data formats
(excluding schema discovery from web pages)?</p>
      <p>Developing the review protocol. The review protocol de nes the
methods used during the execution of the systematic review (described in the next
Section).
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Conducting the review</title>
      <p>The second phase, conducting the review, is composed of two steps (search
strategy and study selection), described below.</p>
      <p>Search strategy The search strategy objective is to nd publications strongly
related to the RQ, while completing and capturing potentially reproducible
bibliographic searches. The procedure consists of the following three steps.</p>
      <p>Identify the search terms Search terms are formulated from the RQ, and
synonyms are incorporated (using the boolean OR connector). In our study, the
search expression corresponds to "schema discovery OR schema extraction OR
schema inference".</p>
      <p>Identify the literature resources The authors judgment selected ve
electronic bibliographic databases; ACM Digital Library, IEEE Xplore,
SpringerLink, Science Direct, Scopus. The authors consider that ACM (Digital Library),
IEEE (Xplore), Springer (Link), and Elsevier (ScienceDirect) are the main
publishers (and corresponding bibliographic portals) of highly ranked journals and
conferences in the computer science area. The authors also consider that
Scopus, an abstract and citation database that indexes a broad set of sources, can
contribute by expanding the search space.</p>
      <p>Conduct the search process The search process consists in submitting
the search expressions in each one of the ve selected libraries, and storing all
the results obtained. This requires adapting the search expression (and choosing
appropriate advanced search options) for each portal interface.</p>
      <p>
        Study selection The set of references obtained from the searches conducted
in all the libraries is ltered in various steps; duplicates are removed, the title
and the abstract of each paper is judged in order to discard out-of-topic papers,
and then inclusion and exclusion criteria is applied to obtain a re ned set of
papers. The initial search returned 412 pertinent papers, of which 107 papers
were identi ed as duplicates, and therefore excluded, resulting in a set of 305
papers. Then, out-of-topic papers were discarded after reading their title and
abstract. Finally, inclusion and exclusion criteria were applied to further lter the
set of papers. The inclusion criteria consisted in only keeping computer science
papers related to the research question, which have been published between 1997
and 2017. Exclusion criteria consisted in ltering papers that are not writen in
english, or focused on HTML based sources or Deep Web. We excluded works
that deal with schema discovery from structured web pages since they have been
already reviewed in extent in the context of web mining tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The outcome
of this selection process was 76 selected papers, and 229 excluded.
3
      </p>
      <p>Review results and discussion
In this section we rst de ne the criteria used to analyze the selected papers.
Then, we present the results of a preliminary analysis, which consists in applying
these criteria to a subset of 31 of the selected papers. Table 1 summarizes the
results of this analysis. Finally, we discuss on some interesting aspects observed.</p>
      <p>The analysis criteria is organized in three aspects: the objectives of the paper,
the methodology outlined in the paper, and the evaluation strategy. We further
re ne these aspects as follows:
{ Objectives. We identify the problems and contexts addressed by the work.</p>
      <p>
        We de ne four categories: concrete motivation and applications (OM),
semistructured data formats supported (OF), schema languages supported for the
input (OSI) and the output (OSO). For example, observing the row
corresponding to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in Table 1 we see that the motivation for extracting the
schema is to obtain a schema description in order to query data (OM), while
the addressed data format is JSON (OF), and JSON appears as the output
format used in the proposal (OSO).
{ Methodology. This criterion focuses on the main characteristics of the
proposed solutions. The de ned categories are: internal data representation
(MD), inferring attributes, related-entities, constraints, types (MI), software
environment and availability of an implementation (MS). Continuing with
the previous example, in Table 1, row [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we nd that the proposed solution
uses a graph as internal representation (MD), it infers attributes and data
types (MI), and the paper presents information about the implementation
(MS).
{ Evaluation. This analysis aspect aims to answer how experiments were
carried out and how their results were studied and validated. For this purpose
the following categories were de ned: quality measures for the result schema
(EQ), experimental input data (ED), experimental measures (EM),
comparison with alternative solutions (EC), support for updates, appends, streaming
(EU), support for schema evolution (EE), and scalability of the solution and
parallelization (ES). Returning to our example in Table 1, in the row
corresponding to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] we observe that the authors do not present quality measures
for the obtained schema (EQ), that they use real data in the experiments
(ED), that they measure the execution time of their process (EM), and that
they present a comparison with other solutions (EC). However, they do not
show experimentation about updates, appends, streaming or evolution in
schemas (EU and EE) and neither they carry out experiments on scalability
or parallelization (ES).
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Most of the selected works do not present a motivation for schema extraction,
they are only focused on the methodology. In some cases the motivation is the
need of an schema to improve data querying, to implement query veri cation,
or to manipulate data. Few works emphasize on the need for schema extraction
to check constraints.</p>
      <p>Regarding data formats, most of the works use either XML, JSON, or RDF.
We observe that oldest data formats, such as OEM and XML, were object of
investigation in the 90s and the beginning of the past decade. In the current
decade JSON and RDF are the main objects of study. Most of the reviewed
solutions receive raw data as input (e.g., XML or JSON documents), while the
output format varies. In the case of XML data, the extracted schemas are often
presented as DTDs and XML schemas. In the cases of RDF and JSON, the
extracted schema often consists of a class structure.</p>
      <p>Most of the reviewed works on JSON and XML use trees to internally
represent the inferred schema, and also as output. In the case of RDF data tuples,
classes, and graphs are used, and there is not a clear preference.</p>
      <p>Regarding on what the reviewed works produce, we observe that all the
proposals infer the structure of the schema, while 39% of them also infer data
types and 26% also infer related-entities.</p>
      <p>In regard to the experimentation, we observe that most of the papers measure
the quality of the extracted schema. These evaluation is often carried out on real
data, while few works use synthetic data. Two metrics are frequently used to
evaluate the solutions: the e ectiveness of the schema to evaluate the accuracy
of the proposed methodology, and the execution time to test its e ciency. Most
of the reviewed works (62%) do not compare their approach with others, and in
most of the cases scalability tests are omitted. A small portion of the literature
reviewed addresses evaluation. A similar comment applies to the availability of
tools and implementations.</p>
      <p>Another signi cant point of analysis is the shortage of solutions that support
schema evolution, updates, appends or stream. This means that in most of the
algorithms proposed it is necessary to re-process all the database and infer a
new schema in order to keep it updated.
e
t
p la la
y y y e e
R</p>
      <p>R</p>
      <p>C
,</p>
      <p>,
,
,
u
M i
r
I t
,</p>
      <p>,
s s s s s
e e e e e
t t t t t
t t t
s e e s e p e e ph le s</p>
      <p>h
s s</p>
      <p>a
D a e e
l
r r l</p>
      <p>G
m h
b p to ap ee ee tom ,e
a u u r r r
u</p>
      <p>u a
u re re eg eg tr r
MC T T C T G T T G T C T C G T T A G T T A T G R A T T R R S G</p>
      <p>T</p>
      <p>T</p>
      <p>e
p la p
l
e y e y
,
s
s
e
n
,</p>
      <p>,
t l l t
n a a n</p>
      <p>y
,
s
e
i
C
e
e t
p la
y e</p>
      <p>s
e e e
n n
o e
i
t
n D
E to r r</p>
      <p>a a
r a
n n
io is l</p>
      <p>o
s</p>
      <p>s e</p>
      <p>X X X
la , , ,
U - - - - - - - 3 - - - - - 3 - - - - - - - - - - - - - - - -
E</p>
      <p>c
u u c
E - S E - E E - E E - E - - E E - - - - E E S E - S - E E E S
l l l l l l l l, l</p>
      <p>,
D e e</p>
      <p>3 - - - - 3 3 3 - - 3 - 3 - - - - 3 - - - 3 - - - - - - 3 -
n n i
o r</p>
      <p>e
la l
u
M M ,</p>
      <p>y
a V</p>
      <p>V
Q</p>
      <p>Q</p>
      <p>s m s
O s e e s
h e a
c r l</p>
      <p>N s
O a
S l</p>
      <p>D
D ab la S</p>
      <p>,
L D e D D a g
l</p>
      <p>e
u D D D D rp</p>
      <p>D p</p>
      <p>M T re T T e e T T T T x S e
O C S T C - J C S S C C C C R T C X X D T D D R R D D D D E X R
O J J J J - - - - J R R R R R R R X X X X X X X X X X X X A X O</p>
      <p>S S S S S S S S S D D D D D D D M M M M M M M M M M M M M M E
O J J J J J J J J J R R R R R R R X X X X X X X X X X X X X X O
m</p>
      <p>F F F F F F F L L L L L L L L L L L L to L M
,
,
,
,
o i
e e e</p>
      <p>e e
h h h h h h h h h h h h h h h h h
c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c
O S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S
m m
e e</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kitchenham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Procedures for performing systematic reviews</article-title>
          . Keele, UK, Keele University 33(
          <year>2004</year>
          ) (
          <year>2004</year>
          )
          <volume>1</volume>
          {
          <fpage>26</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brereton</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kitchenham</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Budgen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khalil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Lessons from applying the systematic literature review process within the software engineering domain</article-title>
          .
          <source>Journal of systems and software 80(4)</source>
          (
          <year>2007</year>
          )
          <volume>571</volume>
          {
          <fpage>583</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kosala</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blockeel</surname>
          </string-name>
          , H.:
          <article-title>Web mining research: A survey</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>2</volume>
          (
          <issue>1</issue>
          ) (
          <year>June 2000</year>
          )
          <volume>1</volume>
          {
          <fpage>15</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Storl, U.,
          <string-name>
            <surname>Darmstadt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Scherzinger</given-names>
            <surname>Othregensburg</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores</article-title>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Canovas</given-names>
            <surname>Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.L.</given-names>
            ,
            <surname>Cabot</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>JSONDiscoverer: Visualizing the schema lurking behind JSON documents</article-title>
          .
          <article-title>Knowledge-Based Systems (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Baazizi</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colazzo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghelli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sartiani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Counting types for massive JSON datasets</article-title>
          .
          <source>In: Proceedings of The 16th International Symposium on Database Programming Languages - DBPL '17</source>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gallinucci</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golfarelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rizzi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Schema Pro ling of Document Stores</article-title>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morales</surname>
            ,
            <given-names>S.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molina</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          :
          <article-title>Inferring versioned schemas from NoSQL databases and its applications</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wangz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Schema management for document stores</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>8</volume>
          (
          <issue>9</issue>
          ) (May
          <year>2015</year>
          )
          <volume>922</volume>
          {
          <fpage>933</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Canovas</given-names>
            <surname>Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.L.</given-names>
            ,
            <surname>Cabot</surname>
          </string-name>
          , J.:
          <article-title>Discovering implicit schemas in JSON data</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Baazizi</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lahmar</surname>
            ,
            <given-names>H.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ben</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colazzo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghelli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sartiani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Schema Inference for Massive JSON Datasets</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>DiScala</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.:</given-names>
          </string-name>
          <article-title>Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data</article-title>
          .
          <source>In: Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16</source>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Christodoulou</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paton</surname>
            ,
            <given-names>N.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandes</surname>
            ,
            <given-names>A.A.A.</given-names>
          </string-name>
          :
          <article-title>Structure inference for linked data sources using clustering. In: Transactions on Large-Scale Data-</article-title>
          and
          <string-name>
            <surname>Knowledge-Centered Systems</surname>
            <given-names>XIX</given-names>
          </string-name>
          :
          <article-title>Special Issue on Big Data and Open Data</article-title>
          . (
          <year>2015</year>
          )
          <volume>1</volume>
          {
          <fpage>25</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kellou-Menouer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kedad</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>On-line Versioned Schema Inference for Large Semantic Web Data Sources</article-title>
          . (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gruetze</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Pro ling and mining RDF data with ProLOD++</article-title>
          .
          <source>In: Proceedings - International Conference on Data Engineering</source>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Weise</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haag</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>LD-VOWL: Extracting and visualizing schema information for linked data</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Konrath</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scherp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SchemEX - E cient construction of a data catalogue by stream-based indexing of linked data</article-title>
          .
          <source>In: Journal of Web Semantics</source>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Matono</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kojima</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Paragraph tables: A storage scheme based on RDF document structure</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Kellou-Menouer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kedad</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Schema Discovery in RDF Data Sources</article-title>
          . In: ER. (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Mlynkova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Necasky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards Inference of More Realistic XSDs</article-title>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Marciniak</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>XML schema and data summarization</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Mlynkova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Necasky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Heuristic Methods for Inference of XML Schemas: Lessons Learned</article-title>
          and Open Issues.
          <volume>24</volume>
          (
          <issue>4</issue>
          ) (
          <year>2013</year>
          )
          <volume>577</volume>
          {
          <fpage>602</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Guen-Hae</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sang-Ki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yo-Sub</surname>
          </string-name>
          , H.:
          <article-title>Inferring a Relax NG Schema from XML Documents</article-title>
          .
          <article-title>(</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Xing</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parthepan</surname>
          </string-name>
          , V.:
          <article-title>E cient schema extraction from a large collection of XML documents</article-title>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Klempa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikula</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smetana</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Starka</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Svirec</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitasek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Necasky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holubova</surname>
          </string-name>
          , I.:
          <article-title>JInfer: A framework for XML schema inference</article-title>
          .
          <source>Computer Journal</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Janga</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>K.C.</given-names>
          </string-name>
          :
          <article-title>Mapping Heterogeneous XML Document Collections to Relational Databases</article-title>
          .
          <source>LNCS 8824</source>
          (
          <year>2014</year>
          )
          <volume>86</volume>
          {
          <fpage>99</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , H.:
          <article-title>Discovering restricted regular expressions with interleaving</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Klempa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Starka</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mlnkova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Optimization and Re nement of XML Schema Inference Approaches</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>10</volume>
          (
          <year>2012</year>
          )
          <volume>120</volume>
          {
          <fpage>127</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Selcuk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , #3,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Sapino</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.L.</surname>
          </string-name>
          :
          <article-title>XML Data Integration: Schema Extraction and Mapping</article-title>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Janga</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>K.C.</given-names>
          </string-name>
          :
          <article-title>Schema extraction and integration of heterogeneous XML document collections</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Garofalakis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gionis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rastogi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seshadri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shim</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaist</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>XTRACT: A System for Extracting Document Type Descriptors from XML Documents</article-title>
          .
          <article-title>(</article-title>
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Bex</surname>
            ,
            <given-names>G.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelade</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neven</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vansummeren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data. ACM Transactions on the Web (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Hegewald</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weis</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>XStruct: E cient schema extraction from multiple and large XML documents</article-title>
          .
          <source>In: ICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops</source>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Nestorov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ullman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiener</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chawathe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Representative Objects:
          <article-title>Concise Representations of Semistructured, Hierarchical Data</article-title>
          . (
          <year>1997</year>
          )
          <article-title>a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>