<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data as a Language: A Novel Approach to Data Integration</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Christos Koutras supervised by Asterios Katsifodimos, Christoph Lofi and Geert-Jan Houben Delft University of Technology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2000</year>
      </pub-date>
      <abstract>
        <p>In modern enterprises, both operational and organizational data is typically spread across multiple heterogeneous systems, databases and file systems. Recognizing the value of their data assets, companies and institutions construct data lakes, storing disparate datasets from di↵erent departments and systems. However, for those datasets to become useful, they need to be cleaned and integrated. Data can be well documented, structured and encoded in di↵erent schemata, but also unstructured with implicit, human-understandable semantics. Due to the sheer scale of the data itself but also the multitude of representations and schemata, data integration techniques need to scale without relying heavily on human labor. Existing integration approaches fail to address hidden semantics without human input or some form of ontology, making large scale integration a daunting task. The goal of my doctoral work is to devise scalable data integration methods, employing modern machine learning to exploit semantics and facilitate discovery of novel relationship types. In order to capture semantics with minimal human intervention, we propose a new approach which we call Data as a Language (DaaL). By leveraging embeddings from the Natural Language Processing (NLP) literature, DaaL aims at extracting semantics from structured and semi-structured data, allowing the exploration of relevance and similarity among di↵erent data sources. This paper discusses existing data integration mechanisms and elaborates on how NLP techniques can be used in data integration, alongside challenges and research directions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Enterprises encounter data-related problems, such as the
integration of multiple databases, spreadsheet files, logs as
well as semi-structured and unstructured documents; data
is stored across multiple storage systems, generally dirty
and modelled abstractly with respect to the corresponding
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los
Angeles, California. Copyright (C) 2019 for this paper by its authors.
Copying permitted for private and academic purposes.
source. For the most part, data integration has been a
manual process. Traditionally, database administrators would
create one federated schema integrating various databases,
allowing data analysts to query all of them at the same time.
To automate this process, administrators relied on Schema
Matching, i.e. the process of capturing potential
relationships between di↵erent data sources and models.</p>
      <p>
        Nowadays such a task is nearly impossible to rely on
humans, since the number of data sources has increased
dramatically. Knowledge about relationships among data assets
is an essential building block for integrating data to be used
in larger scale applications. However, most of the existing
integration methods [
        <xref ref-type="bibr" rid="ref15 ref3 ref6">3, 6, 15</xref>
        ] rely on syntax, i.e. the
symbolic representation of data as found in a database without
considering what they semantically represent (their
underlying meaning), which constraints the quality and amount
of matches. Furthermore, recent work that incorporates
semantics [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], does not consider data instances and focuses
only on the schema elements.
      </p>
      <p>Departing from existing schema matching methods, we
envision a novel approach, which we term Data as a
Language (DaaL). Essentially, DaaL designates a method for
transforming elements of structured (e.g. rows of relational
tables) and semi-structured data (e.g. log entries) into a
non-directed graph, whose traversal outputs a number of
documents. These become the input to Natural Language
Processing (NLP) techniques, which leverage recent
advancements in Deep Neural Networks (DNNs). Research on NLP
methods has proven that they are suitable for capturing
semantics, when the training sample consists of large
quantities of text corpora which provides also some context
information. In a typical data lake, the abundance of
(semi)structured data satisfies the requirement of quantity, but proper
context is dicult to guarantee since non-textual data does
not provide a standard sequence; in contrast, textual data
adheres to a specific syntax. DaaL aims at filling this gap,
by providing a way to strengthen context when dealing with
raw data values, in order to enable leveraging of NLP
approaches that capture semantics. DaaL will facilitate
relationship discovery between di↵erent data schemata and
instances, which, in turn, can improve the accuracy of
demanding data discovery and integration tasks.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>DATA-AS-A-LANGUAGE</title>
      <p>In this section we describe the Data as a Language (DaaL)
approach for capturing relationships among elements of
structured and semi-structured data of various schemata, from
di↵erent data sources.
e1: {
f…1:v1
}// Comment
e2: {
} f…2:v2</p>
      <p>B1 B2 B3
e3: {
} f…3:v3
e4: {
} f…4:v4
//Comment</p>
      <p>Snatch
snatch00
Godfather
user1
user2</p>
      <p>G. Ritchie</p>
      <p>F.F.</p>
      <p>Coppola
1.2.3.4
B.Pitt
$7.000.000
R.De Niro
123ab</p>
      <p>J.Brown
1.2.3.5
35bca</p>
      <sec id="sec-2-1">
        <title>Transform Create</title>
      </sec>
      <sec id="sec-2-2">
        <title>Embed</title>
      </sec>
      <sec id="sec-2-3">
        <title>Relate</title>
        <p>e1: {
f…1:v1
}// Comment
e2: {
} f…2:v2</p>
        <p>B1 B2 B3
e3: {
f3…:v3
e}4: {
} f4…:v4
//Comment
• Transforming Data to a Graph. Data from
various sources, with complete or partial structure, go
through the transformation stage. In that stage, DaaL
processes data elements, such as tuples of relational
tables, and outputs a non-directed graph connecting
data elements with edges.
• Creating Documents from the Graph. Next, we
create documents consisting of data element tokens by
traversing the graph.
• Producing Embeddings from Documents.
Consequently, the documents serve as input for training
vector representations, commonly termed as
embeddings, of data elements (e.g. entire relational tuples or
individual attribute values, entries in semi-structured
files) using existing learning and graph-based NLP
techniques. These embeddings are constructed in such a
way that elements which share the same context have
similar vector representations.
• Finding Semantic Relationships. In the final stage,
we use the embeddings in order to calculate similarity
between di↵erent data assets. Since embeddings of
semantically related data elements are close to each
other, these similarities help us capture and
materialize relationships among them.</p>
        <p>The output of the DaaL pipeline will consist of our initial
datasets enhanced with knowledge about semantically
related content among data items (e.g. a relationship between
two columns of di↵erent schemata). We can then leverage
this knowledge to perform data discovery and integration
tasks. In what follows, we delve into details about each of
the above stages, discussing existing work and challenges.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Transforming Data to a Graph</title>
      <p>As a first step, we need to devise a method for
transforming (semi)structured data into a non-directed graph. We
do so in order to create some reasonable relevance between
schema information and data values sharing some context,
which will later be of high importance for producing
accurate vector representations. In this section, we focus on how
to produce such a graph from either i) relational data, or ii)
semi-structured files.</p>
      <p>Transforming Relational Data. Consider a relation R,
and its set of m attributes {A1, . . . , Am}. For each tuple t
contained in the instance of R we want to create a connected
component of the graph. One alternative would be, for each
tuple, to create nodes representing each data value and edges
between adjacent ones. However, we would relate only data
elements that are adjacent, whereas we would like to relate
them on a row basis.</p>
      <p>Therefore, another approach would be to create a clique
for each relational tuple in the input. More specifically, for
each individual attribute value we create a node and connect
it through an edge with all other values in the same row.
This way, we manage to provide full context for relational
data elements, since we incorporate row information.</p>
      <p>Example 1. Consider the following tuple from a “Movies”
relation with the attribute set {ID, Title, Director,
Production Year, Budget} and the clique created out of it:
(1, Snatch, Guy Ritchie, 2000, $7.000.000)
1</p>
      <p>Snatch</p>
      <p>
        Guy Ritchie
2000
$7.000.000
Transforming Data from Semi-Structured Files.
Consider a semi-structured file F which contains a variety of
entries e referring to various entities. Each of these entries
contains a number of fields f , each accompanied by its value.
Transforming each such entry e to a connected component
of the graph is similar to the approach used for relational
data; in this case, we treat field values as attribute values.
Challenges. Transforming data into a graph is far from
straightforward. Below we highlight three research
challenges:
• Graph Construction. Creating a clique for each
row or entry will be prohibitively costly with respect
to storage, when having a massive amount of data as
input. Therefore, in the case of relational data, if we
know the primary key of a relation, we can create only
an edge between the value of the primary key and each
attribute in the row. For semi-structured data, we
could identify fields that represent keys (e.g. if the field
name is ID ), and proceed in the same way. Hence,
primary key values will act like hubs for relating attribute
values of a single row.
• Incorporating Schema Information. We could
also incorporate schema information into the graph,
such as relation (entry) and attribute (field) names and
their relationships with the corresponding attribute
(field) values. In particular, we could create nodes for
these entities too and connect them with an edge to the
corresponding values and other schema-level elements
they are found together with (e.g. an edge between
attribute names of the same relation). An alternative
approach would be to incorporate schema information
inside the corresponding data values; this would also
help distinguish same data values with di↵erent
contextual meaning.
• Capturing Entries. Semi-structured datasets may
also contain comments and system-generated messages,
which we might not want to take into consideration
when transforming to text. In order to extract e.g.,
log entries from the files, we could leverage approaches
such as Datamaran [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], given some essential
assumptions on the format of a file.
2.3
      </p>
      <p>Creating Documents from the Graph</p>
      <p>In order to train neural networks to produce vector
representations for the data elements in the input, we need to
create documents that provide some context. Towards this
direction, we are going to use the graph of the previous step
and construct documents through its traversal.</p>
      <p>
        Based on [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we propose for each node in the graph to
perform a specified number of random walks of a given length,
to explore diverse neighborhoods. In this fashion, each such
random walk will represent a sequence of data values and
will provide a di↵erent context for each of them. This way,
we are able to provide a lot of useful context for each data
element in the input, and bypass the shortcomings of syntax
absence in non-textual data. The output of this stage
consists of a number of such documents containing tokenized
data values with respect to the random walk sequences.
2.4
      </p>
      <p>
        Producing Embeddings from Documents
The idea of creating similar representations for words that
appear in the same context has its roots in the
distributional hypothesis [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which states that such words tend to
have a similar meaning. The recent progress made in
neural networks facilitated the introduction of distributed
representations called embeddings, which relate words to
vectors of a given dimensionality. Towards this direction,
numerous word embedding methods have been proposed in
the literature, with the most popular ones being Word2Vec
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], GloVe [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and fastText [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which even produces
character embeddings, making it possible to deal with
out-ofvocabulary words. Apart from single word embeddings,
there exist methods that try to produce vector
representations for sentences or even paragraphs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The authors
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] propose the idea of coherent groups for incorporating
single word embeddings into a similarity measure between
two groups of words.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors introduce two approaches for
composing distributed representations of relational tuples. The
simplest one suggests that a tuple embedding is a concatenation
of the embeddings of its attribute instances. They then
propose using Recurrent Neural Networks (RNNs) with Long
Short Term Memory (LSTM) in order to produce tuple
embeddings out of single word ones, by taking into
consideration the relationship and order between di↵erent attribute
values inside a tuple. The authors also propose a method
for handling unknown words, called vocabulary retrofitting,
and two alternatives for learning embeddings when the data
domain is too technical: i) assuming tuples are documents,
and ii) training on a corpus of text with related semantics.
We avoid these issues by presenting a general framework for
producing word embeddings, by transforming data into a
collection of documents comprising of related data values,
which could serve as training input.
      </p>
      <p>
        Training DaaL Embeddings. After transforming data
into documents using the methods described previously, we
make use of this newly crafted knowledge by training
neural networks to produce vector representations that capture
context and semantics. Towards this direction, we feed the
tokenized documents in Word2Vec [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or fastText [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to
receive individual word embeddings. When tuning the
parameters of these methods, we need to pay attention to the
window size around each word, which determines in what
extent we take into consideration the surroundings to
output the word embedding. In our case, we won’t need a large
window size, since we create a lot of di↵erent contexts for
individual data elements, using the random walks for creating
the documents; thus, a smaller window size will guarantee
accurate and more distinct vector representations.
2.5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Finding Semantic Relationships</title>
      <p>
        There has been a lot of prior work on capturing
relationships between di↵erent datasets, or as previously defined,
Schema Matching [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Data Tamer [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] has a Schema
Integration module which allows each ingested attribute to be
matched against a collection of existing ones, by making use
of a variety of algorithms called experts. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] the authors
focus on building Knowledge Graphs, where di↵erent datasets
are correlated with respect to their content or schemata. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
uses a Linkage Graph in order to support data discovery.
All of these methods, rely only on syntax, based on
similarity computation (e.g. Jaccard Similarity ) between pairs
of column signatures [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] . Moreover, provenance of datasets
that are used within Google is explored in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] by looking
into production logs.
      </p>
      <p>
        In an attempt to avoid considering only syntax, the
authors in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] propose an alternative algorithm for matching
attributes of several relations, by clustering them based on
the distribution of their values, whereas in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] matching is
performed with respect to a corpus of existing schema
mappings, which serve as training samples for di↵erent training
modules called base learners. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] tries to build relationships
between relations and columns of di↵erent databases with
respect to a given ontology, by making use of both
semantics and syntax; yet they avoid data instances. Therefore,
DaaL is the first method to incorporate schema information
and data instances to capture relationships between data
elements with respect to the underlying meaning they share.
Our Approach. The trained embeddings could facilitate
discovery of novel relationship types, other than just finding
attributes that relate to the same entity. This is due to the
fact that vector representations of attribute (field) values are
a↵ected by their context inside their relation (entry), which
leads to capturing relationships with attributes (fields) of
other relations (entries) that share the same context.
      </p>
      <p>However, in order to materialize a relationship between
two data elements (e.g. relational tuples, columns of
relations) we need to devise a method that discovers it. One
alternative could be calculating similarities between the vector
representations of data, and if they are above a given
threshold, signal a potential relationship between them. In
addition, clustering similar embeddings could facilitate finding
groups of semantically related data elements. Nonetheless,
the challenge of proposing methods that take advantage of
embeddings for accurately capturing semantic relevance is
very demanding and open, since we want embeddings to be
applicable to any given scenario.</p>
    </sec>
    <sec id="sec-5">
      <title>APPLICATIONS</title>
      <p>
        The output of the DaaL pipeline comprises the initial
datasets together with numerous relationships among them.
Thus, a Data Integration task could use this valuable
information to explore any meaningful data correlation stemming
from the various data sources. In addition, our proposed
approach is the first to deal with finding correspondences
among semi-structured files, without having to transform
them into some structured format, which is something that
the authors in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] deal with.
      </p>
      <p>Interestingly enough, DaaL gives also the opportunity to
augment information available in a specific schema. For
instance, we can enhance a relational table with extra
attributes, found in other schemata, containing data values
which have similar representations with the respective ones
of the table. Most importantly, this could be done before
getting the results of the time costly matching module, since
we immediately can take advantage of the information we
get from the embeddings (e.g., the most similar data
elements to a given one in vector space).</p>
    </sec>
    <sec id="sec-6">
      <title>RESEARCH PLAN</title>
      <p>Experimental Evaluation Framework. We are currently
creating a unified evaluation framework for comparing the
schema matching methods proposed in the literature
throughout the years. We aim at developing an open source
framework for experimenting on Schema Matching and comparing
the DaaL approach with previous ones based on a
standardized set of evaluation techniques.</p>
      <p>Implementing DaaL. The DaaL pipeline comprises four
stages, namely i) transformation of data into a graph, ii)
creation of documents that serve as input to the DNNs, iii)
training DNNs and creating embeddings and subsequently
iv) using those embeddings. Each of those stages poses its
own challenges. We aim at building a streamlined system
to enable plugging-in and experimenting with di↵erent
techniques to generate text from data, to train vector
representations and to create embeddings which we can leverage in
di↵erent ways for data integration and discovery.</p>
      <p>Humans in the Loop. Considering that data integration
cannot be fully automated, we aim at developing a system
which can assist and collaborate with human experts to
perform large-scale data integration and discovery with
minimal human intervention. In particular, users will naturally
interact with the system and suggest suitable refinements
to resolve uncertainties. For instance, humans can provide
existing known relationships, or evaluate the quality of
recommended matches detected by the system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Profiling relational data: a survey</article-title>
          .
          <source>VLDBJ</source>
          ,
          <volume>24</volume>
          (
          <issue>4</issue>
          ):
          <fpage>557</fpage>
          -
          <lpage>581</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>TACL</source>
          ,
          <volume>5</volume>
          :
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , et al.
          <article-title>The data civilizer system</article-title>
          .
          <source>In CIDR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Learning to match the schemas of data sources: A multistrategy approach</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>50</volume>
          (
          <issue>3</issue>
          ):
          <fpage>279</fpage>
          -
          <lpage>301</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ebraheem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          , et al.
          <article-title>Distributed representations of tuples for entity resolution</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>11</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1454</fpage>
          -
          <lpage>1467</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , et al.
          <article-title>Aurum: A data discovery system</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>1001</fpage>
          -
          <lpage>1012</lpage>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Qahtan</surname>
          </string-name>
          , et al.
          <article-title>Seeping semantics: Linking datasets using word embeddings for data discovery</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>989</fpage>
          -
          <lpage>1000</lpage>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          .
          <article-title>Navigating the data lake with datamaran: Automatically extracting structure from log datasets</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>943</fpage>
          -
          <lpage>958</lpage>
          . ACM,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          . node2vec:
          <article-title>Scalable feature learning for networks</article-title>
          .
          <source>In SIGKDD</source>
          , pages
          <fpage>855</fpage>
          -
          <lpage>864</lpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Korn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          , et al.
          <article-title>Goods: Organizing google's datasets</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>795</fpage>
          -
          <lpage>806</lpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Harris</surname>
          </string-name>
          .
          <article-title>Distributional structure</article-title>
          .
          <source>Word</source>
          ,
          <volume>10</volume>
          (
          <issue>2-3</issue>
          ):
          <fpage>146</fpage>
          -
          <lpage>162</lpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Distributed representations of sentences and documents</article-title>
          . In ICML, pages
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In NIPS</source>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In EMNLP</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>A survey of approaches to automatic schema matching</article-title>
          .
          <source>VLDBJ</source>
          ,
          <volume>10</volume>
          (
          <issue>4</issue>
          ):
          <fpage>334</fpage>
          -
          <lpage>350</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bruckner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          , et al.
          <article-title>Data curation at scale: The data tamer system</article-title>
          .
          <source>In CIDR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hadjieleftheriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Ooi</surname>
          </string-name>
          , et al.
          <article-title>Automatic discovery of attributes in relational databases</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>109</fpage>
          -
          <lpage>120</lpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>