<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Foreign Key Constraint Identification in Relational Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Motl</string-name>
          <email>jan.motl@fit.cvut.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Kordík</string-name>
          <email>pavel.kordik@fit.cvut.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Czech Technical University in Prague</institution>
          ,
          <addr-line>Thákurova 9, 160 00 Praha 6</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>106</fpage>
      <lpage>111</lpage>
      <abstract>
        <p>For relational learning, it is important to know the relationships between the tables. In relational databases, the relationships can be described with foreign key constraints. However, the foreign keys may not be explicitly specified. In this article, we present how to automatically and quickly identify primary &amp; foreign key constraints from metadata about the data. Our method was evaluated on 72 databases and has F-measure of 0.87 for foreign key constraint identification. The proposed method significantly outperforms in runtime related methods reported in the literature and is database vendor agnostic.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Whenever we want to build a predictive model on relational
data, we have to be able to connect individual tables
together [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In Structured Query Language (SQL) databases,
the relationships (connections) between the tables can be
defined with foreign key (FK) constraints. However, FK
constraints are not always available. This can happen, for
example, whenever we work with legacy databases or data
sources, like comma separated value (CSV) files.
      </p>
      <p>
        Identification of relationships from database belongs to
reverse engineering from databases [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and can be done
manually or by means of handcrafted rules [
        <xref ref-type="bibr" rid="ref17 ref2 ref3 ref7">2, 3, 7, 17</xref>
        ].
Manual FK constraint discovery is very time-consuming
for complex databases [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. And handcrafted systems may
overfit to small collections of databases, used for the
training. Therefore we use machine learning techniques for this
task and evaluate them on a collection of 72 databases.
      </p>
      <p>Unfortunately, FK constraint identification is difficult. If
we have n columns in a database, then there can be n2 FK
constraints, as each column can reference any column in
the database, including itself1. Hence, there is n2 candidate
FK constraints.</p>
      <p>Example 1. If we have a medium-sized database with 100
tables, each with 100 columns, then we have to consider
108 candidate FK constraints.</p>
      <p>
        We can evaluate probability p that a single candidate FK
constraint is a FK constraint with a classifier (e.g.
logistic regression) in a constant time. Hence, if we assumed
that the probability pi, j, which denotes a probability that a
column i references column j, is independent of all other
candidate FK constraints in the database, the computational
complexity of FK constraint identification would beO(n2).
However, the probabilities do not appear to be independent.
Example 2. If we had two columns A, B and we had known
that pA,B = 0.9 and pB,A = 0.8 then under assumption of
independence it would be reasonable to predict that column
A references column B and also that column B references
column A. However, directed cyclic references2 do not
generally appear in the databases as it would make updates
inconveniently difficult [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Hence, our example database
most likely contains only one FK constraint with A
referencing B.
      </p>
      <p>If we accepted that the FK constraints are not
independent of each other, we could generate each possible
combination of FK constraints and calculate the probability that
the candidate combination of FK constraints is the true
combination of FK constraints. The computational
complexity of such algorithm is O(2n2 ). Clearly, a practical
algorithm must take simplifying assumptions to scale to
complex databases.</p>
      <p>
        The applications of the FK constraint discovery, besides
relational learning, include data quality assessment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
database refactoring [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The paper is structured as follows: first, we describe
related work, then we describe our method, then we describe
our experiments and their results, discuss the results and
provide a conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Li et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] formulated a related problem, attribute
correspondence identification, as a classification problem.
      </p>
      <p>
        Rostin et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] formulate FK constraint identification
as a classification problem.
      </p>
      <p>
        Meurice et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] compared different data sources for
the FK constraint identification: database schema,
Hibernate XML files, JPA Java code and SQL code. Based on
their analysis, the database schema data source has four
times higher recall than any other data source. In this
article, we focus solely on the database schema data source.
Furthermore, they introduce 4 rules for filtering the
candidate FK constraints: the “likeliness” of the candidate FK
constraint must be above a threshold, the FK constraints
cannot be bidirectional, the column(s) of the selected FK
1We have not observed any instance of a column referencing itself.
Nevertheless, SQL standard does not forbid it.
      </p>
      <p>2However, undirected cyclic references are commonly used, for
example, to model hierarchies.
constraints can be used only once and there can be only a
single (undirected) path from FK constraints between any
two tables.</p>
      <p>
        Chen et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] describe how to significantly
accelerate FK constraint identification by pruning unpromising
candidates at multiple levels. We inspire from them and
use multi-level architecture as well. Furthermore, they
introduce 4 rules for filtering the candidate FK constraints:
explore FK constraints only between the tables selected
by the user, only a single FK constraint can exist between
two tables, directed cycles from FK constraints are
forbidden and there can be only a single (undirected) path from
FK constraints between any two tables. We inspire from
Meurice’s and Chen’s articles and reformulate their rules
as integer linear programming (ILP) problem.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>To make the relationship identification fast, a predictive
model was trained only on the metadata about the data,
which are accessible with Java Database Connectivity
(JDBC) API. This approach has the following properties:
1. It is fast and scalable.
2. It is database vendor agnostic.
3. It is not affected by the data quality.</p>
      <p>The problem of relationship identification was
decomposed into two subproblems: identification ofprimary keys
(PKs) and identification of FK constraints (Figure 1). The
reasoning behind this decomposition is that identification
of PKs is a relatively easy task. And knowledge of PKs
simplifies identification of FK constraints because FK
constraints frequently reference PKs3.</p>
      <p>The identification of the PKs is performed in two stages:
scoring and optimization. During the scoring phase, a
probability that an attribute is a part of a PK (a PK can be
compound — composed of multiple attributes) is predicted
with a classifier. The probability estimates are then passed
into the optimization stage, which delivers a binary
prediction.</p>
      <p>The same approach is taken for FK constraint
identification. During the scoring phase, a probability that a
candidate FK constraint is a FK constraint is estimated with
a classifier. The probabilities are then passed into an
optimizer, which returns the most likely FK constraints.
3.1</p>
      <sec id="sec-3-1">
        <title>Primary Key Scoring</title>
        <p>All metadata that are exposed by JDBC4 about attributes
(as obtained with getColumns method) and tables (as
obtained with getTables) were collected and considered as
features for classification. For brevity, we describe and
justify only features used by the final model.</p>
        <p>3A FK may reference any attribute that is unique, not only PKs.
Nevertheless, all FKs in the analyzed databases reference PKs.</p>
        <p>4See docs.oracle.com for the documentation.</p>
        <p>Primary key</p>
        <p>scoring
Primary key
optimization
Relationship</p>
        <p>scoring
Relationship
optimization</p>
        <p>List of all the columns
in the database</p>
        <p>Gradient boosted trees
Probabilities that the columns
are in some PK</p>
        <p>Integer linear programming
List of PKs,
List of candidate FK constraints</p>
        <p>Gradient boosted trees
Probabilities that the candidates
are FK constraints</p>
        <p>Integer linear programming</p>
        <p>List of foreign key constraints</p>
        <p>Data types like integer or char are generally more likely
to be PKs than, for example, double or text. To promote
portability of the trained model, we do not use database
data types but JDBC data types, which have the advantage
that they are the same regardless of the database vendor.</p>
        <p>Since some databases offer only a single data type for
numerical attributes, we also note whether numerical
attributes can contain decimals, as PKs are unlikely to
contain decimal numbers.</p>
        <p>Doppelgänger is an attribute, which has a name
similar to another attribute in the same table. For example,
atom_id1 is a doppelgänger to atom_id2. Doppelgängers
frequently share properties, i.e. either both of them are in
the PK or neither of them is in the PK.</p>
        <p>Ordinal position defines the position of the attribute in
the table. PKs are frequently at the beginning of the tables.</p>
        <p>
          String distance between the column and table names
are helpful for identification of PKs and FKs. Opinions on
the best measure for PK and FK constraint identification
vary. For example, [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] uses exact match while [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] uses
Jaro-Winkler distance. After testing all string measures
available in stringdist package [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we found that
Levenshtein distance delivers the best discriminative power on
the tested databases.
        </p>
        <p>Keywords like id or pk frequently mark PKs. The
presence of the keywords is analyzed after the attribute/table
name tokenization, which works with camel case and snake
case notation.</p>
        <p>JDBC also provides attributes that leak information
about the presence of the PK, like isNullable,
isAutoIncrement and isGeneratedColumn. Since it is unreasonable to
PK distributions) were not explored as the focus of the
article is on the metadata-based features.
3.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Foreign Key Optimization</title>
        <p>The optimization can be formulated as an integer linear
optimization problem on a directed graph G = (V, E), where
V is the set of attributes in the database and E is the set
of candidate FK constraints. The pi j is the estimated
probability that the candidate FK constraint referencing FK i
to PK j is a FK constraint. The probabilities are estimated
with a classification model trained on features described
in section 3.3. Compound FKs are modeled as multiple
FK constraints (one FK constraint for each attribute). We
define variablexi j:
(1 if the candidate FK constraint is a FK constraint
0 otherwise
xi j ≤ |S| − 1,
∀S ⊆ V, |S| ≥ 1,
assume that these metadata would be set correctly after
importing data from CSV files, they were excluded from the
model.</p>
        <p>
          For comparison to features extracted from the data
(and not metadata), two additional features were extracted:
whether the attribute contains nulls (containsNull) and
whether the attribute contains unique values (isUnique).
These features are generally expensive to calculate [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Nevertheless, some databases, like PostgreSQL,
automatically generate these statistics for each attribute in the
database and provide a vendor-specific access to these
statistics.
3.2
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Primary Key Optimization</title>
        <p>Since each table in a well-designed database should
contain a PK, a single most likely PK is identified for each
table in the database. If the single most likely PK attribute
in a table is a doppelgänger, all its doppelgängers in the
table are declared to be part of the PK as well, creating a
compound key. The described optimization can be solved
with an ILP solver, which we use mostly because foreign
key optimization (section 3.4) is using ILP formulation as
well.
3.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Foreign Key Scoring</title>
        <p>Features for FK constraints are a combination of features
calculated for individual attributes from section 3.1
(prefixed withfk and pk respectively) with features unique for
the FK constraints. The description of the unique features
follows.</p>
        <p>Data types between FK and PK attributes should match.
Nevertheless, SQL permits FK constraints between char
and varchar data types.</p>
        <p>Data lengths between FK and PK attributes should
match. Nevertheless, SQL explicitly permits FK
constraints between attributes of different character lengths
as defined in the SQL-92 specification, section 8.2.</p>
        <p>String distance between FK column name and PK
table name should be small because FK column names
frequently embed a part of the PK table name. Similarly, FK
column name should be similar to PK column name
because FK column names frequently embed a part of the PK
column name. On the other end, FK column name should
generally differ from FK table name as they are not directly
related.</p>
        <p>
          Furthermore, to be able to compare metadata-based
features to data-based features, we tested whether all non-null
values in the FK are present in the PK
(satisfiesFKConstraint). This is generally an expensive feature to
calculate [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Nevertheless, some databases, like PostgreSQL,
automatically calculate histograms for each attribute in the
background and offer a vendor specific interface to access
the histograms. And based on the range of the histograms
many candidate FK constraints can be pruned. More
advanced data-based features (e.g. similarity of the FK and
xi j =
max
        </p>
        <p>x
s.t.</p>
        <p>The optimization problem is then:</p>
        <p>∑ xi j − 2 ∑ xi j(1 − pi j)
[i, j]∈E [i, j]∈E
∑ xi j ≤ 1,
j∈V</p>
        <p>∀i,
∑
i∈S, j∈S,[i, j]∈E
xi1 j1 − xi2 j2 = 0,</p>
        <p>xi1 j − xi2 j = 0,
xi j ∈ {0, 1},
∀P, F ⊆ V, i ∈ F, j ∈ P, |P| ≥ 2,</p>
        <p>(2d)
∀D ⊆ V, i ∈ D, j ∈ V,
∀i ∈ V, j ∈ V.</p>
        <p>The objective function defines all FK constraint
candidates xi j with pi j &gt; 0.5 as FK constraints if it does not
violate any of the following constraints.</p>
        <p>Unity constraint 2b enforces that a FK can reference
only a single PK. While a single FK can in theory reference
multiple different PKs, no such occurrence appeared in the
analyzed databases.</p>
        <p>
          Acyclicity constraint 2c ensures that the graph is
(directionally) acyclic. However, this formulation of acyclicity
requires an exponential number of constraints. To deal with
that, we generate acyclicity constraints lazily [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Acyclicity constraint is desirable because if pi j is high, p ji is
generally high as well (particularly for i = j). But directed
cycles (even over intermediate tables) do not appear in the
analyzed databases.
        </p>
        <p>Completeness constraint 2d says that if a PK is
compound, then either all or neither attribute of the PK P is
referenced from the FK table by attributes F. Completeness
constraint ensures that compound FKs are syntactically
correct.
(1)
(2a)
(2b)
(2c)
(2e)
(2f)</p>
        <p>Doppelgänger constraint 2e says that if attributes are
doppelgängers to each other, then either all or neither
attribute from the doppelgänger set D reference the (same)
PK attribute.</p>
        <p>Constraint 2f defines the problem as an integer
programming problem.</p>
        <p>
          It should be noted that if constraints 2d and 2e are
removed, we get an optimization problem similar to the
identification of minimum spanning tree in a graph [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Hence,
the FKs can be efficiently optimized with Dijkstra
algorithm with a modified termination condition (the algorithm
terminates once the objective function starts to increase).
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>4.1</p>
      <sec id="sec-4-1">
        <title>Data</title>
        <p>Following paragraphs describe an empirical comparison
of 5 classifies on 3 different sets of features from 72
databases.</p>
        <p>
          We used all 72 relational databases from relational
repository [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The databases range from classical relational
benchmarking databases (like TPC-C or TPC-H) to
realworld databases used in challenges (e.g. from PKDD in
1999 or from Predictive Toxicology Challenge in 2000).
The collection of the databases contains in total 1343 PKs,
1283 FK constraints, 6129 attributes and 788 tables. That
means that on average approximately 1 of 5 attributes is
part of a PK. The count of all possible relationships is
1,232,392 (in theory, a FK can reference any attribute in
the database, including itself). That means that on average
approximately 1 of 960 tested relationships are FK
constraints.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Algorithm</title>
        <p>Following classification algorithms were tested on the
problem: decision tree, gradient boosted trees, naive Bayes,
neural network and logistic regression as implemented in
RapidMiner 7.5. Since the best results were obtained with
gradient boosted trees, the reported results are for gradient
boosted trees.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Measure</title>
        <p>
          For evaluation of the classification models, AUC and
Fmeasure [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] were used. Classification accuracy was
omitted due to a significant class imbalance in FK identification
task. AUC evaluates the ability of the model to rank. Hence,
AUC is used to evaluate the quality of scoring. On the other
end, F-measure is suitable for the evaluation of the quality
of thresholding. Hence, F-measure is used to evaluate the
quality of the optimization.
4.4
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Validation</title>
        <p>To measure the generalization of the models to new
unobserved databases, batch cross-validation over databases
[16, section 4.3] was performed. Since 72 databases were
analyzed, it means that each model was trained and
evaluated 72 times. The batch cross-validation has the
advantage, in comparison to 10-fold cross-validation, that the
samples from a single database are either all in the
training set or all in the testing set. Hence, if the samples from
a single database are more similar to each other than to
samples from other databases, we may expect that batch
cross-validation will deliver a less optimistically biased
estimate of the model accuracy on new unobserved databases
than 10-fold cross-validation.
4.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>Feature Importance</title>
        <p>Generally, it is desirable to minimize the count of utilized
features to make the model easier to understand and
deploy. Table 1 depicts feature importance for PK
identification as reported by gradient boosted trees for features that
remained after backward selection.</p>
        <p>The single most important feature for PK identification
is the position of the attribute in the table. This is not so
surprising because all non-compound PKs in the analyzed
databases (with the exception of Hepatitis database) were
the first attribute in the table. Indeed, if we always predicted
that the first attribute in a table is a PK, we get F-measure
equal to 0.934±0.007.</p>
        <p>Table 2 lists feature importance for FK constraint
identification. Interestingly, the knowledge whether the FK
constraint is satisfiable is the least important feature from the
selected features.
The PK optimization improves F-measure of PK
identification from 0.845±0.069 to 0.875±0.057. While FK
optimization improves F-measure of FK constraint
identification from 0.743±0.020 to 0.870±0.022.
The time required to score all 72 databases is 55 seconds in
total, where 95% of the runtime is due to the fact that JDBC
collects metadata about the attributes for each table
individually, causing many round trips between the algorithm and
the database server. When we replaced JDBC calls with
a single query to information_schema, which provides all
the data at the database level, the total runtime decreased
to 5 seconds.</p>
        <p>
          Furthermore, the algorithm was tested on our university
database with 909 tables. The runtime was 18 minutes, due
to the quadratic growth of candidate FK constraints with
the count of attributes in the database [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. To keep the
memory requirements manageable, FK candidates were scored
on the fly and only the top n FK candidates with the highest
probability were kept in a heap for FK optimization.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>Table 3 depicts a comparison of our approach to
different approaches in the literature. Since the
implementations of the referenced approaches are not available, we
take and report the measurements for the biggest
common denominator of the evaluated databases — the TPC-H
database. The approaches differ in the utilized features
(e.g. Kruse et al. utilize SQL scripts, while our approach
does not) and objectives (e.g. Chen et al. aim to
maximize precision at the expense of recall). The results of our
method for all 72 databases are available for download at
https://github.com/janmotl/linkifier.</p>
      <p>
        Empirical comparison of our metadata-based approach
to other metadata-based approaches is in Table 4. Oracle
Data Modeler [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] estimates FK constraints based on the
knowledge of PKs (it is assumed that a FK must reference
a PK), equality of column names between the FK and the
PK and equality of the data types between the FK and the
PK. SchemaCrawler [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is using an extended version of
these three filters. SchemaCrawler assumes that a FK must
reference either a PK or a column with a unique constraint.
The column names must equal but differences in the
presence/absence of id keyword and differences between
singular and plural forms are ignored, improving the recall.
And datatypes must equal including their length (except of
varchar datatype), improving the precission.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Limitations</title>
        <p>The metadata-based identification of PK and FK
constraints is limited by the quality of the metadata. For
example, if all the columns in the database had non-informative
names and all the columns were typed as text, the accuracy
of the predictions would suffer.</p>
        <p>But even if the metadata are of hight quality, our
metadata-base approach is not able to reliably reconstruct
a hierarchy of subclasses. The problem is illustrated in
Figure 2. Based on the table and PK names, we can correctly
infer that Person and Vendor are subclasses of
BusinessEntity. However, our metadata-based method has no means
how to infer that Customer and Employee are subclasses of
Person and not directly of BusinessEntity.</p>
        <p>Both these limitations can be addressed by extending the
metadata-based approach by appropriate data-based
features. For example, whenever a subclass can reference
multiple superclasses, the superclass with the lowest row count,
which still satisfies the FK constraint, should be selected.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>
        We described a method for foreign key constraint
identification, which does not put any assumptions on the
schema normalization, data quality, availability of
vendor specific metadata or human interaction. The code for
primary &amp; foreign key constraint identification was
designed to be database vendor agnostic and was successfully
tested against Microsoft SQL Server, MySQL, PostgreSQL
and Oracle. The code with a graphical user interface is
Zhang et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
Chen et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
Rostin et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
Data
Data, Metadata
Data, Metadata
Metadata
published at GitHub (https://github.com/janmotl/
linkifier) under BSD license. The runtime is dominated
by the connection lag to the server and if the requirement
on the code portability is lifted, we are able to predict
primary &amp; foreign key constraints for all 72 tested databases
in 5 seconds.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank Aleš Fišer, Jan Kukacˇka, Jirˇí
Kukacˇka, Manuel Muñoz and Batal Thibaut for their
help. We furthermore thank the anonymous reviewers,
their comments helped to improve this paper. This
research was partially supported by the Grant Agency
of the Czech Technical University in Prague, grant No.
SGS17/210/OHK3/3T/18.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ziawasch</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , Lukasz Golab, and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Profiling relational data: a survey</article-title>
          .
          <source>VLDB J</source>
          .,
          <volume>24</volume>
          (
          <issue>4</issue>
          ):
          <fpage>557</fpage>
          -
          <lpage>581</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>John</surname>
            <given-names>G. Bennett</given-names>
          </string-name>
          , Perry A.
          <string-name>
            <surname>Gee</surname>
            , and
            <given-names>Charles E.</given-names>
          </string-name>
          <string-name>
            <surname>Gayraud</surname>
          </string-name>
          .
          <article-title>System and Methods Including Automatic Linking of Tables for Improved Relational Database Modeling with Interface</article-title>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Zhimin</given-names>
            <surname>Chen</surname>
          </string-name>
          , Vivek Narasayya, and
          <string-name>
            <given-names>Surajit</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          .
          <article-title>Fast Foreign-Key Detection in Microsoft SQL Server PowerPivot for Excel</article-title>
          .
          <source>VLDB Endow</source>
          .,
          <volume>7</volume>
          (
          <issue>13</issue>
          ):
          <fpage>1417</fpage>
          -
          <lpage>1428</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sualeh</given-names>
            <surname>Fatehi. SchemaCrawler</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Fawcett</surname>
          </string-name>
          .
          <article-title>An introduction to ROC analysis</article-title>
          .
          <source>Pattern Recognit. Lett.</source>
          ,
          <volume>27</volume>
          (
          <issue>8</issue>
          ):
          <fpage>861</fpage>
          -
          <lpage>874</lpage>
          , jun
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dorit</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hochbaum</surname>
          </string-name>
          .
          <source>Integer Programming and Combinatorial Optimization. IEOR269 notes</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Kruse</surname>
          </string-name>
          , Thorsten Papenbrock, Hazar Harmouch, and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Data Anamnesis: Admitting Raw Data into an Organization</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          , pages
          <fpage>8</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Wen</given-names>
            <surname>Syan Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Clifton</surname>
          </string-name>
          .
          <article-title>SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks</article-title>
          .
          <source>Data Knowl. Eng.</source>
          ,
          <volume>33</volume>
          (
          <issue>1</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>84</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Nick</given-names>
            <surname>Logan</surname>
          </string-name>
          .
          <article-title>Package stringdist</article-title>
          .
          <source>Technical report, CRAN</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Victor</given-names>
            <surname>Markowitz</surname>
          </string-name>
          .
          <article-title>Safe referential integrity and null constraint structures in relational databases</article-title>
          .
          <source>Inf. Syst.</source>
          ,
          <volume>19</volume>
          (
          <issue>4</issue>
          ):
          <fpage>359</fpage>
          -
          <lpage>378</lpage>
          , jun
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Loup</surname>
            <given-names>Meurice</given-names>
          </string-name>
          , Fco Javier Bermúdez Ruiz,
          <string-name>
            <surname>Jens H. Weber</surname>
            , and
            <given-names>Anthony</given-names>
          </string-name>
          <string-name>
            <surname>Cleve</surname>
          </string-name>
          .
          <article-title>Establishing referential integrity in legacy information systems -</article-title>
          <source>Reality bites! Proc. - 30th Int. Conf. Softw. Maint. Evol. ICSME</source>
          <year>2014</year>
          , pages
          <fpage>461</fpage>
          -
          <lpage>465</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Jan</given-names>
            <surname>Motl</surname>
          </string-name>
          and
          <string-name>
            <given-names>Oliver</given-names>
            <surname>Schulte</surname>
          </string-name>
          .
          <source>The CTU Prague Relational Learning Repository</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Oracle. Oracle SQL Developer Data</surname>
            <given-names>Modeler</given-names>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lurdes Pedro de Jesus and Pedro Sousa</surname>
          </string-name>
          .
          <article-title>Selection of Reverse Engineering Methods for Relational Databases</article-title>
          .
          <source>Proc. Third Eur. Conf. Softw. Maint.</source>
          , pages
          <fpage>194</fpage>
          -
          <lpage>197</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Ulrich</given-names>
            <surname>Pferschy</surname>
          </string-name>
          and
          <article-title>Rostislav Staneˇk</article-title>
          .
          <article-title>Generating subtour elimination constraints for the TSP from pure integer solutions</article-title>
          .
          <source>Cent. Eur. J. Oper. Res.</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Alexandra</surname>
            <given-names>Rostin</given-names>
          </string-name>
          , Oliver Albrecht, Jana Bauckmann, Felix Naumann, and
          <string-name>
            <surname>Ulf Leser.</surname>
          </string-name>
          <article-title>A machine learning approach to foreign key discovery</article-title>
          .
          <source>12th Int. Work. Web Databases (WebDB)</source>
          , Provid. Rhode Isl., (WebDB):
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Meihui</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Marios Hadjieleftheriou, Beng Chin Ooi,
          <string-name>
            <surname>Cecilia M. Procopiuc</surname>
            , and
            <given-names>Divesh</given-names>
          </string-name>
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>On multi-column foreign key discovery</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>3</volume>
          (
          <issue>1</issue>
          -2):
          <fpage>805</fpage>
          -
          <lpage>814</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>