Evaluating Approaches for Supervised Semantic Labeling

                                                       ∗
                   Nataliia Rümmele                                          Yuriy Tyshetskiy                          Alex Collins
                               Siemens                                          Data61, CSIRO                         Data61, CSIRO
                               Germany                                             Australia                             Australia
                   nataliia.ruemmele@                                       yuriy.tyshetskiy@                        alex.collins@
                      siemens.com                                            data61.csiro.au                        data61.csiro.au


ABSTRACT                                                                                    a process is expected to be a uniform integrated view across
Relational data sources are still one of the most popular                                   these data sources. Relational data sources are still one of
ways to store enterprise or Web data, however, the issue                                    the most popular ways to store enterprise or Web data [20].
with relational schema is the lack of a well-defined semantic                               However, the relational schema lacks a well-defined semantic
description. A common ontology provides a way to repre-                                     description. To define the semantics of data, we can intro-
sent the meaning of a relational schema and can facilitate                                  duce an ontology [20]. Now our goal is to map attributes
the integration of heterogeneous data sources within a do-                                  from relational data sources to classes and properties in an
main. Semantic labeling is achieved by mapping attributes                                   ontology. We refer to this problem as semantic labeling.
from the data sources to the classes and properties in the                                     Semantic labeling plays an important role in data inte-
ontology. We formulate this problem as a multi-class classi-                                gration [6, 14], augmenting existing knowledge bases [9, 17,
fication problem where previously labeled data sources are                                  18, 23] or mapping relational sources to ontologies [15, 22].
used to learn rules for labeling new data sources. The major-                               Various approaches to automate semantic labeling have been
ity of existing approaches for semantic labeling have focused                               developed, including DSL [14] and T2K [17]. Typically auto-
on data integration challenges such as naming conflicts and                                 mated semantic labeling techniques encounter several prob-
semantic heterogeneity. In addition, machine learning ap-                                   lems. Firstly, there can be naming conflicts [15], including
proaches typically have issues around class imbalance, lack                                 those cases where users represent the same data in different
of labeled instances and relative importance of attributes.                                 ways. Secondly, semantically different attributes might have
To address these issues, we develop a new machine learning                                  syntactically similar content, for example, birth date versus
model with engineered features as well as two deep learning                                 date of death. Thirdly, there are a considerable number of
models which do not require extensive feature engineering.                                  attributes which do not have any corresponding property in
We evaluate our new approaches with the state-of-the-art.                                   the ontology, either by accident or on purpose. The majority
                                                                                            of existing systems focus on the first two problems, but do
                                                                                            not consider the third problem during evaluation [18, 14].
CCS Concepts                                                                                   To address the challenges of automated semantic labeling,
•Information systems → Data extraction and inte-                                            we formulate this task as a supervised classification problem.
gration; Mediators and data integration; •Computing                                         A set of semantic labels known to the classifier is specified at
methodologies → Supervised learning by classification;                                      training time, e.g., from the provided domain ontology. We
                                                                                            also introduce a special class of attributes, called unknown.
Keywords                                                                                    The purpose of the unknown class is to capture attributes
                                                                                            which will not be mapped to the ontology. The training
data integration, schema matching, semantic labeling, on-                                   data for the classifier will thus consist of source attributes
tology, relational schema, bagging                                                          (name and content) and their semantic labels provided by
                                                                                            the user, including the unknown labels. Since manually as-
1.     INTRODUCTION                                                                         signing labels to attributes is a costly operation, a lack of
  An important problem in database research is determining                                  training data is a common problem for semantic labeling sys-
how to combine multiple data sources that are described by                                  tems. Existing systems [14, 17, 23] use knowledge transfer
different (heterogeneous) schemata [6]. The outcome of such                                 techniques to overcome this issue. Instead, we introduce a
                                                                                            sampling method similar to bagging for ensemble models [3].
∗Work accomplished at Data61, CSIRO.                                                           The bagging technique allows us to generate multiple train-
                                                                                            ing instances from the user-labeled attributes, thus overcom-
                                                                                            ing the lack of labeled training data. It also allows us to
                                                                                            overcome the common issue of class imbalance, when some
Permission to make digital or hard copies of part or all of this work for personal or       semantic labels have more support than others among the
classroom use is granted without fee provided that copies are not made or distributed       attributes. We can achieve this by re-balancing the training
for profit or commercial advantage and that copies bear this notice and the full citation   data via preferential bagging from minority class attributes.
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).                                               The main contributions of this paper are:
 TheWebConf Workshop: Linked Data on the Web (LDOW) 2018, Lyon,
France                                                                                        1. We introduce a bagging approach to handle class im-
c 2018 Copyright held by the owner/author(s).                                                    balance and the lack of training data by drawing ran-
                                                                                                     For example, assume we are given sources “personal-info”
          City                                                   Person                              and “businessInfo” with the correct semantic labels, the sys-
                    state                                   worksFor ceo                             tem should then automatically assign labels to attributes in
                                                                                                     the source “Employees”.
         subclass     State           livesIn    bornIn     Organization          birthDate   name
                                                                                                        To build such a system, we cannot just rely on the names
                        subclass                    operatesIn
                                                                                                     of the columns. For example, columns name in (1a), ceo
                                  Place                   email           phone      name            in (1c) and employee in (1b) all refer to the same prop-
                                                                                                     erty (Person,name). Using just values of the columns is
                     postalCode           name                                                       also problematic. For example, in (1a) acronyms are used
                                                                                                     for states, while in (1c) state names are fully written. Fur-
                                                                                                     thermore, values can overlap for semantically heterogeneous
                                                                                                     columns like for founded in (1c) and birthDate in (1a).
                              Figure 1: Example ontology.
                                                                                                        We can also have attributes that are not mapped to any
                                                                                                     property in the ontology. There might be two reasons for
        dom subsamples from values of an attribute. This ap-                                         their existence: (1) we are not interested in the content of
        proach can achieve meaningful diversity in the training                                      an attribute and want to discard it from any future analysis;
        data and can increase the number of training instances                                       (2) we might have overlooked an attribute by not designing
        for under-represented semantic labels.                                                       the ontology accurately. We do not differentiate between
                                                                                                     these two cases and mark all such attributes as unknown
     2. We address the issue of “unwanted” attributes, i.e., at-                                     class, for example, founded in (1c). The presence of un-
        tributes which do not get mapped to any element in the                                       known class makes the task of semantic labeling more com-
        ontology. In cases where we have a sufficient amount of                                      plicated. Establishing approaches to efficiently handle such
        training data, our models can achieve over 80% Mean                                          attributes is crucial since in many real-world scenarios re-
        Reciprocal Rank (MRR) on two sets of data sources                                            lational data sources (either HTML tables [17] or domain
        from our benchmark.                                                                          specific data [14]) contain a considerable number of such
                                                                                                     attributes.
     3. We construct a classification model DINT with hand-                                             Machine learning techniques proved to be efficient in build-
        engineered semantic labeling features to implement the                                       ing predictive models on noisy and messy data. Yet to ap-
        above. In addition, we design two deep learning models                                       ply these techniques we need to represent source attributes
        CNN and MLP which use very simple features, such as                                          as feature vectors, with semantic labels (classes) attached to
        normalized character frequencies and padded character                                        these vectors. In Table 2 we show such representation for the
        sequences extracted from raw values of data attributes.                                      source Employees. We have explicitly shown only 4 possible
                                                                                                     features, for simplicity. For example, mean string length is
     4. We construct a benchmark with a common evaluation
                                                                                                     the mean length of cell values for an attribute. However,
        strategy to compare different approaches for super-
                                                                                                     the actual size of a feature vector can be arbitrary long, and
        vised semantic labeling. Our benchmark includes such
                                                                                                     the process of designing its components is known as feature
        models as DINT, CNN, MLP and the state-of-the-art
                                                                                                     engineering. In the next section we will discuss the features
        DSL [14], and 5 sets of data sources from different do-
                                                                                                     used in the semantic labeling system.
        mains. We show that each approach has its strengths
        and shortcomings, and choosing a particular semantic
        labeling system depends on the use case. We have re-                                         3.   APPROACHES
        leased the implementation of the benchmark under an                                             In this section we describe classifiers for the semantic la-
        open source license 1 . This benchmark can be easily                                         beling problem used for evaluation. We also discuss ap-
        extended to include other models and datasets, and                                           proaches to the problem of unknown attributes and lack of
        can be used to choose the most appropriate model for                                         training data.
        a given use case.                                                                               Once we have a set of labeled data sources, we construct
                                                                                                     feature vectors for all attributes in this set and mark them
2.     PROBLEM                                                                                       as representatives of a class corresponding to their semantic
                                                                                                     labels. The constructed set of (feature vector, class label)
  We illustrate the semantic labeling problem using a simple
                                                                                                     pairs is then used to train a classifier. We consider several
domain ontology shown in Fig. 1. Assume we have three
                                                                                                     approaches, which we divided into 3 major groups: DINT,
data sources “personal-info”, “businessInfo” and “Employees”
                                                                                                     Deep Learning and the state-of-the-art DSL. Each approach
(see Tab. 1) whose attributes we choose to label according
                                                                                                     trains a multi-class classification model that produces, at
to the example ontology (Fig. 1). We define a semantic label
                                                                                                     the prediction stage, a list of class probabilities for an at-
as a tuple consisting of a domain class and its property. For
                                                                                                     tribute in a new source. The class with the highest predicted
example, attribute name in the source “personal-info” (see
                                                                                                     probability is then assigned to the attribute at the decision
Tab. 1a) is labeled with (Person,name). Note that semantic
                                                                                                     stage.
labels are fixed by the ontology.
  The task of semantic labeling is defined as automatically                                          3.0.1    DINT
assigning semantic labels to attributes in a data source. In
                                                                                                        In our first approach DINT (Data INTegrator) we hand-
the case of supervised semantic labeling, we use existing
                                                                                                     engineer 26 features, which include characteristics such as
known semantic labels for data sources to improve the per-
                                                                                                     number of whitespaces and other special characters, statis-
formance when assigning semantic labels to new sources.
                                                                                                     tics of values in the column (e.g, mean/ max/ min string
1
    http://github.com/NICTA/serene-benchmark                                                         length and numeric statistics) and many more. The com-
                                      Table 1: Example relational data sources with semantic labels.

                name        birthDate          city     state           workplace          employer        employee         DOB
                 Neil       21-05-1916       Waterloo      NSW           CSIRO              CSIRO              Neil       05/21/1916
                Mary        07-12-1990       Eveleigh      NSW           CSIRO              Data61            Mary        12/07/1990
                Henry       15-03-2000       Redfern       NSW           Data61             NICTA             Henry       03/15/2000
    Semantic   (Person,      (Person,         (City,    (State,     (Organization,       (Organization,      (Person,      (Person,
      labels    name)       birthDate)        name)     name)          name)                name)             name)       birthDate)
                            (a) personal-info                                                          (b) Employees

                                         company                 ceo                      state                 founded
                                             CSIRO         Larry Marshall     Australian Capital Territory     21-05-1916
                                             Data61        Adrian Turner           New South Wales             12-07-2016
                                             NICTA         Hugh Durrant            New South Wales             15-03-2002
                          Semantic     (Organization,
                                                           (Person,name)               (State,name)             unknown
                            labels        name)
                                                                   (c) businessInfo


                                             Table 2: Feature vectors for data source Employees.

                                                          feature vector
                 attribute       entropy           mean               ratio             ratio unique              class
                                                               ...
                                               string length       alpha chars             values
                  employer           1.001         5.333          ...       0.875             1           (Organization, name)
                  employee           1.461        13.333          ...       0.925             1              (Person, name)
                    DOB               0.69           10           ...         0               1            (Person, birthDate)


plete list of features is available in the open source bench-                 is the minimum number of edit operations, such as inser-
mark repository 2 . One of the important features charac-                     tion, deletion, substitution, which are required to transform
terising information content of an attribute is Shannon’s                     one string into another [11]. We compute the similarity be-
entropy of the attribute’s concatenated rows. Shannon’s                       tween attribute name and all class instances in the training
entropy (or information
                     P entropy [11]) of a string X is de-                     data. The number of thus extracted features depends on the
fined as H(X) = − i pi log2 pi , where pi is the probability                  number of semantic labels in the training data.
of a character, whose index in character vocabulary is i,                        We choose to train a Random Forest [4] (RF) on this set
to appear in X, and the summation ranges over all char-                       of features. RF is quite robust on noisy data, works well
acters in the vocabulary. To evaluate pi in Shannon’s en-                     even with correlated features, and easily captures complex
tropy, we evaluate normalized character frequency distribu-                   nonlinear relationships between features and target. Addi-
tion chardist of an attribute, as character counts in concate-                tionally, RF classifiers require little hyperparameter tuning,
nated rows of the attribute, normalized by the total length                   and hence they usually work straight “out of the box”, which
of the concatenated rows. The vocabulary of all characters                    makes them a convenient yet versatile classifier to use.
consists of 100 printable characters (including \n). Finally,
we also add the 100-dimensional vector of pi to the attribute                  3.0.2     Deep Learning
feature vector.                                                                  Deep learning has gained much popularity due to its tremen-
   In addition to the above features, which can be directly                   dous impact in such areas as speech recognition, object recog-
calculated from attribute values, we compute mean cosine                      nition, and machine translation [8]. One of the biggest ad-
similarity of attribute character distribution with character                 vantages of deep learning is the ability to process data in
distributions of all class instances. This adds as many addi-                 its raw form and to discover the representation needed for
tional scalar features to the full attribute feature vector as                classification, assisting with the feature engineering step.
there are classes in the training data. In our case we have                      Broadly speaking, deep learning is an overarching term
as many classes as there are semantic labels.                                 for artificial neural networks, where the word “deep” refers
   One can expect that names of the attributes should also                    to the depth of the network. At the basic level neural net-
contain useful information to determine their semantic types,                 works are composed of perceptrons, or neural nodes. There
in addition to the information provided by attribute values.                  can be several layers of interconnected neural nodes; The
To extract features from attribute names, we compute string                   first layer is the input layer while the last one is the output
similarity metrics: minimum edit distance, two WordNet                        layer. The layers in between these two are called hidden.
based similarity measures such as JCN [7] and LIN [10],                       Neural nodes in each layer take as input the output of the
and k-nearest neighbors using Needle-Wunsch distance [13].                    nodes from the previous layer, perform some computation
The minimum edit distance between two strings s1 and s2                       with a nonlinear activation function (e.g., tanh or RELU)
                                                                              and pass the result to the next layer. There are generally no
2
    serene-benchmark                                                          connections between nodes in the same layer. Overall, deep
learning models improve in their performance the more data          classifier to produce the final vector of probabilities. One of
they are trained on. The exact architecture of deep learning        the distinctive properties of this approach is the ability to
models, i.e., number of layers, number of nodes in each layer,      transfer the classification model trained in one domain to
activation functions of neurons and interconnectedness be-          predicting semantic labels for attributes in another domain.
tween layers, all influence the performance of the trained          We denote this enhanced approach as DSL+.
models.
   We choose two different architectures for our deep learn-        3.1    Bagging
ing classifiers: (i) Multi-Layer Perceptron (MLP ) [19] and            To train a classifier for semantic labeling, we need data
(ii) Convolutional Neural Network (CNN ) [8]. We have ex-           sources to have many labeled attributes. However, the costly
perimented with different designs of the MLP and CNN net-           operation of manually assigning labels to attributes, and the
works, varying their hyperparameters that control the num-          relative small number of columns compared to data set size,
ber of hidden layers, the numbers of nodes/filters per layer,       implies that lack of training data is a common problem for
dropout probability, etc., and found that the designs, de-          semantic labeling systems. Existing systems [14, 17, 23] use
scribed briefly below, work well for all the datasets in the        knowledge transfer techniques to overcome this issue. We
benchmark.                                                          introduce a method for increasing training sample size based
   The input layer of the MLP architecture takes the 101-           on a machine learning approach known as bagging [3].
dimensional feature vector of character frequencies pi (chardist)      Breiman [3] introduced the concept of bootstrap aggregat-
and Shannon entropy. Following the input layer, MLP has             ing, also known as bagging, to construct ensembles of mod-
3 fully connected hidden layers with 100 nodes per layer,           els to improve prediction accuracy. The method consists
with tanh activations. After the 1st hidden layer, we intro-        in training different classifiers with bootstrapped replicas of
duced a stochastic dropout layer with dropout probability of        the original dataset. Hence, diversity is obtained with the
0.5, to prevent overfitting. Finally, the output layer of MLP       resampling procedure by the usage of different data subsets.
(the actual classifier) is a softmax layer with the number of       At the prediction stage each individual classifier estimates
nodes equal to the number of semantic types (including the          an unknown instance, and a majority or weighted vote is
‘unknown’ type).                                                    used to infer the class.
   The CNN model takes as input the one-hot representation             We modify the idea of bagging for our problem. It is
of an attribute’s concatenated rows in character space, then        clear that the semantics of columns in the table “Employees”
embeds it to a dense 64-bit embedding, then passes this             (Table 1b) will not change whether we have 3 or 1000 rows.
embedded ”image” of the attribute through two consecutive           So, we can create several training instances for an attribute,
1d convolution layers with 100 filters per layers, followed by      where each instance (called a bag) will contain a random
a 1-d max-pooling layer, a flattening layer, a dropout layer        sample (with replacement) of its content. This procedure
with probability of dropout 0.5, then a fully connected layer       is governed by two parameters numBags and bagSize: the
with 100 nodes, and finally a fully connected softmax output        first parameter controls how many bags are generated per
layer (the classifier) with the number of nodes equal to the        each attribute, while the latter indicates how many rows are
number of semantic types (including the ‘unknown’ type).            sampled per each bag. In such a way we address the issue
   Though we cannot be sure that our final choice for the           of noise by increasing diversity of the training data as well
architectures is optimal, it seems to be a good trade-off be-       as the issue of insufficient training data.
tween complexity of the models, required computational re-             Another common problem encountered in a wide range
sources for their training, and their overall performance in        of data mining and machine learning initiatives is class im-
semantic labeling task. We have implemented both mod-               balance. Class imbalance occurs when the class instances
els using Keras library with GPU-based TensorFlow back-             in a dataset are not equally represented. In such situation,
end [1].                                                            building standard machine learning models will lead to poor
                                                                    results, since they will favor classes with large populations
3.0.3    DSL                                                        over the classes with small populations. To address this is-
   The Domain-independent Semantic Labeler (DSL) has been           sue, we have tried several resampling strategies to equalize
proposed by Pham et al [14], where 6 feature groups based on        the number of instances per each class.
similarity metrics are constructed. These metrics measure
how attribute names and values are similar to the character-        3.2    Unknown class
istics of other attributes. This means that given 5 attributes         As mentioned previously, some attributes are not mapped
in the training data (i.e., already labeled instances) with dis-    to any property in the ontology. To handle this issue, we
tinct semantic labels, a new attribute will be compared to          introduce one more class called unknown. For example, at-
representatives of each semantic label and 30 features will         tributes which get discarded from the integration process can
be calculated in total. The considered similarity metrics           be marked as unknown. This way we can help the classifier
are: attribute name similarity, standard Jaccard similarity         recognize such attributes in new sources. In addition, there
for textual data and a modified version for numerical data,         is another advantage of having the unknown class defined
TF-IDF cosine similarity, distribution and histogram simi-          explicitly. Consider a new attribute with an unseen seman-
larity.                                                             tic label, that is, a label which is not present in the training
   Instead of building one multi-class classifier, the authors      data. Instead of picking the closest match among known
train binary classifiers separately for each semantic label. A      semantic labels, the classifier will mark it as unknown. The
binary classifier for a particular semantic label is a Logistic     user will then need to validate the attributes that are clas-
Regression model trained on a set of similarity metrics with        sified as unknown. This will ensure that the unknown class
representatives of this label. When predicting semantic la-         consists only of unwanted attributes. We do not introduce
bels for a new attribute, they combine the predictions of each      another class to differentiate between unwanted attributes
                                                                                                     Table 3: Description of data sources.

                                                                        #                  # semantic                     #          # unknown               avg # rows          avg # attributes
                                           Domain
                                                                      sources                labels                   attributes     attributes               per source            per source
                                           weather                               4                 12                     44                  0                  108.5                      11
                                           weapons                              15                 28                    175                  0                  54.46                    11.66
                                           museum                               29                 20                    443                 159                6978.89                   15.27
                                            soccer                              12                 18                    138                  42                2120.16                    11.5
                                             city                               10                 52                    520                  0                  2251                       52

                                                                                                             30
                                                                                                                                                                      40
                                                                                                             25
          number of attributes


                                 4
                                                                                                             20                                                       30
                                 3
                                                                                                             15
                                                                                                                                                                      20
                                 2
                                                                                                             10
                                 1                                                                                                                                    10
                                                                                                              5

                                 0                                                                            0                                                        0
                                       1                                                      12                  1             12                      28                 1                10                19
                                     numeric identifiers of semantic labels                                       numeric identifiers of semantic labels                   numeric identifiers of semantic labels

                                              (a) weather                                                                   (b) weapons                                              (c) soccer
                                                                               159
                                                                               140
                                                        number of attributes


                                                                               120
                                                                                                                                        10
                                                                               100
                                                                               80
                                                                               60
                                                                               40
                                                                               20
                                                                                0                                                        0
                                                                                     1                  10                     21            1     10      20     30        40       52
                                                                                     numeric identifiers of semantic labels                  numeric identifiers of semantic labels

                                                                                            (d) museum                                                     (e) city

Figure 2: Distribution of attributes according to semantic labels, including the unknown class, in different domains. We can
see class imbalance in the museum, soccer and weapons domains. On the x-axis we have semantic labels sorted by the number
of attributes in each class. The y-axis shows the number of attributes.


and unseen labels since we cannot guarantee that there is                                                                               This provides us an opportunity to evaluate how different
no overlap between them. Only our DINT and Deep Learn-                                                                                  approaches behave in various scenarios. The museum and
ing approaches support an unknown class.                                                                                                soccer domains are the only domains which have unknown
                                                                                                                                        attributes. The city domain has many semantic labels and
4.     EXPERIMENTS                                                                                                                      attributes while the museum domain contains more data
                                                                                                                                        sources. The number of instances per each semantic label
   We have run all our experiments on a Dell server with                                                                                varies in these domains.
252 GiB of memory, 2 CPUs with 4 cores each, 1 Titan GPU                                                                                  To estimate class imbalance within each domain, we plot
and 1 GeForce 1080 Ti GPU. The deep learning models have                                                                                the class distribution in Figure 2. The museum domain
been optimized for GPUs using Tensorflow. The benchmark                                                                                 has the highest imbalance among classes, the soccer and
for semantic labeling system is implemented in Python and                                                                               weapons domains also have imbalanced classes, whereas the
is available under an open source license 3 .                                                                                           weather and city domains have equally represented classes.
4.1     Datasets
                                                                                                                                        4.2        Experimental setting
   We use 5 different sets of data sources in our evalua-
tion, labeled as: museum, city, weather, soccer [14] and                                                                                  We establish a common evaluation framework for the ap-
weapons [21]. Each set of data sources has been manu-                                                                                   proaches as described in Section 3. As a performance metric
ally mapped to a domain with a specific set of semantic                                                                                 we use Mean Reciprocal Rank (MRR) [5]. To derive a com-
labels. Descriptive statistics of each domain set are shown                                                                             prehensive estimate of performance within domains, we im-
in Table 3. As we can see, these sets differ substantially.                                                                             plement two cross-validation techniques: leave one out and
                                                                                                                                        repeated holdout.
3
    http://github.com/NICTA/serene-benchmark                                                                                              The leave one out strategy is defined as using one source
               Table 4: MRR scores for leave one out strategy when unknown attributes are not considered.

                        Sampling        Model          city    museum   soccer   weapons    weather
                                        DSL           0.711     0.848    0.865     0.731      0.904
                                    DINT all          0.971     0.902    0.801     0.826      0.964
                          None
                                    DINT base         0.925      0.86    0.794     0.810       0.93
                                    DINT base+        0.925      0.89    0.792     0.807       0.93
                                       MLP            0.873     0.886   0.862     0.799      0.965
                                       CNN            0.877     0.893   0.823     0.813      0.939
                        Bagging     DINT all          0.956     0.913   0.804     0.833      0.979
                                    DINT base         0.928     0.894   0.887     0.825      0.941
                                    DINT base+        0.928     0.911    0.79     0.813      0.956
                        Resample    DINT     all      0.969     0.907    0.814     0.792      0.956
                           To       DINT     base     0.929     0.888    0.802     0.788      0.911
                         Mean       DINT     base+    0.929     0.901    0.835     0.765      0.926

                                    Table 5: Model training times (s) for leave one out.

                        Sampling         Model         city    museum   soccer   weapons    weather
                                        DSL           295.6     164.3    36.6     269.5       8.2
                                    DINT all           10.8      74.8     8.0      6.2        2.0
                          None
                                    DINT base         10.2       20.4     3.9      5.0        3.6
                                    DINT base+        10.2       20.7     4.0      4.1        2.0
                                       MLP            184.2     216.5    26.8       85         11
                                       CNN            184.8     276.2    29.7      71.3       12.5
                         Bagging    DINT all           212      310.3    47.3     127.9       11.6
                                    DINT base         165.5      83.1    26.8      35.1        8.6
                                    DINT base+        165.5      80.0    27.8      44.0        7.1
                        Resample    DINT     all       10.9     58.6     12.2       8.3       2.1
                           To       DINT     base      10.8     20.5     2.3        4.3       2.1
                         Mean       DINT     base+     10.9     18.4     2.8        4.3       2.1


as the testing sample and the rest of sources in the domain      labeling system DINT with different sampling strategies.
as the training samples. This procedure is repeated as many      In particular, we report results when we apply no resam-
times as there are sources in the domain. We calculate MRR       pling and bagging with parameters bagSize=100 and num-
on the testing sample and report the average MRR as the          Bags=100. We also experiment with various class imbalance
final performance metric for each iteration. For example,        resampling strategies, including resampling to the mean or
for the domain museum we obtain 29 models in total where         maximum of instance counts per class. For brevity and with-
each model is trained on a different 28 sources, and MRR         out loss of generality we report results only for the resam-
is calculated on the prediction outcome for a single source.     pling to mean strategy denoted as ResampleToMean. By
This strategy allows us to estimate the performance of the       design DSL and DSL+ use no resampling.
different models given that there are enough instances per          As mentioned in Section 3.0.1, the DINT model is built
each semantic label.                                             on a set of elaborately engineered features. MLP model, on
   In repeated holdout strategy, we randomly sample a ratio      the other hand, uses only chardist and entropy. To better
p of sources to place in the training sample and use the         compare the performance of MLP and DINT, we create a
remaining sources for testing sample, and this procedure is      new model DINT base and reduce the number of features
repeated n times. The final MRR score is an average of MRR       to just chardist and entropy. In addition, we create another
scores in each iteration. We use this technique to simulate      model DINT base+ by using chardist and entropy and add
the scenario when there is a shortage of labeled sources. We     a feature minimum edit distance. We choose this feature as
set the ratio p = 0.2 and the number of iterations n = 10.       feature importance scores produced by the random forest
                                                                 algorithm rank edit distance higher than the other features
4.3   Results                                                    extracted from names.
   In this section we report the results of our experiments.        Table 4 reports the MRR scores for leave one out strat-
In total we evaluate 13 models, and we report run times          egy. Surprisingly, models built on just normalized character
required to train the considered models.                         distributions of attribute values perform in many cases very
   To train MLP and CNN models, we need many training            well. Deep learning models MLP and CNN are often com-
instances, so we use bagging (presented in Section 3.1) with     parable with DINT models, however they come usually at
parameters numBags=150 and bagSize=100 to increase the           a higher computational cost. Run times for training each
size of the initial training set. We can train the semantic      model are shown in Table 5.
              Table 6: MRR scores for repeated holdout strategy when unknown attributes are not considered.

                         Sampling           Model        city     museum       soccer    weapons          weather
                                          DSL           0.719      0.889       0.614       0.611           0.805
                                         DSL+           0.782      0.927       0.813       0.872             1
                           None       DINT all          0.949      0.798       0.553       0.688           0.583
                                      DINT base         0.888      0.763       0.516       0.684           0.621
                                      DINT base+        0.888      0.778       0.542       0.686           0.621
                                         MLP            0.797        0.77      0.663       0.695           0.887
                                         CNN            0.723       0.774      0.606       0.664           0.882
                         Bagging      DINT all          0.945       0.791      0.656       0.682           0.854
                                      DINT base         0.919       0.788      0.634       0.701           0.867
                                      DINT base+        0.919       0.790      0.628       0.688           0.852
                         Resample     DINT     all      0.949       0.789      0.455       0.588           0.557
                            To        DINT     base      0.89       0.749      0.451       0.578           0.611
                          Mean        DINT     base+     0.89       0.758      0.445       0.564           0.611

                     Table 7: Performance for leave one out strategy when unknown class is considered.

                                                                  MRR scores         Train time (s)
                                Sampling        Model
                                                                museum soccer       museum soccer
                                                 DSL              0.56      0.618      156.6       36.3
                                             DINT all            0.866      0.827      100.6       6.8
                                   None
                                             DINT base           0.838      0.809      28.4        5.9
                                             DINT base+          0.849      0.824      33.4        6.2
                                                MLP              0.802      0.784      417.2       37.6
                                                CNN              0.831      0.785      394.5       39.6
                                  Bagging    DINT all            0.854      0.795      395.2       64.5
                                             DINT base           0.839      0.863      112.5       26.7
                                             DINT base+          0.867      0.793      114.4       30.6
                                Resample     DINT    all         0.776      0.730      100.5       6.8
                                   To        DINT    base        0.721       0.69      26.2        4.2
                                 Mean        DINT    base+       0.759      0.753      26.7        5.2


   As we can see, DINT models that use bagging to sample             strategy leads to a decreased performance in the domains
more training instances achieve the best results in four do-         with a less prominent imbalance (i.e., weapons and weather).
mains. Remarkably, these are also the domains with higher            This leads us to the idea that a class resampling strategy
class imbalance and variety among data sources in terms of           needs to be improved.
number of rows and number of columns. Data sources in the               One potential strategy may be in combining bagging and
city domains have the same number of attributes. We have             resampling strategies. Instead of fixing numBags for all at-
also discovered that bagging needs to be performed both at           tributes, the parameter could be changed to be either the
the training and prediction stages to achieve the best perfor-       mean or maximum of instance counts per each class. In such
mance. We have observed that this setting makes a notice-            a way we can perform a resampling strategy which does not
able difference in domains where the number of rows varies           produce replicas of the attributes.
substantially among data sources. For example, in the mu-               Apart from the city and weapons domains, our newly de-
seum domain number of rows ranges from 6 to 85235, and               signed models have a similar performance to DSL. However,
in the soccer domain the range is from 500 to 9443.                  the computational complexity of these models varies. For
   In terms of computation time, the best performing model           the museum domain DINT base+ has a higher MRR than
DINT all for the museum domain requires a lot of time                DSL, yet DINT base+ needs half the time less for training.
for training. The most computationally expensive features            It appears that attributes which contain a mixture of tex-
are four different edit distances: minimum edit distance,            tual and numeric are a bottleneck for DSL since data sources
JCN, LIN and k-nearest neighbors. This suggests that the             in the city and weapons domains have multiple mixed data
DINT model with all possible features does not scale well            columns.
with the increasing number of attributes in the training set.           In cases where there are few labeled instances (repeated
Considering similarity metrics used in other approaches like         holdout strategy in Table 6), we observe that DSL per-
DSL and T2K [17], computing TF-IDF and Jaccard’s scores              forms well, especially DSL+, which leverages labeled in-
may help resolve this runtime issue for DINT all.                    stances from other domains. We should be aware that in
   For class imbalance, although the ResampleToMean strat-           this scenario there are many unseen labels, which makes
egy improves the performance of DINT models with no sam-             MRR ill-defined. If we compare DINT models in this sce-
pling in the domains with the highest class imbalance (i.e.,         nario, it suggests that bagging is advantageous in situations
museum and soccer), it appears that the ResampleToMean               when there are few labeled attributes. Overall, enhancing
                   Table 8: Performance for repeated holdout strategy when unknown class is considered.

                                                               MRR scores        Train time (s)
                              Sampling          Model
                                                              museum soccer     museum soccer
                                               DSL             0.544    0.355     15.5      4.2
                                              DSL+             0.303     0.43    215.9     241.3
                                 None      DINT all            0.769    0.549     16.2       2
                                           DINT base           0.743    0.608     10.2       2
                                           DINT base+          0.742    0.613    10.1        2
                                              MLP              0.675    0.572    94.4        9.4
                                              CNN              0.683    0.534    87.7        10
                               Bagging     DINT all            0.827    0.551    101.9      13.6
                                           DINT base            0.76    0.593    55.6       13.5
                                           DINT base+          0.721    0.59     53.1       13.6
                              Resample     DINT     all        0.637    0.428     35.5      2.6
                                 To        DINT     base       0.607    0.475     12.2       2
                               Mean        DINT     base+      0.633    0.479     11.5       2


our DINT model, which uses simple features and bagging,            the city domain is the only domain with an equal distribu-
with DSL+ knowledge transfer capability might result in            tion of semantic labels, equal numbers of columns and rows
a more stable semantic labeling system. Another enhance-           across data sources. It appears that in other domains, bag-
ment may be to introduce resampling strategies into the DSL        ging makes models more robust towards variance in these
system.                                                            characteristics.
   In addition, we perform experiments for the two domains
museum and soccer, where unmapped attributes cause skewed
class distributions. Here we want to establish how well dif-       5.   RELATED WORK
ferent approaches can recognize such attributes. In Tables 7          The problem of semantic labeling, as addressed in this
and 8 we can see that the performance of semantic label-           work, can be regarded as the problem of schema matching
ing systems changes considerably. Both the DSL and DSL+            in the field of data integration [2]. In the schema matching
performance is affected by their inability to differentiate “un-   problem we match elements between the source and target
wanted” attributes.                                                schemata. In our case elements of the source schema are
   When performing bagging on attributes in the training           attributes, and we want to map these attributes to proper-
data, we introduce diversity by drawing many samples of            ties in the ontology. The semantic labeling problem is also
attribute values. However, we do not apply any perturbation        known in literature as attribute-to-property matching [18,
technique to the names of the attributes and instead use           17]. Indicating semantic correspondences manually might be
their exact replicas. In Table 8 we observe that DINT base         appropriate if only few data sources need to be integrated,
performs better than DINT base+ when bagging is used. In           however, it becomes tedious with the growing number of het-
datasets with scarce labeled instances our DINT models tend        erogeneous schemata. Hence, automatic or semi-automatic
to overfit the attribute names that are present in the training    approaches for schema matching are being actively devel-
data. This suggests that introducing a technique similar to        oped.
bagging for column headers might lead to a much better                From machine learning perspective, we can categorize these
performance. On the other hand, our results are consistent         approaches into unsupervised techniques which compute var-
with the observations in the work of Ritze et al.[17]. Their       ious similarity metrics and supervised techniques which build
results indicate that comparing attribute values is crucial for    a multi-class classification model. Unsupervised approaches
this task while attribute names might introduce additional         are used in SemanticTyper [16], T2K [18] and its extended
noise.                                                             version [17]. In all these approaches authors design simi-
   Clearly, the performance of our approach DINT varies            larity metrics for attribute names and attribute values, yet
depending on the chosen bagging parameters numBags and             one substantial difference is whether additional knowledge
bagSize. To explore this dependence, we evaluate the per-          is used in the computation. For example, authors in [18]
formance of DINT with only chardist and entropy features           and [17] leverage contextual information from DBpedia.
by varying one of the bagging parameters and fixing the               Among supervised approaches, there are probabilistic graph-
other one. We report the results of our evaluation in Fig-         ical models used in the work of Limaye et al. [9] to an-
ure 3. Here we do not consider unknown attributes and              notate web tables with entities for cell values, types for
choose the repeated holdout strategy to analyze the behav-         attributes and relationships for binary combinations of at-
ior of bagging when there is a shortage of training data. In-      tributes. Mulwad et al. [12] extend this approach by lever-
terestingly, increasing the values of the bagging parameters       aging information from Wikitology Knowledge Base (KB).
does not always lead to an improved performance, though            The problem with probabilistic graphical models is though
the computational time required for both the training and          that they do not scale with the number of semantic labels
prediction stages increases. The city domain is the most           in the domain. Also, Mulwad et al. as well as Venetis et
sensitive to bagging parameters. We assume this is because         al. [23], who used the isA database KB, extract additional
                                                                   data from knowledge bases to assign a semantic label to an
                              1                                               1
                                                                                                                         weather
                                                                                                                         weapons

         Performance (MRR)
                             0.9                                          0.9                                             soccer
                                                                                                                         museum
                             0.8                                          0.8                                              city


                             0.7                                          0.7

                             0.6                                          0.6

                             0.5                                          0.5
                                   0 10         50          100    150            0 10   30      50               100
                                                     numBags                                   bagSize

                                          (a) fixing bagSize=100                              (b) fixing numBags=50

Figure 3: Dependence of MRR scores for DINT base on the bagging parameters using repeated holdout strategy. Unknown
attributes are not considered.


attribute. Hence, these approaches are limited to domains                extracted from attributes. For example, CNN is built on
well represented in those knowledge bases. Our approach, on              raw sequences of attribute values. Surprisingly, we have dis-
the other hand, is not domain specific and allows a model to             covered that even random forests constructed just on charac-
be trained on any data. However, we cannot apply a model                 ter distributions of values and entropy of attributes provide
learnt on one domain to another, which is possible with the              remarkable results in many cases. This supports the ob-
DSL approach [14].                                                       servations in literature that attribute values are crucial for
   To the best of our knowledge, DSL introduced by Pham                  semantic labeling task [18, 17].
et al.[14] is among the top semantic labeling systems. Pham                 Future work may involve exploring a combination of bag-
et al. compare DSL to their previous approach Semantic-                  ging and class imbalance resampling strategies. We have
Typer [16] and T2K system [18], and achieve higher MRR                   observed that where the domain data has high imbalance
scores on a variety of datasets. Therefore, we use DSL as                among representatives of different semantic labels, resam-
the state-of-the art model in our benchmark to evaluate our              pling can lead to an improved performance but a more so-
new approaches.                                                          phisticated approach is required in domains which do not
   Ritze et al. [17] and Pham et al. [14] mention the problem            exhibit these characteristics. Another possible direction for
of the unknown class. In the first work the authors discuss              improvement is to introduce an equivalent of bagging for at-
”unwanted” attributes while in the second work the authors               tribute names. In addition, our experiments indicate that
reflect on how to handle ”unseen” attributes. In our work we             the performance of systems is often affected by the variance
do not differentiate between these two cases and show that               in sizes of data sources and how well each semantic label
we can successfully identify such attributes when sufficient             is represented in the training data. To this end, we con-
training data is available.                                              sider including T2KMatch [17] into our benchmark as well
                                                                         as domain sets from the RODI benchmark [15].
6.   CONCLUSION
   In this paper we have studied the problem of supervised               7.       REFERENCES
semantic labeling and have conducted experiments to evalu-                [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,
ate how different approaches perform at this task. Our main                   et al. Tensorflow: A system for large-scale machine
finding is that our bagging sampling technique can provide                    learning. In Proc. of OSDI, pages 265–283, 2016.
meaningful diversity to our training data to improve perfor-              [2] Z. Bellahsene, A. Bonifati, and E. Rahm, editors.
mance. Additionally, this technique can overcome the lack                     Schema Matching and Mapping. Data-Centric Systems
of labeled attributes in the domain and can increase the
                                                                              and Applications. Springer, 2011.
number of instances for under-represented semantic labels.
                                                                          [3] L. Breiman. Bagging predictors. Machine learning,
We find that given scarce training data, bagging leads to a
                                                                              24(2):123–140, 1996.
noticeable improvement in performance, though the state-
of-the-art system DSL [14] achieves a better precision by                 [4] L. Breiman. Random forests. Machine learning,
leveraging information about labeled instances from other                     45(1):5–32, 2001.
domains. However, if we are to consider unwanted attributes               [5] N. Craswell. Mean reciprocal rank. In Encyclopedia of
and unseen semantic labels, our new system DINT demon-                        Database Systems, pages 1703–1703. 2009.
strates the best performance. Among the semantic labeling                 [6] A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of
systems in our benchmark we have observed that the per-                       Data Integration. Morgan Kaufmann, 2012.
formance results are highly dependent on the use case.                    [7] J. J. Jiang and D. W. Conrath. Semantic similarity
   We have also shown that deep learning models, such as                      based on corpus statistics and lexical taxonomy. arXiv
CNN and MLP, can also be applied to solve this problem.                       preprint cmp-lg/9709008, 1997.
Though these models do not excel in performance in the ma-                [8] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.
jority of cases, their advantage is the simplicity of features                Nature, 521(7553):436–444, 2015.
 [9] G. Limaye, S. Sarawagi, and S. Chakrabarti.
     Annotating and searching web tables using entities,
     types and relationships. Proc. of the VLDB
     Endowment, 3(1-2):1338–1347, 2010.
[10] D. Lin et al. An information-theoretic definition of
     similarity. In Proc. of ICML, volume 98, pages
     296–304, 1998.
[11] C. D. Manning, P. Raghavan, H. Schütze, et al.
     Introduction to information retrieval, volume 1. 2008.
[12] V. Mulwad, T. Finin, and A. Joshi. Semantic message
     passing for generating linked data from tables. In
     Proc. of ISWC, pages 363–378, 2013.
[13] S. B. Needleman and C. D. Wunsch. A general
     method applicable to the search for similarities in the
     amino acid sequence of two proteins. Journal of
     molecular biology, 48(3):443–453, 1970.
[14] M. Pham, S. Alse, C. A. Knoblock, and P. Szekely.
     Semantic labeling: a domain-independent approach. In
     Proc. of ISWC, pages 446–462. Springer, 2016.
[15] C. Pinkel, C. Binnig, E. Jiménez-Ruiz, E. Kharlamov,
     W. May, et al. RODI: Benchmarking
     relational-to-ontology mapping generation quality.
     Semantic Web, (Preprint):1–28, 2016.
[16] S. Ramnandan, A. Mittal, C. A. Knoblock, and
     P. Szekely. Assigning semantic labels to data sources.
     In Proc. of ESWC, pages 403–417, 2015.
[17] D. Ritze and C. Bizer. Matching web tables to
     dbpedia - A feature utility study. In Proc. of EDBT,
     pages 210–221, 2017.
[18] D. Ritze, O. Lehmberg, and C. Bizer. Matching
     HTML tables to DBpedia. In Proc. of WIMS, page 10,
     2015.
[19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
     Learning internal representations by error
     propagation. Technical report, DTIC Document, 1985.
[20] D. Spanos, P. Stavrou, and N. Mitrou. Bringing
     relational databases into the semantic web: A survey.
     Semantic Web, 3(2):169–209, 2012.
[21] M. Taheriyan, C. A. Knoblock, P. Szekely, and J. L.
     Ambite. Leveraging linked data to discover semantic
     relations within data sources. In Proc. of ISWC, pages
     549–565. Springer, 2016.
[22] M. Taheriyan, C. A. Knoblock, P. A. Szekely, and
     J. L. Ambite. Learning the semantics of structured
     data sources. J. Web Sem., 37:152–169, 2016.
[23] P. Venetis, A. Halevy, J. Madhavan, M. Paşca,
     W. Shen, F. Wu, G. Miao, and C. Wu. Recovering
     semantics of tables on the web. Proc. of the VLDB
     Endowment, 4(9):528–538, 2011.