Evaluating Approaches for Supervised Semantic Labeling ∗ Nataliia Rümmele Yuriy Tyshetskiy Alex Collins Siemens Data61, CSIRO Data61, CSIRO Germany Australia Australia nataliia.ruemmele@ yuriy.tyshetskiy@ alex.collins@ siemens.com data61.csiro.au data61.csiro.au ABSTRACT a process is expected to be a uniform integrated view across Relational data sources are still one of the most popular these data sources. Relational data sources are still one of ways to store enterprise or Web data, however, the issue the most popular ways to store enterprise or Web data [20]. with relational schema is the lack of a well-defined semantic However, the relational schema lacks a well-defined semantic description. A common ontology provides a way to repre- description. To define the semantics of data, we can intro- sent the meaning of a relational schema and can facilitate duce an ontology [20]. Now our goal is to map attributes the integration of heterogeneous data sources within a do- from relational data sources to classes and properties in an main. Semantic labeling is achieved by mapping attributes ontology. We refer to this problem as semantic labeling. from the data sources to the classes and properties in the Semantic labeling plays an important role in data inte- ontology. We formulate this problem as a multi-class classi- gration [6, 14], augmenting existing knowledge bases [9, 17, fication problem where previously labeled data sources are 18, 23] or mapping relational sources to ontologies [15, 22]. used to learn rules for labeling new data sources. The major- Various approaches to automate semantic labeling have been ity of existing approaches for semantic labeling have focused developed, including DSL [14] and T2K [17]. Typically auto- on data integration challenges such as naming conflicts and mated semantic labeling techniques encounter several prob- semantic heterogeneity. In addition, machine learning ap- lems. Firstly, there can be naming conflicts [15], including proaches typically have issues around class imbalance, lack those cases where users represent the same data in different of labeled instances and relative importance of attributes. ways. Secondly, semantically different attributes might have To address these issues, we develop a new machine learning syntactically similar content, for example, birth date versus model with engineered features as well as two deep learning date of death. Thirdly, there are a considerable number of models which do not require extensive feature engineering. attributes which do not have any corresponding property in We evaluate our new approaches with the state-of-the-art. the ontology, either by accident or on purpose. The majority of existing systems focus on the first two problems, but do not consider the third problem during evaluation [18, 14]. CCS Concepts To address the challenges of automated semantic labeling, •Information systems → Data extraction and inte- we formulate this task as a supervised classification problem. gration; Mediators and data integration; •Computing A set of semantic labels known to the classifier is specified at methodologies → Supervised learning by classification; training time, e.g., from the provided domain ontology. We also introduce a special class of attributes, called unknown. Keywords The purpose of the unknown class is to capture attributes which will not be mapped to the ontology. The training data integration, schema matching, semantic labeling, on- data for the classifier will thus consist of source attributes tology, relational schema, bagging (name and content) and their semantic labels provided by the user, including the unknown labels. Since manually as- 1. INTRODUCTION signing labels to attributes is a costly operation, a lack of An important problem in database research is determining training data is a common problem for semantic labeling sys- how to combine multiple data sources that are described by tems. Existing systems [14, 17, 23] use knowledge transfer different (heterogeneous) schemata [6]. The outcome of such techniques to overcome this issue. Instead, we introduce a sampling method similar to bagging for ensemble models [3]. ∗Work accomplished at Data61, CSIRO. The bagging technique allows us to generate multiple train- ing instances from the user-labeled attributes, thus overcom- ing the lack of labeled training data. It also allows us to overcome the common issue of class imbalance, when some Permission to make digital or hard copies of part or all of this work for personal or semantic labels have more support than others among the classroom use is granted without fee provided that copies are not made or distributed attributes. We can achieve this by re-balancing the training for profit or commercial advantage and that copies bear this notice and the full citation data via preferential bagging from minority class attributes. on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). The main contributions of this paper are: TheWebConf Workshop: Linked Data on the Web (LDOW) 2018, Lyon, France 1. We introduce a bagging approach to handle class im- c 2018 Copyright held by the owner/author(s). balance and the lack of training data by drawing ran- For example, assume we are given sources “personal-info” City Person and “businessInfo” with the correct semantic labels, the sys- state worksFor ceo tem should then automatically assign labels to attributes in the source “Employees”. subclass State livesIn bornIn Organization birthDate name To build such a system, we cannot just rely on the names subclass operatesIn of the columns. For example, columns name in (1a), ceo Place email phone name in (1c) and employee in (1b) all refer to the same prop- erty (Person,name). Using just values of the columns is postalCode name also problematic. For example, in (1a) acronyms are used for states, while in (1c) state names are fully written. Fur- thermore, values can overlap for semantically heterogeneous columns like for founded in (1c) and birthDate in (1a). Figure 1: Example ontology. We can also have attributes that are not mapped to any property in the ontology. There might be two reasons for dom subsamples from values of an attribute. This ap- their existence: (1) we are not interested in the content of proach can achieve meaningful diversity in the training an attribute and want to discard it from any future analysis; data and can increase the number of training instances (2) we might have overlooked an attribute by not designing for under-represented semantic labels. the ontology accurately. We do not differentiate between these two cases and mark all such attributes as unknown 2. We address the issue of “unwanted” attributes, i.e., at- class, for example, founded in (1c). The presence of un- tributes which do not get mapped to any element in the known class makes the task of semantic labeling more com- ontology. In cases where we have a sufficient amount of plicated. Establishing approaches to efficiently handle such training data, our models can achieve over 80% Mean attributes is crucial since in many real-world scenarios re- Reciprocal Rank (MRR) on two sets of data sources lational data sources (either HTML tables [17] or domain from our benchmark. specific data [14]) contain a considerable number of such attributes. 3. We construct a classification model DINT with hand- Machine learning techniques proved to be efficient in build- engineered semantic labeling features to implement the ing predictive models on noisy and messy data. Yet to ap- above. In addition, we design two deep learning models ply these techniques we need to represent source attributes CNN and MLP which use very simple features, such as as feature vectors, with semantic labels (classes) attached to normalized character frequencies and padded character these vectors. In Table 2 we show such representation for the sequences extracted from raw values of data attributes. source Employees. We have explicitly shown only 4 possible features, for simplicity. For example, mean string length is 4. We construct a benchmark with a common evaluation the mean length of cell values for an attribute. However, strategy to compare different approaches for super- the actual size of a feature vector can be arbitrary long, and vised semantic labeling. Our benchmark includes such the process of designing its components is known as feature models as DINT, CNN, MLP and the state-of-the-art engineering. In the next section we will discuss the features DSL [14], and 5 sets of data sources from different do- used in the semantic labeling system. mains. We show that each approach has its strengths and shortcomings, and choosing a particular semantic labeling system depends on the use case. We have re- 3. APPROACHES leased the implementation of the benchmark under an In this section we describe classifiers for the semantic la- open source license 1 . This benchmark can be easily beling problem used for evaluation. We also discuss ap- extended to include other models and datasets, and proaches to the problem of unknown attributes and lack of can be used to choose the most appropriate model for training data. a given use case. Once we have a set of labeled data sources, we construct feature vectors for all attributes in this set and mark them 2. PROBLEM as representatives of a class corresponding to their semantic labels. The constructed set of (feature vector, class label) We illustrate the semantic labeling problem using a simple pairs is then used to train a classifier. We consider several domain ontology shown in Fig. 1. Assume we have three approaches, which we divided into 3 major groups: DINT, data sources “personal-info”, “businessInfo” and “Employees” Deep Learning and the state-of-the-art DSL. Each approach (see Tab. 1) whose attributes we choose to label according trains a multi-class classification model that produces, at to the example ontology (Fig. 1). We define a semantic label the prediction stage, a list of class probabilities for an at- as a tuple consisting of a domain class and its property. For tribute in a new source. The class with the highest predicted example, attribute name in the source “personal-info” (see probability is then assigned to the attribute at the decision Tab. 1a) is labeled with (Person,name). Note that semantic stage. labels are fixed by the ontology. The task of semantic labeling is defined as automatically 3.0.1 DINT assigning semantic labels to attributes in a data source. In In our first approach DINT (Data INTegrator) we hand- the case of supervised semantic labeling, we use existing engineer 26 features, which include characteristics such as known semantic labels for data sources to improve the per- number of whitespaces and other special characters, statis- formance when assigning semantic labels to new sources. tics of values in the column (e.g, mean/ max/ min string 1 http://github.com/NICTA/serene-benchmark length and numeric statistics) and many more. The com- Table 1: Example relational data sources with semantic labels. name birthDate city state workplace employer employee DOB Neil 21-05-1916 Waterloo NSW CSIRO CSIRO Neil 05/21/1916 Mary 07-12-1990 Eveleigh NSW CSIRO Data61 Mary 12/07/1990 Henry 15-03-2000 Redfern NSW Data61 NICTA Henry 03/15/2000 Semantic (Person, (Person, (City, (State, (Organization, (Organization, (Person, (Person, labels name) birthDate) name) name) name) name) name) birthDate) (a) personal-info (b) Employees company ceo state founded CSIRO Larry Marshall Australian Capital Territory 21-05-1916 Data61 Adrian Turner New South Wales 12-07-2016 NICTA Hugh Durrant New South Wales 15-03-2002 Semantic (Organization, (Person,name) (State,name) unknown labels name) (c) businessInfo Table 2: Feature vectors for data source Employees. feature vector attribute entropy mean ratio ratio unique class ... string length alpha chars values employer 1.001 5.333 ... 0.875 1 (Organization, name) employee 1.461 13.333 ... 0.925 1 (Person, name) DOB 0.69 10 ... 0 1 (Person, birthDate) plete list of features is available in the open source bench- is the minimum number of edit operations, such as inser- mark repository 2 . One of the important features charac- tion, deletion, substitution, which are required to transform terising information content of an attribute is Shannon’s one string into another [11]. We compute the similarity be- entropy of the attribute’s concatenated rows. Shannon’s tween attribute name and all class instances in the training entropy (or information P entropy [11]) of a string X is de- data. The number of thus extracted features depends on the fined as H(X) = − i pi log2 pi , where pi is the probability number of semantic labels in the training data. of a character, whose index in character vocabulary is i, We choose to train a Random Forest [4] (RF) on this set to appear in X, and the summation ranges over all char- of features. RF is quite robust on noisy data, works well acters in the vocabulary. To evaluate pi in Shannon’s en- even with correlated features, and easily captures complex tropy, we evaluate normalized character frequency distribu- nonlinear relationships between features and target. Addi- tion chardist of an attribute, as character counts in concate- tionally, RF classifiers require little hyperparameter tuning, nated rows of the attribute, normalized by the total length and hence they usually work straight “out of the box”, which of the concatenated rows. The vocabulary of all characters makes them a convenient yet versatile classifier to use. consists of 100 printable characters (including \n). Finally, we also add the 100-dimensional vector of pi to the attribute 3.0.2 Deep Learning feature vector. Deep learning has gained much popularity due to its tremen- In addition to the above features, which can be directly dous impact in such areas as speech recognition, object recog- calculated from attribute values, we compute mean cosine nition, and machine translation [8]. One of the biggest ad- similarity of attribute character distribution with character vantages of deep learning is the ability to process data in distributions of all class instances. This adds as many addi- its raw form and to discover the representation needed for tional scalar features to the full attribute feature vector as classification, assisting with the feature engineering step. there are classes in the training data. In our case we have Broadly speaking, deep learning is an overarching term as many classes as there are semantic labels. for artificial neural networks, where the word “deep” refers One can expect that names of the attributes should also to the depth of the network. At the basic level neural net- contain useful information to determine their semantic types, works are composed of perceptrons, or neural nodes. There in addition to the information provided by attribute values. can be several layers of interconnected neural nodes; The To extract features from attribute names, we compute string first layer is the input layer while the last one is the output similarity metrics: minimum edit distance, two WordNet layer. The layers in between these two are called hidden. based similarity measures such as JCN [7] and LIN [10], Neural nodes in each layer take as input the output of the and k-nearest neighbors using Needle-Wunsch distance [13]. nodes from the previous layer, perform some computation The minimum edit distance between two strings s1 and s2 with a nonlinear activation function (e.g., tanh or RELU) and pass the result to the next layer. There are generally no 2 serene-benchmark connections between nodes in the same layer. Overall, deep learning models improve in their performance the more data classifier to produce the final vector of probabilities. One of they are trained on. The exact architecture of deep learning the distinctive properties of this approach is the ability to models, i.e., number of layers, number of nodes in each layer, transfer the classification model trained in one domain to activation functions of neurons and interconnectedness be- predicting semantic labels for attributes in another domain. tween layers, all influence the performance of the trained We denote this enhanced approach as DSL+. models. We choose two different architectures for our deep learn- 3.1 Bagging ing classifiers: (i) Multi-Layer Perceptron (MLP ) [19] and To train a classifier for semantic labeling, we need data (ii) Convolutional Neural Network (CNN ) [8]. We have ex- sources to have many labeled attributes. However, the costly perimented with different designs of the MLP and CNN net- operation of manually assigning labels to attributes, and the works, varying their hyperparameters that control the num- relative small number of columns compared to data set size, ber of hidden layers, the numbers of nodes/filters per layer, implies that lack of training data is a common problem for dropout probability, etc., and found that the designs, de- semantic labeling systems. Existing systems [14, 17, 23] use scribed briefly below, work well for all the datasets in the knowledge transfer techniques to overcome this issue. We benchmark. introduce a method for increasing training sample size based The input layer of the MLP architecture takes the 101- on a machine learning approach known as bagging [3]. dimensional feature vector of character frequencies pi (chardist) Breiman [3] introduced the concept of bootstrap aggregat- and Shannon entropy. Following the input layer, MLP has ing, also known as bagging, to construct ensembles of mod- 3 fully connected hidden layers with 100 nodes per layer, els to improve prediction accuracy. The method consists with tanh activations. After the 1st hidden layer, we intro- in training different classifiers with bootstrapped replicas of duced a stochastic dropout layer with dropout probability of the original dataset. Hence, diversity is obtained with the 0.5, to prevent overfitting. Finally, the output layer of MLP resampling procedure by the usage of different data subsets. (the actual classifier) is a softmax layer with the number of At the prediction stage each individual classifier estimates nodes equal to the number of semantic types (including the an unknown instance, and a majority or weighted vote is ‘unknown’ type). used to infer the class. The CNN model takes as input the one-hot representation We modify the idea of bagging for our problem. It is of an attribute’s concatenated rows in character space, then clear that the semantics of columns in the table “Employees” embeds it to a dense 64-bit embedding, then passes this (Table 1b) will not change whether we have 3 or 1000 rows. embedded ”image” of the attribute through two consecutive So, we can create several training instances for an attribute, 1d convolution layers with 100 filters per layers, followed by where each instance (called a bag) will contain a random a 1-d max-pooling layer, a flattening layer, a dropout layer sample (with replacement) of its content. This procedure with probability of dropout 0.5, then a fully connected layer is governed by two parameters numBags and bagSize: the with 100 nodes, and finally a fully connected softmax output first parameter controls how many bags are generated per layer (the classifier) with the number of nodes equal to the each attribute, while the latter indicates how many rows are number of semantic types (including the ‘unknown’ type). sampled per each bag. In such a way we address the issue Though we cannot be sure that our final choice for the of noise by increasing diversity of the training data as well architectures is optimal, it seems to be a good trade-off be- as the issue of insufficient training data. tween complexity of the models, required computational re- Another common problem encountered in a wide range sources for their training, and their overall performance in of data mining and machine learning initiatives is class im- semantic labeling task. We have implemented both mod- balance. Class imbalance occurs when the class instances els using Keras library with GPU-based TensorFlow back- in a dataset are not equally represented. In such situation, end [1]. building standard machine learning models will lead to poor results, since they will favor classes with large populations 3.0.3 DSL over the classes with small populations. To address this is- The Domain-independent Semantic Labeler (DSL) has been sue, we have tried several resampling strategies to equalize proposed by Pham et al [14], where 6 feature groups based on the number of instances per each class. similarity metrics are constructed. These metrics measure how attribute names and values are similar to the character- 3.2 Unknown class istics of other attributes. This means that given 5 attributes As mentioned previously, some attributes are not mapped in the training data (i.e., already labeled instances) with dis- to any property in the ontology. To handle this issue, we tinct semantic labels, a new attribute will be compared to introduce one more class called unknown. For example, at- representatives of each semantic label and 30 features will tributes which get discarded from the integration process can be calculated in total. The considered similarity metrics be marked as unknown. This way we can help the classifier are: attribute name similarity, standard Jaccard similarity recognize such attributes in new sources. In addition, there for textual data and a modified version for numerical data, is another advantage of having the unknown class defined TF-IDF cosine similarity, distribution and histogram simi- explicitly. Consider a new attribute with an unseen seman- larity. tic label, that is, a label which is not present in the training Instead of building one multi-class classifier, the authors data. Instead of picking the closest match among known train binary classifiers separately for each semantic label. A semantic labels, the classifier will mark it as unknown. The binary classifier for a particular semantic label is a Logistic user will then need to validate the attributes that are clas- Regression model trained on a set of similarity metrics with sified as unknown. This will ensure that the unknown class representatives of this label. When predicting semantic la- consists only of unwanted attributes. We do not introduce bels for a new attribute, they combine the predictions of each another class to differentiate between unwanted attributes Table 3: Description of data sources. # # semantic # # unknown avg # rows avg # attributes Domain sources labels attributes attributes per source per source weather 4 12 44 0 108.5 11 weapons 15 28 175 0 54.46 11.66 museum 29 20 443 159 6978.89 15.27 soccer 12 18 138 42 2120.16 11.5 city 10 52 520 0 2251 52 30 40 25 number of attributes 4 20 30 3 15 20 2 10 1 10 5 0 0 0 1 12 1 12 28 1 10 19 numeric identifiers of semantic labels numeric identifiers of semantic labels numeric identifiers of semantic labels (a) weather (b) weapons (c) soccer 159 140 number of attributes 120 10 100 80 60 40 20 0 0 1 10 21 1 10 20 30 40 52 numeric identifiers of semantic labels numeric identifiers of semantic labels (d) museum (e) city Figure 2: Distribution of attributes according to semantic labels, including the unknown class, in different domains. We can see class imbalance in the museum, soccer and weapons domains. On the x-axis we have semantic labels sorted by the number of attributes in each class. The y-axis shows the number of attributes. and unseen labels since we cannot guarantee that there is This provides us an opportunity to evaluate how different no overlap between them. Only our DINT and Deep Learn- approaches behave in various scenarios. The museum and ing approaches support an unknown class. soccer domains are the only domains which have unknown attributes. The city domain has many semantic labels and 4. EXPERIMENTS attributes while the museum domain contains more data sources. The number of instances per each semantic label We have run all our experiments on a Dell server with varies in these domains. 252 GiB of memory, 2 CPUs with 4 cores each, 1 Titan GPU To estimate class imbalance within each domain, we plot and 1 GeForce 1080 Ti GPU. The deep learning models have the class distribution in Figure 2. The museum domain been optimized for GPUs using Tensorflow. The benchmark has the highest imbalance among classes, the soccer and for semantic labeling system is implemented in Python and weapons domains also have imbalanced classes, whereas the is available under an open source license 3 . weather and city domains have equally represented classes. 4.1 Datasets 4.2 Experimental setting We use 5 different sets of data sources in our evalua- tion, labeled as: museum, city, weather, soccer [14] and We establish a common evaluation framework for the ap- weapons [21]. Each set of data sources has been manu- proaches as described in Section 3. As a performance metric ally mapped to a domain with a specific set of semantic we use Mean Reciprocal Rank (MRR) [5]. To derive a com- labels. Descriptive statistics of each domain set are shown prehensive estimate of performance within domains, we im- in Table 3. As we can see, these sets differ substantially. plement two cross-validation techniques: leave one out and repeated holdout. 3 http://github.com/NICTA/serene-benchmark The leave one out strategy is defined as using one source Table 4: MRR scores for leave one out strategy when unknown attributes are not considered. Sampling Model city museum soccer weapons weather DSL 0.711 0.848 0.865 0.731 0.904 DINT all 0.971 0.902 0.801 0.826 0.964 None DINT base 0.925 0.86 0.794 0.810 0.93 DINT base+ 0.925 0.89 0.792 0.807 0.93 MLP 0.873 0.886 0.862 0.799 0.965 CNN 0.877 0.893 0.823 0.813 0.939 Bagging DINT all 0.956 0.913 0.804 0.833 0.979 DINT base 0.928 0.894 0.887 0.825 0.941 DINT base+ 0.928 0.911 0.79 0.813 0.956 Resample DINT all 0.969 0.907 0.814 0.792 0.956 To DINT base 0.929 0.888 0.802 0.788 0.911 Mean DINT base+ 0.929 0.901 0.835 0.765 0.926 Table 5: Model training times (s) for leave one out. Sampling Model city museum soccer weapons weather DSL 295.6 164.3 36.6 269.5 8.2 DINT all 10.8 74.8 8.0 6.2 2.0 None DINT base 10.2 20.4 3.9 5.0 3.6 DINT base+ 10.2 20.7 4.0 4.1 2.0 MLP 184.2 216.5 26.8 85 11 CNN 184.8 276.2 29.7 71.3 12.5 Bagging DINT all 212 310.3 47.3 127.9 11.6 DINT base 165.5 83.1 26.8 35.1 8.6 DINT base+ 165.5 80.0 27.8 44.0 7.1 Resample DINT all 10.9 58.6 12.2 8.3 2.1 To DINT base 10.8 20.5 2.3 4.3 2.1 Mean DINT base+ 10.9 18.4 2.8 4.3 2.1 as the testing sample and the rest of sources in the domain labeling system DINT with different sampling strategies. as the training samples. This procedure is repeated as many In particular, we report results when we apply no resam- times as there are sources in the domain. We calculate MRR pling and bagging with parameters bagSize=100 and num- on the testing sample and report the average MRR as the Bags=100. We also experiment with various class imbalance final performance metric for each iteration. For example, resampling strategies, including resampling to the mean or for the domain museum we obtain 29 models in total where maximum of instance counts per class. For brevity and with- each model is trained on a different 28 sources, and MRR out loss of generality we report results only for the resam- is calculated on the prediction outcome for a single source. pling to mean strategy denoted as ResampleToMean. By This strategy allows us to estimate the performance of the design DSL and DSL+ use no resampling. different models given that there are enough instances per As mentioned in Section 3.0.1, the DINT model is built each semantic label. on a set of elaborately engineered features. MLP model, on In repeated holdout strategy, we randomly sample a ratio the other hand, uses only chardist and entropy. To better p of sources to place in the training sample and use the compare the performance of MLP and DINT, we create a remaining sources for testing sample, and this procedure is new model DINT base and reduce the number of features repeated n times. The final MRR score is an average of MRR to just chardist and entropy. In addition, we create another scores in each iteration. We use this technique to simulate model DINT base+ by using chardist and entropy and add the scenario when there is a shortage of labeled sources. We a feature minimum edit distance. We choose this feature as set the ratio p = 0.2 and the number of iterations n = 10. feature importance scores produced by the random forest algorithm rank edit distance higher than the other features 4.3 Results extracted from names. In this section we report the results of our experiments. Table 4 reports the MRR scores for leave one out strat- In total we evaluate 13 models, and we report run times egy. Surprisingly, models built on just normalized character required to train the considered models. distributions of attribute values perform in many cases very To train MLP and CNN models, we need many training well. Deep learning models MLP and CNN are often com- instances, so we use bagging (presented in Section 3.1) with parable with DINT models, however they come usually at parameters numBags=150 and bagSize=100 to increase the a higher computational cost. Run times for training each size of the initial training set. We can train the semantic model are shown in Table 5. Table 6: MRR scores for repeated holdout strategy when unknown attributes are not considered. Sampling Model city museum soccer weapons weather DSL 0.719 0.889 0.614 0.611 0.805 DSL+ 0.782 0.927 0.813 0.872 1 None DINT all 0.949 0.798 0.553 0.688 0.583 DINT base 0.888 0.763 0.516 0.684 0.621 DINT base+ 0.888 0.778 0.542 0.686 0.621 MLP 0.797 0.77 0.663 0.695 0.887 CNN 0.723 0.774 0.606 0.664 0.882 Bagging DINT all 0.945 0.791 0.656 0.682 0.854 DINT base 0.919 0.788 0.634 0.701 0.867 DINT base+ 0.919 0.790 0.628 0.688 0.852 Resample DINT all 0.949 0.789 0.455 0.588 0.557 To DINT base 0.89 0.749 0.451 0.578 0.611 Mean DINT base+ 0.89 0.758 0.445 0.564 0.611 Table 7: Performance for leave one out strategy when unknown class is considered. MRR scores Train time (s) Sampling Model museum soccer museum soccer DSL 0.56 0.618 156.6 36.3 DINT all 0.866 0.827 100.6 6.8 None DINT base 0.838 0.809 28.4 5.9 DINT base+ 0.849 0.824 33.4 6.2 MLP 0.802 0.784 417.2 37.6 CNN 0.831 0.785 394.5 39.6 Bagging DINT all 0.854 0.795 395.2 64.5 DINT base 0.839 0.863 112.5 26.7 DINT base+ 0.867 0.793 114.4 30.6 Resample DINT all 0.776 0.730 100.5 6.8 To DINT base 0.721 0.69 26.2 4.2 Mean DINT base+ 0.759 0.753 26.7 5.2 As we can see, DINT models that use bagging to sample strategy leads to a decreased performance in the domains more training instances achieve the best results in four do- with a less prominent imbalance (i.e., weapons and weather). mains. Remarkably, these are also the domains with higher This leads us to the idea that a class resampling strategy class imbalance and variety among data sources in terms of needs to be improved. number of rows and number of columns. Data sources in the One potential strategy may be in combining bagging and city domains have the same number of attributes. We have resampling strategies. Instead of fixing numBags for all at- also discovered that bagging needs to be performed both at tributes, the parameter could be changed to be either the the training and prediction stages to achieve the best perfor- mean or maximum of instance counts per each class. In such mance. We have observed that this setting makes a notice- a way we can perform a resampling strategy which does not able difference in domains where the number of rows varies produce replicas of the attributes. substantially among data sources. For example, in the mu- Apart from the city and weapons domains, our newly de- seum domain number of rows ranges from 6 to 85235, and signed models have a similar performance to DSL. However, in the soccer domain the range is from 500 to 9443. the computational complexity of these models varies. For In terms of computation time, the best performing model the museum domain DINT base+ has a higher MRR than DINT all for the museum domain requires a lot of time DSL, yet DINT base+ needs half the time less for training. for training. The most computationally expensive features It appears that attributes which contain a mixture of tex- are four different edit distances: minimum edit distance, tual and numeric are a bottleneck for DSL since data sources JCN, LIN and k-nearest neighbors. This suggests that the in the city and weapons domains have multiple mixed data DINT model with all possible features does not scale well columns. with the increasing number of attributes in the training set. In cases where there are few labeled instances (repeated Considering similarity metrics used in other approaches like holdout strategy in Table 6), we observe that DSL per- DSL and T2K [17], computing TF-IDF and Jaccard’s scores forms well, especially DSL+, which leverages labeled in- may help resolve this runtime issue for DINT all. stances from other domains. We should be aware that in For class imbalance, although the ResampleToMean strat- this scenario there are many unseen labels, which makes egy improves the performance of DINT models with no sam- MRR ill-defined. If we compare DINT models in this sce- pling in the domains with the highest class imbalance (i.e., nario, it suggests that bagging is advantageous in situations museum and soccer), it appears that the ResampleToMean when there are few labeled attributes. Overall, enhancing Table 8: Performance for repeated holdout strategy when unknown class is considered. MRR scores Train time (s) Sampling Model museum soccer museum soccer DSL 0.544 0.355 15.5 4.2 DSL+ 0.303 0.43 215.9 241.3 None DINT all 0.769 0.549 16.2 2 DINT base 0.743 0.608 10.2 2 DINT base+ 0.742 0.613 10.1 2 MLP 0.675 0.572 94.4 9.4 CNN 0.683 0.534 87.7 10 Bagging DINT all 0.827 0.551 101.9 13.6 DINT base 0.76 0.593 55.6 13.5 DINT base+ 0.721 0.59 53.1 13.6 Resample DINT all 0.637 0.428 35.5 2.6 To DINT base 0.607 0.475 12.2 2 Mean DINT base+ 0.633 0.479 11.5 2 our DINT model, which uses simple features and bagging, the city domain is the only domain with an equal distribu- with DSL+ knowledge transfer capability might result in tion of semantic labels, equal numbers of columns and rows a more stable semantic labeling system. Another enhance- across data sources. It appears that in other domains, bag- ment may be to introduce resampling strategies into the DSL ging makes models more robust towards variance in these system. characteristics. In addition, we perform experiments for the two domains museum and soccer, where unmapped attributes cause skewed class distributions. Here we want to establish how well dif- 5. RELATED WORK ferent approaches can recognize such attributes. In Tables 7 The problem of semantic labeling, as addressed in this and 8 we can see that the performance of semantic label- work, can be regarded as the problem of schema matching ing systems changes considerably. Both the DSL and DSL+ in the field of data integration [2]. In the schema matching performance is affected by their inability to differentiate “un- problem we match elements between the source and target wanted” attributes. schemata. In our case elements of the source schema are When performing bagging on attributes in the training attributes, and we want to map these attributes to proper- data, we introduce diversity by drawing many samples of ties in the ontology. The semantic labeling problem is also attribute values. However, we do not apply any perturbation known in literature as attribute-to-property matching [18, technique to the names of the attributes and instead use 17]. Indicating semantic correspondences manually might be their exact replicas. In Table 8 we observe that DINT base appropriate if only few data sources need to be integrated, performs better than DINT base+ when bagging is used. In however, it becomes tedious with the growing number of het- datasets with scarce labeled instances our DINT models tend erogeneous schemata. Hence, automatic or semi-automatic to overfit the attribute names that are present in the training approaches for schema matching are being actively devel- data. This suggests that introducing a technique similar to oped. bagging for column headers might lead to a much better From machine learning perspective, we can categorize these performance. On the other hand, our results are consistent approaches into unsupervised techniques which compute var- with the observations in the work of Ritze et al.[17]. Their ious similarity metrics and supervised techniques which build results indicate that comparing attribute values is crucial for a multi-class classification model. Unsupervised approaches this task while attribute names might introduce additional are used in SemanticTyper [16], T2K [18] and its extended noise. version [17]. In all these approaches authors design simi- Clearly, the performance of our approach DINT varies larity metrics for attribute names and attribute values, yet depending on the chosen bagging parameters numBags and one substantial difference is whether additional knowledge bagSize. To explore this dependence, we evaluate the per- is used in the computation. For example, authors in [18] formance of DINT with only chardist and entropy features and [17] leverage contextual information from DBpedia. by varying one of the bagging parameters and fixing the Among supervised approaches, there are probabilistic graph- other one. We report the results of our evaluation in Fig- ical models used in the work of Limaye et al. [9] to an- ure 3. Here we do not consider unknown attributes and notate web tables with entities for cell values, types for choose the repeated holdout strategy to analyze the behav- attributes and relationships for binary combinations of at- ior of bagging when there is a shortage of training data. In- tributes. Mulwad et al. [12] extend this approach by lever- terestingly, increasing the values of the bagging parameters aging information from Wikitology Knowledge Base (KB). does not always lead to an improved performance, though The problem with probabilistic graphical models is though the computational time required for both the training and that they do not scale with the number of semantic labels prediction stages increases. The city domain is the most in the domain. Also, Mulwad et al. as well as Venetis et sensitive to bagging parameters. We assume this is because al. [23], who used the isA database KB, extract additional data from knowledge bases to assign a semantic label to an 1 1 weather weapons Performance (MRR) 0.9 0.9 soccer museum 0.8 0.8 city 0.7 0.7 0.6 0.6 0.5 0.5 0 10 50 100 150 0 10 30 50 100 numBags bagSize (a) fixing bagSize=100 (b) fixing numBags=50 Figure 3: Dependence of MRR scores for DINT base on the bagging parameters using repeated holdout strategy. Unknown attributes are not considered. attribute. Hence, these approaches are limited to domains extracted from attributes. For example, CNN is built on well represented in those knowledge bases. Our approach, on raw sequences of attribute values. Surprisingly, we have dis- the other hand, is not domain specific and allows a model to covered that even random forests constructed just on charac- be trained on any data. However, we cannot apply a model ter distributions of values and entropy of attributes provide learnt on one domain to another, which is possible with the remarkable results in many cases. This supports the ob- DSL approach [14]. servations in literature that attribute values are crucial for To the best of our knowledge, DSL introduced by Pham semantic labeling task [18, 17]. et al.[14] is among the top semantic labeling systems. Pham Future work may involve exploring a combination of bag- et al. compare DSL to their previous approach Semantic- ging and class imbalance resampling strategies. We have Typer [16] and T2K system [18], and achieve higher MRR observed that where the domain data has high imbalance scores on a variety of datasets. Therefore, we use DSL as among representatives of different semantic labels, resam- the state-of-the art model in our benchmark to evaluate our pling can lead to an improved performance but a more so- new approaches. phisticated approach is required in domains which do not Ritze et al. [17] and Pham et al. [14] mention the problem exhibit these characteristics. Another possible direction for of the unknown class. In the first work the authors discuss improvement is to introduce an equivalent of bagging for at- ”unwanted” attributes while in the second work the authors tribute names. In addition, our experiments indicate that reflect on how to handle ”unseen” attributes. In our work we the performance of systems is often affected by the variance do not differentiate between these two cases and show that in sizes of data sources and how well each semantic label we can successfully identify such attributes when sufficient is represented in the training data. To this end, we con- training data is available. sider including T2KMatch [17] into our benchmark as well as domain sets from the RODI benchmark [15]. 6. CONCLUSION In this paper we have studied the problem of supervised 7. REFERENCES semantic labeling and have conducted experiments to evalu- [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, ate how different approaches perform at this task. Our main et al. Tensorflow: A system for large-scale machine finding is that our bagging sampling technique can provide learning. In Proc. of OSDI, pages 265–283, 2016. meaningful diversity to our training data to improve perfor- [2] Z. Bellahsene, A. Bonifati, and E. Rahm, editors. mance. Additionally, this technique can overcome the lack Schema Matching and Mapping. Data-Centric Systems of labeled attributes in the domain and can increase the and Applications. Springer, 2011. number of instances for under-represented semantic labels. [3] L. Breiman. Bagging predictors. Machine learning, We find that given scarce training data, bagging leads to a 24(2):123–140, 1996. noticeable improvement in performance, though the state- of-the-art system DSL [14] achieves a better precision by [4] L. Breiman. Random forests. Machine learning, leveraging information about labeled instances from other 45(1):5–32, 2001. domains. However, if we are to consider unwanted attributes [5] N. Craswell. Mean reciprocal rank. In Encyclopedia of and unseen semantic labels, our new system DINT demon- Database Systems, pages 1703–1703. 2009. strates the best performance. Among the semantic labeling [6] A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of systems in our benchmark we have observed that the per- Data Integration. Morgan Kaufmann, 2012. formance results are highly dependent on the use case. [7] J. J. Jiang and D. W. Conrath. Semantic similarity We have also shown that deep learning models, such as based on corpus statistics and lexical taxonomy. arXiv CNN and MLP, can also be applied to solve this problem. preprint cmp-lg/9709008, 1997. Though these models do not excel in performance in the ma- [8] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. jority of cases, their advantage is the simplicity of features Nature, 521(7553):436–444, 2015. [9] G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. of the VLDB Endowment, 3(1-2):1338–1347, 2010. [10] D. Lin et al. An information-theoretic definition of similarity. In Proc. of ICML, volume 98, pages 296–304, 1998. [11] C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval, volume 1. 2008. [12] V. Mulwad, T. Finin, and A. Joshi. Semantic message passing for generating linked data from tables. In Proc. of ISWC, pages 363–378, 2013. [13] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970. [14] M. Pham, S. Alse, C. A. Knoblock, and P. Szekely. Semantic labeling: a domain-independent approach. In Proc. of ISWC, pages 446–462. Springer, 2016. [15] C. Pinkel, C. Binnig, E. Jiménez-Ruiz, E. Kharlamov, W. May, et al. RODI: Benchmarking relational-to-ontology mapping generation quality. Semantic Web, (Preprint):1–28, 2016. [16] S. Ramnandan, A. Mittal, C. A. Knoblock, and P. Szekely. Assigning semantic labels to data sources. In Proc. of ESWC, pages 403–417, 2015. [17] D. Ritze and C. Bizer. Matching web tables to dbpedia - A feature utility study. In Proc. of EDBT, pages 210–221, 2017. [18] D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML tables to DBpedia. In Proc. of WIMS, page 10, 2015. [19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Technical report, DTIC Document, 1985. [20] D. Spanos, P. Stavrou, and N. Mitrou. Bringing relational databases into the semantic web: A survey. Semantic Web, 3(2):169–209, 2012. [21] M. Taheriyan, C. A. Knoblock, P. Szekely, and J. L. Ambite. Leveraging linked data to discover semantic relations within data sources. In Proc. of ISWC, pages 549–565. Springer, 2016. [22] M. Taheriyan, C. A. Knoblock, P. A. Szekely, and J. L. Ambite. Learning the semantics of structured data sources. J. Web Sem., 37:152–169, 2016. [23] P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. Proc. of the VLDB Endowment, 4(9):528–538, 2011.