Disambiguating Web Tables using Partial Data Ziqi Zhang Department of Computer Science, University of Sheffield, UK z.zhang@dcs.shef.ac.uk Abstract. This work addresses disambiguating Web tables - annotating content cells with named entities and table columns with semantic type information. Con- trary to state-of-the-art that builds features based on the entire table content, this work uses a method that starts by annotating table columns using automatically selected partial data (i.e., a sample), then using the type information to guide content cell disambiguation. Different sample selection methods are introduced and tested to show that they contribute to higher accuracy in cell disambiguation, comparable accuracy in column type annotation with reduced computation. 1 Introduction Enabling machines to effectively and efficiently access the increasing amount of tab- ular data on the Web remains a major challenge to the Semantic Web, as the classic indexing, search and NLP techniques fail to address the underlying semantics carried by tabular structures [1, 2]. This has sparked increasing interest in research on seman- tic Table Interpretation, which deals with semantically annotating tabular data such as shown in Figure 1. This work focuses specifically on annotating table columns that contain named entity mentions with semantic type information (column classification), and linking content cells in these columns with named entities from knowledge bases (cell disambiguation). Existing work follows a typical workflow involving 1) retrieving candidates (e.g., named entities, concepts) from the knowledge base, 2) constructing features of candidates, and 3) applying inference to choose the best candidates. One key limitation is that they adopt an exhaustive strategy to build the candidate space for inference. In particular, annotating table columns depends on candidate entities from all cells in the column [1, 2]. However, for human cognition this is unnecessary. For ex- ample, one does not need to read the entire table shown in Figure 1 - which may contain over a hundred rows - to label the three columns. Being able to make such inference using partial (as opposed to the entire table) or sample data can improve the efficiency of the task as the first two phases can cost up to 99% of computation time [1]. Sample driven Table Interpretation opens up several challenges. The first is defining a sample with respect to each task. The second is determining the optimal size of the sample with respect to varying sizes of tables. The third is choosing the optimal sample entries, since a skewed sample may damage accuracy. Our previous work in [5] has proposed TableMiner to address the first two challenges. This work adapts TableMiner to explore the third challenge. A number of sample selection techniques are introduced and experiments show that they can further improve cell disambiguation accuracy and in the column type annotation task, contribute to reduction in computation with compa- rable learning accuracy. 2 2 Related Work An increasing number of work has been carried out in semantic Table Interpretation, such as Venetis et al. [3] that uses a maximum likeli- hood model, Limaye et al. [1] that uses a joint inference model, and Mulwad et al. [2] that uses joint inference with semantic message passing. These methods differ in terms of the inference models, fea- tures and background knowledge Fig. 1. Lakes in Central Greece bases used. All these methods are, as discussed earlier, ‘exhaustive’ as they require features built based on all content cells in order to annotate table columns. Zwicklbauer et al. [6] is the first method that annotates a table column using a sample of the column. However, the sample is arbitrarily chosen. 3 Methodology TableMiner is previously described in [5]. It disambiguates named entity columns in a table in two phases. The first phase creates preliminary annotations by using a sample of a column to classify the column in an iterative, incremental algorithm shown in Algo- rithm 1. In each iteration, a content cell Ti,j drawn from a column Tj is disambiguated (output Ei,j ). Then the concepts associated with the winning entity are gathered to cre- ate a set of candidate concepts for the column, Cj . Candidate concepts are scored and their score can change at each iteration due to newly disambiguated content cells adding re-enforcing evidence. At the end of each iteration, Cj from the current iteration is com- pared with the previous. If scores of candidate concepts are little changed (convergence, see [5] for a method for detection), then column classification is considered to be stable and the highest scoring candidates are (Cj+ ) chosen to annotate the column. The second phase begins by disambiguating the remaining cells (part I), this time using the type in- formation for the column to limit candidate entity space to those belonging to the type only. This may revise Cj for the column, either adding new elements, or resetting scores of existing ones and possibly causing the winning concept for the column to change. In this case, the next part of the second phase (part II) repeats the disambiguation and classification operations on the entire column, while using the new Cj+ as constraints to restrict candidate entity space. This procedure repeats until Cj+ and the winning entity in each cell stabilizes (i.e., no change). Modified TableMiner For the purpose of this study, TableMiner is modified (T Mmod ) to contain only the first phase and part I of the second phase. In other words, we do not revise the column classification results obtained from sample data. Therefore T Mmod may only use a fraction of a column’s data to classify the column, which reduces com- putation overhead compared to classic ‘exhaustive’ methods. 3 Sample selection The choice of the sam- Algorithm 1 Sample based classification ple can affect learning in T Mmod in two 1: Input: Tj ; Cj ← ∅ ways. While the size of the sample is 2: for all cell Ti,j in Tj do dealt with by the convergence measure 3: prevC ← C j j described in [5], here we address the is- 4: Ei,j ←disambiguate(Ti,j ) sue of selecting the suitable sample en- 5: Cj ←updateclass(Cj , Ei,j ) tries to ensure learning accuracy. Since 6: if convergence(Cj , prevCj ) then column classification depends on the dis- 7: break ambiguated cells in the sample, we hy- 8: end if pothesize that high accuracy of cell dis- 9: end for ambiguation contributes to high accuracy in column classification. And we further hypothesize that higher accuracy of content cell disambiguation can be achieved by 1) richer feature representation, and 2) less ambiguous names (i.e., if a name is used by only one or very few named entities). Therefore, we propose three methods to compute a score of each content cell in a col- umn, then rank them by the score before running Algorithm 1 (i.e., input Tj will contain content cells the order of which is re-arranged based on the scores). One-sense-per-discourse (ospd) First and foremost, we make the hypothesis of ‘one-sense-per-discourse’ in table context, that if an NE-column is not the subject col- umn of the table (e.g., the first column in Figure 1 is a subject column), then cells containing the same text content are extremely likely to express the same meaning1 . Thus to apply ospd we firstly re-arrange cells in a column by putting those containing duplicate text content adjacent to each other. Next, when disambiguating a content cell, the feature representation of the cell concatenates the row context of the cell, and that of any adjacent cells with the same text content (e.g., in Table 1 we assume ‘Aetolia- Acarnania’ on the three rows to have the same meaning, and build a single feature rep- resentation by concatenating all the three rows). Effectively this creates a richer feature representation for cells whose content re-occur across a table. Feature size (fs) With fs, we firstly apply ospd, then rank cells in a column by the size of their feature representation as determined by the number of tokens in a bag-of- words representation. This would allow T Mmod to start with cells that potentially have the largest - hence ‘richest’ - feature representation in Algorithm 1. Name length (nl) With nl, we count the number of words in the cell text content to be disambiguated and rank cells by this number - name length (e.g., in Table 1 ‘Aetolia- Acarnania’ has two words and will be disambiguated before ‘Boeotia’). nl merely relies on the name length of a cell content and does not apply ospd. The idea is that longer names are less likely to be ambiguous. 4 Evaluation and Conclusion We evaluate the proposed methods of sample selection using two datasets: LimayeAll and Limaye2002 . LimayeAll contains over 6000 tables and is used for evaluating con- tent cell disambiguation. Limaye200 contains a subset 200 tables from LimayeAll with 1 Due to space limitation, details are omitted but can be found in [3, 4] 2 [4], currently under transparent review. 4 Cell disambiguation (LimayeAll) Column classification (Limaye200) ospd fs nl ospd fs nl T Mmod T Mmod T Mmod T Mmod T Mmod T Mmod T Mmod T Mmod 0.809 0.812 0.812 0.813 0.723 0.719 0.721 0.723 Table 1. Cell disambiguation and column classification accuracy in F1. columns manually annotated with Freebase concepts, and used for evaluating column classification. As a baseline, T Mmod without any sample selection techniques is used. It simply chooses cells from a column in their original order in Algorithm 1. This is ospd compared against T Mmod , which applies ospd to non-subject NE-columns, preserves the original order but disambiguates groups of cells containing the same text content; fs T Mmod that applies ospd to non-subject NE-columns then prioritizes cells that po- nl tentially have richer feature representation; and T Mmod that prioritizes cells containing longer text content. Results on both datasets are shown in Table 1. It suggests that, com- pared against T Mmod , the sample selection techniques can enhance the accuracy of cell disambiguation marginally. In the column classification task however, they do not add benefits in terms of accuracy. By analyzing the computation overhead in terms of the automatically determined sample size in each table, it shows that the sample selection techniques have reducing the amount of data to be processed in column classification. As an example, T Mmod converges on average after processing 58% of cells in a table column, i.e., it manages to classify a table column using a sample size of 58% of the ospd fs total number of cells in that column. T Mmod reduces this to 53%, for T Mmod 52% nl and for T Mmod 58% (unchanged). This may contribute to noticeable reduction in CPU time since the construction of feature space (including querying knowledge bases) for each data unit consumes over 90% of computation time [1]. To summarize, it has been shown that, by using sample selection techniques, it is possible to semantically anno- tate Web tables in a more efficient way, achieving comparable or even higher learning accuracy depending on tasks. Acknowledgement: Part of this work is carried out in the LODIE project (Linked Open Data Information Extraction), funded by EPSRC (EP/J019488/1). References 1. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proceedings of the VLDB Endowment 3(1-2), 1338–1347 (2010) 2. Mulwad, V., Finin, T., Joshi, A.: Semantic message passing for generating linked data from tables. In: International Semantic Web Conference (1). pp. 363–378. Springer (2013) 3. Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recov- ering semantics of tables on the web. Proc. of VLDB Endowment 4(9), 528–538 (Jun 2011) 4. Zhang, Z.: Start small, build complete: Effective and efficient semantic table interpretation using tableminer. In: The Semantic Web Journal (under reviewer, #668-1878) (2014) 5. Zhang, Z.: Towards efficient and effective semantic table interpretation. In: To appear in: ISWC2014 (2014) 6. Zwicklbauer, S., Einsiedler, C., Granitzer, M., Seifert, C.: Towards disambiguating web tables. In: International Semantic Web Conference (Posters & Demos). pp. 205–208 (2013)