T2LD: Interpreting and Representing Tables as Linked Data? Varish Mulwad, Tim Finin, Zareen Syed, and Anupam Joshi University of Maryland, Baltimore County, Baltimore MD USA 21250 {varish1,finin,joshi}@cs.umbc.edu,zareensyed@gmail.com Abstract. We describe a framework and prototype system for interpret- ing tables and extracting entities and relations from them, and producing a linked data representation of the table’s contents. This can be used to annotate the table or to add new facts to the linked data collection. Keywords: linked data, human language technology, entity linking 1 Introduction Vast amounts of information is available in structured forms like spreadsheets, database relations, and tables in documents found on the Web. We describe a framework for interpreting such tables and extracting entities and relations from them. The results can be used to annotate the table or to contribute its contents to linked data (LOD) collections. The process assigns column headers to classs from an appropriate ontology, links table cells (as appropriate) to entities in a LOD collection, and identifies relations between columns and links them to ontology properties. The resulting table interpretation can be used to confirm existing facts in an LOD collection and to propose new facts to be added. Using the table headers and the data in its rows and columns, we query existing knowledge bases (KBs) including Wikitology [1] and DBpedia to select the best class labels for each column, which is then used to identify potential entity links for table cells. A classifier selects the best entity from the list and a second classifier decides whether the evidence is strong enough to link the cell value to it. Relations in the table are discovered using the generated column classes, linked entities, and KB information bases. Our implemented prototype was evaluated using a collection of tables from Google Squared, Wikipedia and tables found on the Web. Caferella et al. [2] estimated that the Web contains around 14.1 billion HTML tables, over 154 million containing high quality relational data. This represents a huge source of knowledge currently unavailable on the Semantic Web. There is a need for systems that can automatically generate LOD from existing sources, be it unstructured (e.g., free text), semi-structured (e.g., text embedded in forms or Wikis) or structured (e.g., data in spreadsheets and databases). Interpreting tables is a problem of interest in many areas such as databases, web systems and the Semantic Web. See [3, 4] for a comprehensive study of related work on interpreting tables and converting databases and spreadsheets into RDF. ? Research supported in part by a gift from Microsoft Research, a Fulbright fellowship, NSF award IIS-0326460 and the Human Language Technology Center of Excellence. 2 T2LD: Interpreting and Representing Tables as Linked Data Existing systems for extracting knowledge from tables [5] require human intervention and do not focus on a complete interpretation of the table, nor integrating the table with linked open data cloud. This poster paper focuses on an automatic framework for generating an linked RDF which can be integrated into the LOD cloud. The eventual goal of this work is to enrich the LOD cloud by learning new facts and knowledge from tables and publishing it on the Semantic Web. To develop an overall interpretation of a table, we assign every column header city state mayor population Baltimore MD S.Dixon 640000 a class label from an appropriate ontol- Philadelphia PA M.Nutter 1500000 ogy, e.g., the column with header City Washington DC A.Fenty 595000 is assigned a class label dbpedia-owl:City New York NY M.Bloomberg 8400000 Boston MA T.Menino 610000 from the DBpedia ontology. For the ta- ble in Figure 1, we link “Baltimore” to Fig. 1: In simple tables column headers suggests dbpedia:Baltimore. Numbers can be map- the type of data stored in columns and cell val- ues denote instances of that type. ped as values of properties which can be associated with entities in the table. We also identify the relations implicit between columns, e.g., that dbpedia- owl:largestCity seems to hold between the entities denoted by cell values in the first two columns (i.e., city and state). Finally this information is represented in a N3 serialization of RDF. 2 T2LD Framework Given an table as input, the T2LD framework [6] begins with the process of assigning a class label to every column in the table. For all the cell values in every column of the table, the algorithm for assigning class labels (see Algorithm 1 in [3]) submits a complex query to the Wikitology knowledge base to determine the type of each cell value in the column. Each class label from the set of possible class labels obtained from query results is scored. The class label with the highest score is chosen as the class label to be associated with the column. We predict class labels from four vocabularies - DBpedia Ontology, Freebase, WordNet, and Yago. Using the class labels as additional evidence, for every MAP columns table cell, the algorithm for linking table cell to entities (see m=1 11.53% 0 < m < 1 69.23% Algorithm 2 in [3] for detailed algorithm), re-queries the KB. m=0 19.24% For every table cell, the KB returns the top N possible enti- Recall columns ties. For each of the top N entities, the algorithm generates r=1 46.15% 0 < r < 1 34.61% a feature vector consisting of the entity’s KB score, entity’s r=0 19.24% Wikipedia page length, entity’s page rank, the Levenshtein distance between the entity and the string in the query and Fig. 2: The percent- age of columns with the Dice score between the entity and the string. The set various MAP and re- feature vectors for each table cell are ranked using a SVM- call scores. Rank classifier. To the highest rank feature vector from SVM rank, two more features are added - the SVM rank score of the feature vector and the difference T2LD: Interpreting and Representing Tables as Linked Data 3 @prefix rdfs: . @prefix dbpedia: . @prefix dbpedia-owl: . @prefix dbpprop: . “City”@en is rdfs:label of dbpedia-owl:City . “State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion . “Baltimore”@en is rdfs:label of dbpedia:Baltimore . dbpedia:Baltimore a dbpedia-owl:City . “MD”@en is rdfs:label of dbpedia:Maryland . dbpedia:Maryland a dbpedia-owl:AdministrativeRegion . dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion . dbpprop:LargestCity rdfs:range dbpedia-owl:City . Fig. 3: A example of N3 representation of a table as linked data in SVM-Rank scores between the top two feature vectors. Based on this new feature vector, a second SVM classifier decides whether to link the table cell to this top ranked entity or not. If the evidence is not strong enough, it is likely that the table cell is a new entity not present in the KB; this step is useful in discovery of new entities in a given table. If the evidence is strong enough, the table cell is linked to the top ranked entity returned by SVM-Rank. We also present a preliminary approach for identifying relations between table columns (see Algorithm 3 in [3]). The algorithm generates a set of candidate relations from the relations that exist between the strings in each row of the two columns. Each candidate relation is scored and the relation with the highest score is selected to represent relation between the two columns. We have also developed a preliminary template in N3 (see Figure 3), which is a compact and human readable serialization of RDF for representing tables as LOD. 3 Evaluation and Conclusion Our implemented prototype was evaluated against 15 tables obtained from Google Squared, Wikipedia and from a collection of tables extracted from the Web. Ex- cluding the columns with numbers, the 15 tables have 52 columns and 611 entities for evaluation of our algorithms. We used a subset of 23 columns for evaluation of relation identifcation between columns. In the first evaluation of the algorithm for assigning class labels to columns, we compared the ranked list of possible class labels generated by the system against the list of possible class labels ranked by the evaluators. As shown in Figure 2 for 80.76 % of the columns the Mean Average Precision (MAP) between the system and evaluators list is greater than 0 which indicates that there was at least one relevant label in the top three of the system ranked list. Also seen in Figure 2, for 75 % of the columns, the recall of the algorithm was greater than or equal to 0.6. We also assessed whether our predicted class labels were reasonable based on the judgment of human subjects (see [3]). 76.92 % of the class labels predicted were considered correct by the evaluators. The accuracy in 4 T2LD: Interpreting and Representing Tables as Linked Data Fig. 4: Category wise accuracy for “column correctness” is shown in (a) and for entity linking in (b) each of the four categories is shown in Figure 4. 66.12 % of the table cell strings were correctly linked by our algorithm for linking table cells. The breakdown of accuracy based on the categories is shown in Figure 4. Our dataset had 24 new entities and our algorithm was able to correctly predict for all the 24 entities as new entities not present in the KB. We did not get encouraging results for relationship identification with an accuracy of 25 % (see [3] for details). Our existing system performs reasonably well in selecting appropriate types for columns and linking cell values to LOD entities. We have preliminary results for identifying and encoding the relationships implicit in the columns as well. Our current work is focused on improving relationship discovery and generating new facts and knowledge from tables that contain entities not present in the LOD knowledge bases. References 1. Finin, T., Syed, Z.: Creating and Exploiting a Web of Semantic Data. In: Proc. 2nd Int. Conf. on Agents and Artificial Intelligence, Springer (2010) 2. Cafarella, M.J., Halevy, A.Y., Wang, Z.D., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1 (2008) 538–549 3. Mulwad, V.: T2LD - An automatic framework for extracting, interpreting and representing tables as Linked Data. Master’s thesis, U. of Maryalnd, Baltimore County (2010) 4. Sahoo, S.S., Halb, W., Hellmann, S., Idehen, K., Thibodeau Jr, T., Auer, S., Se- queda, J., Ezzat, A.: A survey of current approaches for mapping of relational databases to rdf. Technical report, W3C (2009) 5. Han, L., Finin, T., Parr, C., Sachs, J., Joshi, A.: RDF123: from Spreadsheets to RDF. In: Seventh International Semantic Web Conference, Springer (2008) 6. Syed, Z., Finin, T., Mulwad, V., Joshi, A.: Exploiting a Web of Semantic Data for Interpreting Tables. In: Proceedings of the Second Web Science Conference. (2010) 7. Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables. In: Proc. First Int. Workshop on Consuming Linked Data. (2010)