=Paper=
{{Paper
|id=Vol-2775/paper5
|storemode=property
|title=Kepler-aSI: Kepler as a Semantic Interpreter
|pdfUrl=https://ceur-ws.org/Vol-2775/paper5.pdf
|volume=Vol-2775
|authors=Wiem Baazouzi,Marouen Kachroudi,Sami Faiz
|dblpUrl=https://dblp.org/rec/conf/semweb/BaazouziKF20
}}
==Kepler-aSI: Kepler as a Semantic Interpreter==
Kepler-aSI : Kepler as A Semantic Interpreter Wiem Baazouzi1 , Marouen Kachroudi2 , and Sami Faiz3 1 Laboratoire de Recherche en génie logIciel, Application distribuées , systèmes Décisionnels et Imagerie intelligente,National School of Computer Science, Tunis wiem.baazouzi@ensi-uma.tn 2 Université de Tunis El Manar, Faculté des Sciences de Tunis, Informatique Programmation Algorithmique et Heuristique, LR11ES14, 2092, Tunis, Tunisie marouen.kachroudi@fst.rnu.tn 3 Institut Superieur Des Arts Multimedias De Manouba, UR Téledétection Et Systèmes D’informations A Référence Spatiale sami.faiz@insat.rnu.tn Abstract. This paper presents our system Kepler-aSI, for the Seman- tic Web on Tabular Data Challenge to Knowledge Graph Correspondence (SemTab 2020).Kepler-aSI analyze tabular data and detect the cor- rect matches in Wikidata, where data and values are annotated with a unique tag. Indeed, this task is difficult for machines to identify the right meaning of a given annotation. Kepler-aSI uses the SPARQL query to semantically annotate tables in Knowledge Graphs (KG), in order to solve the critical problems of the matching tasks, namely, CTA columns annotation. Keywords: Tabular Data - Knowledge Graph - Kepler-aSI - SPARQL - Semantic Web Challenge 1 Introduction The World Wide Web contains vast quantities of textual information in sev- eral forms: unstructured text, template-based semi-structured Web pages (which present data in key-value pairs and lists), and obviously tables. Methods which aim to extract information from these resources to convert them into a structured form have been the subject of several works. As an observation, it is obvious that there is a lack of understanding for the semantic structure which can hamper the process of data analysis. Gaining this semantic understanding will therefore be very useful for data integration, data cleaning, data mining, machine learning, and knowledge discovery tasks. For example, understanding data can help assess the appropriate types of transformation on it. Tabular data is routinely transferred on the Web in a variety of formats. Most of these data sets are available in tabular formats (e.g., CSV, Excel). The main reason of this format popularity is simplicity: many common office tools Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 Wiem Baazouzi, Marouen Kachroudi, and Sami Faiz (e.g., Excel) are available to facilitate their generation and exploitation. Tables on the Web are the source of a highly valuable data. The addition of semantic information to Web tables may enhance a wide range of applications, such as Web search, Question Answering, and Knowledge Base (KB) building. Researchers have largely different problems about the data when they extract tabular data from the Web, such as learning with limited labeled data, defining (or avoiding defining) ontologies, making use of prior knowledge, and scaling solutions on the Web. This task is often difficult in practice due to metadata (e.g., table and column names) being missing, incomplete or ambiguous. In recent years, we have identified several works that can be mainly classified as supervised (in the form of annotated tables to carry out the learning task) [1– 5] or unsupervised (tables whose data is not dedicated to learning) [6, 5]. To solve these problems, we propose a global approach named Kepler-aSI, which ad- dresses the challenge of matching tabular data to knowledge graphs.This method is based on previous work, which deals with ontology alignment. [7–9]. In this challenge, the input is a CSV file, but three different challenges had to be met : 1. CTA : A type of the Wikidata ontology had to be assigned a class KG to a column (Column-Type Annotation ). 2. CEA : A wikidata entity had to be matched to the different cells (Cell- Entity Annotation). 3. CPA : A KG property had to be assigned to the relationship between two columns (Columns Property Annotation). Data annotation is a fundamental process in tabular data analysis, it allows to infer the meaning of other information. Then deduce the meaning of a tabular knowledge graph. The data we used was based on Wikidata. We would like to mention, in a more general context, that Wikidata is made up of several types of documents, which obey the triples format : subject (S), a predicate (P) and an object (O) Indeed, Cell Entity Annotation (CEA) matches a cell to a KG entity. At this level, we have to annotate each individual element of the subject (S) and the object (O). Column Property Annotation (CPA) assigns a KG property to the relationship between two columns. The task is to find out which property of the two columns are connected to Wikidata. Column Type Annotation (CTA) assigns connected semantic type to a column. This work means another topic that can be described by including tags corresponding to the topic in Wikidata in common. 2 Knowledge Graph & Tabular Data 2.1 Tabular Data S is a two-dimensional tabular structure made up of an ordered set of N rows and M columns ( Fig 1). ni is a row of the table (i = 1 ... N), mj is a column of Kepler-aSI : Kepler as A Semantic Interpreter 3 the table (j = 1 ... M). The intersection between a row ni and a column mj is ci ,j , which is a value of the cell Si ,j . The table contents can have different types (string, date, float, number, etc.). – Target Table (S): M × N. – Subject Cell: S(i,0) (i = 1, 2 ... N). – Object Cell: S(i,j) (i = 1, 2 ... M),(j = 1, 2 ... N). Col0 Coli ColN Row1 S , ... ... ... S1 ,N 1 0 .. .. . .. . .. . .. . . .. .. .. .. .. . . . . . Rowj Sj ,0 ... S j ,i ... S j ,N .. .. .. .. .. . . . . . . .. . .. .. .. . . . . . RowM SM ,0 ... ... ... S M ,N Fig. 1. Target Table 2.2 Knowledge Graph Knowledge Graphs have been in the focus of research since 2012, resulting in a wide variety of published descriptions and definitions. The lack of a common core, a fact that is also indicated by Paulheim [10] in 2015. Paulheim listed in his survey of Knowledge Graph refinement, the minimum set of characteristics that must be present to distinguish knowledge graphs from other knowledge collections, which basically restricts the term to any graph based knowledge representation. In the online reviewing [10], authors agreed that a more precise definition was hard to find at that point. This statement points out the demand for closer investigation and deeper reflection in this area. Farber et al. defined a Knowledge Graph as an Resource Description Frame- work (RDF) graph and stated that the term KG was coined by Google to describe any graph-based knowledge base (KB) [11]. Although this definition is the only formal one, it contradicts with more general definitions as it explicitly requires the RDF data model. 3 System Description The proposed system solve only CTA task. Our system consists of three phases as flagged by Figure 2. Although there are a number of methods available, sev- eral ideas have been tried, but the most effective according to our objective was 4 Wiem Baazouzi, Marouen Kachroudi, and Sami Faiz the idea used in the Mantistable approach [5, 12], we have indeed adopted the following phases: Phase 1 Data preparation: This phase is used to prepare the data inside the table. Phase 2 Column Analysis: In this phase, we determined the semantic classification of the columns to determine Named-Entity column (NE-column), Literal column (L-column) or Subject column (S-column). Column-Type Annotation (CTA) which deals with mappings between columns and semantic elements in a knowledge graph KG by using a SPARQL query. In the next section, you will find detailed information about each phase. Fig. 2. Overview of our approach Kepler-aSI : Kepler as A Semantic Interpreter 5 3.1 Data Preparation Data preparation aims to clean and standardize the data within the table. The transformations applied to tables are as follows: 1. The deletion of certain characters : For each of the cell values, we first clean them by retaining only the part that comes before a ‘(’ or ‘[’ and by removing all ‘ 2. The transformation of text into lowercase, deletion of text in brackets, res- olution of acronyms and abbreviations, and normalisation of units of mea- surement to decipher acronyms and abbreviations. 3. The normalization of the units of measurement is performed by applying regular expression treatments, as described in [13]. The use of regular expressions allows to devour a complete set of units, which in- cludes area, currency, density, electric current, energy, flow, strength, frequency, energy efficiency, unit of information, length, linear mass density, mass, numbers, population density, power, pressure, speed, temperature, time, torque, voltage and volume. 3.2 Column Analysis In this task we have classified the columns into several types with columns named entity (NE-column) or literal column (L-column) and detected of the subject column (S-column). To accomplish this task, we consider 16 regular expressions that identify multiple Regextypes (for example, numbers, geographic coordinates, address, hexadecimal color code, URL). To accomplish this task, we consider 16 regular expressions that identify multiple Regextypes (eg, numbers, geographic coordinates, address, hexadecimal color code, URL). Then, we set a threshold (equal to 0.7), if the number of occurrences of the Regextype in a column (for example an address) is the most frequent and exceeding this threshold, then this column is annotated as Column L, otherwise, it is annotated as NE-column. After the detection of this column (L-column or NE-column), we identified the subject column S-column. Finaly to define the S-column as the main column of the table according to different statistical characteristics, namely [5] : – aw: the average number of words in each cell. – emc: the fraction of empty cells in the column. – uc: the fraction of cells with unique content. – df: the distance of the first NE-column. These characteristics are combined to calculate the sub-column score (cj ) for each NE-column as follows [5] : 2ucn orm(cj ) + awn orm(cj ) − emcn orm(cj ) subcol(cj ) = p (1) df (cj ) + 1 6 Wiem Baazouzi, Marouen Kachroudi, and Sami Faiz Fig. 3. Column-Type Annotation (CTA) steps 3.3 Column-Type Annotation (CTA) Concept and data type annotation deals with mappings between columns and semantic elements (concepts or data types) in a KG. Figure ( Fig 3) shows the four stages of our architecture for CTA. In the first concept annotation step, we started with an entity link found in KG with a column from a table, from common CSV files, then in the second step we fetched the contents of a column like ItemDescription in our SPARQL query to query wikidata, in order to find the caption of the winning class. The ItemDescription ”%s” ( Listing 1.1) on a Wikidata entry is a short phrase designed to disambiguate items with the same or similar labels. A description does not need to be unique; multiple items can have the same description, however no two items can have both the same label and the same description. If multiple entities were returned for a cell, the one with the number of occurrences was taken. For Correspondence of columns with Knowledge KG entities, All the inferred column types were taken into account using a simple SPARQL query: Kepler-aSI : Kepler as A Semantic Interpreter 7 Listing 1.1. SPARQL query to retrieve a set of entities eligible for the content of a column. { SELECT ? i t e m L a b e l ? c l a s s WHERE { ? item ? i t e m D e s c r i p t i o n ”%s ”@en . ? item wdt : P31 ? c l a s s } } 4 Evaluation results In this section, the results of the Kepler-aSI approach during Rounds 1, 2, 3 and 4 of the challenge are presented. F1-Score and Precision values are listed in Table 1 for the CTA task. Table 1. Results of the challenge AF1 APrecision CTA - - Round 1 CPA - - CEA - - CTA 0.293 0.789 Round 2 CPA - - CEA - - CTA 0.381 0.701 Round 3 CPA - - CEA - - CTA 0.253 0.676 Round 4 CPA - - CEA - - The values obtained in the 3 rounds are encouraging in relation to the vol- umes of data and the limits of the machines. This means that there are ways to investigate in terms of new technologies that can allow us to get around this kind of problem. 5 Conclusion Future Work To sum up, we have developed a simple approach for automatic table anno- tation. While there are several techniques available, we have chosen a simpler approach. Our main effort was in the CTA task and how to proceed using a content-based SPARQL query, but our approach misses the preprocessing phase 8 Wiem Baazouzi, Marouen Kachroudi, and Sami Faiz and data correction. Several techniques were tried during preprocessing, but the most effective was spell checking. This pre-treatment may be improved to increase the precision values. We try to improve on this weak point of our ap- proach in future work. We also have other problems, one run may take 12 hours due to the limitation of our machines. The execution of our system Kepler-aSI requires a lot of instruction of reading and writing, to consult Wikidata. This re- quires a great resource in terms of RAM and processor, to improve the matching processes. We plan to investigate this lead in the near future to identify which resource needs to be further improved. So for a large data set running a job locally was not possible in our case. In this article, we presented our contribution to the SemTab2020 challenge, Kepler-aSI. We tackled a posed task, the CTA. We base our solution only on SPARQL queries using the cell contents as a description of a given item. Our main effort was in using the cell contents as a description of a given item. Kepler-aSI is a simple approach but we will improve our preprocessing phase for two main purposes: First, we will apply a method to correct spelling mistakes and other typos in the source data. Second, we’ll determine the data type of each column. Although the system distinguishes more types of data: OBJECT, DATE, STRING and NUMBER. Finally, due to the small size of the used machines during in the different evaluation phases (Intet (R) Core (TM) i5-7200U CPU @ 2.50GHZ, 2701 MHz, 2 cores 4 processors, with 8 GB of RAM ), we will try to develop our system by integrating new data processing techniques. Eventually, the idea of moving to a data representation using indexes would be a good track to investigate in order to master the search space. In addition, the processing parallelism will allow us to circumvent the problem of the data size which is the major gap for our current machines. References 1. Pham, M., Alse, S., Knoblock, C.A., Szekely, P.: Semantic labeling: a domain- independent approach. In: International Semantic Web Conference, Springer (2016) 446–462 2. Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: Learning the semantics of structured data sources. Journal of Web Semantics 37 (2016) 152–169 3. Ramnandan, S.K., Mittal, A., Knoblock, C.A., Szekely, P.: Assigning semantic labels to data sources. In: European Semantic Web Conference, Springer (2015) 403–417 4. Knoblock, C.A., Szekely, P., Ambite, J.L., Goel, A., Gupta, S., Lerman, K., Muslea, M., Taheriyan, M., Mallick, P.: Semi-automatically mapping structured sources into the semantic web. In: Extended Semantic Web Conference, Springer (2012) 375–390 5. Cremaschi, M., De Paoli, F., Rula, A., Spahiu, B.: A fully automated approach to a complete semantic table interpretation. Future Generation Computer Systems (2020) 6. Zhang, Z.: Effective and efficient semantic table interpretation using tableminer+. Semantic Web 8(6) (2017) 921–957 Kepler-aSI : Kepler as A Semantic Interpreter 9 7. Kachroudi, M., Diallo, G., Ben Yahia, S.: OAEI 2017 results of KEPLER. In: Pro- ceedings of the 12th International Workshop on Ontology Matching co-located with the 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 21, 2017. Volume 2032 of CEUR Workshop Proceedings., CEUR-WS.org (2017) 138–145 8. Kachroudi, M., Ben Yahia, S.: Dealing with direct and indirect ontology alignment. J. Data Semant. 7(4) (2018) 237–252 9. Kachroudi, M., Diallo, G., Ben Yahia, S.: KEPLER at OAEI 2018. In: Proceedings of the 13th International Workshop on Ontology Matching co-located with the 17th International Semantic Web Conference, OM@ISWC 2018, Monterey, CA, USA, October 8, 2018. Volume 2288 of CEUR Workshop Proceedings., CEUR-WS.org (2018) 173–178 10. Ehrlinger, L., Wöß, W.: Towards a definition of knowledge graphs. SEMANTiCS (Posters, Demos, SuCCESS) 48 (2016) 1–4 11. Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago. Semantic Web 9(1) (2018) 77–129 12. Cremaschi, M., Avogadro, R., Chieregato, D.: Mantistable: an automatic approach for the semantic table interpretation. In: SemTab@ ISWC. (2019) 15–24 13. Ritze, D., Lehmberg, O., Bizer, C.: Matching html tables to dbpedia. In: Proceed- ings of the 5th International Conference on Web Intelligence, Mining and Seman- tics. (2015) 1–6