Bridging DBpedia Categories and DL-Concept Definitions using Formal Concept Analysis

Bridging DBpedia Categories and DL-Concept Definitions using Formal Concept Analysis MehwishAlam LORIA (CNRS Inria Nancy Grand Est Université de Lorraine)

BP 239, Vandoeuvre-lès-Nancy F-54506 France

AlekseyBuzmakov LORIA (CNRS Inria Nancy Grand Est Université de Lorraine)

BP 239, Vandoeuvre-lès-Nancy F-54506 France

VictorCodocedo LORIA (CNRS Inria Nancy Grand Est Université de Lorraine)

BP 239, Vandoeuvre-lès-Nancy F-54506 France

AmedeoNapoli LORIA (CNRS Inria Nancy Grand Est Université de Lorraine)

BP 239, Vandoeuvre-lès-Nancy F-54506 France

Bridging DBpedia Categories and DL-Concept Definitions using Formal Concept Analysis C45425A84FE5C81FE34222FED713B05A GROBID - A machine learning software for extracting information from scholarly documents Formal Concept Analysis Linked Open Data Data Completion

The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects regarding quality assessment and data exploration of the RDF triples that shape the LOD cloud. Particularly, we are interested in the completeness of data and its potential to provide concept definitions in terms of necessary and sufficient conditions. In this work we propose a novel technique based on Formal Concept Analysis which organizes RDF data into a concept lattice. This allows the discovery of implications, which are used to automatically detect missing information and then to complete RDF data.

Introduction

The World Wide Web has tried to overcome the barrier of data sharing by converging data publication into Linked Open Data (LOD) [3]. The LOD cloud stores data in the form of subject-predicate-object triples based on the RDF language 1 , a standard formalism for information description of web resources. In this context, DBpedia is the largest reservoir of linked data in the world currently containing more than 4 million triples. All of the information stored in DBpedia is obtained by parsing Wikipedia, the largest open Encyclopedia created by the collaborative effort of thousands of people with different levels of knowledge in several and diverse domains.

More specifically, DBpedia content is obtained from semi-structured sources of information in Wikipedia, namely infoboxes and categories. Infoboxes are used to standardize entries of a given type in Wikipedia. For example, the infobox for "automobile" has entries for an image depicting the car, the name of the car, the manufacturer, the engine, etc. These attributes are mapped by the DBpedia parser to a set of "properties" defined in an emerging ontology 2 [2] (infobox dataset) or mapped through a hand-crafted lookup table to what is called the DBPedia Ontology. Categories are another important tool in Wikipedia used to organize information. Users can freely assign a category name to an article relating it to other articles in the same category. Example of categories for cars are "Category:2010s automobiles", "Category:Sports cars" or "Category:Flagship vehicles". While we can see categories in Wikipedia as an emerging "folksonomy", the fact that they are curated and "edited" make them closer to a controlled vocabulary. DBpedia exploits the Wikipedia category system to "annotate"3 objects using a taxonomy-like notation. Thus, it is possible to query DBpedia by using annotations (e.g. all cars annotated as "Sport cars"). While categorical information in DBpedia is very valuable, it is not possible to use a category as one could expect, i.e. as a definition of a class of elements that are instances of the class or, alternatively, that are "described" by the category. In this sense, such a category violates the actual spirit of semantic Web.

Let us explain this with an example. The Web site of DBpedia in its section of "Online access" contains some query examples using the SPARQL query language. The first query has the description "People who were born in Berlin before 1900" which actually translates into a graph-based search of entities of the type "Person", which have the property "birthPlace" pointing to the entity representing the "city of Berlin" and another property named "birthDate" with a value less than 1900. We can see here linked data working at "its purest", i.e. the form of the query provides the right-hand side of a definition for "People who were born in Berlin before 1900". Nevertheless, the fourth query named "French films" does not work in the same way. While we could expect also a graph-based search of objects of the type "Film" with maybe a property called "hasCountry" pointing to the entity representing "France", we have a much rougher approach. The actual SPARQL query asks for objects (of any type) annotated as "French films".

In general, categorization systems express "information needs" allowing human entities to quickly access data. French films are annotated as such because there is a need to find them by these keywords. However, for a machine agent this information need is better expressed through a definition, like that provided for the first query (i.e. "People who were born in Berlin before 1900"). Currently, DBPedia mixes these two paradigms of data access in an effort to profit from the structured nature of categories, nevertheless further steps have to be developed to ensure coherence and completeness in data.

Accordingly, in this work we describe an approach to bridge the gap between the current syntactic nature of categorical annotations with their semantic correspondent in the form of a concept definition. We achieve this by mining patterns derived from entities annotated by a given category, e.g. All entities annotated as "Lamborghini cars" are of "type automobile" and "manufactured by Lamborghini", or all entities annotated as "French films" are of "type film" and of "French nationality". We describe how these category-pattern equivalences can be described as "definitions" according to implication rules among attributes which can be mined using Formal Concept Analysis (FCA [7]). The method considers the analysis of heterogeneous complex data (not necessarily binary data) through the use of "pattern structures" [6], which is an extension of FCA able to process complex data descriptions. A concept lattice can be built from the data and then used for discovering implication rules (i.e. association rules whose confidence is 100%) which provide a basis for "subject definition" in terms of necessary and sufficient conditions. For more details read the complete version of this paper [1]. This article is structured as follows: Section 2 gives a brief introduction to the theoretical background necessary to sustain the rest of the paper. Section 3 describes the approach used for data completion in the DBpedia knowledge base. Finally, Section 4 concludes the paper.

Preliminaries

Linked Open Data (LOD) [3] is a formalism for publishing structured data on-line using the resource description framework (RDF). RDF stores data in the form of statements represented as xsubject, predicate, objecty. The profile of an RDF triple xs, p, oy is given by pU YBqˆpU YBqˆpU YBYLq where a set of RDF triples is an RDF graph, denoted by G. Here, U denotes a set of URI references, B refers to the blank node and L to literals. For the sake of simplicity, in the current study we do no take into account blank nodes pBq. An RDF triple is represented as U ˆU ˆpU Y Lq. For convenience, in the following we denote the set of predicate names as P and the set of object names as O. LOD can then be queried and accessed through SPARQL 4 , which is a standard query language for RDF data. SPARQL is based on matching graph patterns (present in the WHERE clause of a query) against RDF graphs. For example, let us consider the SPARQL query given in Listing 1.1, for all the entities of type Automobile manufactured by Lamborghini, annotated as "Sport_cars" and as "Lamborghini_vehicles", SELECT ?s WHERE { ?s dc:subject dbpc:Sports_cars . ?s dc:subject dbpc:Lamborghini_vehicles . ?s rdf:type dbo:Automobile . ?s dbo:manufacturer dbp:Lamborghini } Listing 1.1: SPARQL for the formal context in Figure 1. Prefixes are defined in Table 1.

Formal Concept Analysis (FCA) is a mathematical framework introduced in [7], but in the following we assume that the reader already has necessary background of FCA. We only explain it with the help of an example. For example, consider the formal context in Figure 1 where G " U , M " pP ˆOq and pu, pp, oqq P I ðñ xu, p, oy P G, i.e. xu, p, oy is a triple built from different triples manually extracted from DBpedia about nine different Lamborghini cars (35 RDF triples in total). Given a subject-predicateobject triple, the formal context contains subjects in rows, the pairs predicate-object in columns and a cross in the cell where the triple subject in row and predicate-object in column exists. Figure 1 depicts the concept lattice in reduced notation calculated for this formal context and contains 12 formal concepts. Consider the first five cars (subjects) in the table for which the maximal set of attributes they share is given by the first four predicate-object pairs. Actually, they form a formal concept depicted by the gray cells in Figure 1 and labelled as "Islero, 400GT" in Figure 1 (actually, the extent of this concept is "Islero, 400GT, 350GT, Reventon"). Given a concept lattice, rules can be extracted from the intents of concepts which are comparable. ig. 1: The formal context shown on the left is built after scaling from DBpedia data given in Table 1. Each cross (ˆ) corresponds to a triple subject-predicate-object. On the right the corresponding concept lattice is shown.

3 Improving DBpedia with FCA

Problem context

Consider the following fictional scenario. You are a bookkeeper in a library of books written in a language you do not understand. A customer arrives and asks you for a book about "Cars". Since you do not know what the books are about (because you cannot read them), you ask the customer to browse the collection on his own. After he finds a book he is interested to read, you will mark the symbol ‹ on that book for future references. Then, in an empty page you will write (‹ -Cars). After several cases like this, you will probably end up with a page full of symbols representing different topics or categories of your books, among them (a -Sports), (˛-Football) and (˝-History). Now you can even combine symbols when customers ask you for "Sport Cars" which you translate into ‹a. Actually, the demand for books about "Sport Cars" is so high that you create a new symbol : just for it. So doing, you have created your own categorization system of a collection of books you do not understand.

In general, given a topic, you are able to retrieve books without many troubles, however since you do not understand the books, you are restricted to the set of symbols you have for doing this. Furthermore, if you are not careful some problems start to arise, such as books marked with ˛and without a. Finally, people do not get books marked with : when they look for "Cars", since they only search for the symbol a.

It is easy to stablish an analogy on how DBpedia profits from Wikipedia's categorization system and the above scenario. DBpedia is able to retrieve entities when queried with an annotation (as the example of "French films"), however any information need not initially provided as a category is unavailable for retrieval (such as "French films about the Art Nouveau era"). Incoherences in categorical annotations are quite frequent in for example there are over 200 entities annotated as "French films" which are not typed as "Films". Finally, DBpedia is not able to provide inferencing. For example, in Figure 1, the entities Veneno and Aventador, even though they are annotated as "Lamborghini vehicles", cannot be retrieved when queried simply by "vehicles". In such a way, it is exactly as if they were marked with a symbol such as :.

The completion of DBpedia data

Our main concern in this case lies in two aspects. Firstly, are we able to complete data using logical inferences? For example, can we complete the information in the dataset by indicating that the entities "Estoque" and "Gallardo" should be categorized as "Lamborghini vehicles" and "Sport cars"? Secondly, are we able to complete the descriptions of a given type? For example, DBpedia does not specify that an "Automobile" should have a "manufacturer". In the following, we try to answer these two questions using implications and association rules.

Consider rules provided in Table 2. Of course, the first three implications are only true in our dataset. This is due to the fact that we use the "closed world" assumption, meaning that our rules only apply in "our world of data" where all cars are of "Lamborghini" brand, i.e. all other information about cars that we do not know can be assumed as false [5]. While these implications are trivial, they provide a good insight of the capabilities of our model. For instance, including a larger number of triples in our dataset would allow discovering that, while not all automobiles are manufactured by Lamborghini, they are manufactured by either a Company, an Organization or an Agent. These three classes5 are types of the entity Lamborghini in DBpedia. Such a rule would allow providing a domain characterization to the otherwise empty description of the predicate "dbo:manufacturer" in the DBpedia schema.

The association rule given in the fourth row in Table 2 shows the fact that 29% of the subjects of type "Automobile" and manufactured by "Lamborghini" should be categorized by "Sports cars" and "Lamborghini vehicles" to complete the data. This actually corresponds to the entities "Estoque" and "Gallardo" in Figure 1. Based on this fact, we can use association rules also to create new triples that allow the completion of the information included in DBpedia.

Pattern structures for the completion process

The aforementioned models to support linked data using FCA are adequate for small datasets as the example provided. Actually, LOD do not always consists of triples of resources (identified by their URIs) but contains a diversity of data types and structures including dates, numbers, collections, strings and others making the process of data processing much more complex. This calls for a formalism able to deal with this diversity of complex and heterogeneous data. Accordingly, pattern structures are an extension of FCA which enables the analysis of complex data, such as numerical values, graphs, partitions, etc. In a nutshell, pattern structures provide the necessary definitions to apply FCA to entities with complex descriptions. The basics of pattern structures are introduced in [6]. Below, we provide a brief introduction using interval pattern structures [8].

Let us consider Table 3 showing the predicate dbo:productionStartYear for the subjects in Figure 1. In such a case we would like to extract a pattern in the year of production of a subset of cars. Contrasting a formal context as introduced in Section 2, instead of having a set M of attributes, interval pattern structures use a semi-lattice of interval descriptions ordered by a subsumption relation and denoted by pD, Ďq6 . Furthermore, instead of having an incidence relation set I, pattern structures use a mapping function δ : G Ñ D which assigns to any g P G the corresponding interval description δpgq P D. For example, the entity "350GT" in Table 3 has the description δp350GT q " xr1963, 1963sy.

Let us consider two descriptions δpg 1 q " xrl 1 i , r 1 i sy and δpg 2 q " xrl 2 i , r 2 i sy, with i P r1..ns where n is the number of intervals used for the description of entities. The similarity operation [ and the associated subsumption relation Ď between descriptions are defined as the convex hull of two descriptions as follows: δpg 1 q [ δpg 2 q " xrminpl 1 i , l 2 i q, maxpr 1 i , r 2 i qsy δpg 1 q Ď δpg 2 q ðñ δpg 1 q [ δpg 2 q " δpg 1 q δp350GT q [ δpIsleroq " xr1963, 1967sy pδp350GT q [ δpIsleroqq Ď δp400GT q Finally, a pattern structure is denoted as pG, pD, Ďq, δq where operators p¨q l between ℘pGq and pD, Ďq are given below:

A l :" ę gPA δpgq d l :" tg P G | d Ď δpgqu

An interval pattern concept pA, dq is such as A Ď G, d P D, A " d l , d " A l . Using interval pattern concepts, we can extract and classify the actual pattern (and pattern concepts) representing the years of production of the cars. Some of them are presented in the lower part of Table 3. We can appreciate that cars can be divided in three main periods of time of production given by the intent of the interval pattern concepts.

Heterogeneous pattern structures

Different instances of the pattern structure framework have been proposed to deal with different kinds of data, e.g. graph, sequences, interval, partitions, etc. For linked data we propose to use the approach called "heterogeneous pattern structure" framework introduced in [4] as a way to describe objects in a heterogeneous space, i.e. where there are relational, multi-valued and binary attributes. It is easy to observe that this is actually the case for linked data where the set of literals L greatly varies in nature depending on the predicate. For the sake of simplicity we provide only the most important details of the model used for working with linked data. When the range of a predicate (hereafter referred to as "relation") p P P is such that rangeppq Ď U , we call p an "object relation". Analogously, when the range is such that rangeppq Ď L, p is a "literal relation". For any given relation (object or literal), we define the pattern structure K p " pG, pD p , [q, δ p q, where pD p , Ďq is an ordered set of descriptions defined for the elements in rangeppq, and δ p maps entities g P G to their descriptions in D p . Based on that, the triple pG, H, ∆q is called a "heterogeneous pattern structure", where H " Ś D p pp P P q is the Cartesian product of all the descriptions sets D p , and ∆ maps an entity g P G to a tuple where each component corresponds to a description in a set D p .

For an "object relation", the order in pD p , Ďq is given by standard set inclusion and thus, the pattern structure K p is just a formal context. Regarding "literal relations", such as numerical properties, the pattern structure may vary according to what is more appropriate to deal with that specific kind of data. For example, considering the predicate dbo:productionStartYear discussed in the previous section, K dbo:productionStartYear should be modelled as an interval pattern structure. For the running example, the heterogeneous pattern structure is presented in Table 4. Cells in grey mark a heterogeneous pattern concept the extent of which contains cars "350GT, 400GT, Islero". The intent of this heterogeneous pattern concept is given by the tuple pta, bu, tcu, tdu, xr1963, 1967syq, i.e. "Automobiles manufactured by Lamborghini between 1963 and 1967".

Conclusion

To conclude, in the current study we introduce a mechanism based on association rule mining for the completion of the RDF dataset. Moreover, we use heterogeneous pattern structures to deal with heterogeneity in LOD. This study shows the capabilities of FCA for completing complex RDF structures.

Table 1 :1Index of pairs predicate-object and namespaces.PredicatesObjectsIndex URIIndex URIA dc:subjecta dbpc:Sport_Carsb dbpc:Lamborghini_vehiclesB dbp:manufacturerc dbp:LamborghiniC rdf:typed dbo:AutomobileD dbp:assemblye dbp:ItalyE dbo:layoutf dbp:Four-wheel_driveg dbp:Front-engineNamespaces:AB C D Ea b c d e f gReventonˆˆˆˆˆĈountachˆˆˆˆ350GTˆˆˆˆ400GTˆˆˆÎsleroˆˆˆVenenoˆÂventadorRoadster ˆÊstoqueˆˆˆĜallardoˆˆF

dc: http://purl.org/dc/terms/ dbo: http://dbpedia.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns\# dbp: http://dbpedia.org/resource/ dbpc: http://dbpedia.org/resource/Category:

Table 2 :2Association rules extracted from formal context in Figure1.

Table 3 :3Upper table shows values of predicate dbp:productionStartYear for entities in Figure1. The symbol -indicates that there are no values present in DBpedia for those subjects. Lower table shows the derived interval pattern concepts .

Table 4 :4Heterogeneous pattern structure for the running example. Indexes for properties are shown in Table1.

Resource Description Framework -http://www.w3.org/RDF/Emerging in the sense of "dynamic" or "in progress".Notice that in DBPedia the property used to link entities and categories is called "subject". We use "annotation" instead of "subject" to avoid confusions with the "subject" in an RDF triple.http://www.w3.org/TR/rdf-sparql-query/In the OWL language sense.It can be noticed that standard FCA uses a semi-lattice of set descriptions ordered by inclusion, i.e. (M, Ď).

Mining definitions from rdf annotations using formal concept analysis MehwishAlam AlekseyBuzmakov VictorCodocedo AmedeoNapoli Proceedings of the 24th International Joint Conference on Artificial Intelligence the 24th International Joint Conference on Artificial Intelligence

Buenos Aires, Argentina

July 25-31, 2015. 2015 IJCAI 2015 Semantics made by you and me: Selfemerging ontologies can capture the diversity of shared knowledge DominikBenz AndreasHotho GerdStumme Proceedings of the 2nd Web Science Conference the 2nd Web Science Conference 2010 Linked data -the story so far ChristianBizer TomHeath TimBerners-Lee Int. J. Semantic Web Inf. Syst 5 3 2009 A Proposition for Combining Pattern Structures and Relational Concept Analysis VíctorCodocedo AmedeoNapoli 12th International Conference on Formal Concept Analysis 2014 Swiqa -a semantic web information quality assessment framework ChristianFürber MartinHepp 19th European Conference on Information Systems 2011 Pattern structures and their projections BernhardGanter SergeiOKuznetsov ICCS Lecture Notes in Computer Science Springer 2001 2120 Formal Concept Analysis: Mathematical Foundations BernhardGanter RudolfWille 1999 Springer Berlin/Heidelberg Mining gene expression data with pattern structures in formal concept analysis MehdiKaytoue SergeiOKuznetsov AmedeoNapoli SébastienDuplessis Information Sciences 181 10 2011