Nell2RDF: Read the Web, and turn it into RDF Antoine Zimmermann‡ , Christophe Gravier† , Julien Subercaze† , and Quentin Cruzille‡ ‡ École Nationale Supérieure des Mines, FAYOL-ENSMSE, LSTI, F-42023 Saint-Étienne, France † Université de Saint-Étienne, Télécom Saint-Étienne, 23 rue du Dr. Rémy Annino, F-42023 Saint-Etienne, France antoine.zimmermann@emse.fr {christophe.gravier,julien.subercaze}@univ-st-etienne.fr quentin.cruzille@gmail.com Abstract. This paper describes the Nell2RDF platform that provides Linked Data of general knowledge, based on data automatically constructed by a per- manent machine learning process called NELL that reads the Web. As opposed to DBpedia, all facts recorded by NELL can be tracked according to its prove- nance and a degree of confidence. With our platform, we aim at capturing all the data generated by NELL a transform them into state of the art Linked Data, following best practices. We discuss the benefits of the platform in opening new lines of research, while the work is still in progress. Keywords: NELL, RDF-ising, machine learning 1 Introduction Heterogeneity and redundancy are often seen as problematic in information system, es- pecially on the Web. However, in many ways, they can be seen as a strength. On the Web, heterogeneity and redundancy allow one to consider the same information from different angles, thus helping build beliefs and trust in some information, while dis- crediting other. On the Web of Linked Data, most data sources are describing highly specialised information that are not replicated anywhere else. Moreover, Linked Data specialists usually advocate reuse via linking, rather than producing one’s own descrip- tion. Datasets of general knowledge are very scarce, but there is a knowledge base of everything (so to speak) called DBpedia, that is readily available to link to. DBpedia is an excellent project, but however powerful it may be, there is a need to provide alter- native general purpose knowledge sources, if only to allow an objective comparison of generic knowledge bases. We propose a new source of general knowledge in the form of Linked Data, by leveraging a machine learning project called NELL. NELL computes facts automatically to form a knowledge base that constantly evolves. We use the freely available dataset to produce an RDF-ised version, that we in turn make available via a public sparql endpoint and data dumps. This paper is a description of the platform and aims at showing the huge potential that it can be for future research on Linked Data. In the following, we shortly describe NELL (Sect. 2); how we map NELL’s ontology 2 A. Zimmermann, C. Gravier, J. Subercaze, Q. Cruzille to OWL (Sect. 3); how NELL’s believed facts are turned into RDF (Sect. 4); the plat- form for querying the dataset (Sect. 5) and finally the research opportunities we foresee (Sect. 6). 2 The Never Ending Language Learner NELL is the Never-Ending Language Learner [2]. It is a machine learning and text mining programme that reads Web pages, apply NLP analysis to extract formal, logical facts in the form of Entity-Attribute-Value triples annotated with confidence and prove- nance information, then go on reading other Web pages that are linked from the current page. If the same fact is discovered several times from different places, the confidence in the fact grows. Moreover, NELL learns fact according to rules that it is able to learn, which makes its learning capabilities better and better. At the same time, NELL embeds a (mostly) hand-made ontology that is used to detect inconsistencies, which prevents it from getting too much confidence in wrong guesses. During its constant reading of Web pages (now several billions), the confidence of certain facts reaches a threshold that is sufficient to consider the fact to be true. Such facts are published online on the Web site http://rtw.ml.cmu.edu/rtw/ and a Twitter bot publishes those facts1 . In order to currate data, a voting system is present on the web site. Manual edits are made from time to time to avoid false facts to misguide the learning process. Both the internal ontology and the facts that are assumed to be true are publicly available and downloadable on the Web site. 3 Converting NELL’s ontology into OWL NELL’s ontology is described in a tab separated value with three columns. The first column contains a term of the ontology that is being described (i.e., the subject). The second column is an attribute or a relation on the term (i.e., the predicate). The third element corresponds to the value of the attribute or another term to which the subject is related (i.e., the object). The detailed meaning of the predicates is the following: – antireflexive is used with a boolean and indicates that the term is an irrefexive property; – antisymmetric indicates whether the term is an antisymmetric property, but there is no equivalent in OWL (although there is the class owl:AsymmetricProperty), so we record the information in a custom annotation property; – description is always used with an English prose describing the term. We chose to record this in rdfs:comment, but could also use dc:description; – domain has the exact same semantics as rdfs:domain; – domainwithinrange indicates that anything that belongs to the domain of the property must also belong to its range, i.e., in mathematical terms, domainwithinrange is true for predicate p iff ∀x, ∃y.triple(x, p, y) ⇒ ∃z.triple(z, p, x); – generalizations is translating the rdfs:subClassOf relation; 1 NELL’s twitter account. https://twitter.com/cmunell Nell2RDF 3 Table 1. List of NELL’s ontology predicates and their translation in RDF(S) and OWL antireflexive rdf:type owl:IrreflexiveProperty antisymmetric nell:antisymmetric Literal(?object,xsd:boolean) description rdfs:comment Literal(?object,@en) domain rdfs:domain Class(?object) domainwithinrange nell:domainWithinRange Literal(?object,xsd:boolean) generalizations rdfs:subClassOf Class(?object) humanformat nell:humanFormat Literal(?object,xsd:string) instancetype nell:instanceType IRI(?object) inverse owl:inverseOf ?object memberofsets if ?object is rtwcategory then rdf:type rdfs:Class, else ?object is rtwrelation then rdf:type rdf:Property mutexpredicates if ?subject is a class then owl:disjointWith ?object, else ?subject is a property then owl:propertyDisjointWith ?object nrofvalues if ?object is 1 then rdf:type owl:FunctionalProperty, noth- ing otherwise populate nell:populate Literal(?object,xsd:boolean) range rdfs:range ?object rangewithindomain nell:rangeWithinDomain Literal(?object,xsd:boolean) visible nell:visible Literal(?object,xsd:boolean) – humanformat indicates a format code that is used to display facts using the term on the website; – instancetype has only 6 possible values (all, common, proper, regexp, javaDateFormat, url) and applies to classes to indicate that the instances are of a type that must be treated specially (such as dates). We do not use this information at the moment, we simply record the information in a custom annotation property; – inverse has the same meaning as owl:inverseOf; – memberofsets indicates whether the term is a class (rtwcategory) or a property (rtwrelation); – mutexpredicates expresses disjointness of terms, and may apply to classes and properties alike; – nrofvalues is either number 1 or any. The value 1 asserts that the term is a functional property; – populate indicates that the term is not used explicitly in a class membership as- sertion or relation, but serves to group several subclasses or properties; – range has the same semantics as rdfs:range; – rangewithindomain is similar to domainwithinrange and mean, mathemati- cally, that rangewithindomain is true for a property p iff ∀x, ∃y.triple(y, p, x) ⇒ ∃z.triple(x, p, z); – visible indicates that the term is not presented on NELL’s homepage or in NELL’s tweets. 4 A. Zimmermann, C. Gravier, J. Subercaze, Q. Cruzille Table 2. Correspondences between columns and RDF constructs. E Subject of a triple (IRI) R Predicate from the ontology, or rdf:type V Object of a triple (IRI or literal) EL rdfs:label of the subject VL rdfs:label of the object, if not literal BEL skos:prefLabel of the subject BVL skos:prefLabel of the object, if not literal CE rdf:type of the subject CV rdf:type of the object, if not literal 4 Converting facts to RDF In this part, we describe how we turn NELL data to RDF. For this, we use the file that contains all beliefs of NELL2 . Each line contains tab-separated values that we can treat separately to produce triples by extracting the content of the columns Entity, Relations, Value, Entity literalStrings, Value literalStrings, Best Entity literalString, Best Value literalString, Categories for Entity and Categories for Value. In the algorithm below, these columns will be respectively identified by the short names: E, R, V, EL, VL, BEL, BVL, CE and CV. There are a few extra columns that we do not consider for the current work and will omit in this discussion. The first three columns directly map to subject, predicate and object respectively, but they do not form RDF triples as they are because the terms are neither IRIs nor literals. When V is not starting with concept:, it is converted into a literal, otherwise all terms are IRIs. The columns EL and VL provide English phrases that name the entity E and value V (i.e., subject and object) respectively, while CE and CV define classes for E and V. Finally, BEL and BVL provide the most appropriate English phrase that name E and V. Table 2 defines how the columns are translated in their RDF counterpart, and Alg. 1 provides a detailed algorithm. Since EL, VL, CE and CV are lists rather than single values, we use them as sets in the algorithm. In order to simplify the algotihm, we use the following functions: add(s,p,o) adds an RDF triple to an in-memory representation of the RDF graph, geniri(v) (generate an IRI from v), genpropiri(v) (generate a property IRI from v), genclassiri(v) (generate a class IRI from v), genstr(v) (generate a string from v), genlit(v) (generate a literal from v) and isliteral(v) (return true if v is a literal). Function genlit(v) can generate literals of type xsd:integer, xsd:string or xsd:anyURI that can correspond to sport match scores, geolocation, ages or wikipedia URL. The al- gorithm is applied to each line of the CSV file provided by the NELL web site. IRIs are generated by replacing the prefix concept: of NELL’s terms by our IRI prefix http: //nell-ld.telecom-st-etienne.fr/. Properties and classes have their own IRI scheme, with prefix http://nell-ld.telecom-st-etienne.fr/ontology/. Example 1. We give an example of what can be found in NELL’s belief set, followed by the RDF we generate. One line is presented in column for readability: 2 Retrieved January, 2013. http://rtw.ml.cmu.edu/resources/results/08m/ NELL.08m.690.esv.csv.gz Nell2RDF 5 1 foreach c in CE do 2 add(geniri(E), rdf:type, genclassiri(c)); 3 end 4 foreach l in EL do 5 add(geniri(E), rdfs:label, genstr(l)); 6 end 7 add(geniri(E), skos:prefLabel, genstr(BEL)); 8 if isliteral(V) then 9 add(geniri(E), genpropiri(R), genlit(V)); 10 end 11 else 12 if E = generalizations then 13 add(geniri(E), rdf:type, genclassiri(V)); 14 foreach l in VL do 15 add(genclassiri(V), rdfs:label, genstr(l)); 16 end 17 add(genclassiri(V), skos:prefLabel, genstr(BVL)); 18 end 19 else 20 add(geniri(E), genpropiri(R), geniri(V)); 21 foreach l in VL do 22 add(geniri(V), rdfs:label, genstr(l)); 23 end 24 add(geniri(V), skos:prefLabel, genstr(BEL)); 25 foreach c in CV do 26 add(geniri(V), rdf:type, genclassiri(c)); 27 end 28 end 29 end Algorithm 1: Converting a line of NELL’s belief set to RDF E concept:company:prudential001 R concept:hasofficeincity V concept:city:boston EL "prudential" "Prudential" VL "boston" "Boston" "BOSTON" BEL prudential BVL boston CE CV concept:city concept:island Note that CE is empty, while EL, VL and CV have multiple values. This line is translated into the following triples, in Turtle syntax: @prefix : . @prefix no: . :company/prudential001 no:hasofficeincity :city/boston ; rdfs:label "prudential", "Prudential" ; 6 A. Zimmermann, C. Gravier, J. Subercaze, Q. Cruzille skos:prefLabel "prudential" . :city/boston rdfs:label "boston", "Boston", "BOSTON" ; skos:prefLabel "boston" ; rdf:type no:city, no:island . 5 The Nell2RDF dataset and platform We set up a triple store running Sesame with a SPARQL endpoint at http://nell-ld. telecom-st-etienne.fr/sparql. This store contains the ontology and the trans- lated belief set. The ontology has 541 properties and 310 classes, which is relatively few, but has 118,796 triples, mainly because NELL systematically asserts mutual disjoint- ness of classes or properties. The existence of property disjointness, and other features like antireflexive and antisymmetric properties, put NELL in a fragment of OWL for which no efficient reasoners are known. Consequently, we only apply the simple RDFS materialisation that Sesame is able to perform. The belief set has 5,805,102 triples concerning 1,475,390 distinct subjects and hav- ing 2,820,751 distinct objects, among which 2,799,591 are literals. The vast major- ity of triples concern the rdfs:label (2,479,546), skos:prefLabel (1,475,396) and rdf:type (1,244,493). Interestingly, there is a large amount of wikipedia URL (314,525) which could be used to help map NELL’s entities to DBpedia. We publish a dump of the data set, in gzipped ntriples as well as in gzipped HDT [3] available from the website http://nell-ld.telecom-st-etienne.fr/. Meta- data about the dataset is available too, containing a SPARQL service description [4], VoiD metadata [1] and more. 6 Opportunities and Future Directions Nell2RDF is currently a work in progress for which many improvements can be made. Especially, our conversion tool is currently ignoring the annotations regarding confi- dence and provenance of NELL’s beliefs. We expect to develop a new version where named graph will be used for such metadata on triples. The annotation will be used as a real life test bed for annotated RDF reasoning and querying [5]. We also would like to offer an interface for browsing the data in function of the metadata (e.g., filter facts by provenance or confidence). Moreover, the current implementation computes the dataset from the complete data dump, while the NELL website provides iterations in smaller files on a regular basis. We want to provide an automated update mechanism that keeps Nell2RDF in sync with NELL. In the meantime, we will work on comparing NELL to DBpedia in several ways: first, we will map our ontology to DBpedia’s; second, we will experiment with (semi-)automatic instance matching techniques to link to DBpe- dia; third, we will compare coverage and quality of the two knowledge bases. From our preliminary inspection of NELL’s belief, we recognise that NELL is still very far from DBpedia in terms of completeness and reliability. Finally, we hope that this work will also serve other people improve the state of the art in Linked Data production, publishing and evaluation. Nell2RDF 7 References 1. and Richard Cyganiak, Michael Hausenblas, and Yuting Zhao. Describing Linked Datasets with the VoID Vocabulary, W3C Interest Group Note 03 March 2011. W3C IG Note, World Wide Web Consortium (W3C), March 3 2011. 2. Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. Toward an Architecture for Never-Ending Language Learning. In Ronan Fox and David Poole, editors, Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010. AAAI Press, July 2010. 3. Javier D. Fernández, Miguel A. Martı́nez-Prieto, Claudio Gutiérrez, Axel Polleres, and Mario Arias. Binary RDF Representation for Publication and Exchange (HDT). Journal of Web Semantics, 2013. 4. Gregory Todd Williams. SPARQL 1.1 Service Description, W3C Recommendation 21 March 2013. W3C Recommendation, World Wide Web Consortium (W3C), March 21 2013. 5. Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general frame- work for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics, 11:72–95, 2012.