<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Nell2RDF: Read the Web, and turn it into RDF</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antoine Zimmermannz</string-name>
          <email>antoine.zimmermann@emse.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christophe Graviery</string-name>
          <email>christophe.gravier@univ-st-etienne.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julien Subercazey</string-name>
          <email>julien.subercaze@univ-st-etienne.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Quentin Cruzillez</string-name>
          <email>quentin.cruzille@gmail.com</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper describes the Nell2RDF platform that provides Linked Data of general knowledge, based on data automatically constructed by a permanent machine learning process called NELL that reads the Web. As opposed to DBpedia, all facts recorded by NELL can be tracked according to its provenance and a degree of confidence. With our platform, we aim at capturing all the data generated by NELL a transform them into state of the art Linked Data, following best practices. We discuss the benefits of the platform in opening new lines of research, while the work is still in progress.</p>
      </abstract>
      <kwd-group>
        <kwd>NELL</kwd>
        <kwd>RDF-ising</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Heterogeneity and redundancy are often seen as problematic in information system,
especially on the Web. However, in many ways, they can be seen as a strength. On the
Web, heterogeneity and redundancy allow one to consider the same information from
different angles, thus helping build beliefs and trust in some information, while
discrediting other. On the Web of Linked Data, most data sources are describing highly
specialised information that are not replicated anywhere else. Moreover, Linked Data
specialists usually advocate reuse via linking, rather than producing one’s own
description. Datasets of general knowledge are very scarce, but there is a knowledge base of
everything (so to speak) called DBpedia, that is readily available to link to. DBpedia is
an excellent project, but however powerful it may be, there is a need to provide
alternative general purpose knowledge sources, if only to allow an objective comparison of
generic knowledge bases. We propose a new source of general knowledge in the form of
Linked Data, by leveraging a machine learning project called NELL. NELL computes
facts automatically to form a knowledge base that constantly evolves. We use the freely
available dataset to produce an RDF-ised version, that we in turn make available via a
public sparql endpoint and data dumps. This paper is a description of the platform and
aims at showing the huge potential that it can be for future research on Linked Data.
In the following, we shortly describe NELL (Sect. 2); how we map NELL’s ontology
to OWL (Sect. 3); how NELL’s believed facts are turned into RDF (Sect. 4); the
platform for querying the dataset (Sect. 5) and finally the research opportunities we foresee
(Sect. 6).
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Never Ending Language Learner</title>
      <p>
        NELL is the Never-Ending Language Learner [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is a machine learning and text
mining programme that reads Web pages, apply NLP analysis to extract formal, logical
facts in the form of Entity-Attribute-Value triples annotated with confidence and
provenance information, then go on reading other Web pages that are linked from the current
page. If the same fact is discovered several times from different places, the confidence
in the fact grows. Moreover, NELL learns fact according to rules that it is able to learn,
which makes its learning capabilities better and better.
      </p>
      <p>At the same time, NELL embeds a (mostly) hand-made ontology that is used to
detect inconsistencies, which prevents it from getting too much confidence in wrong
guesses. During its constant reading of Web pages (now several billions), the confidence
of certain facts reaches a threshold that is sufficient to consider the fact to be true. Such
facts are published online on the Web site http://rtw.ml.cmu.edu/rtw/ and a
Twitter bot publishes those facts1. In order to currate data, a voting system is present on
the web site. Manual edits are made from time to time to avoid false facts to misguide
the learning process.</p>
      <p>Both the internal ontology and the facts that are assumed to be true are publicly
available and downloadable on the Web site.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Converting NELL’s ontology into OWL</title>
      <p>NELL’s ontology is described in a tab separated value with three columns. The first
column contains a term of the ontology that is being described (i.e., the subject). The
second column is an attribute or a relation on the term (i.e., the predicate). The third
element corresponds to the value of the attribute or another term to which the subject is
related (i.e., the object). The detailed meaning of the predicates is the following:
– antireflexive is used with a boolean and indicates that the term is an irrefexive
property;
– antisymmetric indicates whether the term is an antisymmetric property, but there
is no equivalent in OWL (although there is the class owl:AsymmetricProperty),
so we record the information in a custom annotation property;
– description is always used with an English prose describing the term. We chose
to record this in rdfs:comment, but could also use dc:description;
– domain has the exact same semantics as rdfs:domain;
– domainwithinrange indicates that anything that belongs to the domain of the
property must also belong to its range, i.e., in mathematical terms, domainwithinrange
is true for predicate p iff 8x; 9y:triple(x; p; y) ) 9z:triple(z; p; x);
– generalizations is translating the rdfs:subClassOf relation;
1 NELL’s twitter account. https://twitter.com/cmunell
– instancetypehasonly6possiblevalues(all,common,proper,regexp,javaDateFormat,
url) and applies to classes to indicate that the instances are of a type that must be
treated specially (such as dates). We do not use this information at the moment, we
simply record the information in a custom annotation property;
– inverse has the same meaning as owl:inverseOf;
– memberofsetsindicateswhetherthetermisaclass(rtwcategory)oraproperty
(rtwrelation);
– mutexpredicates expresses disjointness of terms, and may apply to classes and
properties alike;
– nrofvalues is either number 1 or any. The value 1 asserts that the term is a
functional property;
– populate indicates that the term is not used explicitly in a class membership
assertion or relation, but serves to group several subclasses or properties;
– range has the same semantics as rdfs:range;
– rangewithindomain is similar to domainwithinrange and mean,
mathematically,thatrangewithindomainistrueforapropertypiff8x;9y:triple(y;p;x) )
9z:triple(x;p;z);
– visibleindicatesthatthetermisnotpresentedonNELL’shomepageorinNELL’s
tweets.</p>
      <p>E Subject of a triple (IRI)
R Predicate from the ontology, or rdf:type
V Object of a triple (IRI or literal)
EL rdfs:label of the subject
VL rdfs:label of the object, if not literal
BEL skos:prefLabel of the subject
BVL skos:prefLabel of the object, if not literal
CE rdf:type of the subject</p>
      <p>CV rdf:type of the object, if not literal</p>
    </sec>
    <sec id="sec-4">
      <title>4 Converting facts to RDF</title>
      <p>In this part, we describe how we turn NELL data to RDF. For this, we use the file
that contains all beliefs of NELL2. Each line contains tab-separated values that we can
treat separately to produce triples by extracting the content of the columns Entity,
Relations, Value, Entity literalStrings, Value literalStrings, Best Entity
literalString, Best Value literalString, Categories for Entity and Categories
for Value. In the algorithm below, these columns will be respectively identified by the
short names: E, R, V, EL, VL, BEL, BVL, CE and CV. There are a few extra columns
that we do not consider for the current work and will omit in this discussion.</p>
      <p>The first three columns directly map to subject, predicate and object respectively,
but they do not form RDF triples as they are because the terms are neither IRIs nor
literals. When V is not starting with concept:, it is converted into a literal, otherwise
all terms are IRIs. The columns EL and VL provide English phrases that name the entity
E and value V (i.e., subject and object) respectively, while CE and CV define classes
for E and V. Finally, BEL and BVL provide the most appropriate English phrase that
name E and V. Table 2 defines how the columns are translated in their RDF counterpart,
and Alg. 1 provides a detailed algorithm. Since EL, VL, CE and CV are lists rather than
single values, we use them as sets in the algorithm.</p>
      <p>In order to simplify the algotihm, we use the following functions: add(s,p,o) adds
an RDF triple to an in-memory representation of the RDF graph, geniri(v) (generate
an IRI from v), genpropiri(v) (generate a property IRI from v), genclassiri(v)
(generate a class IRI from v), genstr(v) (generate a string from v), genlit(v)
(generate a literal from v) and isliteral(v) (return true if v is a literal). Function
genlit(v) can generate literals of type xsd:integer, xsd:string or xsd:anyURI
that can correspond to sport match scores, geolocation, ages or wikipedia URL. The
algorithm is applied to each line of the CSV file provided by the NELL web site. IRIs are
generated by replacing the prefix concept: of NELL’s terms by our IRI prefix http:
//nell-ld.telecom-st-etienne.fr/. Properties and classes have their own IRI
scheme, with prefix http://nell-ld.telecom-st-etienne.fr/ontology/.
Example 1. We give an example of what can be found in NELL’s belief set, followed
by the RDF we generate. One line is presented in column for readability:
2 Retrieved January, 2013. http://rtw.ml.cmu.edu/resources/results/08m/
NELL.08m.690.esv.csv.gz</p>
      <p>end
Algorithm 1: Converting a line of NELL’s belief set to RDF</p>
      <p>E concept:company:prudential001
R concept:hasofficeincity
V concept:city:boston
EL "prudential" "Prudential"</p>
      <p>VL "boston" "Boston" "BOSTON"
BEL prudential
BVL boston</p>
      <p>CE</p>
      <p>CV concept:city concept:island</p>
      <p>Note that CE is empty, while EL, VL and CV have multiple values. This line is
translated into the following triples, in Turtle syntax:
@prefix : &lt;http://nell-ld.telecom-st-etienne.fr/&gt; .
@prefix no: &lt;http://nell-ld.telecom-st-etienne.fr/ontology&gt; .
:company/prudential001 no:hasofficeincity :city/boston ;
rdfs:label "prudential", "Prudential" ;
skos:prefLabel "prudential" .
:city/boston rdfs:label "boston", "Boston", "BOSTON" ;
skos:prefLabel "boston" ;
rdf:type no:city, no:island .
5</p>
    </sec>
    <sec id="sec-5">
      <title>The Nell2RDF dataset and platform</title>
      <p>We set up a triple store running Sesame with a SPARQL endpoint at http://nell-ld.
telecom-st-etienne.fr/sparql. This store contains the ontology and the
translated belief set. The ontology has 541 properties and 310 classes, which is relatively few,
but has 118,796 triples, mainly because NELL systematically asserts mutual
disjointness of classes or properties. The existence of property disjointness, and other features
like antireflexive and antisymmetric properties, put NELL in a fragment of OWL for
which no efficient reasoners are known. Consequently, we only apply the simple RDFS
materialisation that Sesame is able to perform.</p>
      <p>The belief set has 5,805,102 triples concerning 1,475,390 distinct subjects and
having 2,820,751 distinct objects, among which 2,799,591 are literals. The vast
majority of triples concern the rdfs:label (2,479,546), skos:prefLabel (1,475,396)
and rdf:type (1,244,493). Interestingly, there is a large amount of wikipedia URL
(314,525) which could be used to help map NELL’s entities to DBpedia.</p>
      <p>
        We publish a dump of the data set, in gzipped ntriples as well as in gzipped HDT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
available from the website http://nell-ld.telecom-st-etienne.fr/.
Metadata about the dataset is available too, containing a SPARQL service description [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
VoiD metadata [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and more.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Opportunities and Future Directions</title>
      <p>
        Nell2RDF is currently a work in progress for which many improvements can be made.
Especially, our conversion tool is currently ignoring the annotations regarding
confidence and provenance of NELL’s beliefs. We expect to develop a new version where
named graph will be used for such metadata on triples. The annotation will be used as a
real life test bed for annotated RDF reasoning and querying [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We also would like to
offer an interface for browsing the data in function of the metadata (e.g., filter facts by
provenance or confidence). Moreover, the current implementation computes the dataset
from the complete data dump, while the NELL website provides iterations in smaller
files on a regular basis. We want to provide an automated update mechanism that keeps
Nell2RDF in sync with NELL. In the meantime, we will work on comparing NELL
to DBpedia in several ways: first, we will map our ontology to DBpedia’s; second, we
will experiment with (semi-)automatic instance matching techniques to link to
DBpedia; third, we will compare coverage and quality of the two knowledge bases. From our
preliminary inspection of NELL’s belief, we recognise that NELL is still very far from
DBpedia in terms of completeness and reliability.
      </p>
      <p>Finally, we hope that this work will also serve other people improve the state of the
art in Linked Data production, publishing and evaluation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. and Richard Cyganiak,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yuting</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Describing Linked Datasets with the VoID Vocabulary</article-title>
          , W3C Interest Group Note 03 March
          <year>2011</year>
          .
          <source>W3C IG Note, World Wide Web Consortium (W3C), March 3</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Carlson</surname>
          </string-name>
          , Justin Betteridge, Bryan Kisiel, Burr Settles,
          <string-name>
            <surname>Estevam R. Hruschka</surname>
            Jr., and
            <given-names>Tom M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitchell</surname>
          </string-name>
          .
          <article-title>Toward an Architecture for Never-Ending Language Learning</article-title>
          . In Ronan Fox and David Poole, editors,
          <source>Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI</source>
          <year>2010</year>
          , Atlanta, Georgia, USA, July
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2010</year>
          . AAAI Press,
          <year>July 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Javier</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ferna</surname>
          </string-name>
          <article-title>´ndez, Miguel A. Mart´ınez-</article-title>
          <string-name>
            <surname>Prieto</surname>
            , Claudio Gutie´rrez, Axel Polleres, and
            <given-names>Mario</given-names>
          </string-name>
          <string-name>
            <surname>Arias</surname>
          </string-name>
          .
          <article-title>Binary RDF Representation for Publication and Exchange (HDT)</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Gregory</given-names>
            <surname>Todd</surname>
          </string-name>
          <article-title>Williams</article-title>
          .
          <source>SPARQL 1</source>
          .1 Service Description,
          <source>W3C Recommendation 21 March</source>
          <year>2013</year>
          . W3C Recommendation,
          <source>World Wide Web Consortium (W3C)</source>
          ,
          <source>March 21</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , Nuno Lopes, Axel Polleres, and
          <string-name>
            <given-names>Umberto</given-names>
            <surname>Straccia</surname>
          </string-name>
          .
          <article-title>A general framework for representing, reasoning and querying with annotated Semantic Web data</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>11</volume>
          :
          <fpage>72</fpage>
          -
          <lpage>95</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>