<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Co-LOD: Continuous Space Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Co-LOD: Continuous Space Linked Open Data</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Linked Open Data (LOD) initiative has been one of the successful manifestations of Semantic Web e orts over the last two decades, with near-exponential growth of LOD datasets in the initial years. Entities and datasets on LOD are naturally discrete, making them amenable to both well-de ned reasoning and retrieval procedures that ultimately return lists or sets of resource identi ers ful lling some criteria (whether stating user intent or using pattern-matching query languages like SPARQL). In recent years, representation learning algorithms have witnessed a powerful ascent in mainstream Arti cial Intelligence, fueled in part by the adoption and re nement of neural network architectures like Recurrent Neural Nets and skip-grams, and by empirical successes such as achieved in the natural language processing and knowledge discovery communities by word and graph embeddings. Large datasets, which are almost always required by such algorithms, make it possible to train and release models openly. In some cases, open models can even be released based on proprietary datasets like Twitter corpora. We propose that the Semantic Web community position itself as a pre-eminent research leader in this space by leveraging the vast and diverse collection of structured datasets that are currently available on Linked Open Data, to build out a corresponding continuous-space equivalent.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Open Data</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Embeddings</kwd>
        <kwd>Continuous Space</kwd>
        <kwd>Representation Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>that semantic agents may need to do for powerful processing of queries posed by
Web users, and that the data can get stale very quickly since many datasets do
not update in real time. A bigger, fruitful (in our opinion) and even provocative
debate that has arisen in our community in recent years2 is whether the growth in
schema.org threatens the premises of LOD. Despite utilizing similar technologies,
many of which have been doubtlessly inspired by work in our own community,
schema.org is based on a completely di erent rationale, with the focus being
on facilitating a better search experience for users, rather than on connectivity
(thereby doing away with the notion of `linked' altogether). That the search
engine providers have pushed hard for embedded schema.org markups, particularly
for websites describing restaurants, movies and other consumer-facing products
and services, in an e ort to ingest more standard datasets into their knowledge
graphs and ranking algorithms has also led to the popularity of the schema.org
movement. Website publishers and service providers have an incentive to provide
clean, up-to-date schema.org data for some of these high-priority categories since
it plays a non-trivial role in whether (and how) they will be found and listed by
the search engine when users search for terms that relate to the business they
are in. In short, publishing good schema.org for a subset of ontological classes
in uences modern-day search optimization, a must for any online provider.</p>
      <p>There is no doubt that various ambitious research agendas are already in
place all over the world to address some of the problems we noted above with
LOD, and some have been trying to bridge the gap between LOD and schema.org,
usually by showing how we could possibly extract and LODify schema.org markup
on webpages with high accuracy. But we believe that there is a bigger
opportunity with LOD that will allow us to signi cantly expand its scope, and make it
a vital resource for the AI and Deep Learning community as a whole.</p>
      <p>To lay the groundwork for this vision, we brie y present the preliminaries
on continuous-space representation learning aka `embeddings'. Simply put, an
embedding is a continuous, real-valued vector, usually of relatively low
dimensionality (a common range is from 20-100 depending on the application and
dataset) that serves as a distributed representation of a data unit. The
definition of a data unit depends on the algorithm e.g., a word embedding
algorithm treats words as data units and `embeds' words into continuous, real-valued
and low-dimensional vectors. Graph embeddings generally embed the nodes in
a graph into such spaces, though more advanced knowledge graph embeddings
are also capable of embedding relations. Data units can even be heterogeneous
e.g., the paragraph2vec algorithm was an example of a `document embedding'
model that jointly embedded words and documents into a single vector space.
Even more recently, the StarSpace package released by Facebook Research, is
a general-purpose representation learning package that models data units very
abstractly as a graph-like data structure before embedding them. Because of this
abstraction, it is able to jointly embed all kinds of units, including nodes, text,
documents, users etc. as long as the data is correctly modeled and formatted.
2 Some of these debates have been encouraged by ISWC workshops like HSSUES
(2017).</p>
      <p>We argue that some of the critical features that make these continuous-space
representation learning algorithms so successful are, by a fortuitous coincidence,
exactly in alignment with the features of LOD. First, embedding algorithms rely
on the presence of large datasets that do not necessarily have to be of high
quality3. Second, embedding algorithms perform well when there is enough `context'.
In a graph-theoretic setting, this usually implies connectivity i.e. the denser the
graph, the more likely that `good' embeddings will be learned. Similarly, in a
text-theoretic setting, context means that words are not `sparsely' used i.e. if a
word is only used once or twice in the corpus, it is unlikely that good embeddings
will be generated for it by a model. Given corpora like Wikipedia or the Google
News Corpus, this problem rarely arises since most words are used several times
throughout the corpus. We argue that LOD provides context both because of
connectivity of resources within each dataset, but also because the linked data
principles ensure that resources across datasets are connected using agreed-upon
OWL, SKOS or RDFS properties like owl:sameAs.</p>
      <p>With the groundwork and arguments in place, we present the crux of our
vision in Figure 1. The top portion of the gure shows the LOD ecosystem as it
stands today. In essence, it is a `discrete' system in that it can be visualized as a
giant graph of domain-speci c (and in the center, open-world domain) datasets
that have connections between them due to the fourth Linked Data principle.
These datasets are typically accessed as dumps, or via SPARQL endpoints, and
were designed with Semantic Web agents in mind.</p>
      <p>
        The bottom portion of the gure shows our vision of co-LOD, which is a
continuous space version of Linked Open Data. Our vision can be stated very
simply: embed the entire LOD collection of datasets into a continuous space,
and make the space accessible to machine learning, data mining and
recommendation services that rely so heavily on general-purpose embeddings (such as of
the Wikipedia corpus) for good performance. Our vision is currently just
theoretical and aspirational, due to several wildcard challenges: how can we make
all of LOD accessible to a representation learning algorithm? Which algorithm
should we use (e.g., PyTorch-BigGraph [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ])? How do we take meta-data, data,
literals (including text, dates and numbers) and ontologies all into account when
embedding? How do we evaluate the quality of the embedding? Where should
such an embedding be hosted? Should there be a single continuous space for all
of Linked Data? How do we access the embeddings for machine learning services?
How do we ensure co-LOD and LOD stay in sync?
      </p>
      <p>We suspect that each of these questions has the potential to spawn a host
of papers in the short, medium, and long-term future, and we hope to have the
opportunity to read, critique and write some of them.
3 It is known that lower quality degrades some embedding algorithms, but recent
algorithms, like fastText (also from Facebook Research) have been designed to deal
with many di erent kinds of noise, including misspellings and out-of-vocabulary
words. A complete study analyzing dependence of embedding quality on noise is
lacking, to the best of our knowledge.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          .
          <article-title>Dbpedia-a crystallization point for the web of data</article-title>
          .
          <source>Web Semantics: science, services and agents on the world wide web</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <volume>154</volume>
          {
          <fpage>165</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. M. Farber,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Menne</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rettinger</surname>
          </string-name>
          .
          <article-title>A comparative survey of dbpedia, freebase, opencyc, wikidata, and yago</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          . node2vec:
          <article-title>Scalable feature learning for networks</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>855</volume>
          {
          <fpage>864</fpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ying</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Representation learning on graphs: Methods and applications</article-title>
          .
          <source>arXiv preprint arXiv:1709.05584</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          .
          <article-title>Populating a linked data entity name system: A big data solution to unsupervised instance matching</article-title>
          , volume
          <volume>27</volume>
          . IOS Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          .
          <article-title>Neural embeddings for populated geonames locations</article-title>
          .
          <source>In International Semantic Web Conference</source>
          , pages
          <volume>139</volume>
          {
          <fpage>146</fpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          .
          <article-title>Scalable generation of type embeddings using the abox</article-title>
          .
          <source>Open Journal of Semantic Web (OJSW)</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <volume>20</volume>
          {
          <fpage>34</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          .
          <article-title>Supervised typing of big graphs using semantic embeddings</article-title>
          .
          <source>In Proceedings of The International Workshop on Semantic Big Data, page 3. ACM</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lacroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wehrstedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bose</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Peysakhovich</surname>
          </string-name>
          .
          <article-title>Pytorch-biggraph: A large-scale graph embedding system</article-title>
          .
          <source>arXiv preprint arXiv:1903.12287</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>B.</given-names>
            <surname>Perozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiena</surname>
          </string-name>
          . Deepwalk:
          <article-title>Online learning of social representations</article-title>
          .
          <source>In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>701</volume>
          {
          <fpage>710</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>M. Schmachtenberg</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Adoption of the linked data best practices in di erent topical domains</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2014</year>
          , pages
          <fpage>245</fpage>
          {
          <fpage>260</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Tang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , J. Yan, and
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Mei</surname>
          </string-name>
          . Line:
          <article-title>Large-scale information network embedding</article-title>
          .
          <source>In Proceedings of the 24th international conference on world wide web</source>
          , pages
          <volume>1067</volume>
          {
          <fpage>1077</fpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          .
          <article-title>Knowledge graph embedding: A survey of approaches and applications</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>29</volume>
          (
          <issue>12</issue>
          ):
          <volume>2724</volume>
          {
          <fpage>2743</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>