Piecing the puzzle
                             Self-publishing queryable research data on the Web

                                                            Ruben Verborgh
                                                    Ghent University – imec – IDLab
                                                        ruben.verborgh@ugent.be


         The original article is available at https://ruben.verborgh.org/articles/queryable-research-data/.


ABSTRACT                                                                    researchers who do not actively maintain their online research pro-
Publishing research on the Web accompanied by machine-readable              files risk ending up with incomplete and inaccurate publication lists
data is one of the aims of Linked Research. Merely embedding                on those networks. Such misrepresentation can be significantly
metadata as RDFa in HTML research articles, however, does not               worse than not being present at all—but given the public nature of
solve the problems of accessing and querying that data. Hence,              publication metadata, complete absence is not an enforceable
I created a simple ETL pipeline to extract and enrich Linked Data           choice.
from my personal website, publishing the result in a queryable way
                                                                            Online representation is not limited to social networks: scientific
through Triple Pattern Fragments. The pipeline is open source, uses
                                                                            publishers also make metadata available about their journals and
existing ontologies, and can be adapted to other websites. In this ar-
                                                                            books. For instance, Springer Nature recently released SciGraph, a
ticle, I discuss this pipeline, the resulting data, and its possibilities
                                                                            Linked Open Data platform that includes scholarly metadata. Accu-
for query evaluation on the Web. More than 35,000 RDF triples of
                                                                            racy is less of an issue in such cases, as data comes directly from
my data are queryable, even with federated SPARQL queries be-
                                                                            the source. However, quality and usability are still influenced by the
cause of links to external datasets. This proves that researchers do
                                                                            way data is modeled and whether or how identifiers are disam-
not need to depend on centralized repositories for readily accessible
                                                                            biguated. Completeness is not guaranteed, given that authors typi-
(meta-)data, but instead can—and should—take matters into their
                                                                            cally target multiple publishers. Therefore, even such authoritative
own hands.
                                                                            sources do not provide individual researchers with a correct profile.

INTRODUCTION                                                                In the spirit of decentralized social networking [5] and Linked Data
The World Wide Web continues to shape many domains, and not in              [6], several researchers instead started publishing their own data and
the least research. On the one hand, the Web beautifully fulfills its       metadata. I am one of them, since I believe in practicing what we
role as a distribution channel of scientific knowledge, for which it        preach [7] as Linked Data advocates, and because I want my own
was originally invented. This spurs interesting dialogues concerning        website to act as the main authority for my data. After all, I can
Open Access [1] and even piracy [2] of research articles. On the            spend more effort on the completeness and accuracy of my publica-
other hand, the advent of social networking creates new interaction         tion metadata than most other platforms could reasonably do for
opportunities for researchers, but also forces us to consider our on-       me. In general, self-published data typically resides in separate RDF
line presence [3]. Various social networks dedicated to research            documents [8] (for which the FOAF vocabulary [9] is particularly
have emerged: Mendeley, ResearchGate, Academia, … They attract              popular [10]), or inside of HTML documents (using RDFa Lite [11]
millions of researchers, and employ various tactics to keep us there.       or similar formats).

A major issue of these social research networks is their lack of mu-        Despite the controllable quality of personally maintained research
tual complementarity. None of them has become a clear winner in             data and metadata in individual documents on the Web, they are not
terms of adaption. At first sight, the resulting plurality seems a          as visible, findable, and queryable as those of social research net-
blessing for diversity, compared to the monoculture of Facebook for         works. I call a dataset interface “queryable” with respect to a given
social networking in general. Yet whereas other generic social net-         query when a consumer does not need to download the entire
works such as Twitter and LinkedIn serve complementary profes-              dataset in order to evaluate that query over it with full complete-
sional purposes compared to Facebook, social research networks              ness. Unfortunately, hosting advanced search interfaces on a per-
share nearly identical goals. As an example, a researcher could an-         sonal website quickly becomes complex and expensive. To mitigate
nounce a newly accepted paper on Twitter, discuss its review                this, I have implemented a simple Extract/Transform/Load (ETL)
process on Facebook, and share a photograph of an award on                  pipeline on top of my personal website, which extracts, enriches,
LinkedIn. In contrast, one would typically not exclusively list a spe-      and publishes my Linked Data in a queryable way through a Triple
cific publication on Mendeley and another on Academia, as neither           Pattern Fragments [12] interface. The resulting data can be browsed
publication list would be complete.                                         and queried live on the Web, with higher quality and flexibility than
                                                                            on my other online profiles, and at only a limited cost for me as data
In practice, this results in constant bookkeeping for researchers who       publisher.
want each of their profiles to correctly represent them—a necessity
if such profiles are implicitly or explicitly treated as performance        This article describes my use case, which resembles that of many
indicators [4]. Deliberate absence on any of these networks is not a        other researchers. I detail the design and implementation of the ETL
viable option, as parts of one’s publication metadata might be auto-        pipeline, and report on its results. At the end, I list open questions
matically harvested or entered by co-authors, leaving an automati-          regarding self-publication, before concluding with a reflection on
cally generated but incomplete profile. Furthermore, the quality of         the opportunities for the broader research community.
such non-curated metadata records can be questionable. As a result,
USE CASE                                                                 cates for the concept “label”. Similarly, queries for schema:
                                                                         Article or schema:CreativeWork would not return results
Available Data                                                           because they are not explicitly mentioned, even though their sub-
                                                                         classes schema:BlogPosting and schema:Scholarly-
Like the websites of many researchers, my personal website con-
                                                                         Article appear frequently.
tains data about the following types of resources:

  people such as colleagues, collaborators, and fellow researchers       Given the above considerations, the constraints of individual re-
                                                                         searchers, and the possibilities of social research networks, we for-
  research articles I have co-authored
                                                                         mulate the following requirements:
  blog posts I have written
  courses I teach                                                          Automated clients should be able to evaluate queries with full
                                                                           completeness with respect to the data on the website.
This data is spread across different HTTP resources:                       Semantically equivalent expressions should yield the same
                                                                           query results, regardless of vocabulary with respect to all vo-
  a single RDF document (FOAF profile) containing:
                                                                           cabularies used on the website.
     manually entered data (personal data, affiliations, projects)
                                                                           Queryable data can only involve a limited cost and effort for
      automatically generated metadata (publications, blog posts)
                                                                           publishers as well as consumers.
  an HTML page with RDFa per:
    publication (publication and author metadata)                        ETL PIPELINE
      blog post (post metadata)                                          To automate this process, I have developed a simple ETL pipeline.
      HTML article (metadata and citations)                              With the exception of a couple of finer points, the pipeline itself is
                                                                         fairly straightforward. What is surprising, however, is the impact
      …
                                                                         such a simple pipeline can have, as discussed hereafter in the Re-
Depending on the context, I encode the information with different        sults section. The pipeline consists of the following phases, which
vocabularies:                                                            will be discussed in the following subsections.

  Friend of a Friend (FOAF) (people, documents)                            Extract all triples from the website’s RDF and HTML+RDFa
                                                                           documents.
  Schema.org (blog posts, articles, courses)
                                                                           Reason over this data and its ontologies to complete gaps.
  Bibliographic Ontology (BIBO) (publications)
                                                                           Publish the resulting data in a queryable interface.
  Citation Typing Ontology (CiTO) (citations)
  …                                                                      The source code for the pipeline is available on GitHub. The pipe-
                                                                         line can be run periodically, or triggered on website updates as part
There is a considerable amount of overlap since much data is avail-      of a continuous integration process. In order to adapt this to differ-
able in more than one place, sometimes in different vocabularies.        ent websites, the default ontology files can be replaced by others
For example, webpages about my publications contain Schema.org           that are relevant for a given website.
markup (to facilitate indexing by search engines), whereas my pro-
file describes the same publications more rigorously using BIBO
and FOAF (for more advanced RDF clients). I deliberately reuse the
                                                                         Extract
                                                                         The pipeline loops through all of the website’s files (either through
same identifiers for the same resources everywhere, so identifica-
                                                                         the local filesystem or through Web crawling) and makes lists of
tion is not an issue.
                                                                         RDF documents and HTML+RDFa documents. The RDF docu-
                                                                         ments are fed through the Serd parser to verify validity and for con-
Data Publication Requirements                                            version into N-Triples [14], so the rest of the pipeline can assume
While the publication of structured data as RDF and RDFa is con-         one triple per line. The RDFa is parsed into N-Triples by the RD-
veniently integrated in the webpage creation process, querying in-       FLib library for Python. Surprisingly, this library was the only one
formation over the entire website is difficult. For instance, starting   I found that correctly parsed RDFa Lite in (valid) HTML5; both
from the homepage, obtaining a list of all mentioned people on the       Raptor and Apache Any23 seemed to expect a stricter document
website would be non-trivial. In general, SPARQL query execution         layout.
over Linked Data takes a considerable amount of time, and com-
pleteness cannot be guaranteed [13]. So while Linked Data docu-
ments are excellent for automated exploration of individual re-
                                                                         Reason
                                                                         In order to fix gaps caused by implicit properties and classes, the
sources, and for aggregators such as search engines that can harvest
                                                                         pipeline performs reasoning over the extracted data and its ontolo-
the entire website, the possibilities of individual automated clients
                                                                         gies to compute the deductive closure. The choice of ontologies is
remain limited.
                                                                         based on the data, and currently includes FOAF, DBpedia, CiTO,
Another problem is the heterogeneity of vocabularies: clients with-      Schema.org, and the Organizations ontology. Additionally, I speci-
out reasoning capabilities would only find subsets of the informa-       fied a limited number of custom OWL triples to indicate equiva-
tion, depending on which vocabulary is present in a given represen-      lences that hold on my website, but not necessarily in other con-
tation. Especially in RDFa, it would be cumbersome to combine ev-        texts.
ery single occurrence of schema:name with the semantically
                                                                         The pipeline delegates reasoning to the highly performant EYE rea-
equivalent dc:title, rdfs:label, and foaf:name. As such,
                                                                         soner [15], which does not have any RDFS or OWL knowledge
people might have a foaf:name (because FOAF is common for
                                                                         built-in. Consequently, relevant RDFS and OWL theories can be se-
people), publications a schema:name (because of schema:                  lected manually, such that only a practical subset of the entire de-
ScholarlyArticle), and neither an rdfs:label. Depending                  ductive closure is computed. For instance, my FOAF profile asserts
on the kind of information, queries would thus need different predi-     that all resources on my site are different using
owl:AllDifferent; a full deductive closure would result in an              step 3 improves performance, as the derived ontology triples are al-
undesired combinatorial explosion of owl:differentFrom state-              ready materialized. Given that ontologies change slowly, the output
ments.                                                                     of steps 1 and 2 could be cached.

The website’s dataset is enriched through the following steps:             Publish
1. The ontologies are skolemized [8] and concatenated into a single        The resulting triples are then published through a Triple Pattern
   ontology file.                                                          Fragments (TPF) [12] interface, which allows clients to access
                                                                           a dataset by triple pattern. In essence, the lightweight TPF interface
2. The deductive closure of the joined ontology is computed by
                                                                           extends Linked Data’s subject-based dereferencing by also provid-
   passing it to the EYE reasoner with the RDFS and OWL theories.
                                                                           ing predicate- and object-based lookup. Through this interface,
3. The deductive closure of the website’s data is computed by              clients can execute SPARQL queries with full completeness at lim-
   passing it to the EYE reasoner with the RDFS and OWL theories           ited server cost. Because of the simplicity of the interface, various
   and the deductive closure of the ontology.                              back-ends are possible. For instance, the data from the pipeline can
4. Ontological triples are removed from the data by subtracting            be served from memory by loading the generated N-Triples file, or
   triples that also occur in the deductive closure of the ontology.       the pipeline can compress it into a Header Dictionary Triples
5. Other unnecessary triples are removed, in particular triples            (HDT) [16] file.
   with skolemized ontology IRIs, which are meaningless without
   the ontology.                                                           Special care is taken to make IRIs dereferenceable [6] during the
                                                                           publication process. While I emphasize IRI reuse, some of my
These steps ensure that only triples directly related to the data are      co-authors do not have their own profile, so I had to mint IRIs for
published without any direct or derived triples from its ontologies,       them. Resolving such IRIs results in an HTTP 303 redirect to the
which form different datasets. By separating them, ontologies re-          TPF with data about the concept. For instance, the IRI
main published as independent datasets, and users executing queries        https://data.verborgh.org/people/sam_coppens redi-
can explicitly choose which ontologies or datasets to include.             rects to the TPF of triples with this IRI as subject.

For example, when the original data contains
art:publication schema:author rv:me.
                                                                           RESULTS
                                                                           I applied the ETL pipeline to my personal website https://ruben.ver-
and given that DBpedia and Schema.org ontologies (before skolem-
                                                                           borgh.org/ to verify its effectiveness. The data is published at
ization) contain
dbo:author owl:equivalentProperty schema:author.                           https://data.verborgh.org/ruben and can be queried with a TPF client
schema:author rdfs:range [                                                 such as http://query.verborgh.org/. The results reflect the status of
  owl:unionOf (schema:Organization schema:Person)                          January 2017, and measurements were executed on a MacBook Pro
].                                                                         with a 2.66GHz Intel Core i7 processor and 8GB of RAM.
then the raw reasoner output of step 3 (after skolemization) would
be                                                                         Generated Triples
art:publication dbo:author rv:me.                                          In total, 35,916 triples were generated in under 5 minutes from
art:publication schema:author rv:me.                                       6,307 profile triples and 12,564 unique triples from webpages. The
rv:me rdf:type skolem:b0.                                                  table below shows the number of unique triples at each step and the
dbo:author owl:equivalentProperty schema:author.                           time it took to obtain them. The main bottleneck is not reasoning
schema:author rdfs:range skolem:b0.                                        (≈3,000 triples per second), but rather RDFa extraction
skolem:b0 owl:unionOf skolem:l1.                                           (≈100 triples per second), which can fortunately be parallelized
skolem:l1 a rdf:List.                                                      more easily.
skolem:l1 rdf:first schema:Organization.
skolem:l1 rdf:rest skolem:l2.
                                                                           step                                             time (s) # triples
skolem:l2 a rdf:List.
skolem:l2 rdf:first schema:Person.                                         RDF(a) extraction                                    170.0     17,050
skolem:l2 rdf:rest rdf:nil.
                                                                           ontology skolemization                                 0.6     44,179
The skolemization in step 1 ensures that blank nodes from ontolo-
gies have the same identifier before and after the reasoning runs in       deductive closure ontologies                          38.8    144,549
steps 2 and 3. Step 2 results in triples 9–17 (note the inferred triples   deductive closure data and ontologies                 61.8    183,282
12 and 15), which are also present in the output of step 3, together
                                                                           subtract ontological triples                           0.9     38,745
with the added triples 6–8 derived from data triple 1. Because of the
previous skolemization, triples 9–16 can be removed through a sim-         subtract other triples                                 1.0     35,916
ple line-by-line difference, as they have identical N-Triples repre-       total                                                273.0     35,916
sentations in the outputs of steps 2 and 3. Finally, step 5 removes
triple 8, which is not meaningful as it points to an unreference-
                                                                           Table 1: The number of unique triples per phase, and the time
able blank node in the Schema.org ontology. The resulting enriched
                                                                           it took to extract them.
data is:
art:publication dbo:author rv:me.                                          While dataset size is not an indicator for quality [17], the accessibil-
art:publication schema:author rv:me.                                       ity of the data improves through the completion of inverse predi-
Thereby, data that was previously only described with Schema.org           cates and equivalent or subordinate predicates and classes between
in RDFa becomes also available with DBpedia. Note that the exam-           ontologies. The table below lists the frequency of triples with spe-
ple triple yields several more triples in the actual pipeline, which       cific predicates and classes before and after executing the pipeline.
uses the full FOAF, Schema.org, and DBpedia ontologies.

Passing the deductive closure of the joined ontology from step 2 to
predicate or class                               # pre        # post        line.

dc:title                                              657          714      To this end, I tested three scenarios on the public Web:
rdfs:label                                            473          714      1. a Triple Pattern Fragments client (ldf-client 2.0.4) with the pipe-
foaf:name                                             394          714         line’s TPF interface
schema:name                                           439          714      2. a Linked Data client (SQUIN 20141016) with my homepage as
                                                                               seed
schema:isPartOf                                       263          263      3. a Linked Data client (SQUIN 20141016) with my FOAF profile
schema:hasPart                                           0         263         as seed

cito:citesAsAuthority                                    14            14   All clients started with an empty cache for every query, and the
cito:cites                                               0             33   query timeout was set to 60 seconds. The waiting period between
                                                                            requests for SQUIN was disabled. For the federated query, the TPF
schema:citation                                          0             33   client also accessed DBpedia, which the Linked Data client can find
foaf:Person                                           196          196      through link traversal. To highlight the impact of the seeds, queries
                                                                            avoid IRIs from my domain by using literals for concepts instead.
dbo:Person                                               0         196
schema:ScholarlyArticle                               203          203                                    TPF (pipe-         LD              LD (pro-
                                                                                                          line)              (home)          file)
schema:Article                                           0         243
                                                                            query                         #         t (s)    #        t (s) #        t (s)
schema:CreativeWork                                      0         478
                                                                            people I know
                                                                                                              196      2.1        0    5.6      14 60.0
                                                                            (foaf:name)
Table 2: The number of triples with the given predicate or class
before and after the execution of the pipeline, grouped by se-              people I know
                                                                                                              196      2.1        0    3.2 200 60.0
mantic relatedness.                                                         (rdfs:label)

It is important to note that most improvements are solely the result        publications I wrote              205      4.0        0 10.8         0 10.5
of reasoning on existing ontologies; only 8 custom OWL triples              my publications                   205      4.1 134 12.6 134 14.4
were added (7 for equivalent properties, 1 for a symmetric prop-
                                                                            my blog posts                      43      1.1       40    6.5      40     6.4
erty).
                                                                            my articles                       248      4.9        0    6.3       0     3.3
Quality                                                                     a colleague’s publica-
                                                                                                               32      1.1       20 13.9        20 16.3
While computing the deductive closure should not introduce any in-          tions
consistencies, the quality of the ontologies directly impacts the re-       my first-author publica-
sult. While inspecting the initial output, I found the following con-                                          46      2.7        0    3.8       6 36.2
                                                                            tions
flicting triples, typing me as a person and a company:
rv:me rdf:type dbo:Person.                                                  works I cite                       46      0.5        0    4.0       0 60.0
rv:me rdf:type dbo:Company.                                                 my interests (federated)           4       0.4        0    4.0       4     1.8
To find the cause of this inconsistency, I ran the reasoner on the
website data and ontologies, but instead of asking for the deductive        Table 3: Number of results and execution time per query, com-
closure, I asked to prove the second triple. The resulting proof            paring the TPF client on the enhanced data with Linked Data
traced the result back to the DBpedia ontology erroneously stating          traversal on my website (starting from my home page or my
the equivalence of the schema:publisher and dbo:first-                      FOAF profile).
Publisher properties. While the former has both people and or-
ganisations in its range, the latter is specific to companies—hence         The first two queries show the influence of ontological equiva-
the conflicting triple in the output. I reported this issue and manu-       lences. At the time of writing, my website related me to
ally corrected it in the ontology. Similarly, dbo:Website was               196 foaf:Persons through the foaf:knows predicate. If the
deemed equivalent to schema:WebPage, whereas the latter should              query uses only the FOAF vocabulary, with foaf:name to obtain
be schema:WebSite. Disjointness constraints in the ontologies               people’s names, Linked Data traversal finds 14 results. If we use
would help catch these mistakes. Further validation with RDFU-              rdfs:label instead, it even finds additional results on external
nit [18] brought up a list of errors, but all of them turned out to be      websites (because of link-traversal query semantics).
false positives.
                                                                            A second group of queries reveals the impact of link unidirectional-
                                                                            ity and inference of subclasses and subproperties in queries for
Queries
                                                                            scholarly publications and blog posts. Through traversal, “publica-
Finally, I report on the execution time and number of results for
                                                                            tions I wrote” (with foaf:made) does not yield any results,
a couple of example SPARQL queries. These were evaluated
                                                                            whereas “my publications” (with schema:author) yields 134,
against the live TPF interface by a TPF client, and against the actual
                                                                            even though both queries are semantically equivalent. Given that
webpages and profile by a Linked Data-traversal-based client
                                                                            my profile actually contained 205 publications, the 71 missing pub-
(SQUIN [19]). The intention is not to compare these query engines,
                                                                            lications are caused by SQUIN’s implementation rather than being
as they use different paradigms and query semantics: TPF guaran-
                                                                            an inherent Linked Data limitation. Blog posts are found in all sce-
tees 100% completeness with respect to given datasets, whereas
                                                                            narios, even though the traversal client finds 3 fewer posts. Only the
SQUIN considers reachable subwebs. The goal is rather to highlight
                                                                            TPF client is able to find all articles, because the pipeline generated
the limits of querying over RDFa pages as practiced today, and to
                                                                            the inferred type schema:Article for publications and blog
contrast this with the improved dataset resulting from the ETL pipe-
posts. Other more constrained queries for publications yield fewer        ing my own subproperties in these cases would encode more spe-
results through traversal as well. Citations (cito:cites) are only        cific semantics, while the other properties could be derived from the
identified by the TPF client, as articles solely mention its subprop-     pipeline. However, this would require maintaining a custom ontol-
erties.                                                                   ogy, to which few queries would refer.

The final test examines a federated query: when starting from the         The reuse of identifiers is another source of debate. I opted as much
profile, the Linked Data client also finds all results.                   as possible to reuse URLs for people and publications. The advan-
                                                                          tage is that this enables Linked Data traversal, so additional RDF
Regarding execution times, the measurements provide positive sig-         triples can be picked up from FOAF profiles and other sources. The
nals for low-cost infrastructures on the public Web. Note that both       main drawback, however, is that the URLs do not dereference to my
clients return results iteratively. With an average arrival rate of       own datasource, which also contains data about their concepts. As a
53 results per second for the above queries, the TPF client’s pace        result, my RDF data contains a mix of URLs that dereference exter-
exceeds the processing capabilities of people, enabling usage in live     nally (such as http://csarven.ca/#i), URLs that dereference to my
applications. Even faster performance could be reached with, for in-      website (such as https://ruben.verborgh.org/articles/queryable-re-
stance, a data dump or a SPARQL endpoint; however, these would            search-data/) and URLs that dereference to my TPF interface (such
involve an added cost for either the data publisher or consumer, and      as https://data.verborgh.org/people/anastasia_dimou) Fortunately,
might have difficulties in federated contexts.                            the TPF interface can be considered an extension of the Linked
                                                                          Data principles [20], such that URLs can be “dereferenced” (or
OPEN QUESTIONS                                                            queried) on different domains as well, yet this not help regular
Publishing RDFa data on my website over the past years—and sub-           Linked Data crawlers. An alternative is using my own URLs every-
sequently creating the above pipeline—has left me with a couple of        where and connecting them with external URLs through
questions, some of which I discuss below.                                 owl:sameAs, but then certain results would only be revealed to
                                                                          more complex SPARQL queries that explicitly consider multiple
A first question is what data should be encoded as Linked Data, and       identifiers.
how it should be distributed across resources. In the past, I always
had to decide whether to write data directly on the page as               With regard to publishing, I wondered to what extent we should
HTML+RDFa, whether to place it in my FOAF profile as RDF,                 place RDF triples in the default graph on the Web at large. As noted
whether to do both, or neither. The pipeline partially solves the         above, inconsistencies can creep in the data; also, some of the
where problem by gathering all data in a single interface. Even           things I state might reflect my beliefs rather than general truths.
though each page explicitly links to the Linked Data-compatible           While RDFa does not have a standardized option to place data in
TPF interface using void:inDataset—so traversal-based clients             named graphs, other types of RDF documents do. By moving my
can also consume it—other clients might only extract the triples          data to a dedicated graph, as is practiced by several datasets, I could
from an individual page. Furthermore, apart from the notable excep-       create a separate context for these triples. This would also facilitate
tion of search engine crawlers, it is hard to predict what data auto-     provenance and other applications, and it would then be up to the
mated clients are looking for.                                            data consumer to decide how to treat data graph.

A closely related question is what ontologies should be used on           The above questions highlight the need for guidance and examples
which places. Given that authors have limited time and in order to        in addition to specifications and standards. Usage statistics could
not make HTML pages too heavy, we should probably limit our-              act as an additional information source. While HTTP logs from the
selves to a handful of vocabularies. When inter-vocabulary links are      TPF interface do not contain full SPARQL queries, they show the
present, the pipeline can then materialize equivalent triples automat-    IRIs and triple patterns clients look for. Such behavioral information
ically. I have chosen Schema.org for most HTML pages, as this is          would not be available from clients or crawlers visiting
consumed by several search engines. However, this vocabulary is           HTML+RDFa pages.
rather loose and might not fit other clients. Perhaps the FOAF pro-
                                                                          Finally, when researchers start self-publishing their data in a
file is the right place to elaborate, as this is a dedicated RDF docu-
                                                                          queryable way at a large scale, we will need a connecting layer to
ment that attracts more specific-purpose clients compared to regular
                                                                          approach the decentralized ecosystem efficiently through a single
HTML pages.
                                                                          user interface. While federated query execution over multiple TPF
Even after the above choices have been made, the flexibility of some      interfaces on the public Web is feasible, as demonstrated above, this
vocabularies leads to additional decisions. For example, in HTML          mechanism is impractical to query hundreds or thousands of such
articles I mark up citations with the CiTO ontology. The domain           interfaces. On the one hand, this indicates their will still be room for
and range of predicates such as cito:cites is open to documents,          centralized indexes or aggregators, but their added value then shifts
sections, paragraphs, and other units of information. However,            from data to services. On the other hand, research into decentralized
choosing to cite an article from a paragraph influences how queries       technologies might make even such indexes obsolete.
such as “citations in my articles” need to be written. Fortunately, the
pipeline can infer the other triples, such that the section and docu-     CONCLUSION
ment containing the paragraph also cite the article.                      RDFa makes semantic data publication easy for researchers who
                                                                          want to be in control of their online data and metadata. For those
When marking up data, I noticed that I sometimes attach stronger          who prefer not to work directly on RDFa, or lack the knowledge to
meaning to concepts than strictly prescribed by their ontologies.         do so, annotation tools and editors can help with its production. In
Some of these semantics are encoded in my custom OWL triples,             this article, I examined the question of how we subsequently can
whose contents contribute to the reasoning process (but do not ap-        optimize the queryability of researchers’ data on the Web, in order
pear directly in the output, as this would leak my semantics glob-        to facilitate their consumption by different kinds of clients.
ally). For instance, I assume equivalence of rdfs:label and
foaf:name for my purposes, and treat the foaf:knows relation              Simple clients do not possess the capabilities of large-scale aggre-
as symmetrical (as in its textual—but not formal—definition). Us-         gators to obtain all Linked Data on a website. They encounter
mostly individual HTML+RDFa webpages, which are always in-                able at: http://www.nature.com/news/online-collaboration-scien-
complete with respect to both the whole of knowledge on a website         tists-and-the-social-network-1.15711.
as well as the ontological constructs to express it. Furthermore, vari-   [4] Thelwall, M. and Kousha, K. (2015), “Web indicators for re-
ations in reasoning capabilities make bridging between different on-
                                                                          search evaluation: Part 2: Social media metrics”, El Profesional De
tologies difficult. The proposed ETL pipeline addresses these chal-
                                                                          La Información, EPI SCP, Vol. 24 No. 5, pp. 607–620, available at:
lenges by publishing a website’s explicit and inferred triples in a
                                                                          http://www.elprofesionaldelainformacion.com/contenidos/2015/sep/
queryable interface. The pipeline itself is simple and can be ported
                                                                          09.pdf.
to different scenarios. If cost is an issue, the extraction and reason-
ing steps can run on public infrastructures such as Travis CI, as all     [5] Yeung, C.-man A., Liccardi, I., Lu, K., Seneviratne, O. and
involved software is open source. Queryable data need not be ex-          Berners-Lee, T. (2009), “Decentralization: The future of online so-
pensive either, as proven by free TPF interfaces on GitHub [21] and       cial networking”, in Proceedings of the W3C Workshop on the Fu-
by the LOD Laundromat [22], which provides more than 600,000              ture of Social Networking Position Papers, Vol. 2, pp. 2–7, avail-
TPF interfaces on a single server.                                        able at: https://www.w3.org/2008/09/msnws/papers/decentraliza-
                                                                          tion.pdf.
By publishing queryable research data, we contribute to the Linked
                                                                          [6] Berners-Lee, T. (2006), “Linked Data”, July, available at:
Research vision: the proposed pipeline increases reusability and im-
                                                                          https://www.w3.org/DesignIssues/LinkedData.html.
proves linking by completing semantic data through reasoning. The
possibility to execute live queries—and in particular federated           [7] Möller, K., Heath, T., Handschuh, S. and Domingue, J. (2007),
queries—enables new use cases, offering researchers additional in-        “Recipes for Semantic Web Dog Food – The ESWC and ISWC
centives to self-publish their data. Even though I have focused on        Metadata Projects”, in Aberer, K., Choi, K.-S., Noy, N., Allemang,
research data, the principles generalize to other domains. In particu-    D., Lee, K.-I., Nixon, L., Golbeck, J., et al. (Eds.), Proceedings of
lar, the Solid project for decentralized social applications could ben-   6th International Semantic Web Conference, Vol. 4825, Lecture
efit from a similar pipeline to facilitate data querying and exchange     Notes in Computer Science, pp. 802–815, available at:
across different parties in a scalable way.                               https://doi.org/10.1007/978-3-540-76298-0_58.
                                                                          [8] Cyganiak, R., Wood, D. and Lanthaler, M. (Eds.). (2014), RDF
Even as a researcher who has been publishing RDFa for years, I
                                                                          1.1 Concepts and Abstract Syntax, Recommendation, World Wide
have often wondered about the significance of adding markup to in-
                                                                          Web Consortium, available at: https://www.w3.org/TR/rdf11-con-
dividual pages. I doubted to what extent the individual pieces of
                                                                          cepts/.
data I created contributed to the larger puzzle of Linked Data on my
site and other websites like it, given that they only existed within      [9] Brickley, D. and Miller, L. (2014), “FOAF Vocabulary Specifi-
the confines of a single page. Building the pipeline enabled the exe-     cation 0.99”, available at: http://xmlns.com/foaf/spec/.
cution of complex queries across pages, without significantly             [10] Ding, L., Zhou, L., Finin, T. and Joshi, A. (2005), “How the
changing the maintenance cost of my website. From now on, every           Semantic Web is Being Used: An Analysis of FOAF Documents”,
piece of data I mark up directly leads to one or more queryable           in Proceedings of the 38th Annual Hawaii International Conference
triples, which provides me with a stronger motivation. If others fol-     on System Sciences, available at: https://doi.org/10.1109/
low the same path, we no longer need centralized data stores. We          HICSS.2005.299.
could execute federated across researchers’ websites, using combi-
nations of Linked Data traversal and more complex query interfaces        [11] Sporny, M. (Ed.). (2015), RDFa Lite 1.1 – Second Edition,
that can guarantee completeness. Centralized systems can play a           Recommendation, World Wide Web Consortium, available at:
crucial role by providing indexing and additional services, yet they      https://www.w3.org/TR/rdfa-lite/.
should act at most as secondary storage.                                  [12] Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J.,
                                                                          De Vocht, L., De Meester, B., Haesendonck, G., et al. (2016),
Unfortunately, exposing my own data in a queryable way does not           “Triple Pattern Fragments: a Low-cost Knowledge Graph Interface
relieve me yet of my frustration of synchronizing that data on cur-       for the Web”, Journal of Web Semantics, Vol. 37–38, pp. 184–206,
rent social research networks. It does make my data more search-          available at: https://doi.org/doi:10.1016/j.websem.2016.03.003.
able and useful though, and I deeply hope that one day, these net-
works will synchronize with my interface instead of the other way         [13] Hartig, O. (2013), “An Overview on Execution Strategies for
round. Most of all, I hope that others will mark up their webpages        Linked Data Queries”, Datenbank-Spektrum, Springer, Vol. 13 No.
and make them queryable as well, so we can query research data on         2, pp. 89–99, available at: http://olafhartig.de/files/Hartig_LD-
the Web instead of in silos. To realize this, we should each contrib-     QueryExec_DBSpektrum2013_Preprint.pdf.
ute our own pieces of data in a way that makes them fit together          [14] Beckett, D. (2014), RDF 1.1 N-Triples, Recommendation,
easily, instead of watching third parties mash our data into an en-       World Wide Web Consortium, available at: https://www.w3.org/TR/
tirely different puzzle altogether.                                       n-triples/.
                                                                          [15] Verborgh, R. and De Roo, J. (2015), “Drawing Conclusions
REFERENCES                                                                from Linked Data on the Web”, IEEE Software, Vol. 32 No. 5, pp.
[1] Harnad, S. and Brody, T. (2004), “Comparing the Impact of             23–27, available at: http://online.qmags.com/
Open Access (OA) vs. Non-OA Articles in the Same Journals”,               ISW0515?cid=3244717&eid=19361&pg=25.
D-Lib Magazine, June, available at: http://www.dlib.org/dlib/
june04/harnad/06harnad.html.                                              [16] Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C.,
                                                                          Polleres, A. and Arias, M. (2013), “Binary RDF Representation for
[2] Bohannon, J. (2016), “Who’s downloading pirated papers? Ev-           Publication and Exchange (HDT)”, Journal of Web Semantics, Else-
eryone”, Science, American Association for the Advancement of             vier, Vol. 19, pp. 22–41, available at: http://www.websemanticsjour-
Science, Vol. 352 No. 6285, pp. 508–512, available at:                    nal.org/index.php/ps/article/view/328.
https://doi.org/10.1126/science.352.6285.508.
                                                                          [17] Vrandecı́c Denny, Krötzsch, M., Rudolph, S. and Lösch, U.
[3] Van Noorden, R. (2014), “Online collaboration: Scientists and         (2010), “Leveraging non-lexical knowledge for the linked open data
the social network”, Nature, Vol. 512 No. 7513, pp. 126–129, avail-       web”, Review of April Fool’s Day Transactions, Vol. 5, pp. 18–27,
available at: http://km.aifb.kit.edu/projects/numbers/                  through Linked Data Fragments”, in Bizer, C., Heath, T., Auer, S.
linked_open_numbers.pdf.                                                and Berners-Lee, T. (Eds.), Proceedings of the 7th Workshop on
[18] Kontokostas, D., Westphal, P., Auer, S., Hellmann, S.,             Linked Data on the Web, Vol. 1184, CEUR Workshop Proceedings,
Lehmann, J., Cornelissen, R. and Zaveri, A. (2014), “Test-driven        available at: http://ceur-ws.org/Vol-1184/ldow2014_paper_04.pdf.
Evaluation of Linked Data Quality”, in Proceedings of the 23rd In-      [21] Matteis, L. and Verborgh, R. (2014), “Hosting Queryable and
ternational Conference on World Wide Web, ACM, pp. 747–758,             Highly Available Linked Data for Free”, in Proceedings of the
available at: https://doi.org/10.1145/2566486.2568002.                  ISWC Developers Workshop 2014, Vol. 1268, CEUR Workshop
                                                                        Proceedings, pp. 13–18, available at: http://ceur-ws.org/Vol-1268/
[19] Hartig, O. (2011), “Zero-Knowledge Query Planning for an It-
                                                                        paper3.pdf.
erator Implementation of Link Traversal Based Query Execution”,
in Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis,   [22] Rietveld, L., Verborgh, R., Beek, W., Vander Sande, M. and
D., De Leenheer, P. and Pan, J. (Eds.), Proceedings of the 8th Ex-      Schlobach, S. (2015), “Linked Data-as-a-Service: The Semantic
tended Semantic Web Conference, Vol. 6643, Lecture Notes in             Web Redeployed”, in Gandon, F., Sabou, M., Sack, H., d’Amato,
Computer Science, Springer, pp. 154–169, available at:                  C., Cudré-Mauroux, P. and Zimmermann, A. (Eds.), The Semantic
https://doi.org/10.1007/978-3-642-21034-1_11.                           Web. Latest Advances and New Domains, Vol. 9088, Lecture Notes
                                                                        in Computer Science, Springer, pp. 471–487, available at:
[20] Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S.,
                                                                        http://linkeddatafragments.org/publications/eswc2015-lodl.pdf.
Mannens, E. and Van de Walle, R. (2014), “Web-Scale Querying