Piecing the puzzle Self-publishing queryable research data on the Web Ruben Verborgh Ghent University – imec – IDLab ruben.verborgh@ugent.be The original article is available at https://ruben.verborgh.org/articles/queryable-research-data/. ABSTRACT researchers who do not actively maintain their online research pro- Publishing research on the Web accompanied by machine-readable files risk ending up with incomplete and inaccurate publication lists data is one of the aims of Linked Research. Merely embedding on those networks. Such misrepresentation can be significantly metadata as RDFa in HTML research articles, however, does not worse than not being present at all—but given the public nature of solve the problems of accessing and querying that data. Hence, publication metadata, complete absence is not an enforceable I created a simple ETL pipeline to extract and enrich Linked Data choice. from my personal website, publishing the result in a queryable way Online representation is not limited to social networks: scientific through Triple Pattern Fragments. The pipeline is open source, uses publishers also make metadata available about their journals and existing ontologies, and can be adapted to other websites. In this ar- books. For instance, Springer Nature recently released SciGraph, a ticle, I discuss this pipeline, the resulting data, and its possibilities Linked Open Data platform that includes scholarly metadata. Accu- for query evaluation on the Web. More than 35,000 RDF triples of racy is less of an issue in such cases, as data comes directly from my data are queryable, even with federated SPARQL queries be- the source. However, quality and usability are still influenced by the cause of links to external datasets. This proves that researchers do way data is modeled and whether or how identifiers are disam- not need to depend on centralized repositories for readily accessible biguated. Completeness is not guaranteed, given that authors typi- (meta-)data, but instead can—and should—take matters into their cally target multiple publishers. Therefore, even such authoritative own hands. sources do not provide individual researchers with a correct profile. INTRODUCTION In the spirit of decentralized social networking [5] and Linked Data The World Wide Web continues to shape many domains, and not in [6], several researchers instead started publishing their own data and the least research. On the one hand, the Web beautifully fulfills its metadata. I am one of them, since I believe in practicing what we role as a distribution channel of scientific knowledge, for which it preach [7] as Linked Data advocates, and because I want my own was originally invented. This spurs interesting dialogues concerning website to act as the main authority for my data. After all, I can Open Access [1] and even piracy [2] of research articles. On the spend more effort on the completeness and accuracy of my publica- other hand, the advent of social networking creates new interaction tion metadata than most other platforms could reasonably do for opportunities for researchers, but also forces us to consider our on- me. In general, self-published data typically resides in separate RDF line presence [3]. Various social networks dedicated to research documents [8] (for which the FOAF vocabulary [9] is particularly have emerged: Mendeley, ResearchGate, Academia, … They attract popular [10]), or inside of HTML documents (using RDFa Lite [11] millions of researchers, and employ various tactics to keep us there. or similar formats). A major issue of these social research networks is their lack of mu- Despite the controllable quality of personally maintained research tual complementarity. None of them has become a clear winner in data and metadata in individual documents on the Web, they are not terms of adaption. At first sight, the resulting plurality seems a as visible, findable, and queryable as those of social research net- blessing for diversity, compared to the monoculture of Facebook for works. I call a dataset interface “queryable” with respect to a given social networking in general. Yet whereas other generic social net- query when a consumer does not need to download the entire works such as Twitter and LinkedIn serve complementary profes- dataset in order to evaluate that query over it with full complete- sional purposes compared to Facebook, social research networks ness. Unfortunately, hosting advanced search interfaces on a per- share nearly identical goals. As an example, a researcher could an- sonal website quickly becomes complex and expensive. To mitigate nounce a newly accepted paper on Twitter, discuss its review this, I have implemented a simple Extract/Transform/Load (ETL) process on Facebook, and share a photograph of an award on pipeline on top of my personal website, which extracts, enriches, LinkedIn. In contrast, one would typically not exclusively list a spe- and publishes my Linked Data in a queryable way through a Triple cific publication on Mendeley and another on Academia, as neither Pattern Fragments [12] interface. The resulting data can be browsed publication list would be complete. and queried live on the Web, with higher quality and flexibility than on my other online profiles, and at only a limited cost for me as data In practice, this results in constant bookkeeping for researchers who publisher. want each of their profiles to correctly represent them—a necessity if such profiles are implicitly or explicitly treated as performance This article describes my use case, which resembles that of many indicators [4]. Deliberate absence on any of these networks is not a other researchers. I detail the design and implementation of the ETL viable option, as parts of one’s publication metadata might be auto- pipeline, and report on its results. At the end, I list open questions matically harvested or entered by co-authors, leaving an automati- regarding self-publication, before concluding with a reflection on cally generated but incomplete profile. Furthermore, the quality of the opportunities for the broader research community. such non-curated metadata records can be questionable. As a result, USE CASE cates for the concept “label”. Similarly, queries for schema: Article or schema:CreativeWork would not return results Available Data because they are not explicitly mentioned, even though their sub- classes schema:BlogPosting and schema:Scholarly- Like the websites of many researchers, my personal website con- Article appear frequently. tains data about the following types of resources: people such as colleagues, collaborators, and fellow researchers Given the above considerations, the constraints of individual re- searchers, and the possibilities of social research networks, we for- research articles I have co-authored mulate the following requirements: blog posts I have written courses I teach Automated clients should be able to evaluate queries with full completeness with respect to the data on the website. This data is spread across different HTTP resources: Semantically equivalent expressions should yield the same query results, regardless of vocabulary with respect to all vo- a single RDF document (FOAF profile) containing: cabularies used on the website. manually entered data (personal data, affiliations, projects) Queryable data can only involve a limited cost and effort for automatically generated metadata (publications, blog posts) publishers as well as consumers. an HTML page with RDFa per: publication (publication and author metadata) ETL PIPELINE blog post (post metadata) To automate this process, I have developed a simple ETL pipeline. HTML article (metadata and citations) With the exception of a couple of finer points, the pipeline itself is fairly straightforward. What is surprising, however, is the impact … such a simple pipeline can have, as discussed hereafter in the Re- Depending on the context, I encode the information with different sults section. The pipeline consists of the following phases, which vocabularies: will be discussed in the following subsections. Friend of a Friend (FOAF) (people, documents) Extract all triples from the website’s RDF and HTML+RDFa documents. Schema.org (blog posts, articles, courses) Reason over this data and its ontologies to complete gaps. Bibliographic Ontology (BIBO) (publications) Publish the resulting data in a queryable interface. Citation Typing Ontology (CiTO) (citations) … The source code for the pipeline is available on GitHub. The pipe- line can be run periodically, or triggered on website updates as part There is a considerable amount of overlap since much data is avail- of a continuous integration process. In order to adapt this to differ- able in more than one place, sometimes in different vocabularies. ent websites, the default ontology files can be replaced by others For example, webpages about my publications contain Schema.org that are relevant for a given website. markup (to facilitate indexing by search engines), whereas my pro- file describes the same publications more rigorously using BIBO and FOAF (for more advanced RDF clients). I deliberately reuse the Extract The pipeline loops through all of the website’s files (either through same identifiers for the same resources everywhere, so identifica- the local filesystem or through Web crawling) and makes lists of tion is not an issue. RDF documents and HTML+RDFa documents. The RDF docu- ments are fed through the Serd parser to verify validity and for con- Data Publication Requirements version into N-Triples [14], so the rest of the pipeline can assume While the publication of structured data as RDF and RDFa is con- one triple per line. The RDFa is parsed into N-Triples by the RD- veniently integrated in the webpage creation process, querying in- FLib library for Python. Surprisingly, this library was the only one formation over the entire website is difficult. For instance, starting I found that correctly parsed RDFa Lite in (valid) HTML5; both from the homepage, obtaining a list of all mentioned people on the Raptor and Apache Any23 seemed to expect a stricter document website would be non-trivial. In general, SPARQL query execution layout. over Linked Data takes a considerable amount of time, and com- pleteness cannot be guaranteed [13]. So while Linked Data docu- ments are excellent for automated exploration of individual re- Reason In order to fix gaps caused by implicit properties and classes, the sources, and for aggregators such as search engines that can harvest pipeline performs reasoning over the extracted data and its ontolo- the entire website, the possibilities of individual automated clients gies to compute the deductive closure. The choice of ontologies is remain limited. based on the data, and currently includes FOAF, DBpedia, CiTO, Another problem is the heterogeneity of vocabularies: clients with- Schema.org, and the Organizations ontology. Additionally, I speci- out reasoning capabilities would only find subsets of the informa- fied a limited number of custom OWL triples to indicate equiva- tion, depending on which vocabulary is present in a given represen- lences that hold on my website, but not necessarily in other con- tation. Especially in RDFa, it would be cumbersome to combine ev- texts. ery single occurrence of schema:name with the semantically The pipeline delegates reasoning to the highly performant EYE rea- equivalent dc:title, rdfs:label, and foaf:name. As such, soner [15], which does not have any RDFS or OWL knowledge people might have a foaf:name (because FOAF is common for built-in. Consequently, relevant RDFS and OWL theories can be se- people), publications a schema:name (because of schema: lected manually, such that only a practical subset of the entire de- ScholarlyArticle), and neither an rdfs:label. Depending ductive closure is computed. For instance, my FOAF profile asserts on the kind of information, queries would thus need different predi- that all resources on my site are different using owl:AllDifferent; a full deductive closure would result in an step 3 improves performance, as the derived ontology triples are al- undesired combinatorial explosion of owl:differentFrom state- ready materialized. Given that ontologies change slowly, the output ments. of steps 1 and 2 could be cached. The website’s dataset is enriched through the following steps: Publish 1. The ontologies are skolemized [8] and concatenated into a single The resulting triples are then published through a Triple Pattern ontology file. Fragments (TPF) [12] interface, which allows clients to access a dataset by triple pattern. In essence, the lightweight TPF interface 2. The deductive closure of the joined ontology is computed by extends Linked Data’s subject-based dereferencing by also provid- passing it to the EYE reasoner with the RDFS and OWL theories. ing predicate- and object-based lookup. Through this interface, 3. The deductive closure of the website’s data is computed by clients can execute SPARQL queries with full completeness at lim- passing it to the EYE reasoner with the RDFS and OWL theories ited server cost. Because of the simplicity of the interface, various and the deductive closure of the ontology. back-ends are possible. For instance, the data from the pipeline can 4. Ontological triples are removed from the data by subtracting be served from memory by loading the generated N-Triples file, or triples that also occur in the deductive closure of the ontology. the pipeline can compress it into a Header Dictionary Triples 5. Other unnecessary triples are removed, in particular triples (HDT) [16] file. with skolemized ontology IRIs, which are meaningless without the ontology. Special care is taken to make IRIs dereferenceable [6] during the publication process. While I emphasize IRI reuse, some of my These steps ensure that only triples directly related to the data are co-authors do not have their own profile, so I had to mint IRIs for published without any direct or derived triples from its ontologies, them. Resolving such IRIs results in an HTTP 303 redirect to the which form different datasets. By separating them, ontologies re- TPF with data about the concept. For instance, the IRI main published as independent datasets, and users executing queries https://data.verborgh.org/people/sam_coppens redi- can explicitly choose which ontologies or datasets to include. rects to the TPF of triples with this IRI as subject. For example, when the original data contains art:publication schema:author rv:me. RESULTS I applied the ETL pipeline to my personal website https://ruben.ver- and given that DBpedia and Schema.org ontologies (before skolem- borgh.org/ to verify its effectiveness. The data is published at ization) contain dbo:author owl:equivalentProperty schema:author. https://data.verborgh.org/ruben and can be queried with a TPF client schema:author rdfs:range [ such as http://query.verborgh.org/. The results reflect the status of owl:unionOf (schema:Organization schema:Person) January 2017, and measurements were executed on a MacBook Pro ]. with a 2.66GHz Intel Core i7 processor and 8GB of RAM. then the raw reasoner output of step 3 (after skolemization) would be Generated Triples art:publication dbo:author rv:me. In total, 35,916 triples were generated in under 5 minutes from art:publication schema:author rv:me. 6,307 profile triples and 12,564 unique triples from webpages. The rv:me rdf:type skolem:b0. table below shows the number of unique triples at each step and the dbo:author owl:equivalentProperty schema:author. time it took to obtain them. The main bottleneck is not reasoning schema:author rdfs:range skolem:b0. (≈3,000 triples per second), but rather RDFa extraction skolem:b0 owl:unionOf skolem:l1. (≈100 triples per second), which can fortunately be parallelized skolem:l1 a rdf:List. more easily. skolem:l1 rdf:first schema:Organization. skolem:l1 rdf:rest skolem:l2. step time (s) # triples skolem:l2 a rdf:List. skolem:l2 rdf:first schema:Person. RDF(a) extraction 170.0 17,050 skolem:l2 rdf:rest rdf:nil. ontology skolemization 0.6 44,179 The skolemization in step 1 ensures that blank nodes from ontolo- gies have the same identifier before and after the reasoning runs in deductive closure ontologies 38.8 144,549 steps 2 and 3. Step 2 results in triples 9–17 (note the inferred triples deductive closure data and ontologies 61.8 183,282 12 and 15), which are also present in the output of step 3, together subtract ontological triples 0.9 38,745 with the added triples 6–8 derived from data triple 1. Because of the previous skolemization, triples 9–16 can be removed through a sim- subtract other triples 1.0 35,916 ple line-by-line difference, as they have identical N-Triples repre- total 273.0 35,916 sentations in the outputs of steps 2 and 3. Finally, step 5 removes triple 8, which is not meaningful as it points to an unreference- Table 1: The number of unique triples per phase, and the time able blank node in the Schema.org ontology. The resulting enriched it took to extract them. data is: art:publication dbo:author rv:me. While dataset size is not an indicator for quality [17], the accessibil- art:publication schema:author rv:me. ity of the data improves through the completion of inverse predi- Thereby, data that was previously only described with Schema.org cates and equivalent or subordinate predicates and classes between in RDFa becomes also available with DBpedia. Note that the exam- ontologies. The table below lists the frequency of triples with spe- ple triple yields several more triples in the actual pipeline, which cific predicates and classes before and after executing the pipeline. uses the full FOAF, Schema.org, and DBpedia ontologies. Passing the deductive closure of the joined ontology from step 2 to predicate or class # pre # post line. dc:title 657 714 To this end, I tested three scenarios on the public Web: rdfs:label 473 714 1. a Triple Pattern Fragments client (ldf-client 2.0.4) with the pipe- foaf:name 394 714 line’s TPF interface schema:name 439 714 2. a Linked Data client (SQUIN 20141016) with my homepage as seed schema:isPartOf 263 263 3. a Linked Data client (SQUIN 20141016) with my FOAF profile schema:hasPart 0 263 as seed cito:citesAsAuthority 14 14 All clients started with an empty cache for every query, and the cito:cites 0 33 query timeout was set to 60 seconds. The waiting period between requests for SQUIN was disabled. For the federated query, the TPF schema:citation 0 33 client also accessed DBpedia, which the Linked Data client can find foaf:Person 196 196 through link traversal. To highlight the impact of the seeds, queries avoid IRIs from my domain by using literals for concepts instead. dbo:Person 0 196 schema:ScholarlyArticle 203 203 TPF (pipe- LD LD (pro- line) (home) file) schema:Article 0 243 query # t (s) # t (s) # t (s) schema:CreativeWork 0 478 people I know 196 2.1 0 5.6 14 60.0 (foaf:name) Table 2: The number of triples with the given predicate or class before and after the execution of the pipeline, grouped by se- people I know 196 2.1 0 3.2 200 60.0 mantic relatedness. (rdfs:label) It is important to note that most improvements are solely the result publications I wrote 205 4.0 0 10.8 0 10.5 of reasoning on existing ontologies; only 8 custom OWL triples my publications 205 4.1 134 12.6 134 14.4 were added (7 for equivalent properties, 1 for a symmetric prop- my blog posts 43 1.1 40 6.5 40 6.4 erty). my articles 248 4.9 0 6.3 0 3.3 Quality a colleague’s publica- 32 1.1 20 13.9 20 16.3 While computing the deductive closure should not introduce any in- tions consistencies, the quality of the ontologies directly impacts the re- my first-author publica- sult. While inspecting the initial output, I found the following con- 46 2.7 0 3.8 6 36.2 tions flicting triples, typing me as a person and a company: rv:me rdf:type dbo:Person. works I cite 46 0.5 0 4.0 0 60.0 rv:me rdf:type dbo:Company. my interests (federated) 4 0.4 0 4.0 4 1.8 To find the cause of this inconsistency, I ran the reasoner on the website data and ontologies, but instead of asking for the deductive Table 3: Number of results and execution time per query, com- closure, I asked to prove the second triple. The resulting proof paring the TPF client on the enhanced data with Linked Data traced the result back to the DBpedia ontology erroneously stating traversal on my website (starting from my home page or my the equivalence of the schema:publisher and dbo:first- FOAF profile). Publisher properties. While the former has both people and or- ganisations in its range, the latter is specific to companies—hence The first two queries show the influence of ontological equiva- the conflicting triple in the output. I reported this issue and manu- lences. At the time of writing, my website related me to ally corrected it in the ontology. Similarly, dbo:Website was 196 foaf:Persons through the foaf:knows predicate. If the deemed equivalent to schema:WebPage, whereas the latter should query uses only the FOAF vocabulary, with foaf:name to obtain be schema:WebSite. Disjointness constraints in the ontologies people’s names, Linked Data traversal finds 14 results. If we use would help catch these mistakes. Further validation with RDFU- rdfs:label instead, it even finds additional results on external nit [18] brought up a list of errors, but all of them turned out to be websites (because of link-traversal query semantics). false positives. A second group of queries reveals the impact of link unidirectional- ity and inference of subclasses and subproperties in queries for Queries scholarly publications and blog posts. Through traversal, “publica- Finally, I report on the execution time and number of results for tions I wrote” (with foaf:made) does not yield any results, a couple of example SPARQL queries. These were evaluated whereas “my publications” (with schema:author) yields 134, against the live TPF interface by a TPF client, and against the actual even though both queries are semantically equivalent. Given that webpages and profile by a Linked Data-traversal-based client my profile actually contained 205 publications, the 71 missing pub- (SQUIN [19]). The intention is not to compare these query engines, lications are caused by SQUIN’s implementation rather than being as they use different paradigms and query semantics: TPF guaran- an inherent Linked Data limitation. Blog posts are found in all sce- tees 100% completeness with respect to given datasets, whereas narios, even though the traversal client finds 3 fewer posts. Only the SQUIN considers reachable subwebs. The goal is rather to highlight TPF client is able to find all articles, because the pipeline generated the limits of querying over RDFa pages as practiced today, and to the inferred type schema:Article for publications and blog contrast this with the improved dataset resulting from the ETL pipe- posts. Other more constrained queries for publications yield fewer ing my own subproperties in these cases would encode more spe- results through traversal as well. Citations (cito:cites) are only cific semantics, while the other properties could be derived from the identified by the TPF client, as articles solely mention its subprop- pipeline. However, this would require maintaining a custom ontol- erties. ogy, to which few queries would refer. The final test examines a federated query: when starting from the The reuse of identifiers is another source of debate. I opted as much profile, the Linked Data client also finds all results. as possible to reuse URLs for people and publications. The advan- tage is that this enables Linked Data traversal, so additional RDF Regarding execution times, the measurements provide positive sig- triples can be picked up from FOAF profiles and other sources. The nals for low-cost infrastructures on the public Web. Note that both main drawback, however, is that the URLs do not dereference to my clients return results iteratively. With an average arrival rate of own datasource, which also contains data about their concepts. As a 53 results per second for the above queries, the TPF client’s pace result, my RDF data contains a mix of URLs that dereference exter- exceeds the processing capabilities of people, enabling usage in live nally (such as http://csarven.ca/#i), URLs that dereference to my applications. Even faster performance could be reached with, for in- website (such as https://ruben.verborgh.org/articles/queryable-re- stance, a data dump or a SPARQL endpoint; however, these would search-data/) and URLs that dereference to my TPF interface (such involve an added cost for either the data publisher or consumer, and as https://data.verborgh.org/people/anastasia_dimou) Fortunately, might have difficulties in federated contexts. the TPF interface can be considered an extension of the Linked Data principles [20], such that URLs can be “dereferenced” (or OPEN QUESTIONS queried) on different domains as well, yet this not help regular Publishing RDFa data on my website over the past years—and sub- Linked Data crawlers. An alternative is using my own URLs every- sequently creating the above pipeline—has left me with a couple of where and connecting them with external URLs through questions, some of which I discuss below. owl:sameAs, but then certain results would only be revealed to more complex SPARQL queries that explicitly consider multiple A first question is what data should be encoded as Linked Data, and identifiers. how it should be distributed across resources. In the past, I always had to decide whether to write data directly on the page as With regard to publishing, I wondered to what extent we should HTML+RDFa, whether to place it in my FOAF profile as RDF, place RDF triples in the default graph on the Web at large. As noted whether to do both, or neither. The pipeline partially solves the above, inconsistencies can creep in the data; also, some of the where problem by gathering all data in a single interface. Even things I state might reflect my beliefs rather than general truths. though each page explicitly links to the Linked Data-compatible While RDFa does not have a standardized option to place data in TPF interface using void:inDataset—so traversal-based clients named graphs, other types of RDF documents do. By moving my can also consume it—other clients might only extract the triples data to a dedicated graph, as is practiced by several datasets, I could from an individual page. Furthermore, apart from the notable excep- create a separate context for these triples. This would also facilitate tion of search engine crawlers, it is hard to predict what data auto- provenance and other applications, and it would then be up to the mated clients are looking for. data consumer to decide how to treat data graph. A closely related question is what ontologies should be used on The above questions highlight the need for guidance and examples which places. Given that authors have limited time and in order to in addition to specifications and standards. Usage statistics could not make HTML pages too heavy, we should probably limit our- act as an additional information source. While HTTP logs from the selves to a handful of vocabularies. When inter-vocabulary links are TPF interface do not contain full SPARQL queries, they show the present, the pipeline can then materialize equivalent triples automat- IRIs and triple patterns clients look for. Such behavioral information ically. I have chosen Schema.org for most HTML pages, as this is would not be available from clients or crawlers visiting consumed by several search engines. However, this vocabulary is HTML+RDFa pages. rather loose and might not fit other clients. Perhaps the FOAF pro- Finally, when researchers start self-publishing their data in a file is the right place to elaborate, as this is a dedicated RDF docu- queryable way at a large scale, we will need a connecting layer to ment that attracts more specific-purpose clients compared to regular approach the decentralized ecosystem efficiently through a single HTML pages. user interface. While federated query execution over multiple TPF Even after the above choices have been made, the flexibility of some interfaces on the public Web is feasible, as demonstrated above, this vocabularies leads to additional decisions. For example, in HTML mechanism is impractical to query hundreds or thousands of such articles I mark up citations with the CiTO ontology. The domain interfaces. On the one hand, this indicates their will still be room for and range of predicates such as cito:cites is open to documents, centralized indexes or aggregators, but their added value then shifts sections, paragraphs, and other units of information. However, from data to services. On the other hand, research into decentralized choosing to cite an article from a paragraph influences how queries technologies might make even such indexes obsolete. such as “citations in my articles” need to be written. Fortunately, the pipeline can infer the other triples, such that the section and docu- CONCLUSION ment containing the paragraph also cite the article. RDFa makes semantic data publication easy for researchers who want to be in control of their online data and metadata. For those When marking up data, I noticed that I sometimes attach stronger who prefer not to work directly on RDFa, or lack the knowledge to meaning to concepts than strictly prescribed by their ontologies. do so, annotation tools and editors can help with its production. In Some of these semantics are encoded in my custom OWL triples, this article, I examined the question of how we subsequently can whose contents contribute to the reasoning process (but do not ap- optimize the queryability of researchers’ data on the Web, in order pear directly in the output, as this would leak my semantics glob- to facilitate their consumption by different kinds of clients. ally). For instance, I assume equivalence of rdfs:label and foaf:name for my purposes, and treat the foaf:knows relation Simple clients do not possess the capabilities of large-scale aggre- as symmetrical (as in its textual—but not formal—definition). Us- gators to obtain all Linked Data on a website. They encounter mostly individual HTML+RDFa webpages, which are always in- able at: http://www.nature.com/news/online-collaboration-scien- complete with respect to both the whole of knowledge on a website tists-and-the-social-network-1.15711. as well as the ontological constructs to express it. Furthermore, vari- [4] Thelwall, M. and Kousha, K. (2015), “Web indicators for re- ations in reasoning capabilities make bridging between different on- search evaluation: Part 2: Social media metrics”, El Profesional De tologies difficult. The proposed ETL pipeline addresses these chal- La Información, EPI SCP, Vol. 24 No. 5, pp. 607–620, available at: lenges by publishing a website’s explicit and inferred triples in a http://www.elprofesionaldelainformacion.com/contenidos/2015/sep/ queryable interface. The pipeline itself is simple and can be ported 09.pdf. to different scenarios. If cost is an issue, the extraction and reason- ing steps can run on public infrastructures such as Travis CI, as all [5] Yeung, C.-man A., Liccardi, I., Lu, K., Seneviratne, O. and involved software is open source. Queryable data need not be ex- Berners-Lee, T. (2009), “Decentralization: The future of online so- pensive either, as proven by free TPF interfaces on GitHub [21] and cial networking”, in Proceedings of the W3C Workshop on the Fu- by the LOD Laundromat [22], which provides more than 600,000 ture of Social Networking Position Papers, Vol. 2, pp. 2–7, avail- TPF interfaces on a single server. able at: https://www.w3.org/2008/09/msnws/papers/decentraliza- tion.pdf. By publishing queryable research data, we contribute to the Linked [6] Berners-Lee, T. (2006), “Linked Data”, July, available at: Research vision: the proposed pipeline increases reusability and im- https://www.w3.org/DesignIssues/LinkedData.html. proves linking by completing semantic data through reasoning. The possibility to execute live queries—and in particular federated [7] Möller, K., Heath, T., Handschuh, S. and Domingue, J. (2007), queries—enables new use cases, offering researchers additional in- “Recipes for Semantic Web Dog Food – The ESWC and ISWC centives to self-publish their data. Even though I have focused on Metadata Projects”, in Aberer, K., Choi, K.-S., Noy, N., Allemang, research data, the principles generalize to other domains. In particu- D., Lee, K.-I., Nixon, L., Golbeck, J., et al. (Eds.), Proceedings of lar, the Solid project for decentralized social applications could ben- 6th International Semantic Web Conference, Vol. 4825, Lecture efit from a similar pipeline to facilitate data querying and exchange Notes in Computer Science, pp. 802–815, available at: across different parties in a scalable way. https://doi.org/10.1007/978-3-540-76298-0_58. [8] Cyganiak, R., Wood, D. and Lanthaler, M. (Eds.). (2014), RDF Even as a researcher who has been publishing RDFa for years, I 1.1 Concepts and Abstract Syntax, Recommendation, World Wide have often wondered about the significance of adding markup to in- Web Consortium, available at: https://www.w3.org/TR/rdf11-con- dividual pages. I doubted to what extent the individual pieces of cepts/. data I created contributed to the larger puzzle of Linked Data on my site and other websites like it, given that they only existed within [9] Brickley, D. and Miller, L. (2014), “FOAF Vocabulary Specifi- the confines of a single page. Building the pipeline enabled the exe- cation 0.99”, available at: http://xmlns.com/foaf/spec/. cution of complex queries across pages, without significantly [10] Ding, L., Zhou, L., Finin, T. and Joshi, A. (2005), “How the changing the maintenance cost of my website. From now on, every Semantic Web is Being Used: An Analysis of FOAF Documents”, piece of data I mark up directly leads to one or more queryable in Proceedings of the 38th Annual Hawaii International Conference triples, which provides me with a stronger motivation. If others fol- on System Sciences, available at: https://doi.org/10.1109/ low the same path, we no longer need centralized data stores. We HICSS.2005.299. could execute federated across researchers’ websites, using combi- nations of Linked Data traversal and more complex query interfaces [11] Sporny, M. (Ed.). (2015), RDFa Lite 1.1 – Second Edition, that can guarantee completeness. Centralized systems can play a Recommendation, World Wide Web Consortium, available at: crucial role by providing indexing and additional services, yet they https://www.w3.org/TR/rdfa-lite/. should act at most as secondary storage. [12] Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., et al. (2016), Unfortunately, exposing my own data in a queryable way does not “Triple Pattern Fragments: a Low-cost Knowledge Graph Interface relieve me yet of my frustration of synchronizing that data on cur- for the Web”, Journal of Web Semantics, Vol. 37–38, pp. 184–206, rent social research networks. It does make my data more search- available at: https://doi.org/doi:10.1016/j.websem.2016.03.003. able and useful though, and I deeply hope that one day, these net- works will synchronize with my interface instead of the other way [13] Hartig, O. (2013), “An Overview on Execution Strategies for round. Most of all, I hope that others will mark up their webpages Linked Data Queries”, Datenbank-Spektrum, Springer, Vol. 13 No. and make them queryable as well, so we can query research data on 2, pp. 89–99, available at: http://olafhartig.de/files/Hartig_LD- the Web instead of in silos. To realize this, we should each contrib- QueryExec_DBSpektrum2013_Preprint.pdf. ute our own pieces of data in a way that makes them fit together [14] Beckett, D. (2014), RDF 1.1 N-Triples, Recommendation, easily, instead of watching third parties mash our data into an en- World Wide Web Consortium, available at: https://www.w3.org/TR/ tirely different puzzle altogether. n-triples/. [15] Verborgh, R. and De Roo, J. (2015), “Drawing Conclusions REFERENCES from Linked Data on the Web”, IEEE Software, Vol. 32 No. 5, pp. [1] Harnad, S. and Brody, T. (2004), “Comparing the Impact of 23–27, available at: http://online.qmags.com/ Open Access (OA) vs. Non-OA Articles in the Same Journals”, ISW0515?cid=3244717&eid=19361&pg=25. D-Lib Magazine, June, available at: http://www.dlib.org/dlib/ june04/harnad/06harnad.html. [16] Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A. and Arias, M. (2013), “Binary RDF Representation for [2] Bohannon, J. (2016), “Who’s downloading pirated papers? Ev- Publication and Exchange (HDT)”, Journal of Web Semantics, Else- eryone”, Science, American Association for the Advancement of vier, Vol. 19, pp. 22–41, available at: http://www.websemanticsjour- Science, Vol. 352 No. 6285, pp. 508–512, available at: nal.org/index.php/ps/article/view/328. https://doi.org/10.1126/science.352.6285.508. [17] Vrandecı́c Denny, Krötzsch, M., Rudolph, S. and Lösch, U. [3] Van Noorden, R. (2014), “Online collaboration: Scientists and (2010), “Leveraging non-lexical knowledge for the linked open data the social network”, Nature, Vol. 512 No. 7513, pp. 126–129, avail- web”, Review of April Fool’s Day Transactions, Vol. 5, pp. 18–27, available at: http://km.aifb.kit.edu/projects/numbers/ through Linked Data Fragments”, in Bizer, C., Heath, T., Auer, S. linked_open_numbers.pdf. and Berners-Lee, T. (Eds.), Proceedings of the 7th Workshop on [18] Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Linked Data on the Web, Vol. 1184, CEUR Workshop Proceedings, Lehmann, J., Cornelissen, R. and Zaveri, A. (2014), “Test-driven available at: http://ceur-ws.org/Vol-1184/ldow2014_paper_04.pdf. Evaluation of Linked Data Quality”, in Proceedings of the 23rd In- [21] Matteis, L. and Verborgh, R. (2014), “Hosting Queryable and ternational Conference on World Wide Web, ACM, pp. 747–758, Highly Available Linked Data for Free”, in Proceedings of the available at: https://doi.org/10.1145/2566486.2568002. ISWC Developers Workshop 2014, Vol. 1268, CEUR Workshop Proceedings, pp. 13–18, available at: http://ceur-ws.org/Vol-1268/ [19] Hartig, O. (2011), “Zero-Knowledge Query Planning for an It- paper3.pdf. erator Implementation of Link Traversal Based Query Execution”, in Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, [22] Rietveld, L., Verborgh, R., Beek, W., Vander Sande, M. and D., De Leenheer, P. and Pan, J. (Eds.), Proceedings of the 8th Ex- Schlobach, S. (2015), “Linked Data-as-a-Service: The Semantic tended Semantic Web Conference, Vol. 6643, Lecture Notes in Web Redeployed”, in Gandon, F., Sabou, M., Sack, H., d’Amato, Computer Science, Springer, pp. 154–169, available at: C., Cudré-Mauroux, P. and Zimmermann, A. (Eds.), The Semantic https://doi.org/10.1007/978-3-642-21034-1_11. Web. Latest Advances and New Domains, Vol. 9088, Lecture Notes in Computer Science, Springer, pp. 471–487, available at: [20] Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., http://linkeddatafragments.org/publications/eswc2015-lodl.pdf. Mannens, E. and Van de Walle, R. (2014), “Web-Scale Querying