Describing a network of live datasets with the SDS vocabulary Arthur Vercruysse1 , Sitt Min Oo1 and Pieter Colpaert1 1 IDLab, Department of Electronics and Information Systems, Ghent University – imec, Technologiepark-Zwijnaarde 122, 9052 Ghent, Belgium Abstract Data publishers can provide multiple interfaces per dataset. Each interface has its own merits and drawbacks, SPARQL endpoints are expensive to host and clients find it difficult to work with static data dumps. Furthermore, query agents can only select the most fitting interface and dataset if provenance information is provided. In this paper, we introduce the Smart Data Specification for Semantically Describing Streams (SDS) to annotate dataset interfaces with provenance information, describing the consumed stream and the applied transformations on that stream. We focus on Linked Data Event Streams that can publish the same dataset with different fragmentations and demonstrate a pipeline that transforms a LDES and publishes the data with a different fragmentation as described in the accompanying provenance information. The SDS vocabulary is built upon the DCAT-AP, LDES and P-Plan vocabularies. In future work, we will create a source selection strategy for federated query processors that take into account this provenance information when selecting a dataset and interface to query the dataset. Keywords LDES, Linked Data, Provenance, Dataset selection 1. Introduction The world is ever-changing, so data portals publish live datasets derived from streams of data (events, deltas, etc.). Streams are often related to other streams by transforming their data. This leads to problems when datasets and streams are not accompanied by provenance information. Query agents often use this information to query each dataset only once, while no or inadequate provenance information forces the query agents to query the same underlying data multiple times. Once a required dataset is found the query agent can go further and discover the most fitting interface. Different interfaces determine how data is accessed. For example in a route planning application that notifies the user about changes in routes due to construction sites, multiple interfaces can be used. A time-based interface makes it easy to track the latest changes, whereas a geospatial-based interface makes it easy to calculate whether or not a construction site will be encountered on a particular route. A SPARQL endpoint can fulfil both interfaces, but SPARQL carries high operation costs and may leave the users with Managing the Evolution and Preservation of the Data Web (MEPDaW 2022) Envelope-Open arthur.vercruysse@ugent.be (A. Vercruysse); x.sittminoo@ugent.be (S. M. Oo); pieter.colpaert@ugent.be (P. Colpaert) GLOBE https://pietercolpaert.be (P. Colpaert) Orcid 0000-0003-1586-5122 (A. Vercruysse); 0000-0001-9157-7507 (S. M. Oo); 0000-0001-6917-2167 (P. Colpaert) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) low availability [1]. Constantly updating a data dump is also an option, but leaves the client with high operation and bandwidth costs [2]. Linked Data Event Streams (LDES) alleviate these high costs by fragmenting the data in different web resources and linking these resources with semantic links in multiple dimensions[2], focussing on easy replication and synchronization. Each stream can be fragmented in a different way to accommodate to the needs of the users. We set up a new interface for the Belgian street name registry and demonstrate that a query agent can find the optimal interface to execute particular queries. 2. Related work DCTerms, DCAT and VoID: Exposing metadata about datasets is long established. Dublin Core Terms (DCTerms) can be used to provide basic information about resources, providing terms like title, author, description and subject[5]. Data Catalog Vocabulary (DCAT) is designed to facilitate interoperability between data catalogs published on the web[6]. DCAT also provides terms like license, which makes it possible to define a new license for an interface. The Vocabulary of Interlinked Datasets (VoID) focuses on explicitly linking datasets together on some predicate and defining subset datasets[7]. LDES: Linked Data Event Streams is a way of exposing an evergrowing set of immutable objects. These objects can be divided into fragments as HTTP resources that are linked together with the TREE specification. Fragmentations are used to provide semantic meaning to links between HTTP resources. Each HTTP resource can, for example, hold all items that start with a particular letter. A view description is used to define the meaning of the fragments and their links[2]. VoCaLS: Vocabulary for Cataloging and Linking Streams and streaming services on the web extends the ideas of DCAT with more information about streaming data[8]. The work defines a stream slightly differently than in this paper. VoCaLS focuses on streams that generate high throughput updates, this requires processors to use a windowing mechanism. In this paper, a stream is seen more broadly as a growing collection of objects, updates or otherwise. P-Plan and PROV-O: The Ontology for Provenance and Plans (P-Plan) is an extension of the PROV-O ontology [9] created to represent the plans that guide the execution of scientific processes. P-Plan describes how plans are composed. This information can be used to keep track of provenance tree[10]. 3. The Smart Data Specification for Semantically Describing Streams (SDS) A stream in the context of the Smart Data Specification for Semantically Describing Streams (SDS) is a physical live channel that carries updates or items. A dataset can be derived from a stream as the collection of all updates or items. A physical channel can be any medium such as a Kafka stream, a WebSocket stream or even a file where updates are appended. A stream can carry any kind of data: CSV rows, mutable or immutable linked data objects, video stream chunks, etc. Figure 1: Visual representation of the SDS Ontology. The colored boxes have no semantic meaning and are only used for clarity. A stream can be derived from a transformation applied to items of a different stream. This transformation is described with p - p l a n and resides in the SDS description of the resulting stream. The stream and the transformation correspond with p : p l a n : E n t i t y and p - p l a n : A c t i v i t y respectively. This is shown as the pink part of Figure 1. The transformation links to the previous stream with the p r o v : u s e d predicate. With the power of the p - p l a n , query agents can understand how datasets are linked and what interface fits a specific query the best. The SDS description also links to metadata about the resulting dataset with the s d s : d a t a s e t predicate. This makes modifying the datasets’ metadata possible after a transformation. This is represented as the green part in Figure 1. A s d s : S t r e a m should describe the type of objects it is carrying. First, the stream can use the s d s : c a r r i e s predicate that points to the type of the s d s : R e c o r d s it is carrying(see below). Next, the stream can describe the shacl shape with s d s : s h a p e that these members adhere to, which can only be used if the stream carries linked data objects. Linking specific items to the correct stream is done with s d s : R e c o r d objects. An s d s : R e c o r d points to the data (s d s : p a y l o a d ) and the corresponding stream (s d s : s t r e a m ). These small objects make it possible for multiple streams to use the same channel. Each transformation can thus push s d s : R e c o r d objects and leave the original stream intact. A stream of immutable objects can still be transformed, for example, to calculate a hash or add a fragment id to the s d s : R e c o r d object. The yellow part of Figure 1 gives a visual overview of s d s : R e c o r d . 4. Demo Data published with Linked Data Event Streams can be partitioned or fragmented in a multitude of ways. This helps query agents resolve their queries as fast as possible whilst ingesting as little data as possible. A default fragmentation constitutes a timestamp fragmentation, this allows clients to replicate and synchronize the dataset efficiently. A substring fragmentation, on the other hand, makes autocompletion more efficient[11]. In this demo, we set up a pipeline starting from an existing LDES that exposes the registry of street names with a timestamp fragmentation. The pipeline calculates a substring fragmentation based on the name of the street and exposes a new LDES with the corresponding SDS Description and substring fragmentation. The published V i e w s can then be associated with their respective v i e w D e s c r i p t i o n as described in Listing 1. When asking a query agent “What are the 10 latest updated street names?” starting from the newly created LDES, the query agent can derive from the SDS description that the current LDES is not suitable for this query. This query would require the query agent to request the entire LDES tree and manually find the 10 latest updates, whereas following the links from the SDS description back to the original LDES, this query would only require a few HTTP requests. One HTTP request gets the SDS description and another request gets the latest updates due to the timestamp-based fragmentation. 5. Conclusion The SDS ontology makes it possible to add a description to a stream and the resulting dataset, this adds provenance information. The provenance links streams together and transformations applied to the streams. The SDS ontology aligns well with long-established ontologies like DCAT and P-Plan to maximize interoperability. With the SDS description, a query agent can now automatically select the right dataset and interface based on a given query. This can make a LDES more like a querylike interface because the best fragmentation for a particular problem can be found. SDS descriptions also make it possible to track applied transformations down to the source. This enables the user that normally only extracts data from an interface to request changes at the data source. These changes will then be propagated through the streams and result in the intended change over all interfaces. Federated query processors, that utilize source selection based on this provenance information when selecting a dataset and interface to query the dataset, are still future work. 6. Acknowledgments Funded by the Flemish government’s recovery fund VSDS project: the “Vlaamse Smart Data Space”. @prefix ex: . @prefix rdfs: . @prefix ldes: . @prefix p-plan: . @prefix prov: . @prefix sds: . ex:MyLDES a ldes:EventStream; dcat:title ”An example LDES”. # One fragmentation is based on an existing LDES ex:BasicFragmentation a tree:ViewDescription ; dcat:endpointURL ; # A rootnote from which you can access all members dcat:servesDataset ex:MyLDES ; # the LDES # This viewDescription was created by importing this stream ldes:managedBy sds:LDESStream. # The other fragmentation bucketizes the members before publishing ex:SubstringFragmentation a tree:ViewDescription ; dcat:endpointURL ; # A rootnote from which you can access all members dcat:servesDataset ex:MyLDES ; # the LDES # This viewDescription was created by importing this stream ldes:managedBy sds:BucketizedStream. ex:ImportLDES a p-plan:Activity; rdfs:comment ”Reads csv file and converts to rdf members”; prov:used . ex:LDESStream a sds:Stream; p-plan:wasGeneratedBy ex:ImportLDES; sds:carries sds:Member. ex:BucketizeStrategy a ldes:BucketizeStrategy; ldes:bucketType ldes:SubstringFragmentation; # zegt aan de client dat ik een timestampfragmentation a tree:path rdfs:label; ldes:pageSize 50. ex:BucketizeStream a p-plan:Activiy; rdfs:comment ”Execute a substring bucketization on the incoming stream”; prov:used ex:LDESStream, ex:BucketizeStrategy. ex:BucketizedStream a sds:Stream; p-plan:wasGeneratedBy ex:BucketizeStream; sds:carries sds:Member. Listing 1: RDF sample creating two LDES Views from different Streams References [1] Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: Ready for action? In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., and Janowicz, K. (eds.) The semantic web – iswc 2013. pp. 277–293. Springer Berlin Heidelberg, Berlin, Heidelberg (2013). [2] Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., Van de Walle, R.: Web-scale querying through Linked Data Fragments. In: Bizer, C., Heath, T., Auer, S., and Berners-Lee, T. (eds.) Proceedings of the 7th workshop on linked data on the web (2014). [3] Van Lancker, D., Colpaert, P., Delva, H., Van de Vyvere, B., Meléndez, J.R., Dedecker, R., Michiels, P., Buyle, R., De Craene, A., Verborgh, R.: Publishing base registries as linked data event streams. In: Brambilla, M., Chbeir, R., Frasincar, F., and Manolescu, I. (eds.) Web engineering. pp. 28–36. Springer International Publishing, Cham (2021). [4] Ben Ellefi, M., Bellahsene, Z., Breslin, J.G., Demidova, E., Dietze, S., Szymański, J., Todorov, K.: RDF dataset profiling – a survey of features, methods, vocabularies and applications. Semantic Web. 9, 677–705 (2018). [5] Michel, F., Faron-Zucker, C., Corby, O., Gandon, F.: Enabling automatic discovery and querying of web apis at web scale using linked data standards. Companion proceedings of the 2019 world wide web conference. pp. 883–892. Association for Computing Machinery, New York, NY, USA (2019). [6] Baker, T.: Libraries, languages of description, and linked data: A dublin core perspective. Library Hi Tech. 30, 116–133 (2012). [7] Beltran, A.G., Cox, S., Browning, D., Perego, A., Albertoni, R., Winstanley, P.: Data catalog vocabulary (DCAT) - version 2. W3C (2020). [8] Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets. LDOW (2009). [9] Tommasini, R., Sedira, Y.A., Dell’Aglio, D., Balduini, M., Ali, M.I., Le Phuoc, D., Della Valle, E., Calbimonte, J.-P.: VoCaLS: Vocabulary and catalog of linked streams. In: Vrandečić, D., Bontcheva, K., Suárez-Figueroa, M.C., Presutti, V., Celino, I., Sabou, M., Kaffee, L.-A., and Simperl, E. (eds.) The semantic web – iswc 2018. pp. 256–272. Springer International Publishing, Cham (2018). [10] Lebo, T., Sahoo, S., McGuinness, D.: PROV-o: The PROV ontology. W3C (2013). [11] Garijo, D., Gil, Y.: The P-Plan ontology. (2014). [12] Van de Vyvere, B., D’Huynslager, O.V., Atauil, A., Segers, M., Van Campe, L., Vandekeybus, N., Teugels, S., Saenko, A., Pauwels, P.-J., Colpaert, P.: Publishing cultural heritage col- lections of ghent with linked data event streams. In: Garoufallou, E., Ovalle-Perandones, M.-A., and Vlachidis, A. (eds.) Metadata and semantic research. pp. 357–369. Springer International Publishing, Cham (2022).