A Research Agenda for Linked Closed Data (vision paper) Marcus Cobden, Jennifer Black, Nicholas Gibbins, Les Carr, and Nigel Shadbolt {mc08r,jlb08r,nmg,lac,nrs}@ecs.soton.ac.uk University of Southampton, UK Abstract. While it is preferable that Linked Data is published without access or licence restrictions, there will always remain certain datasets which, perhaps due to financial considerations, cannot be published as Linked Open Data. If these valuable datasets do join the Web of Linked Data, it will be as Linked Closed Data – Linked Data with access and li- cense restrictions. In this paper, we outline a research agenda for Linked Closed Data that considers the effects that access and license restrictions may have on the Web of Linked Data. If implemented poorly, access re- strictions have the potential to break URI resolvability, but even when implemented well, we can expect them to affect dataset selection pro- cesses and inter-dataset link creation rates. Additionally, there remains the technical challenge of developing and standardising access restriction and automated payment techniques for the Web of Linked Data. 1 Introduction To date, most research from the Linked Data community has focussed on Linked Open Data and failed to consider its logical counterpart Linked Closed Data. By Linked Closed Data, we refer to datasets which adhere to the principles of Linked Data publishing, but for which access to and use of the data is subject to legal or technical restrictions which go beyond attribution and share-alike obligations. Linked Closed Data may seem at odds with the aims of the Linked Data community, however, this is not necessarily the case. Undoubtedly it would be a shame if existing ‘open’ datasets changed to a ‘closed’ publishing model, but it would still be preferable to the datasets becoming completely unavailable due to financial pressures on their publishers. High quality data has value. If we wish to see more high quality datasets published as Linked Data, we must provide incentives for publishers to do so. Profit is arguably the most significant publishing incentive, and so, revenue mod- els, and how they interact with Linked Data publishing patterns, are of great importance to the future of the Web of Linked Data. While publishing ‘open’ datasets is to be commended, we anticipate that few companies will be prepared, or able, to make their data available without some access or licence restrictions. Ordnance Survey, Great Britain’s national mapping agency, is in this position; they are required to operate as a self-funded organisation through the commercial sale of mapping data. Offering all their data for free would undermine their primary source of income. 2 A Research Agenda for Linked Closed Data The Times and The New York Times newspapers both used to offer free access to articles on their Web site, recently, however, they both adopted a subscription-based access model for their Web content. Crucially, unlike The Times, The New York Times allows non-subscribers free access to view up to 20 articles per calendar month. Despite the difference in markets, the introduction of paid access may be comparable to the addition of restrictions to Linked Data publishing. It is imperative that we fully understand the implications of Linked Closed Data so that we can be sure that desirable properties of the Web of Data will not be lost in the face of access and license restrictions. Against this background, in this paper we identify the key research challenges posed by Linked Closed Data. 2 Access and Licensing While we describe datasets as ‘open’ and ‘closed’, reality is less clear-cut; in prac- tice we observe a spectrum of ‘openness’, which varies with the access restrictions a dataset is published with (who is permitted to ‘see’ the data), and the licence under which access is granted (what they are permitted to do with the data). 2.1 Access restrictions Access to a dataset may be: completely open, restricted only by the resources of the dataset host; restricted, open to users who meet specific access criteria; or private, open only to its owner. If badly realised, access restrictions have the po- tential to undermine the resolvability of Linked Data URIs and to dis-incentivise the creation of inter-dataset links. The resolvability of a URI would be under- mined if, due to access restrictions, it is no longer possible to resolve that URI to a description document. Access restrictions might also remove any useful infor- mation from a URI description document. Creating links to a restricted dataset is less worthwhile if URIs within it cannot be resolved to a useful description. 2.2 Dataset licences Dataset licences range from public domain dedications (where all intellectual property rights are waived)[9], to permissive licences (which may impose any- thing from attribution, to more weighty obligations such as copyleft) to restric- tive licences (which specify permitted uses of a dataset). Currently, 85% of the datasets in the Linked Data Cloud do not declare any license [2]. We are unlikely to see a restrictive license on a dataset without also encoun- tering access restrictions; if access is open and practically anonymous, license breaches would be difficult to detect and punish. Since it is only worthwhile im- posing license restrictions on valuable data, we expect that the publisher would also take the step of imposing access restrictions to protect its investments. 3 Business models The ‘openness’ of a dataset is generally determined by the business model of the publisher. A range of business models have been proposed for Linked Data A Research Agenda for Linked Closed Data 3 publishers, including subsidised publishing, subscription or micropayment-based access, sponsorship/advertising funded publishing, and loss-leader models (to drive interest or sales in other products, or to shape markets) [3]. Broadly, we can classify these business models into those which offer only free access products, only paid access products, or some combination of both. The revenue models of the individual products these business models are build from can also be split between free and paid access. We can then further categorise them by how their costs are recovered: Advertising supported Costs are covered by revenue from advertising within the content. We consider this to cover both per-view advertising payments, and per-sale commission through affiliate links. Loss-leader Costs are written off as an investment. Loss-leader models may attempt to drive interest in other products, or perhaps to shape markets in the hope of future sales. Subsidised Costs are covered by some form of sponsorship or subsidy. Common in public sector undertakings. Free access models should, in theory, employ at least one of these strategies, though our list is not exhaustive. Equally, paid access models might employ any combination of these to complement revenue from access payments. Currently, all Linked Data publishers operate under loss-leader or subsidy-based revenue models – none have adopted paid access or advertising supported revenue models. In the last year, Ordnance Survey have begun to offer free access to some of their data. These free datasets include postcode location data, electoral and administrative boundaries, and gazetteer (at 1:50000 scale), though only some of these are published as Linked Data. Currently, as they offer no paid access Linked Data, this is likely to be a loss-leader exercise, perhaps to demonstrate what is possible with their premium subscriptions, or to shape the market in anticipation of a paid offering. Coupling a premium Linked Data product with a free version may be an effective way of mitigating the effects of access restrictions on phenomena such as link creation, providing they use the same URIs, and URI resolvability is maintained. 4 Research Challenges The addition of access restrictions to Linked Data publishing systems and the emergence of datasets published under proprietary licenses will be an inevitable fact of commercial Linked Data. These changes will alter our expectations of URI resolvability and change the ways in which we use Linked Data. We have identified six key challenges which are important in a Linked Data ecosystem where not all data is free – they are: i) building dataset reputation, ii) developing access, authentication and payment protocols, iii) fostering links between datasets, iv) managing confidential data, v) respecting license terms, and vi) validating business models. We elaborate further on these in the remain- der of this section. 4 A Research Agenda for Linked Closed Data 4.1 Building reputation Access restrictions make it all the more important for a dataset to have a good reputation. Without access, prospective users will be unable to evaluate whether the dataset meets their needs before they commit to a purchase. Reputation and word of mouth are a common means through which users judge whether a product meets their requirements when considering a purchase. Unfortunately there are few incentives to risk purchasing access to a dataset of unknown quality, so new and untried datasets are unlikely to create a rep- utation for themselves. To combat this bootstrapping problem, publishers will need to improve their reputation through other means; perhaps by (temporarily) removing access restrictions, or by seeking endorsements from trusted authori- ties. Freemium revenue models [1] and free time- or extent-limited licences are a common means of providing access to prospective customers. Future research is needed to explore this area in more detail, and to identify other dataset rep- utation bootstrapping techniques. 4.2 Access, authentication and payment The issue of restricting access to closed datasets poses clear technical challenges. HTTP content negotiation and redirection allows different documents to be served in response to a URI resolution attempt. Similar methods are needed for restricted-access datasets in order to field requests between free and paid content. Additionally, a new vocabulary is needed with which one can declare the presence of related documents which require payment to access, otherwise premium content may not be discoverable. Authentication is a requirement of any access restriction techniques. The foundations for authentication are already being laid; the W3C WebID incubator group is standardising TLS client certificate based authentication [4]. Finally, in order for automated access to premium content to succeed beyond isolated individual publishers, we must standardise the means by which we indi- cate that payment is needed, and the methods by which payment can be made. The HTTP ‘402 Payment Required’ response code has long been reserved, but no standards have yet specified how it should be used [6]. 4.3 Fostering links Restricting access to a dataset may also negatively impact how likely external sources are to create links to that dataset. Inter-dataset links are said to add value to datasets, and incoming links provide a form of advertisement and en- dorsement for the target dataset. Publishers of Linked Closed Data will need to take into account the effect that their particular access and licensing schemes may have on incoming link creation rates. Datasets with schemes which permit some degree of free access may maintain a higher rate of link creation than those under more restrictive schemes. The different paid access models adopted by The Times and The New York Times may allow us to examine how access restriction affects link creation on the Web, although whether this translates to a Semantic Web context remains an open research question. Further research is needed to examine this behaviour in detail, and to explore other means of encouraging link creation. A Research Agenda for Linked Closed Data 5 4.4 Managing confidentiality It is not inconceivable that linked datasets might contain sensitive information. While authentication and authorisation schemes can be used to limit access to restricted information, the challenges posed by confidential data go beyond access, particularly in systems containing mixed confidentiality-level data. Care must be taken that sensitive information cannot be inferred from non- sensitive data. For example, unique identification number ranges with unex- plained gaps may suggest hidden information. Further, a Linked Data system, which adheres to current best practices, might inadvertently admit the existence of a sensitive URI by responding with an ‘HTTP 403: Forbidden’ response code instead of an ‘HTTP 404: Not Found’. While this alone means little, if the URI contains embedded information, or can be correlated with other available data, it may constitute a breach of confidentiality. Examining the issues in detail, and publishing amended best practices for publishing Linked Data in the presence of confidential information, remains an opportunity for future research. 4.5 Respecting licenses As we mentioned in Section 2.2, there are a wide range of common licences which datasets might be licensed under. Standardised licences, such as the GNU Public Licences (GPL), Creative Commons (CC) and Open Database Licences (ODBL), often offer variants with additional restrictions such as attribution, non-commercial use, and copyleft sharing requirements. Some of these licences, such as the GPL and CC, may eventually fall out of favour for Linked Data as they were not designed to be used in this context [9]. Governments often have their own licenses for data, for example the United Kingdom’s Open Government Licence (OGL). Dataset licenses present two main challenges: tracking the licences under which data was received, and respecting the license conditions. In order to be able to honour license restrictions, Linked Data systems need to maintain appropriate provenance records of license conditions. Provenance is an active area of research for Linked Data [8], and we expect that once provenance-aware systems reach maturity much of this will be automated. Ensuring adherence to dataset licenses is a more difficult research challenge. Interpreting licenses confidently and accurately requires legal training. Ideally standard licenses would include an approved machine-understandable descrip- tion of the license, describing how a dataset might be used. Existing work has explored the potential of this approach [7]; however, further work is needed to apply this to common license conditions. 4.6 Validating business models While many business models have been proposed for the Semantic Web and Linked Data publishers [3], it remains to be seen how many of them will prove viable. Some, such as advertising-supported Linked Data publishing, have been 6 A Research Agenda for Linked Closed Data called into question somewhat [5]; however, ultimately, the success of any busi- ness model will depend upon the market conditions in which is employed. Nonethe- less, until there exist successful businesses built around these models, we must consider them unsolved research challenges. 5 Conclusions In this paper, we introduced the notion of Linked Closed Data: Semantic Web datasets which are published in accordance with Linked Data principles, but which include access and licence restrictions. While we described Linked Closed Data as the logical counterpart to Linked Open Data, license restrictions are already common among ‘Open’ datasets (although they usually only require attribution as the source of the data); the addition of access restrictions is of greater significance. We argued that Linked Closed Data is likely to be the form of Linked Data publishing ultimately adopted by commercial Linked Data publishers when offering premium, paid access products, as they are likely to require some form of access restriction. Finally, we identified six research challenges which are of new, or increased, significance when considering the effect of access and license restrictions on the Web of Data. These are: building dataset reputation despite access restrictions; developing and standardising access, authentication and payment protocols; fos- tering the creation of links to access restricted datasets; managing confidential data; tracking data licenses and ensuring license adherence; and the validation of Linked Data business models. References 1. Anderson, C.: Free: The Future of a Radical Price: The Economics of Abundance and Why Zero Pricing Is Changing the Face of Business. Random House Books (Aug 2009) 2. Bizer, C., Jentzsch, A., Cyganiak, R.: State of the LOD Cloud (Aug 2011), http: //www4.wiwiss.fu-berlin.de/lodcloud/state/ 3. Brinker, S.: Business models for linked data and web 3.0 (Mar 2010), http://www. chiefmartec.com/2010/03/business-models-for-linked-data-and-web-30. html 4. Corlosquet, S., Sporny, M., Inkster, T., Story, H., Harbulot, B., Bachmann-Gmür, R.: WebID 1.0 - Web Identification and Discovery (Draft) (Feb 2011), http://www. w3.org/2005/Incubator/webid/spec/ 5. Dodds, L.: Thoughts on Linked Data Business Models (Jan 2010), http://www. ldodds.com/blog/2010/01/thoughts-on-linked-data-business-models/ 6. Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners- Lee, T.: Hypertext Transfer Protocol – HTTP/1.1. RFC 2616 (Draft Standard) (Jun 1999), updated by RFC 2817 7. Hanson, C., Kagal, L., Berners-Lee, T., Sussman, G., Weitzner, D.: Data-purpose algebra: Modeling data usage policies. In: POLICY ’07. pp. 173 –177 (Jun 2007) 8. Hartig, O.: Provenance information in the web of data. Proceedings of the 2nd Workshop on Linked Data on the Web (LDOW2009) (Jan 2009) 9. Miller, P., Styles, R., Heath, T.: Open data commons, a license for open data. In: Proceedings of the WWW2008 Workshop on Linked Data on the Web (LDOW) (2008)