Linked Data Utilization along the Content Value Chain – Observations and Implications Georg Neubauer Tassilo Pellegrini University of Applied Sciences St. University of Applied Sciences St. Poelten Poelten Matthias Corvinus Str. 15, 3100 St. Matthias Corvinus Str. 15, 3100 St. Poelten, Austria Poelten, Austria dm131520@fhstp.ac.at tassilo.pellegrini@fhstp.ac.at ABSTRACT 500 447 The authors present the results of a longitudinal investigation in 400 368 366 the utilization of Linked Data technologies along the content value 317 chain. The authors analyzed 71 papers in the period from 2006 to 300 2014 that used Linked data technologies in editorial workflows. 208 By coding the primary and secondary research topics addressed in 200 the paper the authors draw a conclusion of the maturity of Linked 88 Data technologies as support systems along the content value 100 33 47 47 chain. The survey indicates that Linked Data technologies are 0 constantly maturing as a support infrastructure for editorial processes. The validity of the survey results for application 2006 2007 2008 2009 2010 2011 2012 2013 2014 domains not related to editorial tasks is open to discussion. Figure 1. ACM Publications containing the term “Linked Categories and Subject Descriptors Data” from 2006 – 2014 (N = 1921) E.0 [General]; K.4.3 [Organizational Impacts] To tackle these questions the authors chose to analyze a subset of research papers from the ACM database that address the General Terms application of Linked Data within editorial workflows. This subset Management, Economics, Human Factors, Standardization allowed us to apply a unified classification scheme – known as the content value chain [1] – to the various application areas of Linked Data. The content value chain can be described as a Keywords process model that is comprised of several sequential steps Linked Data, Content Value Chain, Semantic Metadata, Semantic contributing to the content production process. By looking at the Web, Data Journalism, News Production, Editorial Workflows, application area of Linked Data in editorial workflows it was Media Economics, IPR, Data Licensing possible to identify primary and secondary areas of utilization, thus allowing us to draw conclusions towards the diffusion and 1. INTRODUCTION appropriability of Linked Data for the production of media The growing recognition of Linked Data among the research content. community as “Semantic Web done right” [14] motivates to take a closer look if and how Linked Data research has evolved over the 2. CLASSIFICATION SCHEME & recent years. Such an investigation allows to gain insights into research trends and interdependencies thereof, and it allows to RELATED WORK draw conclusions whether the research field has reached a The original concept of the value chain as developed by Michael significant degree of maturity in terms of technology diffusion and Porter in 1979 is used as an analytical framework for the analysis application areas. of value creation processes at the firm level or the industry level As illustrated in Figure 1 a survey about the occurrence of the [15]. Over recent years the concept of the value chain has also phrase “Linked Data” in research publications of the ACM digital gained popularity in the context of open data in general [4; 6; 16] library from the period 2006 to 2014 reveals the growing and Linked Data in special [3; 5]. Especially research that popularity of this technological concept in the computer sciences investigated the organizational and economic impact of Linked till 2013 with a decline in 2014. Linked Data as a generic Data refers to the concept of the value chain [13]. technology for data management is being applied across various In this paper we refer to a generic abstraction of the content value application areas and industries, making it very hard to come to a chain consisting of five steps: 1) content acquisition, 2) content general statement concerning its level of maturity and industry editing, 3) content bundling, 4) content distribution and 5) content adoption. So is this distribution from figure 1 an indicator for the consumption. As illustrated by [1] Linked Data can contribute to growing maturity of a research field? And if yes, how can this each step by supporting its associated intrinsic production maturity be operationalized empirically? function. These are in detail: Content acquisition is mainly concerned with the collection, storage and integration of relevant information necessary to 8 produce a news item. In the course of this process information and with main classification (black) multiplied with the amount of the facts are being pooled from internal or external sources for further related classifications for the secondary classification. Figure 4 processing. illustrates the results of our survey. Content editing entails all necessary steps that deal with the semantic adaptation, interlinking and enrichment of data. Adaptation can be understood as a process in which acquired data is provided in a way that it can be used in the editorial process. Interlinking and enrichment are often performed via processes like tagging and/or referencing to enrich media documents either by disambiguating existing concepts or by providing background knowledge for deeper insights. Content bundling is mainly concerned with the contextualization and personalization of information products. It can be used to provide customized access to media files i.e. by using metadata for Figure 2. Legend: time-based categorization into the content the device-sensitive delivery of content, or to compile thematically value chain relevant material into Landing Pages or Dossiers thus improving the navigability, findability and reuse of information. In a Linked Data environment the process of content distribution 4. RESULTS mainly deals with the provision of machine-readable and 4.1 General Findings semantically interoperable (meta)data via Application Figure 3 illustrates the general findings of our investigation, which Programming Interfaces (APIs) or SPARQL Endpoints. These can are showing the result of all years later discussed in 4.2 as be designed either to serve internal purposes so that data can be influence circles on a grid. The diagonal line with the black circles reused within controlled environments (i.e. within or between represent the amount of papers within the main classification, units) or for external purposes so that data can be shared between while the other circles show the related classifications if they are unknown users (i.e. as open SPARQL Endpoints on the Web). read in a horizontal way. As mentioned, related classifications are Content consumption entails any means that enable a human user a result of additionally found secondary topics that match the to search for and interact with content items in a pleasant und content value chain, for one paper already has a main topic purposeful way. So according to this view this level mainly deals classification. with end user applications that make use of Linked Data to provide access to content i.e. by providing reasonable retrieval tools and/or visualizations. The five steps of the content value chain comprise the classification scheme. 3. METHODOLOGY We selected a sample of 71 papers (out of 1921) dealing with the utilization of Linked Data in editorial workflows in the period from 2006 to 2014 from the ACM Digital Library (DL). The selected papers had to comply with the following criteria: 1) the work must analyse the utilization of Linked Data with reference to some sort of editorial workflow; and 2) the work must not be purely theoretical but provide at least a proof of concept. The relevant papers have then been analysed and clustered according to the five classes acquisition, editing, bundling, distribution, consumption. As most papers treated more than one of these topics we weighted each paper according to the primary and secondary topic discussed, thus also gaining a better understanding how the research topics relate to each other. Figure 3. Influence cycles (result) – time-based categorization Figure 2 illustrates the classification scheme. The black boxes into the content value chain indicate the primary classification of a paper and the amount of papers falling into this category. The secondary classification The main application areas of Linked Data in editorial workflows inherit a weighted greyscale value. The number in the grey and fall into the areas editing (23 papers), bundling (18 papers) and black boxes indicates how many papers referred to these classes. consumption (21 papers). Hence, reading the rows horizontally gives an overview how the Crawling and leveraging processes could be subsumed as primary classification of a paper relates to its secondary acquisition process [1] using special indexing methods for several classification. Reading the columns vertically by summing up the entities found and aggregated through queries. The indexing values from the black boxes gives the amount of papers falling methods built a fundament for further scientific processing called into a specific class. content editing. The weighted greyscale values have been calculated as follows. Scientific editing using algorithmic methods to classify data into Given that black is 100%. 50% divided by the amount of papers separated, semantically enriched lists or ontologies were treated in 9 23 papers as main topic. All of these editing methods were part of to content bundling with subrelations to content acquisition and a recognition process used for video-, text- or graphic- analysis in content editing, while one of them also mentioned content terms of media-analysis and enrichment of metadata. distribution or content consumption as tertiary topic. Four papers 18 papers concerned content bundling as main topic. Bundling can address content consumption as main topic showing subrelations easily be defined as fine-grained representations of resource parts to content acquisition in all of their descriptions and one paper used for personalization and contextualization of the content. including further treatment of editing. Just 4 papers described distributions for example in case of improved accessibility of information. The main difference to the content bundling process and the content consumption process explained later on, therefore was, that only APIs can access this data which in case of content bundling wasn't put to visualized graphs of the content. This low number of distributions is not significant for further conclusions. 21 papers applied Linked Data through a framework visualizing graph-based relations of links. This sort of standard for framework developers was to visualize links of Linked Data for purposes like content recommendation. 4.2 Longitudinal Perspective Figure 4 illustrates the results of our analysis from a longitudinal perspective. The visualization scheme corresponds with Figure 3 but additionally lists the amount of papers (the black boxes) and their related topics (the grey boxes) in the years from 2006 to 2014. I.e. if there are two papers of content acquisition in 2014, this means that these two papers have their main classification in content acquisition and related topics in all other areas of the value chain. 2006: We found just one paper in 2006 with relation to our research focus. This paper addressed content acquisition as main topic and editing issues as secondary topic. 2007: In 2007 one paper was classified treating content bundling as main topic and content acquisition as secondary topic. Two papers addressing content consumption as primary topic and acquisition, editing and bundling in treating only content consumption. 2008: In 2008 we determine one paper addressing content distribution and one paper addressing content consumption both referring to content editing. 2009: We have three papers classified as content editing, content bundling and content consumption. The subrelations in case of content bundling is editing and in case of content consumption the Figure 4. Primary and secondary topics in Linked Data subrelations equally refer to content bundling and content utilization distribution. 2013: All papers that describe content editing frameworks in the 2010: In 2010 the authors detected one paper treating content year of 2013 also have acquisitional processes as topic. One of acquisition, one paper treating content distribution and another three papers addressing content editing have a subrelation to one content consumption. Two papers treated content editing content bundling. Two papers are subrelated to content frameworks. All of the five papers treated content acquisition as distribution and one to content consumption. Only one paper their secondary topic. related to content bundling subrelated to content acquisition and 2011: In 2011 one paper was about content acquisition, editing, content editing. Four papers give reason to content consumption. distribution and content consumption. The relations begin in the Their relation to subclasses are three addressing content editing, content editing class including a single subrelation to content two addressing content bundling and four addressing content acquisition and content consumption. Four papers have all an consumption frameworks as main topic. equal amount of subrelations to content acquisition and editing. Additionally one paper described a framework for content 2014: In 2014 the classification scheme of the content value chain consumption. seems applicable to a huge amount of papers. We analysed 25 papers and came to the conclusion that scientific content editing 2012: In 2012 the authors found one paper addressing content utilizing combinations of vocabularies for the preparation of acquisition as main topic and content editing as secondary topic. linked data is high of note, i.e. automatic extraction RDF-Triples Two papers demonstrated the opposite pattern, discussing editing from web sources for purposes of content enrichment. So 11 as main topic and acquisition as secondary topic. Four papers refer papers are classified as content editing in nearly all cases within 10 acquisitional preprocessing. Content bundling with 5 papers and Technologies for E-Government, 2004. content consumption with 6 papers as main classification seem http://project10x.com/bio_downloads/business_value_of_sem very similar spreaded in relation to the former years. anti c_technologies_2005.pdf, accessed May 9, 2015 [5] Latif, Atif, Anwar Us Saeed, Patrick Hoefler, Alexander 5. DISCUSSION, LIMITATIONS & FUTURE Stocker, and Claudia Wagner. “The Linked Data Value Chain: WORK A Lightweight Model for Business Engineers.” In I- The results show a trend in the utilization of Linked Data SEMANTICS, 568–75. Citeseer, 2009. technologies towards content editing, content bundling and http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.181. content consumption. Especially the increasing amount of papers 950 &rep=rep1&type=pdf. addressing consumption purposes after 2009 is taken as an [6] Pepe, Alberto, Matthew Mayernik, Christine L. Borgman, and indicator for the increasing maturity of Linked Data technologies Herbert Van de Sompel. “Technology to Represent Scientific in editorial workflows. We also made out a reason of the Practice: Data, Life Cycles, and Value Chains.” World Wide increasing usage of content acquisition processes beginning in Web Internet And Web Information Systems, 2009, 1–22. 2008, assuming that the data infrastructure achieved reclaimable [7] Robak, Silva, Bogdan Franczyk, and Marcin Robak. integrity. Concerning the main result the intertwinedness of “Research Problems Associated with Big Data Utilization in research topics have seamless integration of distinct steps in the Logistics and Supply Chains Design and Management,” n.d. content value chain. Metadata acquisition systems can minimize https://fedcsis.org/proceedings/2014/pliks/472.pdf. the human burden in recording data [12]. Normally the content acquisition process is the premier step to process data. We also [8] Solanki, Monika, and Christopher Brewster. “Consuming claim that there exists a structural relation between content Linked Data in Supply Chains: Enabling Data Visibility via distribution and acquisition given the fact that these two processes Linked Pedigrees.” In COLD, 2013. are technologically intertwined in interlinked data ecosystems. http://windermere.aston.ac.uk/~monika/papers/SolankiCOLD2 Content distribution could be treated as a main goal of data 013.pdf. storage and supply [13]. The authors assume that well established [9] Taskar, Benjamin, Eran Segal, and Daphne Koller. Linked Data stores are a precondition to content acquisition “Probabilistic Classification and Clustering in Relational allowing further processing like content bundling, content Data.” In International Joint Conference on Artificial distribution and content consumption. By taking this appropriate Intelligence, 17:870–78. LAWRENCE ERLBAUM amount of papers in 2014 we came to the conclusion that content ASSOCIATES LTD, 2001. editing takes root, but the consistency of the result should also be http://ai.stanford.edu/users/koller/Papers/Taskar+al:IJCAI01.p considered in a normalized way to the former years. df. To gain further insights the authors plan to extend the sample size [10] Van Erp, Marieke, Willem Robert van Hage, Laura of their survey in their future work. The current amount of 71 Hollink, Anthony Jameson, and Raphaël Troncy. “Detection, papers is simply too small to draw precise conclusions on the state Representation, and Exploitation of Events in the Semantic of the art and future direction of Linked Data utilization in Web,” 2013. http://ceur-ws.org/Vol- editorial workflows. But apart from these limitations the insights 1123/proceedingsderive2013.pdf. generated by the survey indicate that Linked Data technologies are [11] Villazón-Terrazas, Boris, and Oscar Corcho. constantly maturing as a support infrastructure for editorial “Methodological Guidelines for Publishing Linked Data.” processes. The validity of the survey results for application Una Profesión, Un Futuro: Actas de Las XII Jornadas domains not related to editorial tasks is open to discussion. Españolas de Documentación: Málaga 25, no. 26 (2011): 20. 6. REFERENCES [12] Labrinidis, Alexandros, and H. V. Jagadish. “Challenges and Opportunities with Big Data.” Proc. VLDB Endow. 5, no. 12 [1] Pellegrini, Tassilo. “Integrating Linked Data into the Content (August 2012): 2032–33. doi:10.14778/2367502.2367572. Value Chain: A Review of News-Related Standards, Methodologies and Licensing Requirements.” In Proceedings [13] Edward, Curry et al. "Big Data. Technical Working Groups of the 8th International Conference on Semantic Systems, 94– White Paper," 2014. 102. ACM, 2012. http://dl.acm.org/citation.cfm?id=2362513. http://bigproject.eu/sites/default/files/BIG_D2_2_2.pdf [2] Auer, Sören, Theodore Dalamagas, Helen Parkinson, François [14] Berners-Lee, Tim (2008). Linked open Data. See also: Bancilhon, Giorgos Flouris, Dimitris Sacharidis, Peter http://www.w3.org/2008/Talks/0617-lod-tbl/#%281%29, Buneman, et al. “Diachronic Linked Data: Towards Long- accessed May 9, 2015 Term Preservation of Structured Interrelated Information.” In [15] Porter, Michael (1985). Competitive Advantage. New York: Proceedings of the First International Workshop on Open Free Press Data, 31–39. WOD ’12. New York, NY, USA: ACM, 2012. http://doi.acm.org/10.1145/2422604.2422610. [16] Archer, Phil; Dekkers, Max; Goedertier, Stijn; Loutas, [3] Auer, Sören, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Nikolaos (2013). Study on business models for Linked Open and Amrapali Zaveri. “Introduction to Linked Data and Its Government Data (BM4LOGD - SC6DI06692). Services Lifecycle on the Web.” In Reasoning Web. Semantic See also: http://ec.europa.eu/isa/documents/study-on- Technologies for Intelligent Data Access, 1–90. Springer, business-modelsopen-government_en.pdf, accessed May 10, 2013. http://link.springer.com/chapter/10.1007/978-3-642- 2015 39784-4_1. [4] Davis, Mills. “The Business Value of Semantic Technologies.” Presentation and Report, Semantic 11