Linked Data Analytics for Business Intelligence SMEs: a Pilot Case in the Pharmaceutical Sector Barbara Kapourani Eleni Fotopoulou, Anastasios Dimitris Papaspyros, Spyros Critical Publics Zafeiropoulos Mouzakitis, Sotiris Koussouris 4 Flitcroft St., London, WC2H 8DH, UK Ubitech National Technical University of Athens, barbara@criticalpublics.com Thessalias 8 & Etolias 10, 15231 DSS Lab, Chalandri, Athens, Greece 9, Iroon Polytechniou str., Zografou, {efotopoulou, Athens, 15780, Greece azafeiropoulos}@ubitech.eu {dpap, smouzakitis}@ epu.ntua.gr, skous@me.com ABSTRACT proper, meaningful and high quality datasets that may be easily The adoption of linked data concepts from SMEs, meshed up with consumed via a set of tools, as well as the need to work with sophisticated analytics and visualization techniques within an heterogeneous and high volume data sources in many cases [5][6]. integrated environment, called the LinDA Workbench, appears to The challenges can be split into three distinct categories: (a) reduce the effort for the realisation of specific tasks within the reviewing the datasets and preparing them in a proper format, (b) company, by almost 50% in terms of time, while at the same time, interpreting or extracting knowledge from the data through it supports the re-design of existing business processes and interlinking, inferences, as well as analytics extraction, and (c) introduces increased efficiency and flexibility. In this paper, we maintaining and updating the data regularly. briefly present the initial findings of the Business Intelligence Several projects have handled parts of these challenges, Analytics (BIA) pilot operation of the LinDA project, which delivering however standalone frameworks that do not provide a concerns the Over-The-Counter (OTC) medicines liberalisation in holistic workflow. Among such initiatives, the LOD21 project Europe. It aims at examining the association among the usage of aims to contribute high-quality interlinked versions of public OTC medicines and pharmaceutical parameters, with other Semantic Web data sets, promoting their use in new cross-domain healthcare, socio-economic and political ones. Focus is given on applications, by developers across the globe. The DIACHRON2 the added value emerged, through the consumption and project takes on the challenges of evolution, archiving, production of linked data for analysis purposes, as well as the provenance, annotation, citation, and data quality in the context of challenges faced for the execution of such a pilot. LOD and intends to automate the collection of metadata, provenance and all forms of contextual information, so that data are accessible and usable at the point of creation and remain so Keywords indefinitely. SDI4APPS3 handles the uptake of open geographic Linked data, business intelligence analytics, SMEs, Over the information through innovative services based on LOD. With Counter (OTC) medicines. regards to enabling Linked Data analytics, no holistic framework exists -to the authors’ knowledge- that is able to consume Linked 1. INTRODUCTION Data towards the production of analytics and produce output data It seems that the information era has given its place into the interlinked with the input data [4]. Finally, it should be noted that analytics one, where the only way for SMEs to manipulate and large effort is also given towards the design of systems for the extract intelligence from the huge amount of data and information production of Big Data analytics taking into account the out there is the investment on innovative solutions. The LinDA collection of data in a distributed way, without providing project [2] is an effort towards that direction. It is a co-funded mechanisms for evaluating or improving the quality of the European project, under the FP7 framework, aiming to support available data, or providing techniques for producing Linked Data the SMEs’ efforts to effectively adopt Linked Open Data (LOD) prior to their processing, by the analytics tools [5][6]. in their pursuit of competitiveness, by providing a complete set of The distinguishing characteristic of LinDA is that -compared to tools for publication, consumption, analysis and visualization of existing tools- it provides a complete open-source package of linked data in an easy, user-friendly way. In this paper we are Enterprise Linked Data tools to quickly map and publish your shortly presenting the LinDA concepts through a Business data in the Linked Data Format, interlink them with other public Intelligence Analytics (BIA) scenario. In Section 2, a short or private data, analyse them and create visualizations. reference to existing challenges and related work is provided, while Section 3 presents the tools composing the integrated 3. LINDA WORKBENCH LinDA environment. Section 4 provides a step-by-step As a result of the LinDA project, the LinDA Workbench [3] is the presentation of a real life scenario, which concerns the Over-The- integrated environment, consisting of a set of tools that facilitates Counter (OTC) medicines liberalization in Europe. Next, in the manipulation of linked data towards the realisation of analysis. Section 5, the SME’s added-value identified is shown, while More specifically, it consists of: (a) the LinDA Transformation Section 6 concludes the paper, with plans for future work. 1 The LOD2 project, http://lod2.eu/ 2. CHALLENGES AND RELATED WORK 2 The DIACHRON project, http://www.diachron-fp7.eu/ Challenges for Linked Data provisioning and consumption regard 3 mainly the renovation, compilation, maintenance and update of The SDI4APPS project, http://sdi4apps.eu/ 47 engine, a lightweight transformation to linked data tool, (b) the studies have been conducted for the OTC market, revising LinDA Vocabulary repository, for increasing the semantic different aspects of it, focusing only in a subset of the correlated interoperability for your data, (c) the LinDA RDF2Any, a tool for parameters that affects or are being affecting by the OTC converting RDF to conventional data structures in order to be liberalization, which basically reveals the complexity of the used by legacy applications, (d) the LinDA Query Builder and domain. Query Designer, to easily navigate and query your data, (e) the With the usage of the linked data approach, through the LinDA LinDA visualization, to perform smart visualizations on linked BIA pilot, we have tried to compose a complete, cumulative, data out-of-the box and (f) the LinDA Analytics package for conceptualization map of all the correlated parameters for the running analytic processes against your data. In an easy, OTC market; trying to fill the gap that the rest of the studies left. straightforward and user friendly way, the user can follow a The objective is to investigate the OTC liberalization effects in simple 3-steps procedure in order to transform the data to RDF, various parameters (e.g. healthcare expenditures, OTC revenues) link (and query) them with public and private endpoints, and at by studying and analyzing the countries where the liberalization last, to analyse and/or visualize the resulted information. The 3- has already taken place, and by combining this information with steps procedure is as follows: the unique parameters of the rest European countries of interest, Turn Data into RDF: Using the LinDA Transformation engine, where the liberalization has not been yet implemented. Based on users can publish their data as linked data in a few, simple steps. this conceptualization map, and by using the LinDA Workbench, They can simply connect to their database, select the data table we have tried to identify the indicators that play a significant role they want and make mappings to popular and standardized in the OTC liberalization, to find their correlations and to analyze vocabularies. LinDA assists even more by providing automatic their impact, aiming at getting business intelligence insights that suggestions to the mapping through its Suggest API (Oracle). can be proven helpful towards decision-making processes. Query / Link your Data: With the LinDA Query Builder and Query Designer, users can perform simple or complex queries 4.1 Scenario Implementation through an intuitive graphical environment that eliminates the Figure 1 presents the concrete steps followed for the need for SPARQL syntax. With simple drag and drop implementation of the OTC scenario within the LinDA functionality users can perform complex SPARQL queries and framework. filtering, including interlinking with external SPARQL endpoints. Visualize / Analyze your Data: LinDA Visualization and Analytic engines can help enterprise users gain insight from the data that the company generates. The added-value of LinDA visualizations and analytics, in comparison with traditional tools, is that it takes advantage of the enriched metadata contained within the Linked Data format to produce more meaningful visualizations. On top of that, users can gracefully link their data with any other private or public data, therefore realizing an ecosystem of data extractions and visualizations, which can be bound together in a dynamic and unforeseen way. 4. REAL-LIFE SCENARIO Figure 1. OTC scenario implementation steps in LinDA In this section, a short presentation of a real-life scenario As a result of an extensive scientific research and an intelligence- implemented within LinDA Workbench is presented, aiming at gathering mechanism, conducted manually by the involved actors, validating the provided functionality and showing the added spanning from health advisors to business intelligence business value that it can be gained. For this purpose, we are consultants, the conceptualization map of the OTC scenario has considering an SME operating in the Business Intelligence (BI) been crafted, acting as the basis of the data needed for composing sector that is making use of the LinDA Workbench for realising the full picture of the correlated parameters for the OTC market. part of its daily business operations. The considered SME is The identified public and private datasets of this map have been providing consultation services in a wide range of national and classified into five main categories: (a) healthcare indicators, (b) multinational enterprises, organizations and governments, helping OTC indicators, (c) economic indicators, (d) political indicators them to build their communication strategy and uphold their and (e) social indicators. Overall, the map contains more than decision making processes. This is achieved by employing twenty unique indicators, half of which represents private datasets consultants that gather and assess huge amount of data in a daily of the SME that have been created for the purpose of the OTC basis, from different sources, in different formats and by using a scenario. Furthermore, the indicators are spanning in various time variety of tools. Usually, the volume, variety, velocity, the time- ranges and across different European countries. Most of these sensitivity, the heterogeneity and non-interoperability of the data datasets are in excel or csv format, while some of the public sector to be handled is a great burden for the SME, in terms of effort, datasets have SPARQL endpoints to be retrieved from (e.g. time, resources and complexity. Eurostat, Worldbank, Transparency.org). The pilot scenario is related with the Over-The-Counter (OTC) Based on the conceptualization map and the acquired datasets, medicines liberalization in Europe, in terms of price, retail and two major interlinking directions have been followed in the BIA entry. As stated in [1], the sales of the OTC medicines are pilot’s scenario; in the first case (Figure 2) all the private datasets increasing during the last years, despite of the global financial have been interlinked with the World Factbook (http://wifo5- crisis, while at the same time, the role of the OTC in the 03.informatik.uni-mannheim.de/factbook/), both by country and pharmaceutical market is quite promising and prominent, opening by year property. In the second interlinking case (Figure 3) some new businesses opportunities and potentials for growth. Many of the identified private and public datasets have been interlinked 48 together, targeted at being used towards the examination of open with larger GDPpc values (e.g. through an increase on the public issues of the OTC scenario. expenditure on health) and thus less need for using OTC. The next step in the analysis regards the examination of the relationship among the OTC expenditures and the GDP growth rate. A regression analysis is realized, leading, however, to very small adjusted R-squared value and thus considering that the variation of the GDP growth rate explains a very small percentage of the variation of the OTC expenditures. Following, a set of analysis are realised, examining the relationship among OTC expenditures and governmental indicators, specifically political stability and governmental effectiveness indicators. Based on the produced results, it seems that higher political stability and higher governmental effectiveness lead to reduction in the OTC expenditures. It could be claimed that in countries with political stability and high quality of the governmental institutions, the social care is advanced and the allocated budget on healthcare is associated with effective results, leading to reduction on the need for usage of OTC medicines. However, the analysis has to be evolved in order to acquire further insights on this aspect. Furthermore, a clustering analysis is realised upon an interlinked dataset (linking parameters such as government effectiveness, political stability, GDP per capita and OTC expenditures per Figure 2. OTC scenario – private datasets interlinking country and per year). It aims to provide a set of indications regarding the existence of clusters based on the examined parameters and the investigation of the composition of such clusters. A k-means clustering algorithm is executed for the partitioning of the overall observations in three clusters. The results are depicted in Figure 4. Figure 4. OTC scenario – K-means clustering results Three really well defined clusters with no overlapping among Figure 3. OTC scenario private & public datasets interlinking each other are produced. The first cluster refers to countries with After the datasets preparation and their interlinking, the SME has low political stability, low governmental effectiveness and low used the LinDA tools (e.g. Transformation Engine, Query GDP per capita that tend to have high OTC expenditures. The Designer) for uploading/saving the datasets, transforming them second cluster refers to countries with medium-to-high political into RDF format and formulating the interlinking questions to be stability, high governmental effectiveness and high GDP per used next, into the analysis phase. The LinDA Analytics tool is capita that tend to have low OTC expenditures. The third cluster then being used for drilling down to the study, which is the refers to countries with medium political stability, low fundamental step towards the intelligence extraction process. governmental effectiveness and low GDP per capita that tend to Initially, a set of regression analyses are realised for examining have medium to low OTC expenditures. the relationship among the OTC expenditures variation with It should be noted that the afore-mentioned results led to initial several parameters. The first regression analysis regards the insights based on the first phase of the BIA pilot’s execution, relationship between OTC health expenditures and the GDP per while further analysis is envisaged to be realised the upcoming capita of each country. Based on the produced results, it can be period. Furthermore, the analysis performed includes also a set of claimed that increase in the GDPpc by 10 US$ leads to reduction visualisations produced by the corresponding LinDA tool that in the percentage of OTC expenditures by 1.06%. This reduction provide initial views and insights with regards to the evolution of can be attributed to the better social care that can be associated specific parameters. 49 5. SME’S ADDED VALUE WITH LINDA may play the role of the “one-stop-shop”, with a quite user In order to accurately evaluate the business value introduced to friendly and straightforward UI, sidestepping the need of different the SME, by the usage of the LinDA Workbench, we have tools and distributed, autonomous working environments, which introduced a simple example, where four parameters (e.g. GDP is until now the case of the SME working status. per capita, OTC expenditures, unemployment and political Overall, it is clear the added-value coming from the usage of the stability) have to be acquired, transformed, analyzed and linked data and the LinDA Workbench. It has been calculated that visualized, in a weekly basis, with and without the usage of the linked data concept could definitely assist and improve the LinDA tools. The evaluation takes into consideration the access to more and instant information, while the usage of the execution time needed, the people engaged and the tools LinDA Workbench has proven to facilitate the SME’s current employed. Figure 5 presents in detail the SME’s four core workflow and to reduce the effort required for specific tasks, in business phases, as well as the tasks, resources and effort that are terms of time and resources. More importantly, LinDA has needed when the example is implemented “as usual” and next, introduced new business value for the SME (e.g. analytic process when it is implemented with the help of the LinDA Workbench. which were not implemented before, due to lack of experience or time), which can boost the provided services to new levels of completeness and sophisticated analysis and visualizations. 6. CONCLUSION In this paper we have presented the workflow followed by an SME, operating in the business intelligence sector and providing consultation services to its clients, with and without the usage of the LinDA Workbench, so as to derive insights about the usefulness of the adoption of linked data concepts. The current results, of the first pilot round, are quite promising, showcasing that the tasks are performed faster, easier and with less human resources, while new business value is also introduced. However, there are challenges that need to be further investigated, such as the need for evaluating the quality of the considered data, the need for optimally transforming business data to linked data through the usage of specific vocabularies, as well as the mentality shifting within the SMEs, towards the adoption of a more “data-scientist” oriented workflow, along with the organization of the corresponding training activities. 7. ACKNOWLEDGMENTS This work has been co-funded by the LinDA project, a European Commission research program under Contract Number FP7- 610565. 8. REFERENCES [1] Tisman, A. 2010. The Rising Tide of OTC in Europe, IMS Health. [2] The LinDA project, http://linda-project.eu/ [3] The LinDA Workbench, http://linda.epu.ntua.gr/ [4] Fotopoulou, E., et al. 2015. Exploiting Linked Data Towards the Production of Added-Value Business Analytics and Figure 5. SME workflow, with and without LinDA Vice-versa, DATA 2015, Colmar, Alsace, France, July 2015. It has been noticed that the time needed for the completion of the [5] Networked and Electronic Media Initiative, Big and Open example is reduced with the usage of LinDA of about 20min to Data position paper, December 2013. 3,41hours per realisation of a daily analytics process. Moreover, the tasks executed are also reduced from 11 to 9. As for the [6] UN Global Pulse, White Paper: Big Data for Development – involved experts, changes have been detected; with LinDA a new Challenges and Opportunities, May 2012. Accessed on actor is introduced, the data analyst, while the designer that is 30.01.2015 at needed in the “as usual” case is taken out of the scene. The last is http://www.unglobalpulse.org/sites/default/files/BigDataforD quite significant, because it is closely related with the reduction of evelopment-UNGlobalPulseJune2012.pdf. the outsourcing costs of the report’s visual design. Furthermore, [7] Hu, H., Wen, Y., Chua, T. and Li, X. 2014. Toward Scalable what is quite vital is that with LinDA all the tasks are realised Systems for Big Data Analytics: A Technology Tutorial, within the unified environment of the LinDA Workbench. This IEEE Access, vol. 2, pp. 652–687, 2014. 50