Supporting open dataset publication decisions based on Open Source Software reuse Alvaro E. Prieto Jose-Norberto Mazón Universidad de Extremadura Universidad de Alicante Cáceres, Spain San Vicente del Raspeig, Alicante, Spain aeprieto@unex.es jnmazon@dlsi.ua.es Adolfo Lozano-Tello Luis-Daniel Ibáñez Universidad de Extremadura University of Southampton Cáceres, Spain Southampton, United Kingdom alozano@unex.es l.d.ibanez@southampton.ac.uk ABSTRACT the other hand, a big city with consolidated open data portals Publishing and maintaining open data is a costly task for public may prefer opening datasets that could be used in complex and institutions, that becomes even more challenging in the context mature software applications that involve big teams, since it is of Smart Cities, where large amounts of varied data are generated more relevant to their specific technological industry context. from different domains. To optimize resources, they should prior- Unfortunately, to the best of our knowledge, Smart Cities itize the publication and maintenance of datasets most likely to lack such decision support system, mainly because the process generate social and economic impact. However, there is currently of calculation of those indicators that would use the system is a lack of decision-support tools to help public sector data publish- not a trivial task. According to Janssen et al. [14] , “there is no ers to evaluate datasets on the light of their particular reuse goals. way to predict and calculate the return of investment (ROI) in In this paper, we propose to suggest to data publishers the dataset advance [. . . ]". The main challenge is that open data has no value categories with most potential impact, based on the impact of in itself; it only becomes valuable when used”. Therefore, the already published datasets of the same category. To measure im- main problem is that data owners have limited understanding pact, we propose a set of indicators based on the amount and on how open data is reused, thus lacking knowledge about the quality of Open Source Software projects that use datasets. To impact generated by reusing the published open data. aggregate indicators according to specific reuse goals, we provide More reasonable indicators of the use of open datasets could an Analytic-Hierarchy-Process based tool. help to identify which categories of datasets have more possi- bilities of being reused and, in this way, generate some type of economic impact to people or enterprises. In this sense, good 1 INTRODUCTION indicators could come from the reuse of datasets within the open One of the most important challenges faced by Smart Cities is source community. The Tenth Annual Future of Open Source creating an ecosystem of public and private actors that reuse open Survey [11] reflects the increasing adoption of pen source and data in order to produce IT services and products that both (i) highlights the abundance of organizations participating in the would improve citizens’ quality of life and (ii) would contribute open source community. Concretely, this survey estimates that to economic growth [32]. However, few open data portals in 65% of companies currently participate in open source projects. cities currently track data usage and consider the impact of data Open Source Software (hereon OSS) is considered to encourage on deciding which datasets to maintain or what complementary the creation of SMEs and jobs, by providing a skills development datasets publish. Cities are not even aware of what kinds of environment valued by employers and retaining a greater share apps are developed, using what data, and how many there are. of generated value locally [8]. Focusing in Europe, a study esti- Answering these questions is a significant research issue [30] mated that the contribution of OSS to its economy was of 450 that would allow prioritizing which categories of data must be billion euro per year [7]. published and maintained with respect to the applications that Based on these figures, an estimation of the use of the different use them (i.e., impact that a category of open data generates). categories of datasets by the OSS community could be a good To reverse this situation, publishing datasets as open data indicator of their potential impact. Therefore, when Smart Cities requires a decision support system to select those categories of make decisions on which data to publish, they could prioritize datasets that offer higher potential to generate value [12]. Such a publication of data which allows a community of developers system must consider indicators about the impact of the already to generate impact and effectively release benefits of open data published open datasets, as well as the strategy of the Smart through OSS projects. City. E.g., a small town could provide an open data portal with In this paper, we present an approach based on the estimation many high-quality datasets but the portal is rather unknown, of indicators of the use of open datasets in OSS projects. The and the technological fabric of the city is composed of small IT goal of this approach is to provide Smart Cities with a Decision companies. Therefore, the goal of the city could be to extend the Support System which provides an ordered list of categories of use of the open data portal by prioritizing those datasets that datasets most suitable to be published or maintained in their open belong to categories that are likely to generate a large number data portal. To do so, we have carried out a set of actions aimed of projects -though simpler ones that involve fewer people. On at estimating useful impact indicators related to the datasets of © 2018 Copyright held by the owner/author(s). Published in the Workshop the same category already published by open data portals of Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, other cities. Concretely, to calculate our proposed indicators we Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted needed two kinds of data sources: (i) already published Smart City under the terms of the Creative Commons license CC-by-nc-nd 4.0. datasets (and their metadata) and (ii) OSS projects (together with 2.1 A proposal of indicators of reuse based on information about them) which referenced the gathered datasets; GitHub i.e., we needed to know which open datasets were being used in Smart Cities should follow a strategy for opening data as de- which OSS projects. To collect already published open datasets, scribed in [17]. This strategy should prioritize publication of we chose Socrata [26] because it is one of the most used open data which allows a community of developers to generate im- data repositories, and notably by some of the most important pact and effectively release benefits of open data through OSS US cities. We also measured the existence of potential reuses projects [37]. A Smart City could in fact prioritize publication within a community in order to measure open data impact. To of open data with more reuse potential depending on the cate- do this, we used GitHub [9], because it is the largest web-based gory to which the data belong to. However, due to “open-data distributed revision control and source code repository in the by default” idiosyncrasy [23], data is usually published without world, and the source of several empirical studies such as in Yu establishing specific goals and without imposing utilization or et al. [33]. authentication restrictions to the infomediaries and end users. As Using the indicators obtained from these sources, we provide a result, collecting the usage information and measuring impact an Analytic Hierarchy Process (hereon AHP)-based [24] tool1 generated by open datasets may become very complex. that allows decision makers weigh these indicators, taking into To overcome this situation, our approach is based on consid- account the reuse objectives of the city, to offer an ordered list of ering that the more used an open dataset is by OSS projects, categories of datasets recommended to publish. the more impact is generated. Therefore, we borrowed some This paper is structured as follows: section 2 describes a new well-known indicators that measure the success of OSS projects approach to select the most relevant categories of data to be and we have used as starting point to develop our indicators published in a smart city open data portal. Section 3 presents to measure such success when open data is reused. Then, these toy samples of two different stereotypical smart cities using our indicators allow Smart Cities to measure which categories of approach and, to finish, section 4 summarizes other work related open data have more reuse potential and decide which data must to the publishing of open data in Smart Cities. be released according to the requirements of each city. The fol- lowing indicators from existing research literature on OSS are 2 USING REUSE INDICATORS BASED ON considered [27] [28]. First of all, we included (i) number of people DATA FROM OSS PROJECTS IN GITHUB who agree to receive information about the project because they FOR SELECTING DATASETS TO OPEN find it interesting (subscribers), and (ii) number of people who actually work on the OSS project (developers). On the one hand, This section describes the steps that have been carried out to get subscribers to OSS choose to obtain information on the project an AHP process that allows classifying categories of dataset based and thus reveal a deeper interest in the OSS project. The sub- on the preferences of the decision-maker. These preferences are scriber indicator not only measures interest within the project applied to a set of useful indicators obtained from data about but the reputation of the project within the community and the their reuse in OSS Projects of GitHub repositories. Concretely, dissemination of the project through the community. On the these steps2 are detailed in the following subsections and are other hand, the number of developers working on a project is summarized below: critical to its success, since survival of an OSS project depends on (1) From GitHub repositories, studying the characteristics of continued contribution from developers [28]. There is another OSS projects that use open datasets. This information was measure for the success of OSS projects [27] as the (iii) age of analyzed to establish a set of reuse indicators. an active project that is positively related to OSS progress to- (2) Gathering datasets from 32 cities of the United States (such ward completion, as well as the experience of the community of as San Francisco, Chicago or New York) which use Socrata developers. as an open data repository. With respect to this point, it Based on these three indicators described in the literature should also be noted that, although these cities are from about success of OSS projects, we developed a set of three in- the same country, United States, they have different cul- dicators that measure the success of open source projects that tural, social and economic characteristics that make us con- reuse open datasets (they are summarized in Table 1). The aim is sider that the results obtained from their data are enough to compare projects that use different categories of datasets and scalable to other Smart Cities located in different coun- how successful they are. First of all, we define the reputation tries. among a community of developers of OSS projects that reuse (3) Classifying the datasets according to a set of categories open data from a category. Some projects that reuse open data specifically designed for Smart Cities. from some specific categories can be perceived by developers (4) Searching for references to the datasets obtained from as being highly appealing projects. Smart Cities are interested Socrata in GitHub to calculate the indicators. in opening data that will be reused in these kinds of projects in (5) With the reuse indicators established in step 1 as crite- view of creating a community around open data, thus allowing an ria, and the values from step 4, we have created a Google open data portal to attract the attention of potential developers. Spreadsheet [w3] based on AHP that allows decision mak- Therefore, the reputation indicator measures how well-known ers to prioritize the most relevant categories of datasets projects reusing data from some specific category are (within the that must be published in a smart city open data portal. community of developers). Furthermore, the size of the com- munity involved in projects that use data from a category is defined in terms of the size of the community of developers that use open data from a given category. A city needs to adapt the 1 https://goo.gl/HcUc1e 2 A repository containing all the scripts and detailed instructions needed to carry out size of the community to the budget and available infrastructure. a functional application of our approach is available at GitHub https://goo.gl/TDp1xi Finally, maturity of projects that use an open data category is Table 1: Proposed indicators and their definitions Table 2: G8 Open Data Categories Indicator Description Id Data Category Example Datasets Reputation Average number of subscribers of each 1 Companies Company/business register repository that references datasets of the 2 Crime and Justice Crime statistics, safety category 3 Earth observation Meteorological/weather, agri- Community size Average number of contributors of every culture, forestry, fishing, and repository that references datasets of the hunting category 4 Education List of schools; performance of Maturity Average maturity of every repository ref- schools, digital skills erencing datasets of the category. Matu- 5 Energy and Environ- Pollution levels, energy con- rity is computed using 2 lifetimes, project ment sumption lifetime (PL) and last update lifetime 6 Finance and contracts Transaction spend, contracts let, (LUL). Thus, the resulting formula is: call for tender, future tenders, PL/LUL local budget, national budget Efficiency Proportion of datasets of each category (planned and spent) referenced in GitHub 7 Geospatial Topography, postcodes, na- tional maps, local maps 8 Global Development Aid, food security, extractives, proposed. Maturity means that the community has been working land on the project for some time without the project being aban- 9 Government Account- Government contact points, doned. A Smart City may want to select the datasets that help in ability and Democ- election results, legislation and promoting fewer projects stretching over longer periods of time, racy statutes, salaries (pay scales), rather than promoting a larger number of short-term projects. hospitality/gifts An additional indicator has been developed in order to assess 10 Health Prescription data, performance the impact of a dataset category, i.e. the likelihood of datasets data from each category of being used. To do so, we defined efficiency 11 Science and Research Genome data, research activity, of an open data category, as the probability of datasets of one experiment results category to be referenced by an OSS project. This indicator de- 12 Statistics National Statistics, Census, in- termines how relevant a category of datasets is. Smart Cities frastructure, wealth, skills will use this indicator to know which categories of open data 13 Social mobility and Housing, health insurance and are most likely to be reused. Therefore, in a scenario where the welfare unemployment benefits Smart City has the chance of opening a large number of datasets, 14 Transport and Infras- Public transport timetables, ac- the efficiency indicator will become secondary to the publishing tructure cess points broadband penetra- efforts regarding a wide a variety of datasets. tion As aforementioned, these indicators come from well-known indicators from the OSS community, being thus completely gen- eralizable to be used in any OSS repository. It is worth noting containing the identifier of every dataset and useful metadata that our proposal of indicators is not set in stone, consequently about it, such as the theme or the keyword of the dataset. These more indicators could be created and checked to be used by Smart metadata of open datasets are important because they are needed Cities according to their requirements. to facilitate the categorization step that comes next. To collect the data from Socrata, we followed these steps: 2.2 Search of smart city datasets on Socrata (1) Retrieve data from Socrata on institutions which use its Once the impact measuring indicators have been established Open Data Platform. 106 institutions were recovered. and defined, information should be gathered. This gathering of (2) Gather and filter the identifier and the minimal metadata information focuses on datasets specifically related to the smart needed to categorize them (theme or keyword) from every cities so as to obtain a more accurate assessment of the collected dataset published by US cities using Socrata. 8960 datasets data. from 32 different US cities met these conditions. Socrata is a software company focused “exclusively on de- mocratizing access to public sector data around the world”. It 2.3 Categorization fo datasets provides an Open Data Platform for allowing local, regional or In this step, we had to choose the taxonomy of dataset categories national governments to release data. Socrata is a partner of the to be analyzed. There is no common agreement on the best way USA National League of Cities [22] for the development of open of classifying Smart City open datasets. However, a 14 high-value data strategies. Nowadays, the Socrata Open Data Platform is data categories is suggested by the G8 Open Data Charter [10]. used by some of the most important US cities such as New York, These categories, together with example datasets for each one, Chicago, San Francisco or Los Angeles. In this respect, Socrata are shown in Table 2. is very useful as a proof-of-concept of our approach, since it is These categories seem to be a good way to classify Smart possible to collect precisely open dataset identifiers and their City datasets, however, some of these categories, such as Global metadata. In this sense, every Socrata dataset has its own end- Development and Science and Research, might not be used in the point and each is designated by a unique dataset identifier. Every Smart City context. Thus, specific domains which can generate Socrata open data portal provides a list of its published datasets data within a Smart City must be taken into account. In this sense, Table 3: G8 Open Data Categories Table 4: Proposal of Open Data categories for Smart Cities Id Domain Subdomain Id Data Category Example Datasets A Natural resources and1.-Smart grids 1 Administration & Fi- Audits and Reports, City Fi- energy 2.-Public lighting nance nance and Budget, City Govern- 3.-Green/renewable energies ment, Fees, Liabilities and As- 4.-Waste management sets, Purchasing, Revenue 5.-Water management 2 Business City Businesses, Community & 6.-Food and agriculture Economic Development, Grow- B Transport and mobil- 7.-City logistics ing Economy, Regulated Indus- ity 8.- Info-mobility tries 9.- People mobility 3 Demographics Census, CitiStat, Forecasts, Neighborhoods, Statistics C Buildings 10.-Facility management 4 Education Schools, Youth 11.-Building services 5 Ethics & Democracy City Management and Ethics, 12.-Housing quality Elections, Ethics, Expenditures, General Information, Gover- D Living 13.-Entertainment nance, Government, Human 14.-Hospitality Relations, Human Resources, 15.-Pollution control Legislation, People, Permitting, 16.-Public safety Public Works, Taxes 17.-Healthcare 6 Geospatial Geographic Locations and 18.-Welfare and social inclusion Boundaries, Mapping, Location, 19.-Culture GIS 20.-Public spaces management 7 Health Public Health, Human Services, E Government 21.-E-government Social Services 22.-E-democracy 8 Recreation & Culture Arts and Culture, Events, 23.-Procurement Greenways, Historic Preserva- 24.-Transparency tion, Library, Parks, Recreation, F Economy and people 25.-Innovation and en- Tourism trepreneurship 9 Safety Crime, Emergency, Fire, Police, 26.-Cultural heritage manage- Public Safety ment 10 Services 311 Call Center, City Services, 27.-Digital Education Community, Customer Service, 28.-Human capital management Facilities, Government Build- ings and Structures, Inspec- tional Services, Public Prop- erty, Public Services, Service Re- a survey [21] about Smart City initiatives proposes a classification quests divided in domains and subdomains show in Table 3 11 Sustainability Energy and Environment, Nat- Establishing an exhaustive classification of open data cate- ural Resources, Sustainability, gories for Smart Cities is beyond the scope of this paper. How- Waste Management, Food, Agri- ever, this work proposes an initial classification of open data culture categories for Smart Cities aimed to be as close as possible to 12 Transport & Infras- Airports, City Infrastructure, the G8 Open Data Charter but incorporating modifications to tructure Transportation, Parking, Street- encompass the aforementioned domains and subdomains proper car, Traffic to Smart Cities. This proposed classification is given in Table4 13 Urban Planning Area Plans, Buildings, City Fa- together with example datasets for each category. cilities, City Parks and Tree Once the categories were established we had to classify the Data, Construction, Develop- collected datasets according to such categories. Due to its char- ment, Housing, Land Use, Ur- acteristics, this step requires the participation of experts to ex- ban Planning ecute it adequately. The research groups that have developed 14 Welfare Insurance, Life Enrichment, this approach includes researchers working in related fields such Quality of Life, Pension, Re- as open data and knowledge representation. These researchers tirement, Sanitation, Social were responsible for classifying the datasets following the steps Services described below: (1) Extracting different themes from US city datasets. In our case, 215 different themes were extracted. (2) Mapping every theme to one of the available categories. Themes without a clear fit had to be classified as ‘Others’ in order to be discarded later. When we performed this step, 211 themes could be mapped to the established categories After this process, we made an estimation of the indicators in and 4 were classified as ‘Others’. order to be used with AHP. We defined a process consisting in (3) Automatically classifying datasets with a theme according the following steps: to the mapping in step 2. In our case, 8299 datasets were (1) Discarding repositories that do not have all the required classified according to the established categories, 11 were data to make an estimation of the indicators. When we categorized as ‘Others’ and 650 were not categorized due performed this step, only 2501 repositories remained. to their lack of theme. (2) Discarding all repeated references to a specific dataset (4) Optionally, trying to categorize datasets that have no from a specific repository. When we performed this step, theme manually, using other metadata such as keywords. 32551 unrepeated references from 2501 repositories re- This step can be carried out when the number of datasets mained. without a theme is considered high enough to distort the (3) Making an estimation of the indicators. When we per- value of the indicators. In our case, although the datasets formed this step, we applied the formulas previously pre- without a theme represented less than 10 sented in Table 1. (5) As a result of this process, 8949 datasets were adequately (4) Normalizing the indicators in order to use the ideal mode categorized and 11 were discarded due to their unclear fit. of AHP. When we applied this step to our case, the indi- cator of each category was divided by the maximal value 2.4 Collecting data from GitHub to calculate obtained by a category in the indicator. Thus, all the indi- indicators cators of each category were normalized to a 0-1 range. In order to calculate the above-described indicators on the suc- 2.5 Use of AHP to weight indicators cess of OSS projects that reuse open data, we decided to collect data from GitHub. GitHub, as mentioned previously, is a plat- The method of decision-making, which our model is based on, form for collaborative development of software based on a Git is named Analytic Hierarchy Process, hereinafter referred to as repository. It is used by individuals, communities and businesses AHP [25]. It is a powerful and flexible tool for decision-making alike to develop software projects. GitHub is free to use for public in complex multi-criteria problem situations and is useful for and open source projects, and it is profusely used in studies on comparing several alternatives when several objectives need to Software Engineering. Therefore, it offers useful data about open be borne in mind at the same time. source software projects, including information on whether they Following this method, the evaluator can directly assign a nor- are using open data. malized weight to a criterion that will indicate the importance GitHub has been used for collecting data and calculating indi- which that criterion has with regard to the final objective. Firstly, cators related to OSS success in several works such as [3] [19], the AHP method compares the relative importance that each where GitHub allows researchers to collect several measures criterion has in relation to all the others; this assessment enables regarding open source projects, for example, forks, stars, etc. the relative weights of the criteria to be calculated, and finally the GitHub has an API that is used to collect all required data from method normalizes the weights in order to obtain the measures an open source software project. More specifically, the data can for the existing alternatives; for this reason, AHP constitutes one be acquired from repositories and from users. A repository is of the best options to assist multi-criteria decision making. This a kind of software project folder that contains all the project method allows people to gather knowledge about a particular files. Valuable data from a repository that can be collected by problem, to quantify subjective opinions and to force the compar- using the API, apart from the code itself, are as follows: repos- ison of alternatives in relation to established criteria. The method itory_id, user_id, stargazers_count, watchers_count, language, consists in the following steps: forks_count, subscribers_count, network_count, created_at, up- (1) Define the problem and the main objective in making the dated_at, pushed_at, total_contributors, total_contributions. GitHub decision. user data also provide interesting data to be considered, such as (2) If required, build a hierarchy tree in this way: the root node followers_user, following_user, public_repos_user, location_user, is the objective of the problem, the intermediate levels are updated_at_user, created_at_user. The indicators used in our the criteria, and the lowest level contains the alternatives. approach are based on these data. We established a process for (3) At each level, build a pairwise comparison matrix with the identifying which OSS projects were using open datasets from brothers (sons of the same node). The matrix contains the Socrata US Cities. Our process consists in the following steps weights of pairwise comparisons between brother nodes. (it was implemented by using the GitHub API within a Pentaho This provides us with a pairwise comparison matrix (see Data Integration [5] process): a simple example in Table 5) for each parent node. (4) For each comparison matrix, an eigenvector must be cal- (1) Searching every eight-character code from existing Socrata culated, using the equation: |A − λI | = 0, where A is the datasets belonging to USA cities (obtained as described in comparison matrix, I is the identity matrix and λ is the Section 3.3.1) based on code from OSS repositories hosted eigenvector. This calculus must be performed for each on GitHub in order to know which projects are reusing level of the tree. open data. When we performed this step, 350644 refer- (5) Rate each alternative (leaf nodes) with a previously calcu- ences were found from 2517 repositories to 5874 of the lated fixed value for every criteria. The scales for rating 8949 categorized datasets. alternatives should be established and described in a pre- (2) Gathering required data from GitHub on the repositories cise way. that reference open datasets to make an estimation of the (6) Determine the value of each alternative using a weighted indicators. In our case we found that 2501 of the 2517 addition formula, with the weights from the previous steps. repositories had all the needed data. These results ascend up the tree to calculate the final value of the objective (root). This final value is used to make a decision about the alternative to choose. Using this method, as final stage, we have created a Google Spreadsheet based on AHP that uses the reuse indicators as cri- teria of the process. Concretely, this spreadsheet is composed of three sheets: (1) ‘Indicators’. This sheet provides the normalized indicators that were calculated from GitHub in the previous step. (2) ‘AHP Criterion Pair Comparison’. This sheet allows assess- ing the relative importance between pairs of indicators using AHP. Thereby, a decision maker could weigh the importance of the indicators set out in the previous steps, taking into account the characteristics and objectives of the city. These weights can be assigned according to the in- stitution’s strategic reuse objectives. Thus, different Smart Cities may have different objectives, strategies and target audiences when deciding which datasets should have pri- ority of publication. Each city has its own idiosyncrasy Figure 1: Simulated weights of a medium-sized town. defining what is most important or of particular interest, and it is unlikely two cities share the same priorities with regard to their respective reuse objectives. Cities can be characterized by their size, the importance of the tourism sector, or its residential, commercial or industrial sectors, etc. And also, cities may have different priorities for pub- lishing datasets depending on the type of reuse they want to promote. The result of this step will be the eigenvectors of each matrix, meaning the relative importance of the established indicators. (3) Finally, the ‘AHP Direct Results’ shows a suitability rank- ing list of dataset categories to publish according to the weights introduced in the second sheet and the indicators calculated from GitHub shown in the first sheet. That is, the value used to elaborate such ranking is the result of multiplying the relative importance of each indicator, cal- culated in the second sheet, by the values of the indicators in the corresponding categories shown in the first sheet. Figure 2: Medium-sized town rankings Thus, the use of this tool allows Smart Cities to prioritize datasets in a reasonable way based on the data collected from well-known cities, the indicators taken into account and the open data strategy of the city. 3 SIMULATING THE BEHAVIOUR OF THE TOOL ON STEREOTYPICAL CITIES In order to check our proposal according to different motivations in the weighting process, we have simulated the behavior of the tool taking into account the different prospects of two stereotyp- ical cities. We asked three experts to agree on the importance assignment of the indicators, with the assumptions of the two cities. On one hand, a medium-sized town located in a rural region, with small software companies in its zone rather than big ones, that is starting to develop its own open data portal. On the other hand, a big city with a well-known open data portal and a lot of Figure 3: Default ranking cutting edge software companies in its area of influence. In the first case, we have guessed that the town could be inter- ested, mainly, in getting reuses of its different datasets through The weights applied with this philosophy are shown in Fig- the development of simple applications by small local enterprises. ure 1, and the resulting in the ranking shown in Figure 2. The Hence, the town would assign high weights to efficiency whereas first position of ‘Geospatial’ does not change with respect to the reputation, size of the community and maturity would perform a default ranking (same weights for all the indicators) shown in secondary role. Figure 3 but the rest of the ranking suffers some variations. between programming languages and projects success. Marlow et al. [19] analyze metadata projects of GitHub to find how its users decide whom and what to keep track of, or where to contribute next. Sheoran et al. [25] investigate what kind of contributors can be the “watchers” of GitHub. Jarczyk et al. [15] study the relation between popularity of a project in GitHub and its quality. Muthukumaran et al. [20] uses GitHub to propose change metrics that can predict possible bugs. As far as we know, this is the first time GitHub has been used to estimate indicators related to reuse of open data in OSS projects. Secondly, AHP is a multiple criteria decision making method that has been used in many different applications related to de- cision making [31]. Some works specifically use AHP in Smart Cities and e-government. In this context, Bartolozzi et al. [2] present a DSS which uses AHP for supporting the decision- making process related to Smart City issues. Sultan et al. [29] suggest the use of AHP to decide the most appropriate technology for the development of e-government projects in Smart Cities. Figure 4: Simulated weights of a big-sized city Boselli et al. [4] use AHP to rank the factors for innovating a smart-mobility service in the city of Milan. A very interesting use of AHP to evaluate open data portal quality can be found in Kubler et al. [18]. The authors propose considering different di- mensions: completeness, openness, addressability and retrievabil- ity to assess the quality of 146 open data portals. Although there are several applications of AHP to the domains of Smart Cities and e-governments, they all aim at assessing Smart City strate- gies and the quality of open data portals. Instead, our approach proposes AHP to recommend the most appropriate datasets to be published. Finally, with respect to how (local) governments publish open data, Conradie & Choenni [6] explain that data release by local governments is still a novel task, thus knowledge is lacking as to its benefits and barriers. Therefore, they conduct a participatory action research approach to get a better understanding of how internal processes of local governments influence data release. The authors found that the following indicators needed to be Figure 5: Big city ranking addressed by local governments to overcome barriers to releasing public sector information: (i) Data Storage, i.e., is data stored centrally, or is it decentralized?; (ii) Use of data, i.e., the way data In the second case, we have conjectured that, due to its portal is is used by the department; (iii) Source of data, i.e., how is a set well-known, it does not search for more reuses, that is, efficiency, of data obtained?; and (iv) Suitability of data for release, i.e., are but for mature projects with good reputation and bigger com- there rules and regulations that determine whether a dataset may munities behind them. The weights applied with this philosophy be released or not, such as privacy or copyright. are shown in Figure 4. Notwithstanding, these indicators are related to current data The ranking obtained with these weights is shown in Figure 5 but do not address the actual use of the data and its benefits. Here, ‘Geospatial’ changes to third position and ‘Welfare’ takes For example, Hossain et al. [13] show that benefits associated the first one. As can be seen, the indicators obtained from GitHub with opening data are ill-understood. In their systematic review produces that some categories of the ranking tend to have a stable of open government data initiatives, Attard et al. [1] explore position regardless of the weights assigned with AHP but, even open data initiatives of a large number of governments, as well so, different combinations of weights may change this ranking. as existing tools and approaches. They found that while efforts have focused on developing tools for helping data publishers to 4 RELATED WORK open data, there have been no initiatives related to strategies for This section gives a description of (i) some relevant studies about supporting decisions on which data to release. This means that the use of GitHub to measure different indicators about Open public entities may end up publishing data with no value, rather Source Software projects, (ii) applications of AHP in Smart Cities than focusing on the relevance of the data they are publishing. as well as (iii) the most relevant studies about how (local) gov- Therefore, success in opening data is not a matter of the amount ernments publish open data. of data published, but of understanding how data is reused. As Firstly, GitHub is used by individuals, communities and busi- highlighted by Zuiderwijk & Janssen [34], since providers of open nesses alike to develop software projects. GitHub is free to use data are not concerned with needs of open data users, they do for public and OSS projects, and it is profusely used in studies not know how their data are reused, and business related issues on Software Engineering related to OSS success in several works. (such as creation of added-value services or products based on Thus, Bissyande et al. uses GitHub [3] to study a possible relation open data) are not widely used as a decision criterion. Furthermore, Zuiderwijk et al. [36] argue that the publication (3) Almost 9000 open located datasets of many of the most of open data is often cumbersome so standard procedures and important US cities. processes for opening data are required. They found a series of (4) A catalogue of these US city datasets classified according barriers preventing easy and low-cost publication of open data, to the proposed categories. leading them to propose a set of five design principles for im- (5) Around 32000 distinct references from 2500 different GitHub proving the open data publishing process of public organizations: projects referencing two thirds of the categorized datasets (i) start thinking about the opening of data at the beginning of found, based on a search performed over all OSS projects the process; (ii) develop guidelines, especially about privacy and in GitHub. policy sensitivity of data; (iii) provide decision support by inte- (6) An estimation of the defined indicators of reuse of every grating insights into the activities of other actors involved in Smart City dataset category. the publishing process; (iv) make data publication an integral, (7) An AHP-based Decision Support System to recommend well-defined and standardized part of daily procedures and rou- Smart City dataset categories to prioritize, taking into tines; and (v) monitor how the published data are reused. Our account the estimated indicators and the importance of approach is related to principle (iii) since we provide a decision each indicator for the cities. support framework based on activities of data consumers. We also contribute to principle (v) since our approach is useful for This approach is completely functional and reproducible. We monitoring how datasets are being reused in OSS applications. provide a public repository containing the data obtained from Additionally, Jetzek et al. [16] propose a framework to explain Socrata and GitHub, the scripts to collect and analyze the infor- how value is generated from open data. This framework is useful mation and the AHP tool in order to users can use or modify for governments to understand the value of their open data. Their these processes. So, Smart Cities or any other public institution framework is based on assessing the impact of open data based can reuse and adapt them to their concrete requirements regard- on two dimensions: (i) how openness generates value, and (ii) less of whether they work in a Smart City or in any other type how society as a whole can get value from openness. The au- of institution. In this sense, further alternative applications of thors identify four different archetypical generative mechanisms our approach that can be considered as a continuation of this (cause-effect relationship between open data and value) in their research may include: framework: transparency (open data helps to improve visibility to ensure socially responsible resource allocation), participation (1) Searching and categorizing open datasets of different cities, (open data as a mechanism for engaging stakeholders who help regions, countries, companies or any other kind of institu- in solving social problems), efficiency (open data to improve how tions in order to get more data. resources are used) and innovation (open data as a cornerstone (2) Developing semantic-based software tools for automatic for generating new ideas, processes, services and products). The classification of datasets. authors claim that their framework can help governments in the (3) Analyzing the reuse of open datasets in proprietary soft- development of their strategy for opening data by considering ware projects, for instance, by developing an app web factors that can enable the generation of value from open data repository where developers could register their applica- through the mechanism of innovation. tions that use open data and indicating which particular Furthermore, Zuiderwijk & Janssen [35] state that different datasets are reused. types of users of open data are often interested in different types (4) Analyzing the impact of open datasets in mass media, of data, therefore, publication of data can be improved by taking social media, blogs, etc. by searching the references to the into account preferences for certain types of data for certain open datasets in these sites. data users. (5) A set of controlled experiments to demonstrate the effec- Therefore, there are several methods that support opening tiveness of our approach in different scenarios. data, but to the best of our knowledge no approaches focus on sup- porting Smart Cities in selecting and prioritizing which datasets In summary, a successful publication of open datasets should should be open according to their preferences and the context of be based on the proper combination of the objectives of the open the city they work for. To fill this gap, we presented our approach data portal and the analysis of the impact of already available based on obtaining useful indicators from Socrata and GitHub open datasets. This approach provides a useful method for Smart and use them with AHP. City decision makers to carry out this task in an objective and analytic way. 5 CONCLUSIONS Smart Cities usually have a limited budget and insufficient time 6 ACKNOWLEDGEMENTS to release and maintain all available open data. In this paper, we We would like to thank GitHub that allowed us to use its API have presented an approach whose goal is to provide an AHP tool without limitations and Socrata that provides a way to collect that allows weighting different indicators of reuse, calculated precisely all the datasets published using its tools. This work using Socrata and GitHub as sources of information, in order has been developed with the support of (i) TIN2015-69957-R and to combine them taking into account objective criteria. This TIN2016-78103-C2-2-R (MINECO/ERDF, EU) project, (ii) POCTEP approach is characterized by: 4IE project (0045-4IE-4-P), (iii) Consejería de Economía e In- (1) A classification of 14 categories for Smart City open datasets fraestructuras/Junta de Extremadura (Spain) - European Regional based on the G8 Open Data Charter and the Smart City Development Fund (ERDF)- GR15098 project and IB16055 project, domain. and (iv) Consejería de Educación y Empleo/Junta de Extremadura (2) A definition of 4 indicators based on the reuse of datasets (Spain) - Becas de Movilidad al Personal Docente e Investigador in OSS projects. Curso 2016/2017. REFERENCES the Information Systems Perspective (EGOVIS 2014) 8650, 2014 (2014), 275–291. [1] Judie Attard, Fabrizio Orlandi, Simon Scerri, and Sören Auer. 2015. A sys- https://doi.org/10.1007/978-3-319-10178-1_22 tematic review of open government data initiatives. Government Information [24] T.L. Saaty. 1980. The Analytic Hierarchy Process. McGraw-Hill, New York. Quarterly 32, 4 (2015), 399–418. https://doi.org/10.1016/j.giq.2015.07.006 [25] Jyoti Sheoran, Kelly Blincoe, Eirini Kalliamvakou, Daniela Damian, and Jordan [2] Marco Bartolozzi, Pierfrancesco Bellini, Paolo Nesi, Gianni Pantaleo, and Luca Ell. 2014. Understanding "watchers" on GitHub. In MSR 2014: Proceedings of Santi. 2015. A Smart Decision Support System for Smart City. In 2015 IEEE the 11th Working Conference on Mining Software Repositories. ACM Press, New International Conference on Smart City/SocialCom/SustainCom (SmartCity). York, New York, USA, 336–339. https://doi.org/10.1145/2597073.2597114 IEEE, 117–122. https://doi.org/10.1109/SmartCity.2015.57 [26] Socrata. 2018. Socrata: Data-driven innovation of government programs. [3] Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, and (2018). https://www.socrata.com/ Laurent Reveillere. 2013. Popularity, interoperability, and impact of pro- [27] Katherine J. Stewart, Anthony P. Ammeter, and Likoebe M. Maruping. 2006. gramming languages in 100,000 open source projects. In Proceedings - In- Impacts of license choice and organizational sponsorship on user interest and ternational Computer Software and Applications Conference. IEEE, 303–312. development activity in open source software projects. Information Systems https://doi.org/10.1109/COMPSAC.2013.55 Research 17, 2 (jun 2006), 126–144. https://doi.org/10.1287/isre.1060.0082 [4] Roberto Boselli, Mirko Cesarini, Fabio Mercorio, and Mario Mezzanzan- [28] Chandrasekar Subramaniam, Ravi Sen, and Matthew L. Nelson. 2009. Determi- ica. 2015. Applying the AHP to Smart Mobility Services: A Case Study. nants of open source software project success: A longitudinal study. Decision In Proceedings of 4th International Conference on Data Management Tech- Support Systems 46, 2 (jan 2009), 576–585. https://doi.org/10.1016/j.dss.2008. nologies and Applications - Volume 1: KomIS. SCITEPRESS, 354–361. https: 10.005 arXiv:arXiv:cond-mat/0402594v3 //doi.org/10.5220/0005580003540361 [29] Abobakr Sultan, Khalid A. AlArfaj, and Ghassan A. AlKutbi. 2012. Analytic [5] Hitachi Vantara Community. 2018. Data Integration - Kettle. (2018). http: hierarchy process for the success of e-government. Business Strategy Series 13, //community.pentaho.com/projects/data-integration/ 6 (nov 2012), 295–306. https://doi.org/10.1108/17515631211286146 [6] Peter Conradie and Sunil Choenni. 2014. On the barriers for local government [30] Jeffrey Thorsby, Genie N.L. Stowers, Kristen Wolslegel, and Ellie Tumbuan. releasing open data. Government Information Quarterly 31, SUPPL.1 (2014), 2016. Understanding the content and features of open data portals in American S10–S17. https://doi.org/10.1016/j.giq.2014.01.003 cities. Government Information Quarterly 34, 1 (2016), 53–61. https://doi.org/ [7] Carlo Daffara. 2012. Estimating the Economic Contribution of Open Source 10.1016/j.giq.2016.07.001 Software to the European Economy. In The First Openforum Academy Confer- [31] Omkarprasad S. Vaidya and Sushil Kumar. 2006. Analytic hierarchy process: ence Proceedings. OpenForum Europe LTD, 11–14. An overview of applications. European Journal of Operational Research 169, 1 [8] Rishab Aiyer Ghosh. 2006. Economic impact of open source software on inno- (2006), 1–29. https://doi.org/10.1016/j.ejor.2004.04.028 vation and the competitiveness of the Information and Communication Tech- [32] Nils Walravens, Jonas Breuer, and Pieter Ballon. 2014. Open Data as a Catalyst nologies (ICT) sector in the EU. Technical Report. Maastricht: UNU-MERIT. For The Smart City as a Local Innovation Platform. Communications & Strate- http://stuermer.ch/blog/documents/FLOSSImpactOnEU.pdf gies 96, 4th quarter 2014 (2014), 15–33. https://ssrn.com/abstract=2636315 [9] Github. 2018. Github: The world’s leading software development platform. [33] Liguo Yu, Alok Mishra, and Deepti Mishra. 2014. An Empirical Study of (2018). https://www.github.com/ the Dynamics of GitHub Repository and Its Impact on Distributed Software [10] Group of Eight. 2013. G8 Open Data Charter. (2013). https: Development. In Proceedings of the Confederated International Workshops //www.gov.uk/government/uploads/system/uploads/attachment_data/ on On the Move to Meaningful Internet Systems: OTM 2014 Workshops - Vol- file/207772/Open_Data_Charter.pdf ume 8842. Springer-Verlag New York, Inc., 457–466. https://doi.org/10.1007/ [11] Jeffrey Hammond, Paul Santinelli, Jay Jay Billings, and Bill Ledingham. 2016. 978-3-662-45550-0_46 The Tenth Annual Future of Open Source Survey. Technical Report. Black [34] Anneke Zuiderwijk and Marijn Janssen. 2013. A Coordination Theory Per- Duck Software and North Bridge. https://www.blackducksoftware.com/ spective to Improve the Use of Open Data in Policy-Making. In Proceedings 2016-future-of-open-source of the 12th IFIP WG 8.5 International Conference on Electronic Government - [12] Anders Hjalmarsson, Niklas Johansson, and Daniel Rudmark. 2015. Mind the Volume 8074. Springer-Verlag New York, Inc., 38–49. https://doi.org/10.1007/ gap: Exploring stakeholders’ value with open data assessment. In Proceedings 978-3-642-40358-3_4 of the Annual Hawaii International Conference on System Sciences. IEEE, 1314– [35] Anneke Zuiderwijk and Marijn Janssen. 2014. Barriers and Development 1323. https://doi.org/10.1109/HICSS.2015.160 Directions for the Publication and Usage of Open Data: A Socio-Technical [13] Mohammad Alamgir Hossain, Yogesh K Dwivedi, and Nripendra P. Rana. 2016. View. In Open Government. Vol. 4. Springer New York, New York, NY, 115–135. State of the Art in Open Data Research: Insights from Existing Literature https://doi.org/10.1007/978-1-4614-9563-5_8 arXiv:arXiv:1011.1669v3 and a Research Agenda. Journal of Organizational Computing and Electronic [36] Anneke Zuiderwijk, Marijn Janssen, Sunil Choenni, and Ronald Meijer. 2014. Commerce 26, 1-2 (apr 2016), 14–40. https://doi.org/10.1080/10919392.2015. Design principles for improving the process of publishing open data. Trans- 1124007 forming Government: People, Process and Policy 8, 2 (may 2014), 185–204. [14] Marijn Janssen, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. Benefits, https://doi.org/10.1108/TG-07-2013-0024 Adoption Barriers and Myths of Open Data and Open Government. Informa- [37] Anneke Zuiderwijk, Iryna Susha, Yannis Charalabidis, Peter Parycek, and tion Systems Management 29, 4 (sep 2012), 258–268. https://doi.org/10.1080/ Marijn Janssen. 2015. Open data disclosure and use : critical factors from a 10580530.2012.716740 arXiv:arXiv:1011.1669v3 case study. In In: CeDEM 2015: Proceedings of the International Conference for [15] Oskar Jarczyk, Blazej Gruszka, Szymon Jaroszewicz, and Leszek Bukowski. E-Democracy and Open Government 2015. Edition Donau-Universität Krems, 2014. GitHub Projects. Quality Analysis of Open-Source Software. In SocInfo 197–208. 2014: The 6th International Conference on Social Informatics. Springer, Cham, 80–94. https://doi.org/10.1007/978-3-319-13734-6_6 [16] Thorhildur Jetzek, Michel Avital, and Niels Bjorn-Andersen. 2014. Data- driven innovation through open government data. Journal of Theoretical and Applied Electronic Commerce Research 9, 2 (aug 2014), 100–120. https: //doi.org/10.4067/S0718-18762014000200008 [17] Maxat Kassen. 2013. A promising phenomenon of open data: A case study of the Chicago open data project. Government Information Quarterly 30, 4 (2013), 508–513. https://doi.org/10.1016/j.giq.2013.05.012 [18] Sylvain Kubler, Jérémy Robert, Yves Le Traon, Jürgen Umbrich, and Sebastian Neumaier. 2016. Open Data Portal Quality Comparison using AHP. In Pro- ceedings of the 17th International Digital Government Research Conference on Digital Government Research - dg.o ’16. ACM Press, New York, New York, USA, 397–407. https://doi.org/10.1145/2912160.2912167 [19] Jennifer Marlow, Laura Dabbish, and Jim Herbsleb. 2013. Impression Formation in Online Peer Production : Activity Traces and Personal Profiles in GitHub. In 16th ACM Conference on Computer Supported Cooperative Work. ACM Press, New York, New York, USA, 117–128. https://doi.org/10.1145/2441776.2441792 [20] K. Muthukumaran, Abhinav Choudhary, and N.L. Bhanu Murthy. 2015. Mining GitHub for Novel Change Metrics to Predict Buggy Files in Software Systems. In 2015 International Conference on Computational Intelligence and Networks. IEEE, 15–20. https://doi.org/10.1109/CINE.2015.13 [21] Paolo Neirotti, Alberto De Marco, Anna Corinna Cagliano, Giulio Mangano, and Francesco Scorrano. 2014. Current trends in smart city initiatives: Some stylised facts. Cities 38 (2014), 25–36. https://doi.org/10.1016/j.cities.2013.12. 010 [22] National League of Cities. 2018. National League of Cities. (2018). https: //www.nlc.org/ [23] Monica Palmirani, Michele Martoni, and Dino Girardi. 2014. Beyond Trans- parency Introduction : OGA Beyond Transparency. Electronic Government and