=Paper= {{Paper |id=Vol-2062/paper3 |storemode=property |title=Supporting Open Dataset Publication Decisions Based on Open Source Software Reuse |pdfUrl=https://ceur-ws.org/Vol-2062/paper03.pdf |volume=Vol-2062 |authors=Alvaro E. Prieto,Jose-Norberto Mazón,Adolfo Lozano-Tello,Luis-Daniel Ibáñez |dblpUrl=https://dblp.org/rec/conf/dolap/PrietoMTI18 }} ==Supporting Open Dataset Publication Decisions Based on Open Source Software Reuse== https://ceur-ws.org/Vol-2062/paper03.pdf
Supporting open dataset publication decisions based on Open
                  Source Software reuse
                              Alvaro E. Prieto                                                      Jose-Norberto Mazón
                       Universidad de Extremadura                                                 Universidad de Alicante
                             Cáceres, Spain                                                San Vicente del Raspeig, Alicante, Spain
                           aeprieto@unex.es                                                         jnmazon@dlsi.ua.es

                          Adolfo Lozano-Tello                                                        Luis-Daniel Ibáñez
                       Universidad de Extremadura                                                 University of Southampton
                              Cáceres, Spain                                                    Southampton, United Kingdom
                            alozano@unex.es                                                     l.d.ibanez@southampton.ac.uk

ABSTRACT                                                                            the other hand, a big city with consolidated open data portals
Publishing and maintaining open data is a costly task for public                    may prefer opening datasets that could be used in complex and
institutions, that becomes even more challenging in the context                     mature software applications that involve big teams, since it is
of Smart Cities, where large amounts of varied data are generated                   more relevant to their specific technological industry context.
from different domains. To optimize resources, they should prior-                      Unfortunately, to the best of our knowledge, Smart Cities
itize the publication and maintenance of datasets most likely to                    lack such decision support system, mainly because the process
generate social and economic impact. However, there is currently                    of calculation of those indicators that would use the system is
a lack of decision-support tools to help public sector data publish-                not a trivial task. According to Janssen et al. [14] , “there is no
ers to evaluate datasets on the light of their particular reuse goals.              way to predict and calculate the return of investment (ROI) in
In this paper, we propose to suggest to data publishers the dataset                 advance [. . . ]". The main challenge is that open data has no value
categories with most potential impact, based on the impact of                       in itself; it only becomes valuable when used”. Therefore, the
already published datasets of the same category. To measure im-                     main problem is that data owners have limited understanding
pact, we propose a set of indicators based on the amount and                        on how open data is reused, thus lacking knowledge about the
quality of Open Source Software projects that use datasets. To                      impact generated by reusing the published open data.
aggregate indicators according to specific reuse goals, we provide                     More reasonable indicators of the use of open datasets could
an Analytic-Hierarchy-Process based tool.                                           help to identify which categories of datasets have more possi-
                                                                                    bilities of being reused and, in this way, generate some type of
                                                                                    economic impact to people or enterprises. In this sense, good
1    INTRODUCTION                                                                   indicators could come from the reuse of datasets within the open
One of the most important challenges faced by Smart Cities is                       source community. The Tenth Annual Future of Open Source
creating an ecosystem of public and private actors that reuse open                  Survey [11] reflects the increasing adoption of pen source and
data in order to produce IT services and products that both (i)                     highlights the abundance of organizations participating in the
would improve citizens’ quality of life and (ii) would contribute                   open source community. Concretely, this survey estimates that
to economic growth [32]. However, few open data portals in                          65% of companies currently participate in open source projects.
cities currently track data usage and consider the impact of data                   Open Source Software (hereon OSS) is considered to encourage
on deciding which datasets to maintain or what complementary                        the creation of SMEs and jobs, by providing a skills development
datasets publish. Cities are not even aware of what kinds of                        environment valued by employers and retaining a greater share
apps are developed, using what data, and how many there are.                        of generated value locally [8]. Focusing in Europe, a study esti-
Answering these questions is a significant research issue [30]                      mated that the contribution of OSS to its economy was of 450
that would allow prioritizing which categories of data must be                      billion euro per year [7].
published and maintained with respect to the applications that                         Based on these figures, an estimation of the use of the different
use them (i.e., impact that a category of open data generates).                     categories of datasets by the OSS community could be a good
   To reverse this situation, publishing datasets as open data                      indicator of their potential impact. Therefore, when Smart Cities
requires a decision support system to select those categories of                    make decisions on which data to publish, they could prioritize
datasets that offer higher potential to generate value [12]. Such a                 publication of data which allows a community of developers
system must consider indicators about the impact of the already                     to generate impact and effectively release benefits of open data
published open datasets, as well as the strategy of the Smart                       through OSS projects.
City. E.g., a small town could provide an open data portal with                        In this paper, we present an approach based on the estimation
many high-quality datasets but the portal is rather unknown,                        of indicators of the use of open datasets in OSS projects. The
and the technological fabric of the city is composed of small IT                    goal of this approach is to provide Smart Cities with a Decision
companies. Therefore, the goal of the city could be to extend the                   Support System which provides an ordered list of categories of
use of the open data portal by prioritizing those datasets that                     datasets most suitable to be published or maintained in their open
belong to categories that are likely to generate a large number                     data portal. To do so, we have carried out a set of actions aimed
of projects -though simpler ones that involve fewer people. On                      at estimating useful impact indicators related to the datasets of
© 2018 Copyright held by the owner/author(s). Published in the Workshop
                                                                                    the same category already published by open data portals of
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,         other cities. Concretely, to calculate our proposed indicators we
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted   needed two kinds of data sources: (i) already published Smart City
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
datasets (and their metadata) and (ii) OSS projects (together with                        2.1    A proposal of indicators of reuse based on
information about them) which referenced the gathered datasets;                                  GitHub
i.e., we needed to know which open datasets were being used in
                                                                                          Smart Cities should follow a strategy for opening data as de-
which OSS projects. To collect already published open datasets,
                                                                                          scribed in [17]. This strategy should prioritize publication of
we chose Socrata [26] because it is one of the most used open
                                                                                          data which allows a community of developers to generate im-
data repositories, and notably by some of the most important
                                                                                          pact and effectively release benefits of open data through OSS
US cities. We also measured the existence of potential reuses
                                                                                          projects [37]. A Smart City could in fact prioritize publication
within a community in order to measure open data impact. To
                                                                                          of open data with more reuse potential depending on the cate-
do this, we used GitHub [9], because it is the largest web-based
                                                                                          gory to which the data belong to. However, due to “open-data
distributed revision control and source code repository in the
                                                                                          by default” idiosyncrasy [23], data is usually published without
world, and the source of several empirical studies such as in Yu
                                                                                          establishing specific goals and without imposing utilization or
et al. [33].
                                                                                          authentication restrictions to the infomediaries and end users. As
    Using the indicators obtained from these sources, we provide
                                                                                          a result, collecting the usage information and measuring impact
an Analytic Hierarchy Process (hereon AHP)-based [24] tool1
                                                                                          generated by open datasets may become very complex.
that allows decision makers weigh these indicators, taking into
                                                                                             To overcome this situation, our approach is based on consid-
account the reuse objectives of the city, to offer an ordered list of
                                                                                          ering that the more used an open dataset is by OSS projects,
categories of datasets recommended to publish.
                                                                                          the more impact is generated. Therefore, we borrowed some
    This paper is structured as follows: section 2 describes a new
                                                                                          well-known indicators that measure the success of OSS projects
approach to select the most relevant categories of data to be
                                                                                          and we have used as starting point to develop our indicators
published in a smart city open data portal. Section 3 presents
                                                                                          to measure such success when open data is reused. Then, these
toy samples of two different stereotypical smart cities using our
                                                                                          indicators allow Smart Cities to measure which categories of
approach and, to finish, section 4 summarizes other work related
                                                                                          open data have more reuse potential and decide which data must
to the publishing of open data in Smart Cities.
                                                                                          be released according to the requirements of each city. The fol-
                                                                                          lowing indicators from existing research literature on OSS are
2     USING REUSE INDICATORS BASED ON                                                     considered [27] [28]. First of all, we included (i) number of people
      DATA FROM OSS PROJECTS IN GITHUB                                                    who agree to receive information about the project because they
      FOR SELECTING DATASETS TO OPEN                                                      find it interesting (subscribers), and (ii) number of people who
                                                                                          actually work on the OSS project (developers). On the one hand,
This section describes the steps that have been carried out to get
                                                                                          subscribers to OSS choose to obtain information on the project
an AHP process that allows classifying categories of dataset based
                                                                                          and thus reveal a deeper interest in the OSS project. The sub-
on the preferences of the decision-maker. These preferences are
                                                                                          scriber indicator not only measures interest within the project
applied to a set of useful indicators obtained from data about
                                                                                          but the reputation of the project within the community and the
their reuse in OSS Projects of GitHub repositories. Concretely,
                                                                                          dissemination of the project through the community. On the
these steps2 are detailed in the following subsections and are
                                                                                          other hand, the number of developers working on a project is
summarized below:
                                                                                          critical to its success, since survival of an OSS project depends on
    (1) From GitHub repositories, studying the characteristics of                         continued contribution from developers [28]. There is another
        OSS projects that use open datasets. This information was                         measure for the success of OSS projects [27] as the (iii) age of
        analyzed to establish a set of reuse indicators.                                  an active project that is positively related to OSS progress to-
    (2) Gathering datasets from 32 cities of the United States (such                      ward completion, as well as the experience of the community of
        as San Francisco, Chicago or New York) which use Socrata                          developers.
        as an open data repository. With respect to this point, it                           Based on these three indicators described in the literature
        should also be noted that, although these cities are from                         about success of OSS projects, we developed a set of three in-
        the same country, United States, they have different cul-                         dicators that measure the success of open source projects that
        tural, social and economic characteristics that make us con-                      reuse open datasets (they are summarized in Table 1). The aim is
        sider that the results obtained from their data are enough                        to compare projects that use different categories of datasets and
        scalable to other Smart Cities located in different coun-                         how successful they are. First of all, we define the reputation
        tries.                                                                            among a community of developers of OSS projects that reuse
    (3) Classifying the datasets according to a set of categories                         open data from a category. Some projects that reuse open data
        specifically designed for Smart Cities.                                           from some specific categories can be perceived by developers
    (4) Searching for references to the datasets obtained from                            as being highly appealing projects. Smart Cities are interested
        Socrata in GitHub to calculate the indicators.                                    in opening data that will be reused in these kinds of projects in
    (5) With the reuse indicators established in step 1 as crite-                         view of creating a community around open data, thus allowing an
        ria, and the values from step 4, we have created a Google                         open data portal to attract the attention of potential developers.
        Spreadsheet [w3] based on AHP that allows decision mak-                           Therefore, the reputation indicator measures how well-known
        ers to prioritize the most relevant categories of datasets                        projects reusing data from some specific category are (within the
        that must be published in a smart city open data portal.                          community of developers). Furthermore, the size of the com-
                                                                                          munity involved in projects that use data from a category is
                                                                                          defined in terms of the size of the community of developers that
                                                                                          use open data from a given category. A city needs to adapt the
1 https://goo.gl/HcUc1e
2 A repository containing all the scripts and detailed instructions needed to carry out
                                                                                          size of the community to the budget and available infrastructure.
a functional application of our approach is available at GitHub https://goo.gl/TDp1xi
                                                                                          Finally, maturity of projects that use an open data category is
      Table 1: Proposed indicators and their definitions                          Table 2: G8 Open Data Categories

  Indicator          Description                                      Id   Data Category             Example Datasets
  Reputation         Average number of subscribers of each            1    Companies                 Company/business register
                     repository that references datasets of the       2    Crime and Justice         Crime statistics, safety
                     category                                         3    Earth observation         Meteorological/weather, agri-
  Community size     Average number of contributors of every                                         culture, forestry, fishing, and
                     repository that references datasets of the                                      hunting
                     category                                         4    Education                 List of schools; performance of
  Maturity           Average maturity of every repository ref-                                       schools, digital skills
                     erencing datasets of the category. Matu-         5    Energy and Environ-       Pollution levels, energy con-
                     rity is computed using 2 lifetimes, project           ment                      sumption
                     lifetime (PL) and last update lifetime           6    Finance and contracts     Transaction spend, contracts let,
                     (LUL). Thus, the resulting formula is:                                          call for tender, future tenders,
                     PL/LUL                                                                          local budget, national budget
  Efficiency         Proportion of datasets of each category                                         (planned and spent)
                     referenced in GitHub                             7    Geospatial                Topography, postcodes, na-
                                                                                                     tional maps, local maps
                                                                      8    Global Development        Aid, food security, extractives,
proposed. Maturity means that the community has been working                                         land
on the project for some time without the project being aban-          9    Government Account-       Government contact points,
doned. A Smart City may want to select the datasets that help in           ability and Democ-        election results, legislation and
promoting fewer projects stretching over longer periods of time,           racy                      statutes, salaries (pay scales),
rather than promoting a larger number of short-term projects.                                        hospitality/gifts
   An additional indicator has been developed in order to assess      10   Health                    Prescription data, performance
the impact of a dataset category, i.e. the likelihood of datasets                                    data
from each category of being used. To do so, we defined efficiency     11   Science and Research      Genome data, research activity,
of an open data category, as the probability of datasets of one                                      experiment results
category to be referenced by an OSS project. This indicator de-       12   Statistics                National Statistics, Census, in-
termines how relevant a category of datasets is. Smart Cities                                        frastructure, wealth, skills
will use this indicator to know which categories of open data         13   Social mobility and       Housing, health insurance and
are most likely to be reused. Therefore, in a scenario where the           welfare                   unemployment benefits
Smart City has the chance of opening a large number of datasets,      14   Transport and Infras-     Public transport timetables, ac-
the efficiency indicator will become secondary to the publishing           tructure                  cess points broadband penetra-
efforts regarding a wide a variety of datasets.                                                      tion
   As aforementioned, these indicators come from well-known
indicators from the OSS community, being thus completely gen-
eralizable to be used in any OSS repository. It is worth noting      containing the identifier of every dataset and useful metadata
that our proposal of indicators is not set in stone, consequently    about it, such as the theme or the keyword of the dataset. These
more indicators could be created and checked to be used by Smart     metadata of open datasets are important because they are needed
Cities according to their requirements.                              to facilitate the categorization step that comes next. To collect
                                                                     the data from Socrata, we followed these steps:
2.2     Search of smart city datasets on Socrata                        (1) Retrieve data from Socrata on institutions which use its
Once the impact measuring indicators have been established                  Open Data Platform. 106 institutions were recovered.
and defined, information should be gathered. This gathering of          (2) Gather and filter the identifier and the minimal metadata
information focuses on datasets specifically related to the smart           needed to categorize them (theme or keyword) from every
cities so as to obtain a more accurate assessment of the collected          dataset published by US cities using Socrata. 8960 datasets
data.                                                                       from 32 different US cities met these conditions.
   Socrata is a software company focused “exclusively on de-
mocratizing access to public sector data around the world”. It       2.3    Categorization fo datasets
provides an Open Data Platform for allowing local, regional or       In this step, we had to choose the taxonomy of dataset categories
national governments to release data. Socrata is a partner of the    to be analyzed. There is no common agreement on the best way
USA National League of Cities [22] for the development of open       of classifying Smart City open datasets. However, a 14 high-value
data strategies. Nowadays, the Socrata Open Data Platform is         data categories is suggested by the G8 Open Data Charter [10].
used by some of the most important US cities such as New York,       These categories, together with example datasets for each one,
Chicago, San Francisco or Los Angeles. In this respect, Socrata      are shown in Table 2.
is very useful as a proof-of-concept of our approach, since it is       These categories seem to be a good way to classify Smart
possible to collect precisely open dataset identifiers and their     City datasets, however, some of these categories, such as Global
metadata. In this sense, every Socrata dataset has its own end-      Development and Science and Research, might not be used in the
point and each is designated by a unique dataset identifier. Every   Smart City context. Thus, specific domains which can generate
Socrata open data portal provides a list of its published datasets   data within a Smart City must be taken into account. In this sense,
                 Table 3: G8 Open Data Categories                         Table 4: Proposal of Open Data categories for Smart Cities

 Id     Domain                     Subdomain                               Id   Data Category           Example Datasets
 A      Natural resources and1.-Smart grids                                1    Administration & Fi- Audits and Reports, City Fi-
        energy               2.-Public lighting                                 nance                nance and Budget, City Govern-
                             3.-Green/renewable energies                                             ment, Fees, Liabilities and As-
                             4.-Waste management                                                     sets, Purchasing, Revenue
                             5.-Water management                           2    Business             City Businesses, Community &
                             6.-Food and agriculture                                                 Economic Development, Grow-
 B      Transport and mobil- 7.-City logistics                                                       ing Economy, Regulated Indus-
        ity                  8.- Info-mobility                                                       tries
                             9.- People mobility                           3    Demographics         Census, CitiStat, Forecasts,
                                                                                                     Neighborhoods, Statistics
 C      Buildings                  10.-Facility management                 4    Education            Schools, Youth
                                   11.-Building services                   5    Ethics & Democracy   City Management and Ethics,
                                   12.-Housing quality                                               Elections, Ethics, Expenditures,
                                                                                                     General Information, Gover-
 D      Living                     13.-Entertainment                                                 nance, Government, Human
                                   14.-Hospitality                                                   Relations, Human Resources,
                                   15.-Pollution control                                             Legislation, People, Permitting,
                                   16.-Public safety                                                 Public Works, Taxes
                                   17.-Healthcare                          6    Geospatial           Geographic Locations and
                                   18.-Welfare and social inclusion                                  Boundaries, Mapping, Location,
                                   19.-Culture                                                       GIS
                                   20.-Public spaces management            7    Health               Public Health, Human Services,
 E      Government                 21.-E-government                                                  Social Services
                                   22.-E-democracy                         8    Recreation & Culture Arts and Culture, Events,
                                   23.-Procurement                                                   Greenways, Historic Preserva-
                                   24.-Transparency                                                  tion, Library, Parks, Recreation,
 F      Economy and people         25.-Innovation      and      en-                                  Tourism
                                   trepreneurship                          9    Safety               Crime, Emergency, Fire, Police,
                                   26.-Cultural heritage manage-                                     Public Safety
                                   ment                                    10   Services             311 Call Center, City Services,
                                   27.-Digital Education                                             Community, Customer Service,
                                   28.-Human capital management                                      Facilities, Government Build-
                                                                                                     ings and Structures, Inspec-
                                                                                                     tional Services, Public Prop-
                                                                                                     erty, Public Services, Service Re-
a survey [21] about Smart City initiatives proposes a classification                                 quests
divided in domains and subdomains show in Table 3                          11   Sustainability       Energy and Environment, Nat-
   Establishing an exhaustive classification of open data cate-                                      ural Resources, Sustainability,
gories for Smart Cities is beyond the scope of this paper. How-                                      Waste Management, Food, Agri-
ever, this work proposes an initial classification of open data                                      culture
categories for Smart Cities aimed to be as close as possible to            12   Transport & Infras- Airports, City Infrastructure,
the G8 Open Data Charter but incorporating modifications to                     tructure             Transportation, Parking, Street-
encompass the aforementioned domains and subdomains proper                                           car, Traffic
to Smart Cities. This proposed classification is given in Table4           13   Urban Planning       Area Plans, Buildings, City Fa-
together with example datasets for each category.                                                    cilities, City Parks and Tree
   Once the categories were established we had to classify the                                       Data, Construction, Develop-
collected datasets according to such categories. Due to its char-                                    ment, Housing, Land Use, Ur-
acteristics, this step requires the participation of experts to ex-                                  ban Planning
ecute it adequately. The research groups that have developed               14   Welfare              Insurance, Life Enrichment,
this approach includes researchers working in related fields such                                    Quality of Life, Pension, Re-
as open data and knowledge representation. These researchers                                         tirement, Sanitation, Social
were responsible for classifying the datasets following the steps                                    Services
described below:
     (1) Extracting different themes from US city datasets. In our
         case, 215 different themes were extracted.
     (2) Mapping every theme to one of the available categories.
         Themes without a clear fit had to be classified as ‘Others’ in
         order to be discarded later. When we performed this step,
       211 themes could be mapped to the established categories           After this process, we made an estimation of the indicators in
       and 4 were classified as ‘Others’.                              order to be used with AHP. We defined a process consisting in
   (3) Automatically classifying datasets with a theme according       the following steps:
       to the mapping in step 2. In our case, 8299 datasets were          (1) Discarding repositories that do not have all the required
       classified according to the established categories, 11 were            data to make an estimation of the indicators. When we
       categorized as ‘Others’ and 650 were not categorized due               performed this step, only 2501 repositories remained.
       to their lack of theme.                                            (2) Discarding all repeated references to a specific dataset
   (4) Optionally, trying to categorize datasets that have no                 from a specific repository. When we performed this step,
       theme manually, using other metadata such as keywords.                 32551 unrepeated references from 2501 repositories re-
       This step can be carried out when the number of datasets               mained.
       without a theme is considered high enough to distort the           (3) Making an estimation of the indicators. When we per-
       value of the indicators. In our case, although the datasets            formed this step, we applied the formulas previously pre-
       without a theme represented less than 10                               sented in Table 1.
   (5) As a result of this process, 8949 datasets were adequately         (4) Normalizing the indicators in order to use the ideal mode
       categorized and 11 were discarded due to their unclear fit.            of AHP. When we applied this step to our case, the indi-
                                                                              cator of each category was divided by the maximal value
2.4    Collecting data from GitHub to calculate                               obtained by a category in the indicator. Thus, all the indi-
       indicators                                                             cators of each category were normalized to a 0-1 range.

In order to calculate the above-described indicators on the suc-
                                                                       2.5    Use of AHP to weight indicators
cess of OSS projects that reuse open data, we decided to collect
data from GitHub. GitHub, as mentioned previously, is a plat-          The method of decision-making, which our model is based on,
form for collaborative development of software based on a Git          is named Analytic Hierarchy Process, hereinafter referred to as
repository. It is used by individuals, communities and businesses      AHP [25]. It is a powerful and flexible tool for decision-making
alike to develop software projects. GitHub is free to use for public   in complex multi-criteria problem situations and is useful for
and open source projects, and it is profusely used in studies on       comparing several alternatives when several objectives need to
Software Engineering. Therefore, it offers useful data about open      be borne in mind at the same time.
source software projects, including information on whether they           Following this method, the evaluator can directly assign a nor-
are using open data.                                                   malized weight to a criterion that will indicate the importance
    GitHub has been used for collecting data and calculating indi-     which that criterion has with regard to the final objective. Firstly,
cators related to OSS success in several works such as [3] [19],       the AHP method compares the relative importance that each
where GitHub allows researchers to collect several measures            criterion has in relation to all the others; this assessment enables
regarding open source projects, for example, forks, stars, etc.        the relative weights of the criteria to be calculated, and finally the
GitHub has an API that is used to collect all required data from       method normalizes the weights in order to obtain the measures
an open source software project. More specifically, the data can       for the existing alternatives; for this reason, AHP constitutes one
be acquired from repositories and from users. A repository is          of the best options to assist multi-criteria decision making. This
a kind of software project folder that contains all the project        method allows people to gather knowledge about a particular
files. Valuable data from a repository that can be collected by        problem, to quantify subjective opinions and to force the compar-
using the API, apart from the code itself, are as follows: repos-      ison of alternatives in relation to established criteria. The method
itory_id, user_id, stargazers_count, watchers_count, language,         consists in the following steps:
forks_count, subscribers_count, network_count, created_at, up-            (1) Define the problem and the main objective in making the
dated_at, pushed_at, total_contributors, total_contributions. GitHub          decision.
user data also provide interesting data to be considered, such as         (2) If required, build a hierarchy tree in this way: the root node
followers_user, following_user, public_repos_user, location_user,             is the objective of the problem, the intermediate levels are
updated_at_user, created_at_user. The indicators used in our                  the criteria, and the lowest level contains the alternatives.
approach are based on these data. We established a process for            (3) At each level, build a pairwise comparison matrix with the
identifying which OSS projects were using open datasets from                  brothers (sons of the same node). The matrix contains the
Socrata US Cities. Our process consists in the following steps                weights of pairwise comparisons between brother nodes.
(it was implemented by using the GitHub API within a Pentaho                  This provides us with a pairwise comparison matrix (see
Data Integration [5] process):                                                a simple example in Table 5) for each parent node.
                                                                          (4) For each comparison matrix, an eigenvector must be cal-
   (1) Searching every eight-character code from existing Socrata             culated, using the equation: |A − λI | = 0, where A is the
       datasets belonging to USA cities (obtained as described in             comparison matrix, I is the identity matrix and λ is the
       Section 3.3.1) based on code from OSS repositories hosted              eigenvector. This calculus must be performed for each
       on GitHub in order to know which projects are reusing                  level of the tree.
       open data. When we performed this step, 350644 refer-              (5) Rate each alternative (leaf nodes) with a previously calcu-
       ences were found from 2517 repositories to 5874 of the                 lated fixed value for every criteria. The scales for rating
       8949 categorized datasets.                                             alternatives should be established and described in a pre-
   (2) Gathering required data from GitHub on the repositories                cise way.
       that reference open datasets to make an estimation of the          (6) Determine the value of each alternative using a weighted
       indicators. In our case we found that 2501 of the 2517                 addition formula, with the weights from the previous steps.
       repositories had all the needed data.                                  These results ascend up the tree to calculate the final value
       of the objective (root). This final value is used to make a
       decision about the alternative to choose.
   Using this method, as final stage, we have created a Google
Spreadsheet based on AHP that uses the reuse indicators as cri-
teria of the process. Concretely, this spreadsheet is composed of
three sheets:
    (1) ‘Indicators’. This sheet provides the normalized indicators
        that were calculated from GitHub in the previous step.
    (2) ‘AHP Criterion Pair Comparison’. This sheet allows assess-
        ing the relative importance between pairs of indicators
        using AHP. Thereby, a decision maker could weigh the
        importance of the indicators set out in the previous steps,
        taking into account the characteristics and objectives of
        the city. These weights can be assigned according to the in-
        stitution’s strategic reuse objectives. Thus, different Smart
        Cities may have different objectives, strategies and target
        audiences when deciding which datasets should have pri-
        ority of publication. Each city has its own idiosyncrasy          Figure 1: Simulated weights of a medium-sized town.
        defining what is most important or of particular interest,
        and it is unlikely two cities share the same priorities with
        regard to their respective reuse objectives. Cities can be
        characterized by their size, the importance of the tourism
        sector, or its residential, commercial or industrial sectors,
        etc. And also, cities may have different priorities for pub-
        lishing datasets depending on the type of reuse they want
        to promote. The result of this step will be the eigenvectors
        of each matrix, meaning the relative importance of the
        established indicators.
    (3) Finally, the ‘AHP Direct Results’ shows a suitability rank-
        ing list of dataset categories to publish according to the
        weights introduced in the second sheet and the indicators
        calculated from GitHub shown in the first sheet. That is,
        the value used to elaborate such ranking is the result of
        multiplying the relative importance of each indicator, cal-
        culated in the second sheet, by the values of the indicators
        in the corresponding categories shown in the first sheet.                 Figure 2: Medium-sized town rankings
Thus, the use of this tool allows Smart Cities to prioritize datasets
in a reasonable way based on the data collected from well-known
cities, the indicators taken into account and the open data strategy
of the city.

3    SIMULATING THE BEHAVIOUR OF THE
     TOOL ON STEREOTYPICAL CITIES
In order to check our proposal according to different motivations
in the weighting process, we have simulated the behavior of the
tool taking into account the different prospects of two stereotyp-
ical cities. We asked three experts to agree on the importance
assignment of the indicators, with the assumptions of the two
cities.
   On one hand, a medium-sized town located in a rural region,
with small software companies in its zone rather than big ones,
that is starting to develop its own open data portal. On the other
hand, a big city with a well-known open data portal and a lot of                          Figure 3: Default ranking
cutting edge software companies in its area of influence.
   In the first case, we have guessed that the town could be inter-
ested, mainly, in getting reuses of its different datasets through         The weights applied with this philosophy are shown in Fig-
the development of simple applications by small local enterprises.      ure 1, and the resulting in the ranking shown in Figure 2. The
Hence, the town would assign high weights to efficiency whereas         first position of ‘Geospatial’ does not change with respect to the
reputation, size of the community and maturity would perform a          default ranking (same weights for all the indicators) shown in
secondary role.                                                         Figure 3 but the rest of the ranking suffers some variations.
                                                                        between programming languages and projects success. Marlow et
                                                                        al. [19] analyze metadata projects of GitHub to find how its users
                                                                        decide whom and what to keep track of, or where to contribute
                                                                        next. Sheoran et al. [25] investigate what kind of contributors
                                                                        can be the “watchers” of GitHub. Jarczyk et al. [15] study the
                                                                        relation between popularity of a project in GitHub and its quality.
                                                                        Muthukumaran et al. [20] uses GitHub to propose change metrics
                                                                        that can predict possible bugs. As far as we know, this is the first
                                                                        time GitHub has been used to estimate indicators related to reuse
                                                                        of open data in OSS projects.
                                                                            Secondly, AHP is a multiple criteria decision making method
                                                                        that has been used in many different applications related to de-
                                                                        cision making [31]. Some works specifically use AHP in Smart
                                                                        Cities and e-government. In this context, Bartolozzi et al. [2]
                                                                        present a DSS which uses AHP for supporting the decision-
                                                                        making process related to Smart City issues. Sultan et al. [29]
                                                                        suggest the use of AHP to decide the most appropriate technology
                                                                        for the development of e-government projects in Smart Cities.
       Figure 4: Simulated weights of a big-sized city                  Boselli et al. [4] use AHP to rank the factors for innovating a
                                                                        smart-mobility service in the city of Milan. A very interesting
                                                                        use of AHP to evaluate open data portal quality can be found in
                                                                        Kubler et al. [18]. The authors propose considering different di-
                                                                        mensions: completeness, openness, addressability and retrievabil-
                                                                        ity to assess the quality of 146 open data portals. Although there
                                                                        are several applications of AHP to the domains of Smart Cities
                                                                        and e-governments, they all aim at assessing Smart City strate-
                                                                        gies and the quality of open data portals. Instead, our approach
                                                                        proposes AHP to recommend the most appropriate datasets to
                                                                        be published.
                                                                            Finally, with respect to how (local) governments publish open
                                                                        data, Conradie & Choenni [6] explain that data release by local
                                                                        governments is still a novel task, thus knowledge is lacking as to
                                                                        its benefits and barriers. Therefore, they conduct a participatory
                                                                        action research approach to get a better understanding of how
                                                                        internal processes of local governments influence data release.
                                                                        The authors found that the following indicators needed to be
                   Figure 5: Big city ranking                           addressed by local governments to overcome barriers to releasing
                                                                        public sector information: (i) Data Storage, i.e., is data stored
                                                                        centrally, or is it decentralized?; (ii) Use of data, i.e., the way data
   In the second case, we have conjectured that, due to its portal is   is used by the department; (iii) Source of data, i.e., how is a set
well-known, it does not search for more reuses, that is, efficiency,    of data obtained?; and (iv) Suitability of data for release, i.e., are
but for mature projects with good reputation and bigger com-            there rules and regulations that determine whether a dataset may
munities behind them. The weights applied with this philosophy          be released or not, such as privacy or copyright.
are shown in Figure 4.                                                      Notwithstanding, these indicators are related to current data
   The ranking obtained with these weights is shown in Figure 5         but do not address the actual use of the data and its benefits.
Here, ‘Geospatial’ changes to third position and ‘Welfare’ takes        For example, Hossain et al. [13] show that benefits associated
the first one. As can be seen, the indicators obtained from GitHub      with opening data are ill-understood. In their systematic review
produces that some categories of the ranking tend to have a stable      of open government data initiatives, Attard et al. [1] explore
position regardless of the weights assigned with AHP but, even          open data initiatives of a large number of governments, as well
so, different combinations of weights may change this ranking.          as existing tools and approaches. They found that while efforts
                                                                        have focused on developing tools for helping data publishers to
4   RELATED WORK                                                        open data, there have been no initiatives related to strategies for
This section gives a description of (i) some relevant studies about     supporting decisions on which data to release. This means that
the use of GitHub to measure different indicators about Open            public entities may end up publishing data with no value, rather
Source Software projects, (ii) applications of AHP in Smart Cities      than focusing on the relevance of the data they are publishing.
as well as (iii) the most relevant studies about how (local) gov-       Therefore, success in opening data is not a matter of the amount
ernments publish open data.                                             of data published, but of understanding how data is reused. As
   Firstly, GitHub is used by individuals, communities and busi-        highlighted by Zuiderwijk & Janssen [34], since providers of open
nesses alike to develop software projects. GitHub is free to use        data are not concerned with needs of open data users, they do
for public and OSS projects, and it is profusely used in studies        not know how their data are reused, and business related issues
on Software Engineering related to OSS success in several works.        (such as creation of added-value services or products based on
Thus, Bissyande et al. uses GitHub [3] to study a possible relation     open data) are not widely used as a decision criterion.
    Furthermore, Zuiderwijk et al. [36] argue that the publication           (3) Almost 9000 open located datasets of many of the most
of open data is often cumbersome so standard procedures and                      important US cities.
processes for opening data are required. They found a series of              (4) A catalogue of these US city datasets classified according
barriers preventing easy and low-cost publication of open data,                  to the proposed categories.
leading them to propose a set of five design principles for im-              (5) Around 32000 distinct references from 2500 different GitHub
proving the open data publishing process of public organizations:                projects referencing two thirds of the categorized datasets
(i) start thinking about the opening of data at the beginning of                 found, based on a search performed over all OSS projects
the process; (ii) develop guidelines, especially about privacy and               in GitHub.
policy sensitivity of data; (iii) provide decision support by inte-          (6) An estimation of the defined indicators of reuse of every
grating insights into the activities of other actors involved in                 Smart City dataset category.
the publishing process; (iv) make data publication an integral,              (7) An AHP-based Decision Support System to recommend
well-defined and standardized part of daily procedures and rou-                  Smart City dataset categories to prioritize, taking into
tines; and (v) monitor how the published data are reused. Our                    account the estimated indicators and the importance of
approach is related to principle (iii) since we provide a decision               each indicator for the cities.
support framework based on activities of data consumers. We
also contribute to principle (v) since our approach is useful for           This approach is completely functional and reproducible. We
monitoring how datasets are being reused in OSS applications.            provide a public repository containing the data obtained from
Additionally, Jetzek et al. [16] propose a framework to explain          Socrata and GitHub, the scripts to collect and analyze the infor-
how value is generated from open data. This framework is useful          mation and the AHP tool in order to users can use or modify
for governments to understand the value of their open data. Their        these processes. So, Smart Cities or any other public institution
framework is based on assessing the impact of open data based            can reuse and adapt them to their concrete requirements regard-
on two dimensions: (i) how openness generates value, and (ii)            less of whether they work in a Smart City or in any other type
how society as a whole can get value from openness. The au-              of institution. In this sense, further alternative applications of
thors identify four different archetypical generative mechanisms         our approach that can be considered as a continuation of this
(cause-effect relationship between open data and value) in their         research may include:
framework: transparency (open data helps to improve visibility
to ensure socially responsible resource allocation), participation           (1) Searching and categorizing open datasets of different cities,
(open data as a mechanism for engaging stakeholders who help                     regions, countries, companies or any other kind of institu-
in solving social problems), efficiency (open data to improve how                tions in order to get more data.
resources are used) and innovation (open data as a cornerstone               (2) Developing semantic-based software tools for automatic
for generating new ideas, processes, services and products). The                 classification of datasets.
authors claim that their framework can help governments in the               (3) Analyzing the reuse of open datasets in proprietary soft-
development of their strategy for opening data by considering                    ware projects, for instance, by developing an app web
factors that can enable the generation of value from open data                   repository where developers could register their applica-
through the mechanism of innovation.                                             tions that use open data and indicating which particular
    Furthermore, Zuiderwijk & Janssen [35] state that different                  datasets are reused.
types of users of open data are often interested in different types          (4) Analyzing the impact of open datasets in mass media,
of data, therefore, publication of data can be improved by taking                social media, blogs, etc. by searching the references to the
into account preferences for certain types of data for certain open              datasets in these sites.
data users.                                                                  (5) A set of controlled experiments to demonstrate the effec-
    Therefore, there are several methods that support opening                    tiveness of our approach in different scenarios.
data, but to the best of our knowledge no approaches focus on sup-
porting Smart Cities in selecting and prioritizing which datasets           In summary, a successful publication of open datasets should
should be open according to their preferences and the context of         be based on the proper combination of the objectives of the open
the city they work for. To fill this gap, we presented our approach      data portal and the analysis of the impact of already available
based on obtaining useful indicators from Socrata and GitHub             open datasets. This approach provides a useful method for Smart
and use them with AHP.                                                   City decision makers to carry out this task in an objective and
                                                                         analytic way.
5    CONCLUSIONS
Smart Cities usually have a limited budget and insufficient time         6    ACKNOWLEDGEMENTS
to release and maintain all available open data. In this paper, we       We would like to thank GitHub that allowed us to use its API
have presented an approach whose goal is to provide an AHP tool          without limitations and Socrata that provides a way to collect
that allows weighting different indicators of reuse, calculated          precisely all the datasets published using its tools. This work
using Socrata and GitHub as sources of information, in order             has been developed with the support of (i) TIN2015-69957-R and
to combine them taking into account objective criteria. This             TIN2016-78103-C2-2-R (MINECO/ERDF, EU) project, (ii) POCTEP
approach is characterized by:                                            4IE project (0045-4IE-4-P), (iii) Consejería de Economía e In-
    (1) A classification of 14 categories for Smart City open datasets   fraestructuras/Junta de Extremadura (Spain) - European Regional
        based on the G8 Open Data Charter and the Smart City             Development Fund (ERDF)- GR15098 project and IB16055 project,
        domain.                                                          and (iv) Consejería de Educación y Empleo/Junta de Extremadura
    (2) A definition of 4 indicators based on the reuse of datasets      (Spain) - Becas de Movilidad al Personal Docente e Investigador
        in OSS projects.                                                 Curso 2016/2017.
REFERENCES                                                                                     the Information Systems Perspective (EGOVIS 2014) 8650, 2014 (2014), 275–291.
 [1] Judie Attard, Fabrizio Orlandi, Simon Scerri, and Sören Auer. 2015. A sys-                https://doi.org/10.1007/978-3-319-10178-1_22
     tematic review of open government data initiatives. Government Information           [24] T.L. Saaty. 1980. The Analytic Hierarchy Process. McGraw-Hill, New York.
     Quarterly 32, 4 (2015), 399–418. https://doi.org/10.1016/j.giq.2015.07.006           [25] Jyoti Sheoran, Kelly Blincoe, Eirini Kalliamvakou, Daniela Damian, and Jordan
 [2] Marco Bartolozzi, Pierfrancesco Bellini, Paolo Nesi, Gianni Pantaleo, and Luca            Ell. 2014. Understanding "watchers" on GitHub. In MSR 2014: Proceedings of
     Santi. 2015. A Smart Decision Support System for Smart City. In 2015 IEEE                 the 11th Working Conference on Mining Software Repositories. ACM Press, New
     International Conference on Smart City/SocialCom/SustainCom (SmartCity).                  York, New York, USA, 336–339. https://doi.org/10.1145/2597073.2597114
     IEEE, 117–122. https://doi.org/10.1109/SmartCity.2015.57                             [26] Socrata. 2018. Socrata: Data-driven innovation of government programs.
 [3] Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, and                      (2018). https://www.socrata.com/
     Laurent Reveillere. 2013. Popularity, interoperability, and impact of pro-           [27] Katherine J. Stewart, Anthony P. Ammeter, and Likoebe M. Maruping. 2006.
     gramming languages in 100,000 open source projects. In Proceedings - In-                  Impacts of license choice and organizational sponsorship on user interest and
     ternational Computer Software and Applications Conference. IEEE, 303–312.                 development activity in open source software projects. Information Systems
     https://doi.org/10.1109/COMPSAC.2013.55                                                   Research 17, 2 (jun 2006), 126–144. https://doi.org/10.1287/isre.1060.0082
 [4] Roberto Boselli, Mirko Cesarini, Fabio Mercorio, and Mario Mezzanzan-                [28] Chandrasekar Subramaniam, Ravi Sen, and Matthew L. Nelson. 2009. Determi-
     ica. 2015. Applying the AHP to Smart Mobility Services: A Case Study.                     nants of open source software project success: A longitudinal study. Decision
     In Proceedings of 4th International Conference on Data Management Tech-                   Support Systems 46, 2 (jan 2009), 576–585. https://doi.org/10.1016/j.dss.2008.
     nologies and Applications - Volume 1: KomIS. SCITEPRESS, 354–361. https:                  10.005 arXiv:arXiv:cond-mat/0402594v3
     //doi.org/10.5220/0005580003540361                                                   [29] Abobakr Sultan, Khalid A. AlArfaj, and Ghassan A. AlKutbi. 2012. Analytic
 [5] Hitachi Vantara Community. 2018. Data Integration - Kettle. (2018). http:                 hierarchy process for the success of e-government. Business Strategy Series 13,
     //community.pentaho.com/projects/data-integration/                                        6 (nov 2012), 295–306. https://doi.org/10.1108/17515631211286146
 [6] Peter Conradie and Sunil Choenni. 2014. On the barriers for local government         [30] Jeffrey Thorsby, Genie N.L. Stowers, Kristen Wolslegel, and Ellie Tumbuan.
     releasing open data. Government Information Quarterly 31, SUPPL.1 (2014),                 2016. Understanding the content and features of open data portals in American
     S10–S17. https://doi.org/10.1016/j.giq.2014.01.003                                        cities. Government Information Quarterly 34, 1 (2016), 53–61. https://doi.org/
 [7] Carlo Daffara. 2012. Estimating the Economic Contribution of Open Source                  10.1016/j.giq.2016.07.001
     Software to the European Economy. In The First Openforum Academy Confer-             [31] Omkarprasad S. Vaidya and Sushil Kumar. 2006. Analytic hierarchy process:
     ence Proceedings. OpenForum Europe LTD, 11–14.                                            An overview of applications. European Journal of Operational Research 169, 1
 [8] Rishab Aiyer Ghosh. 2006. Economic impact of open source software on inno-                (2006), 1–29. https://doi.org/10.1016/j.ejor.2004.04.028
     vation and the competitiveness of the Information and Communication Tech-            [32] Nils Walravens, Jonas Breuer, and Pieter Ballon. 2014. Open Data as a Catalyst
     nologies (ICT) sector in the EU. Technical Report. Maastricht: UNU-MERIT.                 For The Smart City as a Local Innovation Platform. Communications & Strate-
     http://stuermer.ch/blog/documents/FLOSSImpactOnEU.pdf                                     gies 96, 4th quarter 2014 (2014), 15–33. https://ssrn.com/abstract=2636315
 [9] Github. 2018. Github: The world’s leading software development platform.             [33] Liguo Yu, Alok Mishra, and Deepti Mishra. 2014. An Empirical Study of
     (2018). https://www.github.com/                                                           the Dynamics of GitHub Repository and Its Impact on Distributed Software
[10] Group of Eight. 2013.           G8 Open Data Charter. (2013).               https:        Development. In Proceedings of the Confederated International Workshops
     //www.gov.uk/government/uploads/system/uploads/attachment_data/                           on On the Move to Meaningful Internet Systems: OTM 2014 Workshops - Vol-
     file/207772/Open_Data_Charter.pdf                                                         ume 8842. Springer-Verlag New York, Inc., 457–466. https://doi.org/10.1007/
[11] Jeffrey Hammond, Paul Santinelli, Jay Jay Billings, and Bill Ledingham. 2016.             978-3-662-45550-0_46
     The Tenth Annual Future of Open Source Survey. Technical Report. Black               [34] Anneke Zuiderwijk and Marijn Janssen. 2013. A Coordination Theory Per-
     Duck Software and North Bridge. https://www.blackducksoftware.com/                        spective to Improve the Use of Open Data in Policy-Making. In Proceedings
     2016-future-of-open-source                                                                of the 12th IFIP WG 8.5 International Conference on Electronic Government -
[12] Anders Hjalmarsson, Niklas Johansson, and Daniel Rudmark. 2015. Mind the                  Volume 8074. Springer-Verlag New York, Inc., 38–49. https://doi.org/10.1007/
     gap: Exploring stakeholders’ value with open data assessment. In Proceedings              978-3-642-40358-3_4
     of the Annual Hawaii International Conference on System Sciences. IEEE, 1314–        [35] Anneke Zuiderwijk and Marijn Janssen. 2014. Barriers and Development
     1323. https://doi.org/10.1109/HICSS.2015.160                                              Directions for the Publication and Usage of Open Data: A Socio-Technical
[13] Mohammad Alamgir Hossain, Yogesh K Dwivedi, and Nripendra P. Rana. 2016.                  View. In Open Government. Vol. 4. Springer New York, New York, NY, 115–135.
     State of the Art in Open Data Research: Insights from Existing Literature                 https://doi.org/10.1007/978-1-4614-9563-5_8 arXiv:arXiv:1011.1669v3
     and a Research Agenda. Journal of Organizational Computing and Electronic            [36] Anneke Zuiderwijk, Marijn Janssen, Sunil Choenni, and Ronald Meijer. 2014.
     Commerce 26, 1-2 (apr 2016), 14–40. https://doi.org/10.1080/10919392.2015.                Design principles for improving the process of publishing open data. Trans-
     1124007                                                                                   forming Government: People, Process and Policy 8, 2 (may 2014), 185–204.
[14] Marijn Janssen, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. Benefits,               https://doi.org/10.1108/TG-07-2013-0024
     Adoption Barriers and Myths of Open Data and Open Government. Informa-               [37] Anneke Zuiderwijk, Iryna Susha, Yannis Charalabidis, Peter Parycek, and
     tion Systems Management 29, 4 (sep 2012), 258–268. https://doi.org/10.1080/               Marijn Janssen. 2015. Open data disclosure and use : critical factors from a
     10580530.2012.716740 arXiv:arXiv:1011.1669v3                                              case study. In In: CeDEM 2015: Proceedings of the International Conference for
[15] Oskar Jarczyk, Blazej Gruszka, Szymon Jaroszewicz, and Leszek Bukowski.                   E-Democracy and Open Government 2015. Edition Donau-Universität Krems,
     2014. GitHub Projects. Quality Analysis of Open-Source Software. In SocInfo               197–208.
     2014: The 6th International Conference on Social Informatics. Springer, Cham,
     80–94. https://doi.org/10.1007/978-3-319-13734-6_6
[16] Thorhildur Jetzek, Michel Avital, and Niels Bjorn-Andersen. 2014. Data-
     driven innovation through open government data. Journal of Theoretical
     and Applied Electronic Commerce Research 9, 2 (aug 2014), 100–120. https:
     //doi.org/10.4067/S0718-18762014000200008
[17] Maxat Kassen. 2013. A promising phenomenon of open data: A case study of
     the Chicago open data project. Government Information Quarterly 30, 4 (2013),
     508–513. https://doi.org/10.1016/j.giq.2013.05.012
[18] Sylvain Kubler, Jérémy Robert, Yves Le Traon, Jürgen Umbrich, and Sebastian
     Neumaier. 2016. Open Data Portal Quality Comparison using AHP. In Pro-
     ceedings of the 17th International Digital Government Research Conference on
     Digital Government Research - dg.o ’16. ACM Press, New York, New York, USA,
     397–407. https://doi.org/10.1145/2912160.2912167
[19] Jennifer Marlow, Laura Dabbish, and Jim Herbsleb. 2013. Impression Formation
     in Online Peer Production : Activity Traces and Personal Profiles in GitHub.
     In 16th ACM Conference on Computer Supported Cooperative Work. ACM Press,
     New York, New York, USA, 117–128. https://doi.org/10.1145/2441776.2441792
[20] K. Muthukumaran, Abhinav Choudhary, and N.L. Bhanu Murthy. 2015. Mining
     GitHub for Novel Change Metrics to Predict Buggy Files in Software Systems.
     In 2015 International Conference on Computational Intelligence and Networks.
     IEEE, 15–20. https://doi.org/10.1109/CINE.2015.13
[21] Paolo Neirotti, Alberto De Marco, Anna Corinna Cagliano, Giulio Mangano,
     and Francesco Scorrano. 2014. Current trends in smart city initiatives: Some
     stylised facts. Cities 38 (2014), 25–36. https://doi.org/10.1016/j.cities.2013.12.
     010
[22] National League of Cities. 2018. National League of Cities. (2018). https:
     //www.nlc.org/
[23] Monica Palmirani, Michele Martoni, and Dino Girardi. 2014. Beyond Trans-
     parency Introduction : OGA Beyond Transparency. Electronic Government and