=Paper=
{{Paper
|id=Vol-2062/paper3
|storemode=property
|title=Supporting Open Dataset Publication Decisions Based on Open Source Software Reuse
|pdfUrl=https://ceur-ws.org/Vol-2062/paper03.pdf
|volume=Vol-2062
|authors=Alvaro E. Prieto,Jose-Norberto Mazón,Adolfo Lozano-Tello,Luis-Daniel Ibáñez
|dblpUrl=https://dblp.org/rec/conf/dolap/PrietoMTI18
}}
==Supporting Open Dataset Publication Decisions Based on Open Source Software Reuse==
Supporting open dataset publication decisions based on Open
Source Software reuse
Alvaro E. Prieto Jose-Norberto Mazón
Universidad de Extremadura Universidad de Alicante
Cáceres, Spain San Vicente del Raspeig, Alicante, Spain
aeprieto@unex.es jnmazon@dlsi.ua.es
Adolfo Lozano-Tello Luis-Daniel Ibáñez
Universidad de Extremadura University of Southampton
Cáceres, Spain Southampton, United Kingdom
alozano@unex.es l.d.ibanez@southampton.ac.uk
ABSTRACT the other hand, a big city with consolidated open data portals
Publishing and maintaining open data is a costly task for public may prefer opening datasets that could be used in complex and
institutions, that becomes even more challenging in the context mature software applications that involve big teams, since it is
of Smart Cities, where large amounts of varied data are generated more relevant to their specific technological industry context.
from different domains. To optimize resources, they should prior- Unfortunately, to the best of our knowledge, Smart Cities
itize the publication and maintenance of datasets most likely to lack such decision support system, mainly because the process
generate social and economic impact. However, there is currently of calculation of those indicators that would use the system is
a lack of decision-support tools to help public sector data publish- not a trivial task. According to Janssen et al. [14] , “there is no
ers to evaluate datasets on the light of their particular reuse goals. way to predict and calculate the return of investment (ROI) in
In this paper, we propose to suggest to data publishers the dataset advance [. . . ]". The main challenge is that open data has no value
categories with most potential impact, based on the impact of in itself; it only becomes valuable when used”. Therefore, the
already published datasets of the same category. To measure im- main problem is that data owners have limited understanding
pact, we propose a set of indicators based on the amount and on how open data is reused, thus lacking knowledge about the
quality of Open Source Software projects that use datasets. To impact generated by reusing the published open data.
aggregate indicators according to specific reuse goals, we provide More reasonable indicators of the use of open datasets could
an Analytic-Hierarchy-Process based tool. help to identify which categories of datasets have more possi-
bilities of being reused and, in this way, generate some type of
economic impact to people or enterprises. In this sense, good
1 INTRODUCTION indicators could come from the reuse of datasets within the open
One of the most important challenges faced by Smart Cities is source community. The Tenth Annual Future of Open Source
creating an ecosystem of public and private actors that reuse open Survey [11] reflects the increasing adoption of pen source and
data in order to produce IT services and products that both (i) highlights the abundance of organizations participating in the
would improve citizens’ quality of life and (ii) would contribute open source community. Concretely, this survey estimates that
to economic growth [32]. However, few open data portals in 65% of companies currently participate in open source projects.
cities currently track data usage and consider the impact of data Open Source Software (hereon OSS) is considered to encourage
on deciding which datasets to maintain or what complementary the creation of SMEs and jobs, by providing a skills development
datasets publish. Cities are not even aware of what kinds of environment valued by employers and retaining a greater share
apps are developed, using what data, and how many there are. of generated value locally [8]. Focusing in Europe, a study esti-
Answering these questions is a significant research issue [30] mated that the contribution of OSS to its economy was of 450
that would allow prioritizing which categories of data must be billion euro per year [7].
published and maintained with respect to the applications that Based on these figures, an estimation of the use of the different
use them (i.e., impact that a category of open data generates). categories of datasets by the OSS community could be a good
To reverse this situation, publishing datasets as open data indicator of their potential impact. Therefore, when Smart Cities
requires a decision support system to select those categories of make decisions on which data to publish, they could prioritize
datasets that offer higher potential to generate value [12]. Such a publication of data which allows a community of developers
system must consider indicators about the impact of the already to generate impact and effectively release benefits of open data
published open datasets, as well as the strategy of the Smart through OSS projects.
City. E.g., a small town could provide an open data portal with In this paper, we present an approach based on the estimation
many high-quality datasets but the portal is rather unknown, of indicators of the use of open datasets in OSS projects. The
and the technological fabric of the city is composed of small IT goal of this approach is to provide Smart Cities with a Decision
companies. Therefore, the goal of the city could be to extend the Support System which provides an ordered list of categories of
use of the open data portal by prioritizing those datasets that datasets most suitable to be published or maintained in their open
belong to categories that are likely to generate a large number data portal. To do so, we have carried out a set of actions aimed
of projects -though simpler ones that involve fewer people. On at estimating useful impact indicators related to the datasets of
© 2018 Copyright held by the owner/author(s). Published in the Workshop
the same category already published by open data portals of
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, other cities. Concretely, to calculate our proposed indicators we
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted needed two kinds of data sources: (i) already published Smart City
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
datasets (and their metadata) and (ii) OSS projects (together with 2.1 A proposal of indicators of reuse based on
information about them) which referenced the gathered datasets; GitHub
i.e., we needed to know which open datasets were being used in
Smart Cities should follow a strategy for opening data as de-
which OSS projects. To collect already published open datasets,
scribed in [17]. This strategy should prioritize publication of
we chose Socrata [26] because it is one of the most used open
data which allows a community of developers to generate im-
data repositories, and notably by some of the most important
pact and effectively release benefits of open data through OSS
US cities. We also measured the existence of potential reuses
projects [37]. A Smart City could in fact prioritize publication
within a community in order to measure open data impact. To
of open data with more reuse potential depending on the cate-
do this, we used GitHub [9], because it is the largest web-based
gory to which the data belong to. However, due to “open-data
distributed revision control and source code repository in the
by default” idiosyncrasy [23], data is usually published without
world, and the source of several empirical studies such as in Yu
establishing specific goals and without imposing utilization or
et al. [33].
authentication restrictions to the infomediaries and end users. As
Using the indicators obtained from these sources, we provide
a result, collecting the usage information and measuring impact
an Analytic Hierarchy Process (hereon AHP)-based [24] tool1
generated by open datasets may become very complex.
that allows decision makers weigh these indicators, taking into
To overcome this situation, our approach is based on consid-
account the reuse objectives of the city, to offer an ordered list of
ering that the more used an open dataset is by OSS projects,
categories of datasets recommended to publish.
the more impact is generated. Therefore, we borrowed some
This paper is structured as follows: section 2 describes a new
well-known indicators that measure the success of OSS projects
approach to select the most relevant categories of data to be
and we have used as starting point to develop our indicators
published in a smart city open data portal. Section 3 presents
to measure such success when open data is reused. Then, these
toy samples of two different stereotypical smart cities using our
indicators allow Smart Cities to measure which categories of
approach and, to finish, section 4 summarizes other work related
open data have more reuse potential and decide which data must
to the publishing of open data in Smart Cities.
be released according to the requirements of each city. The fol-
lowing indicators from existing research literature on OSS are
2 USING REUSE INDICATORS BASED ON considered [27] [28]. First of all, we included (i) number of people
DATA FROM OSS PROJECTS IN GITHUB who agree to receive information about the project because they
FOR SELECTING DATASETS TO OPEN find it interesting (subscribers), and (ii) number of people who
actually work on the OSS project (developers). On the one hand,
This section describes the steps that have been carried out to get
subscribers to OSS choose to obtain information on the project
an AHP process that allows classifying categories of dataset based
and thus reveal a deeper interest in the OSS project. The sub-
on the preferences of the decision-maker. These preferences are
scriber indicator not only measures interest within the project
applied to a set of useful indicators obtained from data about
but the reputation of the project within the community and the
their reuse in OSS Projects of GitHub repositories. Concretely,
dissemination of the project through the community. On the
these steps2 are detailed in the following subsections and are
other hand, the number of developers working on a project is
summarized below:
critical to its success, since survival of an OSS project depends on
(1) From GitHub repositories, studying the characteristics of continued contribution from developers [28]. There is another
OSS projects that use open datasets. This information was measure for the success of OSS projects [27] as the (iii) age of
analyzed to establish a set of reuse indicators. an active project that is positively related to OSS progress to-
(2) Gathering datasets from 32 cities of the United States (such ward completion, as well as the experience of the community of
as San Francisco, Chicago or New York) which use Socrata developers.
as an open data repository. With respect to this point, it Based on these three indicators described in the literature
should also be noted that, although these cities are from about success of OSS projects, we developed a set of three in-
the same country, United States, they have different cul- dicators that measure the success of open source projects that
tural, social and economic characteristics that make us con- reuse open datasets (they are summarized in Table 1). The aim is
sider that the results obtained from their data are enough to compare projects that use different categories of datasets and
scalable to other Smart Cities located in different coun- how successful they are. First of all, we define the reputation
tries. among a community of developers of OSS projects that reuse
(3) Classifying the datasets according to a set of categories open data from a category. Some projects that reuse open data
specifically designed for Smart Cities. from some specific categories can be perceived by developers
(4) Searching for references to the datasets obtained from as being highly appealing projects. Smart Cities are interested
Socrata in GitHub to calculate the indicators. in opening data that will be reused in these kinds of projects in
(5) With the reuse indicators established in step 1 as crite- view of creating a community around open data, thus allowing an
ria, and the values from step 4, we have created a Google open data portal to attract the attention of potential developers.
Spreadsheet [w3] based on AHP that allows decision mak- Therefore, the reputation indicator measures how well-known
ers to prioritize the most relevant categories of datasets projects reusing data from some specific category are (within the
that must be published in a smart city open data portal. community of developers). Furthermore, the size of the com-
munity involved in projects that use data from a category is
defined in terms of the size of the community of developers that
use open data from a given category. A city needs to adapt the
1 https://goo.gl/HcUc1e
2 A repository containing all the scripts and detailed instructions needed to carry out
size of the community to the budget and available infrastructure.
a functional application of our approach is available at GitHub https://goo.gl/TDp1xi
Finally, maturity of projects that use an open data category is
Table 1: Proposed indicators and their definitions Table 2: G8 Open Data Categories
Indicator Description Id Data Category Example Datasets
Reputation Average number of subscribers of each 1 Companies Company/business register
repository that references datasets of the 2 Crime and Justice Crime statistics, safety
category 3 Earth observation Meteorological/weather, agri-
Community size Average number of contributors of every culture, forestry, fishing, and
repository that references datasets of the hunting
category 4 Education List of schools; performance of
Maturity Average maturity of every repository ref- schools, digital skills
erencing datasets of the category. Matu- 5 Energy and Environ- Pollution levels, energy con-
rity is computed using 2 lifetimes, project ment sumption
lifetime (PL) and last update lifetime 6 Finance and contracts Transaction spend, contracts let,
(LUL). Thus, the resulting formula is: call for tender, future tenders,
PL/LUL local budget, national budget
Efficiency Proportion of datasets of each category (planned and spent)
referenced in GitHub 7 Geospatial Topography, postcodes, na-
tional maps, local maps
8 Global Development Aid, food security, extractives,
proposed. Maturity means that the community has been working land
on the project for some time without the project being aban- 9 Government Account- Government contact points,
doned. A Smart City may want to select the datasets that help in ability and Democ- election results, legislation and
promoting fewer projects stretching over longer periods of time, racy statutes, salaries (pay scales),
rather than promoting a larger number of short-term projects. hospitality/gifts
An additional indicator has been developed in order to assess 10 Health Prescription data, performance
the impact of a dataset category, i.e. the likelihood of datasets data
from each category of being used. To do so, we defined efficiency 11 Science and Research Genome data, research activity,
of an open data category, as the probability of datasets of one experiment results
category to be referenced by an OSS project. This indicator de- 12 Statistics National Statistics, Census, in-
termines how relevant a category of datasets is. Smart Cities frastructure, wealth, skills
will use this indicator to know which categories of open data 13 Social mobility and Housing, health insurance and
are most likely to be reused. Therefore, in a scenario where the welfare unemployment benefits
Smart City has the chance of opening a large number of datasets, 14 Transport and Infras- Public transport timetables, ac-
the efficiency indicator will become secondary to the publishing tructure cess points broadband penetra-
efforts regarding a wide a variety of datasets. tion
As aforementioned, these indicators come from well-known
indicators from the OSS community, being thus completely gen-
eralizable to be used in any OSS repository. It is worth noting containing the identifier of every dataset and useful metadata
that our proposal of indicators is not set in stone, consequently about it, such as the theme or the keyword of the dataset. These
more indicators could be created and checked to be used by Smart metadata of open datasets are important because they are needed
Cities according to their requirements. to facilitate the categorization step that comes next. To collect
the data from Socrata, we followed these steps:
2.2 Search of smart city datasets on Socrata (1) Retrieve data from Socrata on institutions which use its
Once the impact measuring indicators have been established Open Data Platform. 106 institutions were recovered.
and defined, information should be gathered. This gathering of (2) Gather and filter the identifier and the minimal metadata
information focuses on datasets specifically related to the smart needed to categorize them (theme or keyword) from every
cities so as to obtain a more accurate assessment of the collected dataset published by US cities using Socrata. 8960 datasets
data. from 32 different US cities met these conditions.
Socrata is a software company focused “exclusively on de-
mocratizing access to public sector data around the world”. It 2.3 Categorization fo datasets
provides an Open Data Platform for allowing local, regional or In this step, we had to choose the taxonomy of dataset categories
national governments to release data. Socrata is a partner of the to be analyzed. There is no common agreement on the best way
USA National League of Cities [22] for the development of open of classifying Smart City open datasets. However, a 14 high-value
data strategies. Nowadays, the Socrata Open Data Platform is data categories is suggested by the G8 Open Data Charter [10].
used by some of the most important US cities such as New York, These categories, together with example datasets for each one,
Chicago, San Francisco or Los Angeles. In this respect, Socrata are shown in Table 2.
is very useful as a proof-of-concept of our approach, since it is These categories seem to be a good way to classify Smart
possible to collect precisely open dataset identifiers and their City datasets, however, some of these categories, such as Global
metadata. In this sense, every Socrata dataset has its own end- Development and Science and Research, might not be used in the
point and each is designated by a unique dataset identifier. Every Smart City context. Thus, specific domains which can generate
Socrata open data portal provides a list of its published datasets data within a Smart City must be taken into account. In this sense,
Table 3: G8 Open Data Categories Table 4: Proposal of Open Data categories for Smart Cities
Id Domain Subdomain Id Data Category Example Datasets
A Natural resources and1.-Smart grids 1 Administration & Fi- Audits and Reports, City Fi-
energy 2.-Public lighting nance nance and Budget, City Govern-
3.-Green/renewable energies ment, Fees, Liabilities and As-
4.-Waste management sets, Purchasing, Revenue
5.-Water management 2 Business City Businesses, Community &
6.-Food and agriculture Economic Development, Grow-
B Transport and mobil- 7.-City logistics ing Economy, Regulated Indus-
ity 8.- Info-mobility tries
9.- People mobility 3 Demographics Census, CitiStat, Forecasts,
Neighborhoods, Statistics
C Buildings 10.-Facility management 4 Education Schools, Youth
11.-Building services 5 Ethics & Democracy City Management and Ethics,
12.-Housing quality Elections, Ethics, Expenditures,
General Information, Gover-
D Living 13.-Entertainment nance, Government, Human
14.-Hospitality Relations, Human Resources,
15.-Pollution control Legislation, People, Permitting,
16.-Public safety Public Works, Taxes
17.-Healthcare 6 Geospatial Geographic Locations and
18.-Welfare and social inclusion Boundaries, Mapping, Location,
19.-Culture GIS
20.-Public spaces management 7 Health Public Health, Human Services,
E Government 21.-E-government Social Services
22.-E-democracy 8 Recreation & Culture Arts and Culture, Events,
23.-Procurement Greenways, Historic Preserva-
24.-Transparency tion, Library, Parks, Recreation,
F Economy and people 25.-Innovation and en- Tourism
trepreneurship 9 Safety Crime, Emergency, Fire, Police,
26.-Cultural heritage manage- Public Safety
ment 10 Services 311 Call Center, City Services,
27.-Digital Education Community, Customer Service,
28.-Human capital management Facilities, Government Build-
ings and Structures, Inspec-
tional Services, Public Prop-
erty, Public Services, Service Re-
a survey [21] about Smart City initiatives proposes a classification quests
divided in domains and subdomains show in Table 3 11 Sustainability Energy and Environment, Nat-
Establishing an exhaustive classification of open data cate- ural Resources, Sustainability,
gories for Smart Cities is beyond the scope of this paper. How- Waste Management, Food, Agri-
ever, this work proposes an initial classification of open data culture
categories for Smart Cities aimed to be as close as possible to 12 Transport & Infras- Airports, City Infrastructure,
the G8 Open Data Charter but incorporating modifications to tructure Transportation, Parking, Street-
encompass the aforementioned domains and subdomains proper car, Traffic
to Smart Cities. This proposed classification is given in Table4 13 Urban Planning Area Plans, Buildings, City Fa-
together with example datasets for each category. cilities, City Parks and Tree
Once the categories were established we had to classify the Data, Construction, Develop-
collected datasets according to such categories. Due to its char- ment, Housing, Land Use, Ur-
acteristics, this step requires the participation of experts to ex- ban Planning
ecute it adequately. The research groups that have developed 14 Welfare Insurance, Life Enrichment,
this approach includes researchers working in related fields such Quality of Life, Pension, Re-
as open data and knowledge representation. These researchers tirement, Sanitation, Social
were responsible for classifying the datasets following the steps Services
described below:
(1) Extracting different themes from US city datasets. In our
case, 215 different themes were extracted.
(2) Mapping every theme to one of the available categories.
Themes without a clear fit had to be classified as ‘Others’ in
order to be discarded later. When we performed this step,
211 themes could be mapped to the established categories After this process, we made an estimation of the indicators in
and 4 were classified as ‘Others’. order to be used with AHP. We defined a process consisting in
(3) Automatically classifying datasets with a theme according the following steps:
to the mapping in step 2. In our case, 8299 datasets were (1) Discarding repositories that do not have all the required
classified according to the established categories, 11 were data to make an estimation of the indicators. When we
categorized as ‘Others’ and 650 were not categorized due performed this step, only 2501 repositories remained.
to their lack of theme. (2) Discarding all repeated references to a specific dataset
(4) Optionally, trying to categorize datasets that have no from a specific repository. When we performed this step,
theme manually, using other metadata such as keywords. 32551 unrepeated references from 2501 repositories re-
This step can be carried out when the number of datasets mained.
without a theme is considered high enough to distort the (3) Making an estimation of the indicators. When we per-
value of the indicators. In our case, although the datasets formed this step, we applied the formulas previously pre-
without a theme represented less than 10 sented in Table 1.
(5) As a result of this process, 8949 datasets were adequately (4) Normalizing the indicators in order to use the ideal mode
categorized and 11 were discarded due to their unclear fit. of AHP. When we applied this step to our case, the indi-
cator of each category was divided by the maximal value
2.4 Collecting data from GitHub to calculate obtained by a category in the indicator. Thus, all the indi-
indicators cators of each category were normalized to a 0-1 range.
In order to calculate the above-described indicators on the suc-
2.5 Use of AHP to weight indicators
cess of OSS projects that reuse open data, we decided to collect
data from GitHub. GitHub, as mentioned previously, is a plat- The method of decision-making, which our model is based on,
form for collaborative development of software based on a Git is named Analytic Hierarchy Process, hereinafter referred to as
repository. It is used by individuals, communities and businesses AHP [25]. It is a powerful and flexible tool for decision-making
alike to develop software projects. GitHub is free to use for public in complex multi-criteria problem situations and is useful for
and open source projects, and it is profusely used in studies on comparing several alternatives when several objectives need to
Software Engineering. Therefore, it offers useful data about open be borne in mind at the same time.
source software projects, including information on whether they Following this method, the evaluator can directly assign a nor-
are using open data. malized weight to a criterion that will indicate the importance
GitHub has been used for collecting data and calculating indi- which that criterion has with regard to the final objective. Firstly,
cators related to OSS success in several works such as [3] [19], the AHP method compares the relative importance that each
where GitHub allows researchers to collect several measures criterion has in relation to all the others; this assessment enables
regarding open source projects, for example, forks, stars, etc. the relative weights of the criteria to be calculated, and finally the
GitHub has an API that is used to collect all required data from method normalizes the weights in order to obtain the measures
an open source software project. More specifically, the data can for the existing alternatives; for this reason, AHP constitutes one
be acquired from repositories and from users. A repository is of the best options to assist multi-criteria decision making. This
a kind of software project folder that contains all the project method allows people to gather knowledge about a particular
files. Valuable data from a repository that can be collected by problem, to quantify subjective opinions and to force the compar-
using the API, apart from the code itself, are as follows: repos- ison of alternatives in relation to established criteria. The method
itory_id, user_id, stargazers_count, watchers_count, language, consists in the following steps:
forks_count, subscribers_count, network_count, created_at, up- (1) Define the problem and the main objective in making the
dated_at, pushed_at, total_contributors, total_contributions. GitHub decision.
user data also provide interesting data to be considered, such as (2) If required, build a hierarchy tree in this way: the root node
followers_user, following_user, public_repos_user, location_user, is the objective of the problem, the intermediate levels are
updated_at_user, created_at_user. The indicators used in our the criteria, and the lowest level contains the alternatives.
approach are based on these data. We established a process for (3) At each level, build a pairwise comparison matrix with the
identifying which OSS projects were using open datasets from brothers (sons of the same node). The matrix contains the
Socrata US Cities. Our process consists in the following steps weights of pairwise comparisons between brother nodes.
(it was implemented by using the GitHub API within a Pentaho This provides us with a pairwise comparison matrix (see
Data Integration [5] process): a simple example in Table 5) for each parent node.
(4) For each comparison matrix, an eigenvector must be cal-
(1) Searching every eight-character code from existing Socrata culated, using the equation: |A − λI | = 0, where A is the
datasets belonging to USA cities (obtained as described in comparison matrix, I is the identity matrix and λ is the
Section 3.3.1) based on code from OSS repositories hosted eigenvector. This calculus must be performed for each
on GitHub in order to know which projects are reusing level of the tree.
open data. When we performed this step, 350644 refer- (5) Rate each alternative (leaf nodes) with a previously calcu-
ences were found from 2517 repositories to 5874 of the lated fixed value for every criteria. The scales for rating
8949 categorized datasets. alternatives should be established and described in a pre-
(2) Gathering required data from GitHub on the repositories cise way.
that reference open datasets to make an estimation of the (6) Determine the value of each alternative using a weighted
indicators. In our case we found that 2501 of the 2517 addition formula, with the weights from the previous steps.
repositories had all the needed data. These results ascend up the tree to calculate the final value
of the objective (root). This final value is used to make a
decision about the alternative to choose.
Using this method, as final stage, we have created a Google
Spreadsheet based on AHP that uses the reuse indicators as cri-
teria of the process. Concretely, this spreadsheet is composed of
three sheets:
(1) ‘Indicators’. This sheet provides the normalized indicators
that were calculated from GitHub in the previous step.
(2) ‘AHP Criterion Pair Comparison’. This sheet allows assess-
ing the relative importance between pairs of indicators
using AHP. Thereby, a decision maker could weigh the
importance of the indicators set out in the previous steps,
taking into account the characteristics and objectives of
the city. These weights can be assigned according to the in-
stitution’s strategic reuse objectives. Thus, different Smart
Cities may have different objectives, strategies and target
audiences when deciding which datasets should have pri-
ority of publication. Each city has its own idiosyncrasy Figure 1: Simulated weights of a medium-sized town.
defining what is most important or of particular interest,
and it is unlikely two cities share the same priorities with
regard to their respective reuse objectives. Cities can be
characterized by their size, the importance of the tourism
sector, or its residential, commercial or industrial sectors,
etc. And also, cities may have different priorities for pub-
lishing datasets depending on the type of reuse they want
to promote. The result of this step will be the eigenvectors
of each matrix, meaning the relative importance of the
established indicators.
(3) Finally, the ‘AHP Direct Results’ shows a suitability rank-
ing list of dataset categories to publish according to the
weights introduced in the second sheet and the indicators
calculated from GitHub shown in the first sheet. That is,
the value used to elaborate such ranking is the result of
multiplying the relative importance of each indicator, cal-
culated in the second sheet, by the values of the indicators
in the corresponding categories shown in the first sheet. Figure 2: Medium-sized town rankings
Thus, the use of this tool allows Smart Cities to prioritize datasets
in a reasonable way based on the data collected from well-known
cities, the indicators taken into account and the open data strategy
of the city.
3 SIMULATING THE BEHAVIOUR OF THE
TOOL ON STEREOTYPICAL CITIES
In order to check our proposal according to different motivations
in the weighting process, we have simulated the behavior of the
tool taking into account the different prospects of two stereotyp-
ical cities. We asked three experts to agree on the importance
assignment of the indicators, with the assumptions of the two
cities.
On one hand, a medium-sized town located in a rural region,
with small software companies in its zone rather than big ones,
that is starting to develop its own open data portal. On the other
hand, a big city with a well-known open data portal and a lot of Figure 3: Default ranking
cutting edge software companies in its area of influence.
In the first case, we have guessed that the town could be inter-
ested, mainly, in getting reuses of its different datasets through The weights applied with this philosophy are shown in Fig-
the development of simple applications by small local enterprises. ure 1, and the resulting in the ranking shown in Figure 2. The
Hence, the town would assign high weights to efficiency whereas first position of ‘Geospatial’ does not change with respect to the
reputation, size of the community and maturity would perform a default ranking (same weights for all the indicators) shown in
secondary role. Figure 3 but the rest of the ranking suffers some variations.
between programming languages and projects success. Marlow et
al. [19] analyze metadata projects of GitHub to find how its users
decide whom and what to keep track of, or where to contribute
next. Sheoran et al. [25] investigate what kind of contributors
can be the “watchers” of GitHub. Jarczyk et al. [15] study the
relation between popularity of a project in GitHub and its quality.
Muthukumaran et al. [20] uses GitHub to propose change metrics
that can predict possible bugs. As far as we know, this is the first
time GitHub has been used to estimate indicators related to reuse
of open data in OSS projects.
Secondly, AHP is a multiple criteria decision making method
that has been used in many different applications related to de-
cision making [31]. Some works specifically use AHP in Smart
Cities and e-government. In this context, Bartolozzi et al. [2]
present a DSS which uses AHP for supporting the decision-
making process related to Smart City issues. Sultan et al. [29]
suggest the use of AHP to decide the most appropriate technology
for the development of e-government projects in Smart Cities.
Figure 4: Simulated weights of a big-sized city Boselli et al. [4] use AHP to rank the factors for innovating a
smart-mobility service in the city of Milan. A very interesting
use of AHP to evaluate open data portal quality can be found in
Kubler et al. [18]. The authors propose considering different di-
mensions: completeness, openness, addressability and retrievabil-
ity to assess the quality of 146 open data portals. Although there
are several applications of AHP to the domains of Smart Cities
and e-governments, they all aim at assessing Smart City strate-
gies and the quality of open data portals. Instead, our approach
proposes AHP to recommend the most appropriate datasets to
be published.
Finally, with respect to how (local) governments publish open
data, Conradie & Choenni [6] explain that data release by local
governments is still a novel task, thus knowledge is lacking as to
its benefits and barriers. Therefore, they conduct a participatory
action research approach to get a better understanding of how
internal processes of local governments influence data release.
The authors found that the following indicators needed to be
Figure 5: Big city ranking addressed by local governments to overcome barriers to releasing
public sector information: (i) Data Storage, i.e., is data stored
centrally, or is it decentralized?; (ii) Use of data, i.e., the way data
In the second case, we have conjectured that, due to its portal is is used by the department; (iii) Source of data, i.e., how is a set
well-known, it does not search for more reuses, that is, efficiency, of data obtained?; and (iv) Suitability of data for release, i.e., are
but for mature projects with good reputation and bigger com- there rules and regulations that determine whether a dataset may
munities behind them. The weights applied with this philosophy be released or not, such as privacy or copyright.
are shown in Figure 4. Notwithstanding, these indicators are related to current data
The ranking obtained with these weights is shown in Figure 5 but do not address the actual use of the data and its benefits.
Here, ‘Geospatial’ changes to third position and ‘Welfare’ takes For example, Hossain et al. [13] show that benefits associated
the first one. As can be seen, the indicators obtained from GitHub with opening data are ill-understood. In their systematic review
produces that some categories of the ranking tend to have a stable of open government data initiatives, Attard et al. [1] explore
position regardless of the weights assigned with AHP but, even open data initiatives of a large number of governments, as well
so, different combinations of weights may change this ranking. as existing tools and approaches. They found that while efforts
have focused on developing tools for helping data publishers to
4 RELATED WORK open data, there have been no initiatives related to strategies for
This section gives a description of (i) some relevant studies about supporting decisions on which data to release. This means that
the use of GitHub to measure different indicators about Open public entities may end up publishing data with no value, rather
Source Software projects, (ii) applications of AHP in Smart Cities than focusing on the relevance of the data they are publishing.
as well as (iii) the most relevant studies about how (local) gov- Therefore, success in opening data is not a matter of the amount
ernments publish open data. of data published, but of understanding how data is reused. As
Firstly, GitHub is used by individuals, communities and busi- highlighted by Zuiderwijk & Janssen [34], since providers of open
nesses alike to develop software projects. GitHub is free to use data are not concerned with needs of open data users, they do
for public and OSS projects, and it is profusely used in studies not know how their data are reused, and business related issues
on Software Engineering related to OSS success in several works. (such as creation of added-value services or products based on
Thus, Bissyande et al. uses GitHub [3] to study a possible relation open data) are not widely used as a decision criterion.
Furthermore, Zuiderwijk et al. [36] argue that the publication (3) Almost 9000 open located datasets of many of the most
of open data is often cumbersome so standard procedures and important US cities.
processes for opening data are required. They found a series of (4) A catalogue of these US city datasets classified according
barriers preventing easy and low-cost publication of open data, to the proposed categories.
leading them to propose a set of five design principles for im- (5) Around 32000 distinct references from 2500 different GitHub
proving the open data publishing process of public organizations: projects referencing two thirds of the categorized datasets
(i) start thinking about the opening of data at the beginning of found, based on a search performed over all OSS projects
the process; (ii) develop guidelines, especially about privacy and in GitHub.
policy sensitivity of data; (iii) provide decision support by inte- (6) An estimation of the defined indicators of reuse of every
grating insights into the activities of other actors involved in Smart City dataset category.
the publishing process; (iv) make data publication an integral, (7) An AHP-based Decision Support System to recommend
well-defined and standardized part of daily procedures and rou- Smart City dataset categories to prioritize, taking into
tines; and (v) monitor how the published data are reused. Our account the estimated indicators and the importance of
approach is related to principle (iii) since we provide a decision each indicator for the cities.
support framework based on activities of data consumers. We
also contribute to principle (v) since our approach is useful for This approach is completely functional and reproducible. We
monitoring how datasets are being reused in OSS applications. provide a public repository containing the data obtained from
Additionally, Jetzek et al. [16] propose a framework to explain Socrata and GitHub, the scripts to collect and analyze the infor-
how value is generated from open data. This framework is useful mation and the AHP tool in order to users can use or modify
for governments to understand the value of their open data. Their these processes. So, Smart Cities or any other public institution
framework is based on assessing the impact of open data based can reuse and adapt them to their concrete requirements regard-
on two dimensions: (i) how openness generates value, and (ii) less of whether they work in a Smart City or in any other type
how society as a whole can get value from openness. The au- of institution. In this sense, further alternative applications of
thors identify four different archetypical generative mechanisms our approach that can be considered as a continuation of this
(cause-effect relationship between open data and value) in their research may include:
framework: transparency (open data helps to improve visibility
to ensure socially responsible resource allocation), participation (1) Searching and categorizing open datasets of different cities,
(open data as a mechanism for engaging stakeholders who help regions, countries, companies or any other kind of institu-
in solving social problems), efficiency (open data to improve how tions in order to get more data.
resources are used) and innovation (open data as a cornerstone (2) Developing semantic-based software tools for automatic
for generating new ideas, processes, services and products). The classification of datasets.
authors claim that their framework can help governments in the (3) Analyzing the reuse of open datasets in proprietary soft-
development of their strategy for opening data by considering ware projects, for instance, by developing an app web
factors that can enable the generation of value from open data repository where developers could register their applica-
through the mechanism of innovation. tions that use open data and indicating which particular
Furthermore, Zuiderwijk & Janssen [35] state that different datasets are reused.
types of users of open data are often interested in different types (4) Analyzing the impact of open datasets in mass media,
of data, therefore, publication of data can be improved by taking social media, blogs, etc. by searching the references to the
into account preferences for certain types of data for certain open datasets in these sites.
data users. (5) A set of controlled experiments to demonstrate the effec-
Therefore, there are several methods that support opening tiveness of our approach in different scenarios.
data, but to the best of our knowledge no approaches focus on sup-
porting Smart Cities in selecting and prioritizing which datasets In summary, a successful publication of open datasets should
should be open according to their preferences and the context of be based on the proper combination of the objectives of the open
the city they work for. To fill this gap, we presented our approach data portal and the analysis of the impact of already available
based on obtaining useful indicators from Socrata and GitHub open datasets. This approach provides a useful method for Smart
and use them with AHP. City decision makers to carry out this task in an objective and
analytic way.
5 CONCLUSIONS
Smart Cities usually have a limited budget and insufficient time 6 ACKNOWLEDGEMENTS
to release and maintain all available open data. In this paper, we We would like to thank GitHub that allowed us to use its API
have presented an approach whose goal is to provide an AHP tool without limitations and Socrata that provides a way to collect
that allows weighting different indicators of reuse, calculated precisely all the datasets published using its tools. This work
using Socrata and GitHub as sources of information, in order has been developed with the support of (i) TIN2015-69957-R and
to combine them taking into account objective criteria. This TIN2016-78103-C2-2-R (MINECO/ERDF, EU) project, (ii) POCTEP
approach is characterized by: 4IE project (0045-4IE-4-P), (iii) Consejería de Economía e In-
(1) A classification of 14 categories for Smart City open datasets fraestructuras/Junta de Extremadura (Spain) - European Regional
based on the G8 Open Data Charter and the Smart City Development Fund (ERDF)- GR15098 project and IB16055 project,
domain. and (iv) Consejería de Educación y Empleo/Junta de Extremadura
(2) A definition of 4 indicators based on the reuse of datasets (Spain) - Becas de Movilidad al Personal Docente e Investigador
in OSS projects. Curso 2016/2017.
REFERENCES the Information Systems Perspective (EGOVIS 2014) 8650, 2014 (2014), 275–291.
[1] Judie Attard, Fabrizio Orlandi, Simon Scerri, and Sören Auer. 2015. A sys- https://doi.org/10.1007/978-3-319-10178-1_22
tematic review of open government data initiatives. Government Information [24] T.L. Saaty. 1980. The Analytic Hierarchy Process. McGraw-Hill, New York.
Quarterly 32, 4 (2015), 399–418. https://doi.org/10.1016/j.giq.2015.07.006 [25] Jyoti Sheoran, Kelly Blincoe, Eirini Kalliamvakou, Daniela Damian, and Jordan
[2] Marco Bartolozzi, Pierfrancesco Bellini, Paolo Nesi, Gianni Pantaleo, and Luca Ell. 2014. Understanding "watchers" on GitHub. In MSR 2014: Proceedings of
Santi. 2015. A Smart Decision Support System for Smart City. In 2015 IEEE the 11th Working Conference on Mining Software Repositories. ACM Press, New
International Conference on Smart City/SocialCom/SustainCom (SmartCity). York, New York, USA, 336–339. https://doi.org/10.1145/2597073.2597114
IEEE, 117–122. https://doi.org/10.1109/SmartCity.2015.57 [26] Socrata. 2018. Socrata: Data-driven innovation of government programs.
[3] Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, and (2018). https://www.socrata.com/
Laurent Reveillere. 2013. Popularity, interoperability, and impact of pro- [27] Katherine J. Stewart, Anthony P. Ammeter, and Likoebe M. Maruping. 2006.
gramming languages in 100,000 open source projects. In Proceedings - In- Impacts of license choice and organizational sponsorship on user interest and
ternational Computer Software and Applications Conference. IEEE, 303–312. development activity in open source software projects. Information Systems
https://doi.org/10.1109/COMPSAC.2013.55 Research 17, 2 (jun 2006), 126–144. https://doi.org/10.1287/isre.1060.0082
[4] Roberto Boselli, Mirko Cesarini, Fabio Mercorio, and Mario Mezzanzan- [28] Chandrasekar Subramaniam, Ravi Sen, and Matthew L. Nelson. 2009. Determi-
ica. 2015. Applying the AHP to Smart Mobility Services: A Case Study. nants of open source software project success: A longitudinal study. Decision
In Proceedings of 4th International Conference on Data Management Tech- Support Systems 46, 2 (jan 2009), 576–585. https://doi.org/10.1016/j.dss.2008.
nologies and Applications - Volume 1: KomIS. SCITEPRESS, 354–361. https: 10.005 arXiv:arXiv:cond-mat/0402594v3
//doi.org/10.5220/0005580003540361 [29] Abobakr Sultan, Khalid A. AlArfaj, and Ghassan A. AlKutbi. 2012. Analytic
[5] Hitachi Vantara Community. 2018. Data Integration - Kettle. (2018). http: hierarchy process for the success of e-government. Business Strategy Series 13,
//community.pentaho.com/projects/data-integration/ 6 (nov 2012), 295–306. https://doi.org/10.1108/17515631211286146
[6] Peter Conradie and Sunil Choenni. 2014. On the barriers for local government [30] Jeffrey Thorsby, Genie N.L. Stowers, Kristen Wolslegel, and Ellie Tumbuan.
releasing open data. Government Information Quarterly 31, SUPPL.1 (2014), 2016. Understanding the content and features of open data portals in American
S10–S17. https://doi.org/10.1016/j.giq.2014.01.003 cities. Government Information Quarterly 34, 1 (2016), 53–61. https://doi.org/
[7] Carlo Daffara. 2012. Estimating the Economic Contribution of Open Source 10.1016/j.giq.2016.07.001
Software to the European Economy. In The First Openforum Academy Confer- [31] Omkarprasad S. Vaidya and Sushil Kumar. 2006. Analytic hierarchy process:
ence Proceedings. OpenForum Europe LTD, 11–14. An overview of applications. European Journal of Operational Research 169, 1
[8] Rishab Aiyer Ghosh. 2006. Economic impact of open source software on inno- (2006), 1–29. https://doi.org/10.1016/j.ejor.2004.04.028
vation and the competitiveness of the Information and Communication Tech- [32] Nils Walravens, Jonas Breuer, and Pieter Ballon. 2014. Open Data as a Catalyst
nologies (ICT) sector in the EU. Technical Report. Maastricht: UNU-MERIT. For The Smart City as a Local Innovation Platform. Communications & Strate-
http://stuermer.ch/blog/documents/FLOSSImpactOnEU.pdf gies 96, 4th quarter 2014 (2014), 15–33. https://ssrn.com/abstract=2636315
[9] Github. 2018. Github: The world’s leading software development platform. [33] Liguo Yu, Alok Mishra, and Deepti Mishra. 2014. An Empirical Study of
(2018). https://www.github.com/ the Dynamics of GitHub Repository and Its Impact on Distributed Software
[10] Group of Eight. 2013. G8 Open Data Charter. (2013). https: Development. In Proceedings of the Confederated International Workshops
//www.gov.uk/government/uploads/system/uploads/attachment_data/ on On the Move to Meaningful Internet Systems: OTM 2014 Workshops - Vol-
file/207772/Open_Data_Charter.pdf ume 8842. Springer-Verlag New York, Inc., 457–466. https://doi.org/10.1007/
[11] Jeffrey Hammond, Paul Santinelli, Jay Jay Billings, and Bill Ledingham. 2016. 978-3-662-45550-0_46
The Tenth Annual Future of Open Source Survey. Technical Report. Black [34] Anneke Zuiderwijk and Marijn Janssen. 2013. A Coordination Theory Per-
Duck Software and North Bridge. https://www.blackducksoftware.com/ spective to Improve the Use of Open Data in Policy-Making. In Proceedings
2016-future-of-open-source of the 12th IFIP WG 8.5 International Conference on Electronic Government -
[12] Anders Hjalmarsson, Niklas Johansson, and Daniel Rudmark. 2015. Mind the Volume 8074. Springer-Verlag New York, Inc., 38–49. https://doi.org/10.1007/
gap: Exploring stakeholders’ value with open data assessment. In Proceedings 978-3-642-40358-3_4
of the Annual Hawaii International Conference on System Sciences. IEEE, 1314– [35] Anneke Zuiderwijk and Marijn Janssen. 2014. Barriers and Development
1323. https://doi.org/10.1109/HICSS.2015.160 Directions for the Publication and Usage of Open Data: A Socio-Technical
[13] Mohammad Alamgir Hossain, Yogesh K Dwivedi, and Nripendra P. Rana. 2016. View. In Open Government. Vol. 4. Springer New York, New York, NY, 115–135.
State of the Art in Open Data Research: Insights from Existing Literature https://doi.org/10.1007/978-1-4614-9563-5_8 arXiv:arXiv:1011.1669v3
and a Research Agenda. Journal of Organizational Computing and Electronic [36] Anneke Zuiderwijk, Marijn Janssen, Sunil Choenni, and Ronald Meijer. 2014.
Commerce 26, 1-2 (apr 2016), 14–40. https://doi.org/10.1080/10919392.2015. Design principles for improving the process of publishing open data. Trans-
1124007 forming Government: People, Process and Policy 8, 2 (may 2014), 185–204.
[14] Marijn Janssen, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. Benefits, https://doi.org/10.1108/TG-07-2013-0024
Adoption Barriers and Myths of Open Data and Open Government. Informa- [37] Anneke Zuiderwijk, Iryna Susha, Yannis Charalabidis, Peter Parycek, and
tion Systems Management 29, 4 (sep 2012), 258–268. https://doi.org/10.1080/ Marijn Janssen. 2015. Open data disclosure and use : critical factors from a
10580530.2012.716740 arXiv:arXiv:1011.1669v3 case study. In In: CeDEM 2015: Proceedings of the International Conference for
[15] Oskar Jarczyk, Blazej Gruszka, Szymon Jaroszewicz, and Leszek Bukowski. E-Democracy and Open Government 2015. Edition Donau-Universität Krems,
2014. GitHub Projects. Quality Analysis of Open-Source Software. In SocInfo 197–208.
2014: The 6th International Conference on Social Informatics. Springer, Cham,
80–94. https://doi.org/10.1007/978-3-319-13734-6_6
[16] Thorhildur Jetzek, Michel Avital, and Niels Bjorn-Andersen. 2014. Data-
driven innovation through open government data. Journal of Theoretical
and Applied Electronic Commerce Research 9, 2 (aug 2014), 100–120. https:
//doi.org/10.4067/S0718-18762014000200008
[17] Maxat Kassen. 2013. A promising phenomenon of open data: A case study of
the Chicago open data project. Government Information Quarterly 30, 4 (2013),
508–513. https://doi.org/10.1016/j.giq.2013.05.012
[18] Sylvain Kubler, Jérémy Robert, Yves Le Traon, Jürgen Umbrich, and Sebastian
Neumaier. 2016. Open Data Portal Quality Comparison using AHP. In Pro-
ceedings of the 17th International Digital Government Research Conference on
Digital Government Research - dg.o ’16. ACM Press, New York, New York, USA,
397–407. https://doi.org/10.1145/2912160.2912167
[19] Jennifer Marlow, Laura Dabbish, and Jim Herbsleb. 2013. Impression Formation
in Online Peer Production : Activity Traces and Personal Profiles in GitHub.
In 16th ACM Conference on Computer Supported Cooperative Work. ACM Press,
New York, New York, USA, 117–128. https://doi.org/10.1145/2441776.2441792
[20] K. Muthukumaran, Abhinav Choudhary, and N.L. Bhanu Murthy. 2015. Mining
GitHub for Novel Change Metrics to Predict Buggy Files in Software Systems.
In 2015 International Conference on Computational Intelligence and Networks.
IEEE, 15–20. https://doi.org/10.1109/CINE.2015.13
[21] Paolo Neirotti, Alberto De Marco, Anna Corinna Cagliano, Giulio Mangano,
and Francesco Scorrano. 2014. Current trends in smart city initiatives: Some
stylised facts. Cities 38 (2014), 25–36. https://doi.org/10.1016/j.cities.2013.12.
010
[22] National League of Cities. 2018. National League of Cities. (2018). https:
//www.nlc.org/
[23] Monica Palmirani, Michele Martoni, and Dino Girardi. 2014. Beyond Trans-
parency Introduction : OGA Beyond Transparency. Electronic Government and