=Paper= {{Paper |id=Vol-2323/SKI-Canada-2019-7-4-3 |storemode=property |title=Exploring GitHub Data for Geospatial Government Research: Understanding the Limitations of Inadequate Quality Control |pdfUrl=https://ceur-ws.org/Vol-2323/SKI-Canada-2019-7-4-3.pdf |volume=Vol-2323 |authors=Jaydeep Mistry }} ==Exploring GitHub Data for Geospatial Government Research: Understanding the Limitations of Inadequate Quality Control== https://ceur-ws.org/Vol-2323/SKI-Canada-2019-7-4-3.pdf
Spatial Knowledge and Information Canada, 2019, 7(4), 3



Exploring GitHub Data for Geospatial
Government Research: Understanding the
Limitations of Inadequate Quality Control
JAYDEEP MISTRY
Department of Geography
University of Waterloo
jaydeep.mistry@uwaterloo.ca

ABSTRACT                                              more. Research has been conducted on the
                                                      use case of existing tools, or the
GitHub is an online platform that allows for          examination of emerging tools in the
open collaboration between government                 industry that can support multi-scale, multi-
members and public contributors. Over the             temporal, and multi-dimensional geospatial
past few years, there was an increase in the          data management or analysis (Palomino et
number of governments who were adopting               al., 2017). Although working together on a
the use of GitHub to host their own software          software project is not immediately taken
projects publically because of the nature of          into consideration for being a geospatial and
GitHub being friendly to open source                  temporal problem, any team developing a
projects. This brings a need to research how          software solution needs to face these issues
governments are using the platform, for               if they want to work together on a project
what projects, and how their use differs              without being bound by any team member’s
spatially. To perform geospatial government           location or time of day. Open collaboration
research of their use of GitHub, there is             has brought new platforms that can allow
need for the data to be complete and not              collaboration from members inside and
missing information. It is found that the             outside an organization to develop software
data that is automatically generated from             that can be shared from the web (Mergel,
the GitHub platform tends to be complete              2015). However, due to barriers in
and accurate, while the voluntarily provided          individual expertise of software use, or
data by the governments is often missing              organization-level adoption of the platform
some information that is geospatial or                from IT constraints, only select platforms
contextual.                                           are adopted by governments (Longo & Kelly,
                                                      2016).
1. Introduction
                                                      A platform that was adopted by
Over the past decade, there have been                 governments in the recent years is called
significant advancements in the realm of              GitHub. It is a web-based and version
Open Data (Janssen et al., 2012), spatial             control software project repository hosting
analysis (Anselin, 2012), and collaboration           service that allows users within and outside
(Palomino et al., 2017) for solving issues in         an organization to work together on
various fields such as policy planning                projects, review changes, comment on
(Taeihagh, 2017), ecology (Steiniger &                issues, and more. Although it is possible to
Geoffrey, 2009), web mapping (Neset et al.,           make GitHub accounts and projects be
2016), CyberGIS (Wang, et al., 2013), and             private and only visible to approved users, it
                                                      has been mostly used to host open-source
2 | Exploring GitHub Data for Geospatial Government Research


projects where all of the data is copyrighted        listed as official government accounts on
under a public license, but anyone can               GitHub’s               own             website
contribute their changes to the project or           (government.github.com/community). As
use the code themselves. Due to it being             the GitHub API would answer the request
very friendly to open-source projects, it has        with the data on the GitHub accounts, they
become a very useful and powerful tool for           were stored as a table using a Python library
governments to use because it allows them            called Pandas. For each account, there was a
to work together on projects while having it         field which listed the geographic location of
be accessible to the public, and still control       where the account was in the world. Since
who gets to make changes (Longo & Kelly,             the location data on these accounts was just
2016).                                               in plain text, it had to be geocoded, meaning
                                                     that it has to be converted to a
With the rise in the adoption of GitHub by           latitude/longitude pair which could be
governments for government related uses,             placed accurately on a world map. Using
there is a need to research how those                Google Maps API to geocode each account,
governments are using it, for what projects,         the geospatial dataset was ready and stored
and if their use of the platform is differs          into a Microsoft Excel for further data
between governments of different regions;            visualization.
i.e. its use in North America versus Europe.
Although there have been previous studies
that have tried to analyze GitHub use by              GitHub Rest API
governments, there have only been some                • Web scraping
                                                      • api.github.com
that have tried to analyze the quantitative
data available from GitHub. Thus, the
research goal of this paper is to use the
GitHub data to explore how many                          Pandas Data Structures
governments are using the platform, and                  • Data frames
                                                         • Data manipulation
also analyze its completeness of spatial and
contextual information for use in future
Geospatial Research. It will do so by
answering the following research objectives:                   Google Maps API
     1. How many governments are using                         • Geocoding addresses
         GitHub accounts?
     2. How many government GitHub
         accounts are geo-locatable?
     3. What is the completeness of the                           Microsoft Excel
         contextual    government    GitHub                       • Storing as Spreadsheets
                                                                  • Data Visualization
         data?
                                                          Figure 1: Spatial Data Gathering Workflow
2. Methods
All of the data was gathered using Python            3. Results
libraries in a Jupyter Notebook. Figure 1
illustrates the process of gathering the data.       3.1 Research Objective 1
The first step involved using GitHub’s Rest          The first objective is to see how many
API to make hundreds of web request for              governments are using GitHub accounts.
data on specific GitHub accounts which are           After web scraping all 770 GitHub accounts
3 | Exploring GitHub Data for Geospatial Government Research


listed as official government accounts on                                     50.00%
GitHub’s own webpage, they were plotted
based on their date of creation. Figure 2                                     40.00%
shows the plot of those accounts from the
early days of using GitHub in 2009, and up                                    30.00%
to the end of 2018. Since the creation of
GitHub, its adoption in governments was                                       20.00%
increasing year over year until 2014 where
that increase plateaued. It is also important                                 10.00%
to note that a single government
organization could own multiple GitHub                                         0.00%
accounts.                                                                                2014   2015    2016   2017    2018
                                                                                 Total 4.03% 18.31% 17.01% 13.51% 47.14%
                   250
                                                  213                         Figure 3: Percentage of Government Accounts by
                                                                                               Year of Update
                   200
                                                                             Figure 3 shows that at least 60% of the
                                            142
                                                                             accounts have been updated since the
   # of Accounts




                   150                                  133
                                                                             beginning of 2017, whereas a sum of 40% of
                   100                 86                                    the accounts have not been updated since
                                                              60
                                                                             2016. Although an account could have been
                                  49                               48        created a while ago, it is possible that it
                   50
                             17                                         19   could have genuinely not needed to be
                         3                                                   updated in any way, thus the date of
                    0                                                        creation and update are not the best
                                                                             indicators of account activity. It is possible
                                                                             to look at the performance of individual
                                                  Total                      repositories of each government account,
 Figure 2: Number of GitHub Accounts by Year of                              but that would require analyzing over
                   Creation                                                  27,000 repositories which is beyond the
                                                                             scope of this paper.
However, ownership of the GitHub accounts
to their real world government organization                                  3.2 Research Objective 2
is often not listed and rather implied by the
name of the account. For example the
account @thecityofcalgary is owned by the
City of Calgary in Canada, but the @web-
boew account is owned by the federal
Government of Canada.                                                                             28%            Missing
                                                                                                                 Location Data
                                                                                                                 Provided
                                                                                       72%                       Location Data




                                                                             Figure 4: Proportion of Accounts with Location Data

                                                                             The second objective was to determine how
                                                                             many government accounts are geo-
4 | Exploring GitHub Data for Geospatial Government Research


locatable. Figure 4 illustrates that 28% (217)
of the government GitHub accounts were                               3.3 Research Objective 3
completely missing location data. The only                           The third objective was to determine the
way to tell what country those accounts                              completeness of the contextual information
belong to would be from further web-                                 of the government GitHub data. For this
scraping the GitHub webpage which lists                              paper, the contextual information being
these official government accounts and                               assessed was limited to four aspects of the
recording what country the account was                               government GitHub data available about the
listed under.                                                        organization: name, description, email, and
                                                                     location. Figure 6 illustrates that only about
                       300                                           32% of the accounts have given all 4 of the
                                                                     contextual information items. Almost 40%
                       250                                           of the accounts are completely missing at
                                                                     least one item, and over 28% are missing
                       200                                           more than one item.
   # of Accounts




                       150                                                              50.00%


                       100
                                                                                        40.00%

                        50
                                                                        % of Accounts

                                                                                        30.00%
                         0
                             1 2 3 4 5 6 7 8 9 11 15 16
                   Geocoded                                                             20.00%
                               12 3 4 1
                   Incorrectly
                   Geocoded
                             127256 87 29 12 5 8 4 2 1 1 1                              10.00%
                   Correctly
                              # of Words Used to Describe Location
                                                                                         0.00%
Figure 5: Number of Accounts Geocoded to Correct                                                   0        1        2        3         4
Country                                                                                    Total 0.00%    4.84% 23.34% 39.43% 32.39%
                                                                                                       # of Contextual Items Provided
One of the biggest issues with the location
data, other than being empty, is that some                           Figure 6: Percentage of Accounts by the Amounts of
are geo-located to the incorrect country                                      Contextual Information Provided
because their GitHub data only listed a few
word which were generic enough to be
places in other countries. Figure 5 shows                            4. Conclusion
that there were 12 accounts who were
geocoded incorrectly because there was only                          In conclusion, only after having adequate
one word used to describe their location. An                         information is it possible to use the
example is the account for the Canterbury                            government GitHub account data for
Regional Council which only used the word                            geospatial research that might investigate
‘Canterbury’ in the location field. Since                            who, when, and where these accounts are
there are various places called Canterbury                           coming from. Data such as the date of
across the world, the Google Maps API                                account creation and last update are
geocoded the account to a place in the                               accurate to perform analysis because they
United States instead of New Zealand,                                are automatically recorded by the platform
which is where the account is actually from.                         as the changes happened. However, data
5 | Exploring GitHub Data for Geospatial Government Research


such as the contextual information that the                    collaboration tool. Canadian Public
governments can voluntarily add to their                       Administration, 598-623.
accounts is often missing some information,
for example the geographic location of the           Mergel, I. (2015). Open collaboration in the
organization.                                               public sector: The case of social
                                                            coding on GitHub. Government
There is a need for better quality control of               Information Quarterly, 32(4), 464-
the voluntarily provided government data                    472.
because it is often incomplete or lacking
some parts which should not be the case for          Neset, T. S., Opach, T., Lion, P., Lilja, A., &
a public facing government resource that                    Johansson, J. (2016). Map-based
citizens of their community may want to                     web tools supporting climate change
view or interact with. Having complete                      adaptation. The Professional
geospatial and contextual GitHub data on                    Geographer, 103-114.
these government accounts can not only
benefit the public, but also allow for future        Palomino, J., Muellerklein, O. C., & Kelly,
geospatial government research into various                M. (2017). A review of the emergent
fields of GIScience, open collaboration,                   ecosystem of collaborative geospatial
open source software, and much more.                       tools for addressing environmental
                                                           challenges. Computers,
Acknowledgements                                           Environment and Urban Systems,
                                                           79-92.
I would like to acknowledge my graduate
supervisor Dr. Peter A. Johnson for                  Steiniger, S., & Geoffrey, H. J. (2009). Free
supporting me in my own Masters studies. I                  and open source geographic
would also like to thank him for funding me                 information tools for landscape
through scholarships awarded by the Social                  ecology. Ecological Informatics, 183-
Sciences and Humanities Research Council                    195.
of Canada (SSHRC).
                                                     Taeihagh, A. (2017). Taeihagh, A. (2017).
                                                            Crowdsourcing: a new tool for
References
                                                            policy-making? Policy Sciences, 629-
                                                            647.
Anselin, L. (2012). Anselin, L. (2012). From
       SpaceStat to CyberGIS: Twenty years
                                                     Wang, S., Anselin, L., Bhaduri, B., Crosby,
       of spatial data analysis software.
                                                           C., Goodchild, M., Liu, Y., &
       International Regional Science
                                                           Nyerges, T. L. (2013). CyberGIS
       Review, 131-157.
                                                           software: a synthetic review and
                                                           integration roadmap. International
Janssen, M., Charalabidis, Y., & Zuiderwijk,
                                                           Journal of Geographical Information
       A. (2012). Benefits, adoption
                                                           Science, 2122-2145.
       barriers and myths of open data and
       open government. Information
       systems management, 258-268.

Longo, J., & Kelly, T. M. (2016). GitHub use
       in public administration in Canada:
       Early experience with a new