=Paper=
{{Paper
|id=Vol-2323/SKI-Canada-2019-7-4-3
|storemode=property
|title=Exploring GitHub Data for Geospatial Government Research: Understanding the Limitations of Inadequate Quality Control
|pdfUrl=https://ceur-ws.org/Vol-2323/SKI-Canada-2019-7-4-3.pdf
|volume=Vol-2323
|authors=Jaydeep Mistry
}}
==Exploring GitHub Data for Geospatial Government Research: Understanding the Limitations of Inadequate Quality Control==
Spatial Knowledge and Information Canada, 2019, 7(4), 3 Exploring GitHub Data for Geospatial Government Research: Understanding the Limitations of Inadequate Quality Control JAYDEEP MISTRY Department of Geography University of Waterloo jaydeep.mistry@uwaterloo.ca ABSTRACT more. Research has been conducted on the use case of existing tools, or the GitHub is an online platform that allows for examination of emerging tools in the open collaboration between government industry that can support multi-scale, multi- members and public contributors. Over the temporal, and multi-dimensional geospatial past few years, there was an increase in the data management or analysis (Palomino et number of governments who were adopting al., 2017). Although working together on a the use of GitHub to host their own software software project is not immediately taken projects publically because of the nature of into consideration for being a geospatial and GitHub being friendly to open source temporal problem, any team developing a projects. This brings a need to research how software solution needs to face these issues governments are using the platform, for if they want to work together on a project what projects, and how their use differs without being bound by any team member’s spatially. To perform geospatial government location or time of day. Open collaboration research of their use of GitHub, there is has brought new platforms that can allow need for the data to be complete and not collaboration from members inside and missing information. It is found that the outside an organization to develop software data that is automatically generated from that can be shared from the web (Mergel, the GitHub platform tends to be complete 2015). However, due to barriers in and accurate, while the voluntarily provided individual expertise of software use, or data by the governments is often missing organization-level adoption of the platform some information that is geospatial or from IT constraints, only select platforms contextual. are adopted by governments (Longo & Kelly, 2016). 1. Introduction A platform that was adopted by Over the past decade, there have been governments in the recent years is called significant advancements in the realm of GitHub. It is a web-based and version Open Data (Janssen et al., 2012), spatial control software project repository hosting analysis (Anselin, 2012), and collaboration service that allows users within and outside (Palomino et al., 2017) for solving issues in an organization to work together on various fields such as policy planning projects, review changes, comment on (Taeihagh, 2017), ecology (Steiniger & issues, and more. Although it is possible to Geoffrey, 2009), web mapping (Neset et al., make GitHub accounts and projects be 2016), CyberGIS (Wang, et al., 2013), and private and only visible to approved users, it has been mostly used to host open-source 2 | Exploring GitHub Data for Geospatial Government Research projects where all of the data is copyrighted listed as official government accounts on under a public license, but anyone can GitHub’s own website contribute their changes to the project or (government.github.com/community). As use the code themselves. Due to it being the GitHub API would answer the request very friendly to open-source projects, it has with the data on the GitHub accounts, they become a very useful and powerful tool for were stored as a table using a Python library governments to use because it allows them called Pandas. For each account, there was a to work together on projects while having it field which listed the geographic location of be accessible to the public, and still control where the account was in the world. Since who gets to make changes (Longo & Kelly, the location data on these accounts was just 2016). in plain text, it had to be geocoded, meaning that it has to be converted to a With the rise in the adoption of GitHub by latitude/longitude pair which could be governments for government related uses, placed accurately on a world map. Using there is a need to research how those Google Maps API to geocode each account, governments are using it, for what projects, the geospatial dataset was ready and stored and if their use of the platform is differs into a Microsoft Excel for further data between governments of different regions; visualization. i.e. its use in North America versus Europe. Although there have been previous studies that have tried to analyze GitHub use by GitHub Rest API governments, there have only been some • Web scraping • api.github.com that have tried to analyze the quantitative data available from GitHub. Thus, the research goal of this paper is to use the GitHub data to explore how many Pandas Data Structures governments are using the platform, and • Data frames • Data manipulation also analyze its completeness of spatial and contextual information for use in future Geospatial Research. It will do so by answering the following research objectives: Google Maps API 1. How many governments are using • Geocoding addresses GitHub accounts? 2. How many government GitHub accounts are geo-locatable? 3. What is the completeness of the Microsoft Excel contextual government GitHub • Storing as Spreadsheets • Data Visualization data? Figure 1: Spatial Data Gathering Workflow 2. Methods All of the data was gathered using Python 3. Results libraries in a Jupyter Notebook. Figure 1 illustrates the process of gathering the data. 3.1 Research Objective 1 The first step involved using GitHub’s Rest The first objective is to see how many API to make hundreds of web request for governments are using GitHub accounts. data on specific GitHub accounts which are After web scraping all 770 GitHub accounts 3 | Exploring GitHub Data for Geospatial Government Research listed as official government accounts on 50.00% GitHub’s own webpage, they were plotted based on their date of creation. Figure 2 40.00% shows the plot of those accounts from the early days of using GitHub in 2009, and up 30.00% to the end of 2018. Since the creation of GitHub, its adoption in governments was 20.00% increasing year over year until 2014 where that increase plateaued. It is also important 10.00% to note that a single government organization could own multiple GitHub 0.00% accounts. 2014 2015 2016 2017 2018 Total 4.03% 18.31% 17.01% 13.51% 47.14% 250 213 Figure 3: Percentage of Government Accounts by Year of Update 200 Figure 3 shows that at least 60% of the 142 accounts have been updated since the # of Accounts 150 133 beginning of 2017, whereas a sum of 40% of 100 86 the accounts have not been updated since 60 2016. Although an account could have been 49 48 created a while ago, it is possible that it 50 17 19 could have genuinely not needed to be 3 updated in any way, thus the date of 0 creation and update are not the best indicators of account activity. It is possible to look at the performance of individual Total repositories of each government account, Figure 2: Number of GitHub Accounts by Year of but that would require analyzing over Creation 27,000 repositories which is beyond the scope of this paper. However, ownership of the GitHub accounts to their real world government organization 3.2 Research Objective 2 is often not listed and rather implied by the name of the account. For example the account @thecityofcalgary is owned by the City of Calgary in Canada, but the @web- boew account is owned by the federal Government of Canada. 28% Missing Location Data Provided 72% Location Data Figure 4: Proportion of Accounts with Location Data The second objective was to determine how many government accounts are geo- 4 | Exploring GitHub Data for Geospatial Government Research locatable. Figure 4 illustrates that 28% (217) of the government GitHub accounts were 3.3 Research Objective 3 completely missing location data. The only The third objective was to determine the way to tell what country those accounts completeness of the contextual information belong to would be from further web- of the government GitHub data. For this scraping the GitHub webpage which lists paper, the contextual information being these official government accounts and assessed was limited to four aspects of the recording what country the account was government GitHub data available about the listed under. organization: name, description, email, and location. Figure 6 illustrates that only about 300 32% of the accounts have given all 4 of the contextual information items. Almost 40% 250 of the accounts are completely missing at least one item, and over 28% are missing 200 more than one item. # of Accounts 150 50.00% 100 40.00% 50 % of Accounts 30.00% 0 1 2 3 4 5 6 7 8 9 11 15 16 Geocoded 20.00% 12 3 4 1 Incorrectly Geocoded 127256 87 29 12 5 8 4 2 1 1 1 10.00% Correctly # of Words Used to Describe Location 0.00% Figure 5: Number of Accounts Geocoded to Correct 0 1 2 3 4 Country Total 0.00% 4.84% 23.34% 39.43% 32.39% # of Contextual Items Provided One of the biggest issues with the location data, other than being empty, is that some Figure 6: Percentage of Accounts by the Amounts of are geo-located to the incorrect country Contextual Information Provided because their GitHub data only listed a few word which were generic enough to be places in other countries. Figure 5 shows 4. Conclusion that there were 12 accounts who were geocoded incorrectly because there was only In conclusion, only after having adequate one word used to describe their location. An information is it possible to use the example is the account for the Canterbury government GitHub account data for Regional Council which only used the word geospatial research that might investigate ‘Canterbury’ in the location field. Since who, when, and where these accounts are there are various places called Canterbury coming from. Data such as the date of across the world, the Google Maps API account creation and last update are geocoded the account to a place in the accurate to perform analysis because they United States instead of New Zealand, are automatically recorded by the platform which is where the account is actually from. as the changes happened. However, data 5 | Exploring GitHub Data for Geospatial Government Research such as the contextual information that the collaboration tool. Canadian Public governments can voluntarily add to their Administration, 598-623. accounts is often missing some information, for example the geographic location of the Mergel, I. (2015). Open collaboration in the organization. public sector: The case of social coding on GitHub. Government There is a need for better quality control of Information Quarterly, 32(4), 464- the voluntarily provided government data 472. because it is often incomplete or lacking some parts which should not be the case for Neset, T. S., Opach, T., Lion, P., Lilja, A., & a public facing government resource that Johansson, J. (2016). Map-based citizens of their community may want to web tools supporting climate change view or interact with. Having complete adaptation. The Professional geospatial and contextual GitHub data on Geographer, 103-114. these government accounts can not only benefit the public, but also allow for future Palomino, J., Muellerklein, O. C., & Kelly, geospatial government research into various M. (2017). A review of the emergent fields of GIScience, open collaboration, ecosystem of collaborative geospatial open source software, and much more. tools for addressing environmental challenges. Computers, Acknowledgements Environment and Urban Systems, 79-92. I would like to acknowledge my graduate supervisor Dr. Peter A. Johnson for Steiniger, S., & Geoffrey, H. J. (2009). Free supporting me in my own Masters studies. I and open source geographic would also like to thank him for funding me information tools for landscape through scholarships awarded by the Social ecology. Ecological Informatics, 183- Sciences and Humanities Research Council 195. of Canada (SSHRC). Taeihagh, A. (2017). Taeihagh, A. (2017). Crowdsourcing: a new tool for References policy-making? Policy Sciences, 629- 647. Anselin, L. (2012). Anselin, L. (2012). From SpaceStat to CyberGIS: Twenty years Wang, S., Anselin, L., Bhaduri, B., Crosby, of spatial data analysis software. C., Goodchild, M., Liu, Y., & International Regional Science Nyerges, T. L. (2013). CyberGIS Review, 131-157. software: a synthetic review and integration roadmap. International Janssen, M., Charalabidis, Y., & Zuiderwijk, Journal of Geographical Information A. (2012). Benefits, adoption Science, 2122-2145. barriers and myths of open data and open government. Information systems management, 258-268. Longo, J., & Kelly, T. M. (2016). GitHub use in public administration in Canada: Early experience with a new