=Paper=
{{Paper
|id=Vol-3293/paper30
|storemode=property
|title=Patterns of User Participation and Contribution in Global Crowdsourcing: A Data Mining Study of Stack Overflow
|pdfUrl=https://ceur-ws.org/Vol-3293/paper30.pdf
|volume=Vol-3293
|authors=Himesha Wijekoon,Vojtěch Merunka
|dblpUrl=https://dblp.org/rec/conf/haicta/WijekoonM22
}}
==Patterns of User Participation and Contribution in Global Crowdsourcing: A Data Mining Study of Stack Overflow==
Patterns of User Participation and Contribution in Global
Crowdsourcing: A Data Mining Study of Stack Overflow
Himesha Wijekoon 1 and Vojtěch Merunka 1,2
1
Czech University of Life Sciences Prague, Prague, Czech Republic
2
Czech Technical University in Prague, Prague, Czech Republic
Abstract
Among many popular crowdsourcing platforms, the Question & Answer website Stack
Overflow in Stack Exchange Network is used daily to share knowledge globally by millions
of software professionals. Therefore, Stack Overflow data can reveal important patterns in
global crowdsourcing beneficial for software industry. The aim of this study was to perform
data mining on Stack Overflow data, to discover some of these patterns. Focus of this research
was to analyze the global user distribution and contribution. Big data analytic techniques were
used for data mining activities using Apache Spark with Python language. Oracle Data
Visualization Desktop and scikit-learn python library were used for visualization. The results
show that although majority of the users are from USA and India, the average contribution is
higher in European countries.
Keywords 1
Stack Overflow, Data Mining, Big Data Analytics, Crowdsourcing, Software Engineering,
User Participation, User Contribution
1. Introduction
Crowdsourcing is basically a type of participative online activity where a person or an organization
requests a loosely defined group of people (crowd) to carry out tasks for them using open calls. The
crowd undertakes the tasks voluntarily driven by motivation which is not supposed to be financial
reasons in all the cases [1]. A new term called Crowdsourced Software Engineering has also emerged
to describe the phenomena of using crowdsourcing for various software engineering tasks as it is very
popular nowadays [2].
Among many popular crowdsourcing platforms used in software engineering, the Question &
Answer (Q&A) website Stack Overflow is used daily to share knowledge globally by millions of
software professionals. Therefore, Stack Overflow data can reveal important patterns which will help
to get an idea about how software professionals share knowledge in a global scale. Eventually the
findings will also help global software companies and crowdsourcing platforms to formulate and re-
evaluate their strategies and incentive criteria. The aim of this study is to perform data mining on Stack
Overflow data to discover patterns of global user distribution and contribution.
2. Background
Stack Overflow caters wide range of computer programming subjects or topics. In 2015 it has
recorded 5.7 billion page views as the number of registered Stack Overflow users was reaching 5 million
[3]. The registered users can post questions and answers on the website. All the content is freely
Proceedings of HAICTA 2022, September 22–25, 2022, Athens, Greece
EMAIL: wijekoon@pef.czu.cz (A. 1); merunka@pef.czu.cz (A. 2)
ORCID: 0000-0002-2800-5693 (A. 1); 0000-0002-9056-1439 (A. 2)
©️ 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
143
available for the public for viewing. It also utilizes a comprehensive reputation management system as
Atwood states in one of his blog posts in 2009, that he believes in community moderation [3][4].
Schenk et al. in 2013 in their research has found out that contribution is highest in Europe and North
America. Then Asia, which is mostly represented by India; Oceania contributes not as much as Asia,
but more than South America and Africa combined. However, they base their research on the transfer
of knowledge. Specifically, who (country) raises the question and who (country) answers it [5].
However, it will be beneficial also to perform a comprehensive study on the user distribution across the
globe with respect to their contribution and reputation.
Reputation measurement can also be manipulated by users who play around with the gamification
methods of Stack Overflow [6]. To tackle this issue, in this research the number of questions and
answers posted will be also used to represent the contribution.
When comparing these measurements across users, there is a need of normalization of the figures
according to the length of membership for the users. For example, Morrison and Murphy-Hill has used
the Reputation per Month without just taking Reputation as the measurement in their research [7].
Similarly, number of answers posted per month and number of questions posted per month can be used
in this research in addition to the reputation.
3. Methodology
Methodology of this research is based on the following phases specified by Fayyad et al. for
discovering knowledge in databases [8].
3.1. Selection
The public data dump of all user-contributed content on the Stack Exchange Network shared in The
Internet Archive is used as the main data source for this research. Following files from Stack Exchange
data dump which has been published on 8th December 2017 has been downloaded from The Internet
Archive for this study.
• Users.xml (2.36 GB)
• Posts.xml (56.3 GB)
Then the structure of the above xml files were studied to select the most appropriate data items. The
Entity Relationship Diagram of the schema is shown in Figure 1.
3.2. Pre-processing
Data mining tasks could not be performed directly on top of downloaded raw XML files due to large
file size, flat structure of XML files and unbreakable nature of XML files. Therefore, raw data had to
be loaded into another format which Apache Spark can utilize its in-memory processing and
parallelization power. A MySQL relational database is used for this purpose. A Python script has been
written for each raw XML file which was then executed using spark-submit script which is loaded in
Spark’s bin directory. The Table 1 shows the number of records loaded into respective MySQL tables.
3.3. Transformation
Conversion of some of the data into appropriate forms was needed before starting data mining
activities which are described below.
3.3.1. Extraction of Country Names
Since names of countries/locations have been specified in different formats in raw data, a special
Python program was implemented to extract the country name accurately with the help of a free and
144
open-source Python library named geodict (https://github.com/petewarden/geodict). In the end the
location of 1,172,495 users were identified and saved in a new database table. This is 15.83% from all
users and 80.24% of all the users who have specified a location.
Figure 1: ER Diagram of the Original Schema.
Table 1
Number of Records Loaded into MySQL Tables
MySQL Table Name Number of Records
Users 7,408,959
Posts 38,360,000
3.3.2. Aggregation
Since tables have millions of data records, Spark with Python API was chosen leveraging the
partition aware loading feature. The groupBy function and other built-in aggregate functions like count,
avg in Spark were used. All the necessary aggregated data required for the research were generated with
the help of Python scripts executed on Spark engine.
3.3.3. Merging
The aggregated data were sometimes needed to be merged prior to data mining. Spark’s feature to
join RDDs is utilized for this purpose.
3.4. Data Mining
For the numerical data, descriptive summary statistics were used to understand the distribution of
data. Mainly the Spark function describe was used for this purpose.
145
3.5. Interpretation/Evaluation
The descriptive statistics, graphs generated by Oracle Data Visualization Desktop (ODVD) tool and
Matplotlib were used to interpret and evaluate the results.
4. Results and Discussion
Country names of 1,172,495 users of Stack Overflow (15.83% from total users) and then 205 country
names were identified in the subset under analysis. Top 50 countries sorted in the descending order of
user count are presented in Table 2.
Table 2
Top 50 Countries with Users
Country Count Cluster Country Count Cluster
1 UNITED STATES 256470 5 26 VIET NAM 8359 2
2 INDIA 214574 5 27 ROMANIA 8012 2
3 UK 74955 4 28 BELGIUM 7683 2
4 GERMANY 39550 4 29 SWITZERLAND 7406 2
5 CANADA 37576 4 30 ARGENTINA 7277 2
6 FRANCE 30470 4 31 SINGAPORE 7168 2
7 CHINA 30164 4 32 PORTUGAL 7103 2
8 AUSTRALIA 22434 3 33 IRELAND 6906 2
RUSSIAN
9 22070 3 34 DENMARK 6846 2
FEDERATION
10 BRAZIL 20070 3 35 SRI LANKA 6508 2
11 PAKISTAN 18661 3 36 JAPAN 6352 2
12 NETHERLANDS 18170 3 37 MEXICO 6327 2
13 INDONESIA 14055 3 38 NEW ZEALAND 6191 2
14 UKRAINE 13391 3 39 MALAYSIA 6179 2
15 POLAND 13027 3 40 TAIWAN 5693 2
16 BANGLADESH 12825 3 41 NORWAY 5475 2
17 SPAIN 12364 3 42 NIGERIA 5288 2
18 PHILIPPINES 12288 3 43 GREECE 5121 2
19 ITALY 12194 3 44 AUSTRIA 5070 2
20 SWEDEN 11928 3 45 COLOMBIA 4765 2
21 IRAN 11862 3 46 SOUTH KOREA 4708 2
22 SOUTH AFRICA 9198 2 47 CZECH REPUBLIC 4405 2
23 ISRAEL 9002 2 48 FINLAND 4251 2
24 TURKEY 8697 2 49 NEPAL 4148 2
25 EGYPT 8527 2 50 BULGARIA 4134 2
As observed United States and India have marginally very high number of users which is more than
200,000 each. Collectively they represent 40% of total users. They are categorized as countries in
Cluster 5. Cluster 4 countries have users between 30,000 and 75,000. UK, Germany, Canada, France
and China belong to this category. Even though China has the world’s highest population, its
participation is not matching with the population. It could be due to language issues. This can be same
for Russian Federation. Another notable observation is there are only 78 countries with more than 1000
identified users. Cluster 2 represents countries with more than 3000 and only some of them are in top
146
50 list. Cluster 1 represents countries with less than 3000 users which is not even included in the Table
2.
Above data has been merged with world population data for year 2015 published by United Nations,
Population Division [9]. Then users per 1000 capita figure has been calculated for each country for
further analysis.
The map in the Figure 2 displays how users per 1000 capita changes across the globe and the Table
3 presents the top 50 countries with users per 1000 capita in descending order. The main observation
compared with user count ranking is United States falling to 17th position while India does not even
qualify in top 50. However, UK shows consistency in both and the biggest (population wise) country
having highest participation. Iceland becomes the number one even though it does not even have
sufficient users to be listed in the first list. The main conclusion that can be derived is that most
European countries have higher participation per capita generally. The countries like New Zealand,
Singapore, Israel, Canada, and Australia are also among the high participating countries.
Figure 2: Users per 1000 Capita.
To compare contribution levels of average users of countries, the user contributions in the means of
average reputation per user, average number of questions posted per user and average number of
answers posted per user from each country have been analyzed. The Table 4 summarizes the rankings
of countries which fall into top 20 of each category and has more than 500 users along with Russian
Federation and India for their significance. The cells in blue background color displays the ranks within
top 20 while cells with pink background displays rankings greater than 20 for the respective category.
As reputation and answer ranking relates to knowledge sharing, respectively Switzerland has
become top country in both rankings while closely followed by UK and Germany. Sweden, Austria,
and Israel are among top 10 of both rankings with most of other European countries. New Zealand,
Austria and Canada contribute much as well.
147
However, India and Russian Federation have less contribution despite their large population.
Another important observation is that most of countries who are reputed, and good answer providers
are also good at asking questions. However, Italy, Ireland, Latvia, and Lebanon are basically question
askers but not answer providers. Meanwhile Finland, Netherlands and Bulgaria have higher reputation
and answering rate, but they do not ask many questions.
Table 3
Top 50 Countries with users per 1000 Capita
Country UsersPer1000Capita Country UsersPer1000Capita
1 ICELAND 1.91677 26 CROATIA 0.537297
2 MALTA 1.585535 27 CYPRUS 0.484933
3 IRELAND 1.469328 28 GERMANY 0.484042
4 NEW ZEALAND 1.341631 29 FRANCE 0.472717
5 SINGAPORE 1.29497 30 HONG KONG 0.462205
6 SWEDEN 1.221685 31 GREECE 0.456507
7 DENMARK 1.203439 32 MACEDONIA 0.438127
8 UK 1.146152 33 ARMENIA 0.416531
9 ISRAEL 1.116244 34 CZECH REPUBLIC 0.415419
10 NETHERLANDS 1.072704 35 ROMANIA 0.403087
11 NORWAY 1.052918 36 BELARUS 0.395961
12 CANADA 1.045238 37 URUGUAY 0.37942
13 ESTONIA 1.008119 38 HUNGARY 0.372039
14 LUXEMBOURG 0.959874 39 SLOVAKIA 0.359604
15 AUSTRALIA 0.942623 40 POLAND 0.34044
16 SWITZERLAND 0.890169 41 GEORGIA 0.322154
17 UNITED STATES 0.801646 42 SRI LANKA 0.314183
18 FINLAND 0.775452 43 SERBIA 0.312271
UNITED ARAB
19 LITHUANIA 0.718981 44 0.299968
EMIRATES
20 PORTUGAL 0.68177 45 UKRAINE 0.299859
21 LATVIA 0.6815 46 COSTA RICA 0.285575
22 BELGIUM 0.680638 47 SPAIN 0.266479
BOSNIA AND
23 SLOVENIA 0.679106 48 0.257921
HERZEGOVINA
24 AUSTRIA 0.584192 49 TAIWAN 0.242402
25 BULGARIA 0.575975 50 ALBANIA 0.238767
In both user participation and contribution, European countries along with Israel, Australia, Canada,
and New Zealand are highlighted from the rest of the world. These findings were cross evaluated by
comparing with the ICT Development Indexes of countries provided by United Nations [10]. The major
difference found was the underperformance of crowdsourcing activities of countries like South Korea
and Japan who have good global ICT rankings. This situation can be further proven by comparing the
findings with the IMD World Digital Competitiveness Ranking 2017 [11]. Even though this must be
further analyzed, one reason can be the language barrier. Presence of some other popular alternatives
to Stack Overflow also can be also another reason. Under presence of China and Russian Federation
can be also due to this.
148
5. Conclusion
Stack Overflow data reveals important patterns in global crowdsourcing beneficial for software
industry. The results on Global User Distribution and Contribution, clearly show that majority of the
users are from USA and India. However, in both participation and contribution aspects, European
countries along with Australia, Canada and New Zealand have higher rankings. It is also noted the less
rankings of Japan, South Korea, Russian Federation, Brazil and China. Since these countries represent
huge portion of world population, further studies should be carried out to find factors for this
phenomenon.
Table 4
Country Rankings for Contribution
Reputation Answer Question
Country
Rank Rank Rank
SWITZERLAND 1 1 6
UK 2 4 5
GERMANY 3 3 14
SWEDEN 4 10 13
GUATEMALA 5 55 97
MALTA 6 15 3
ISRAEL 7 2 1
AUSTRIA 8 6 15
NORWAY 9 14 9
NETHERLANDS 10 5 21
AUSTRALIA 11 12 16
NEW ZEALAND 12 13 18
FINLAND 13 11 49
CZECH REPUBLIC 14 7 4
BULGARIA 15 8 38
DENMARK 16 18 7
UNITED STATES 17 22 35
SLOVENIA 18 16 2
CANADA 19 25 24
SLOVAKIA 20 9 20
POLAND 21 17 25
BELGIUM 22 19 10
LATVIA 23 28 17
IRELAND 24 30 11
ITALY 27 23 8
PERU 32 20 55
RUSSIAN FEDERATION 35 38 54
CYPRUS 44 36 19
LEBANON 53 50 12
INDIA 64 58 56
149
6. References
[1] Y. Zhao, Q. Zhu, 2014, Evaluation on crowdsourcing research: Current status and future direction.
Information Systems Frontiers. 2014. Vol. 16, no. 3, p. 417–434.
[2] K. Mao, L. Capra, M. Harman, Y. Jia, 2017. A survey of the use of crowdsourcing in software
engineering. Journal of Systems and Software. 2017. Vol. 126, p. 57–84.
[3] Stack Exchange Inc, 2018. About - Stack Exchange. URL: https://stackexchange.com/about.
[4] J. Atwood, A Theory of Moderation - Stack Overflow Blog, 2009. URL:
https://stackoverflow.blog/2009/05/18/a-theory-of-moderation/.
[5] D. Schenk, M. Lungu, 2013. Geo-Locating the Knowledge Transfer in Stack Overflow. In:
Proceedings of the 2013 International Workshop on Social Software Engineering. Saint
Petersburg, Russia: ACM. 2013. p. 2–5.
[6] T. Ahmed, A. Srivastava, 2017. Understanding and evaluating the behavior of technical users. A
study of developer interaction at StackOverflow. Human-centric Computing and Information
Sciences. 2017. Vol. 7, no. 1, p. 1–19.
[7] P. Morrison, E. Murphy-Hill, 2013. Is Programming Knowledge Related To Age?
People.Engr.Ncsu.Edu , 2013. P. 3–6.
[8] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, 1996. From Data Mining to Knowledge Discovery in
Databases. AI Magazine. 1996. Vol. 17, p. 37–54
[9] United Nations Department of Social Affairs, Population Division, 2017, World Population
Prospects: The 2017 Revision.
[10] United Nations International Telecommunication Union, 2017. ITU | 2017 Global ICT
Development Index, 2017. URL: http://www.itu.int/net4/ITU-D/idi/2017/#idi2017rank-tab.
[11] IMD World Competitiveness Centre, 2017, IMD World Digital Competitiveness Ranking 2017,
URL: https://www.imd.org/globalassets/wcc/docs/release-
2017/world_digital_competitiveness_yearbook_2017.pdf.
150