=Paper=
{{Paper
|id=Vol-3293/paper30
|storemode=property
|title=Patterns of User Participation and Contribution in Global Crowdsourcing: A Data Mining Study of Stack Overflow
|pdfUrl=https://ceur-ws.org/Vol-3293/paper30.pdf
|volume=Vol-3293
|authors=Himesha Wijekoon,Vojtěch Merunka
|dblpUrl=https://dblp.org/rec/conf/haicta/WijekoonM22
}}
==Patterns of User Participation and Contribution in Global Crowdsourcing: A Data Mining Study of Stack Overflow==
<pdf width="1500px">https://ceur-ws.org/Vol-3293/paper30.pdf</pdf>
<pre>
Patterns of User Participation and Contribution in Global
Crowdsourcing: A Data Mining Study of Stack Overflow
Himesha Wijekoon 1 and Vojtěch Merunka 1,2
1
    Czech University of Life Sciences Prague, Prague, Czech Republic
2
    Czech Technical University in Prague, Prague, Czech Republic


                 Abstract
                 Among many popular crowdsourcing platforms, the Question & Answer website Stack
                 Overflow in Stack Exchange Network is used daily to share knowledge globally by millions
                 of software professionals. Therefore, Stack Overflow data can reveal important patterns in
                 global crowdsourcing beneficial for software industry. The aim of this study was to perform
                 data mining on Stack Overflow data, to discover some of these patterns. Focus of this research
                 was to analyze the global user distribution and contribution. Big data analytic techniques were
                 used for data mining activities using Apache Spark with Python language. Oracle Data
                 Visualization Desktop and scikit-learn python library were used for visualization. The results
                 show that although majority of the users are from USA and India, the average contribution is
                 higher in European countries.

                 Keywords 1
                 Stack Overflow, Data Mining, Big Data Analytics, Crowdsourcing, Software Engineering,
                 User Participation, User Contribution

1. Introduction

    Crowdsourcing is basically a type of participative online activity where a person or an organization
requests a loosely defined group of people (crowd) to carry out tasks for them using open calls. The
crowd undertakes the tasks voluntarily driven by motivation which is not supposed to be financial
reasons in all the cases [1]. A new term called Crowdsourced Software Engineering has also emerged
to describe the phenomena of using crowdsourcing for various software engineering tasks as it is very
popular nowadays [2].
    Among many popular crowdsourcing platforms used in software engineering, the Question &
Answer (Q&A) website Stack Overflow is used daily to share knowledge globally by millions of
software professionals. Therefore, Stack Overflow data can reveal important patterns which will help
to get an idea about how software professionals share knowledge in a global scale. Eventually the
findings will also help global software companies and crowdsourcing platforms to formulate and re-
evaluate their strategies and incentive criteria. The aim of this study is to perform data mining on Stack
Overflow data to discover patterns of global user distribution and contribution.

2. Background

    Stack Overflow caters wide range of computer programming subjects or topics. In 2015 it has
recorded 5.7 billion page views as the number of registered Stack Overflow users was reaching 5 million
[3]. The registered users can post questions and answers on the website. All the content is freely


Proceedings of HAICTA 2022, September 22–25, 2022, Athens, Greece
EMAIL: wijekoon@pef.czu.cz (A. 1); merunka@pef.czu.cz (A. 2)
ORCID: 0000-0002-2800-5693 (A. 1); 0000-0002-9056-1439 (A. 2)
              ©️ 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                   143
available for the public for viewing. It also utilizes a comprehensive reputation management system as
Atwood states in one of his blog posts in 2009, that he believes in community moderation [3][4].
    Schenk et al. in 2013 in their research has found out that contribution is highest in Europe and North
America. Then Asia, which is mostly represented by India; Oceania contributes not as much as Asia,
but more than South America and Africa combined. However, they base their research on the transfer
of knowledge. Specifically, who (country) raises the question and who (country) answers it [5].
However, it will be beneficial also to perform a comprehensive study on the user distribution across the
globe with respect to their contribution and reputation.
    Reputation measurement can also be manipulated by users who play around with the gamification
methods of Stack Overflow [6]. To tackle this issue, in this research the number of questions and
answers posted will be also used to represent the contribution.
    When comparing these measurements across users, there is a need of normalization of the figures
according to the length of membership for the users. For example, Morrison and Murphy-Hill has used
the Reputation per Month without just taking Reputation as the measurement in their research [7].
Similarly, number of answers posted per month and number of questions posted per month can be used
in this research in addition to the reputation.

3. Methodology
   Methodology of this research is based on the following phases specified by Fayyad et al. for
discovering knowledge in databases [8].

3.1. Selection

   The public data dump of all user-contributed content on the Stack Exchange Network shared in The
Internet Archive is used as the main data source for this research. Following files from Stack Exchange
data dump which has been published on 8th December 2017 has been downloaded from The Internet
Archive for this study.
    • Users.xml (2.36 GB)
    • Posts.xml (56.3 GB)

   Then the structure of the above xml files were studied to select the most appropriate data items. The
Entity Relationship Diagram of the schema is shown in Figure 1.

3.2. Pre-processing

    Data mining tasks could not be performed directly on top of downloaded raw XML files due to large
file size, flat structure of XML files and unbreakable nature of XML files. Therefore, raw data had to
be loaded into another format which Apache Spark can utilize its in-memory processing and
parallelization power. A MySQL relational database is used for this purpose. A Python script has been
written for each raw XML file which was then executed using spark-submit script which is loaded in
Spark’s bin directory. The Table 1 shows the number of records loaded into respective MySQL tables.

3.3. Transformation

   Conversion of some of the data into appropriate forms was needed before starting data mining
activities which are described below.

3.3.1. Extraction of Country Names

   Since names of countries/locations have been specified in different formats in raw data, a special
Python program was implemented to extract the country name accurately with the help of a free and


                                                    144
open-source Python library named geodict (https://github.com/petewarden/geodict). In the end the
location of 1,172,495 users were identified and saved in a new database table. This is 15.83% from all
users and 80.24% of all the users who have specified a location.


Figure 1: ER Diagram of the Original Schema.

Table 1
Number of Records Loaded into MySQL Tables
              MySQL Table Name                                     Number of Records
                    Users                                              7,408,959
                    Posts                                             38,360,000


3.3.2. Aggregation

   Since tables have millions of data records, Spark with Python API was chosen leveraging the
partition aware loading feature. The groupBy function and other built-in aggregate functions like count,
avg in Spark were used. All the necessary aggregated data required for the research were generated with
the help of Python scripts executed on Spark engine.

3.3.3. Merging

   The aggregated data were sometimes needed to be merged prior to data mining. Spark’s feature to
join RDDs is utilized for this purpose.

3.4.    Data Mining
   For the numerical data, descriptive summary statistics were used to understand the distribution of
data. Mainly the Spark function describe was used for this purpose.


                                                   145
3.5.     Interpretation/Evaluation

  The descriptive statistics, graphs generated by Oracle Data Visualization Desktop (ODVD) tool and
Matplotlib were used to interpret and evaluate the results.

4. Results and Discussion

   Country names of 1,172,495 users of Stack Overflow (15.83% from total users) and then 205 country
names were identified in the subset under analysis. Top 50 countries sorted in the descending order of
user count are presented in Table 2.

Table 2
Top 50 Countries with Users
               Country         Count      Cluster                 Country           Count      Cluster
    1       UNITED STATES     256470         5            26     VIET NAM           8359       2
    2           INDIA         214574         5            27     ROMANIA            8012       2
    3            UK            74955         4            28      BELGIUM           7683       2
    4         GERMANY          39550         4            29   SWITZERLAND          7406       2
    5         CANADA           37576         4            30    ARGENTINA           7277       2
    6          FRANCE          30470         4            31     SINGAPORE          7168       2
    7          CHINA           30164         4            32     PORTUGAL           7103       2
    8         AUSTRALIA        22434         3            33      IRELAND           6906       2
               RUSSIAN
    9                          22070         3            34     DENMARK            6846       2
             FEDERATION
    10          BRAZIL         20070         3            35     SRI LANKA          6508       2
    11        PAKISTAN         18661         3            36       JAPAN            6352       2
    12      NETHERLANDS        18170         3            37      MEXICO            6327       2
    13       INDONESIA         14055         3            38   NEW ZEALAND          6191       2
    14        UKRAINE          13391         3            39     MALAYSIA           6179       2
    15        POLAND           13027         3            40      TAIWAN            5693       2
    16      BANGLADESH         12825         3            41      NORWAY            5475       2
    17         SPAIN           12364         3            42      NIGERIA           5288       2
    18       PHILIPPINES       12288         3            43      GREECE            5121       2
    19          ITALY          12194         3            44      AUSTRIA           5070       2
    20        SWEDEN           11928         3            45     COLOMBIA           4765       2
    21          IRAN           11862         3            46   SOUTH KOREA          4708       2
    22      SOUTH AFRICA       9198          2            47   CZECH REPUBLIC       4405       2
    23         ISRAEL          9002          2            48      FINLAND           4251       2
    24         TURKEY          8697          2            49       NEPAL            4148       2
    25         EGYPT           8527          2            50     BULGARIA           4134       2


    As observed United States and India have marginally very high number of users which is more than
200,000 each. Collectively they represent 40% of total users. They are categorized as countries in
Cluster 5. Cluster 4 countries have users between 30,000 and 75,000. UK, Germany, Canada, France
and China belong to this category. Even though China has the world’s highest population, its
participation is not matching with the population. It could be due to language issues. This can be same
for Russian Federation. Another notable observation is there are only 78 countries with more than 1000
identified users. Cluster 2 represents countries with more than 3000 and only some of them are in top


                                                    146
50 list. Cluster 1 represents countries with less than 3000 users which is not even included in the Table
2.
   Above data has been merged with world population data for year 2015 published by United Nations,
Population Division [9]. Then users per 1000 capita figure has been calculated for each country for
further analysis.
   The map in the Figure 2 displays how users per 1000 capita changes across the globe and the Table
3 presents the top 50 countries with users per 1000 capita in descending order. The main observation
compared with user count ranking is United States falling to 17th position while India does not even
qualify in top 50. However, UK shows consistency in both and the biggest (population wise) country
having highest participation. Iceland becomes the number one even though it does not even have
sufficient users to be listed in the first list. The main conclusion that can be derived is that most
European countries have higher participation per capita generally. The countries like New Zealand,
Singapore, Israel, Canada, and Australia are also among the high participating countries.


Figure 2: Users per 1000 Capita.

   To compare contribution levels of average users of countries, the user contributions in the means of
average reputation per user, average number of questions posted per user and average number of
answers posted per user from each country have been analyzed. The Table 4 summarizes the rankings
of countries which fall into top 20 of each category and has more than 500 users along with Russian
Federation and India for their significance. The cells in blue background color displays the ranks within
top 20 while cells with pink background displays rankings greater than 20 for the respective category.
   As reputation and answer ranking relates to knowledge sharing, respectively Switzerland has
become top country in both rankings while closely followed by UK and Germany. Sweden, Austria,
and Israel are among top 10 of both rankings with most of other European countries. New Zealand,
Austria and Canada contribute much as well.


                                                    147
   However, India and Russian Federation have less contribution despite their large population.
Another important observation is that most of countries who are reputed, and good answer providers
are also good at asking questions. However, Italy, Ireland, Latvia, and Lebanon are basically question
askers but not answer providers. Meanwhile Finland, Netherlands and Bulgaria have higher reputation
and answering rate, but they do not ask many questions.

Table 3
Top 50 Countries with users per 1000 Capita
             Country           UsersPer1000Capita                 Country             UsersPer1000Capita
     1       ICELAND                      1.91677     26          CROATIA                      0.537297
     2        MALTA                      1.585535     27          CYPRUS                       0.484933
     3       IRELAND                     1.469328     28         GERMANY                       0.484042
     4    NEW ZEALAND                    1.341631     29          FRANCE                       0.472717
     5      SINGAPORE                     1.29497     30        HONG KONG                      0.462205
     6       SWEDEN                      1.221685     31          GREECE                       0.456507
     7      DENMARK                      1.203439     32        MACEDONIA                      0.438127
     8          UK                       1.146152     33         ARMENIA                       0.416531
     9        ISRAEL                     1.116244     34      CZECH REPUBLIC                   0.415419
    10    NETHERLANDS                    1.072704     35         ROMANIA                       0.403087
    11       NORWAY                      1.052918     36          BELARUS                      0.395961
    12       CANADA                      1.045238     37         URUGUAY                        0.37942
    13       ESTONIA                     1.008119     38         HUNGARY                       0.372039
    14     LUXEMBOURG                    0.959874     39         SLOVAKIA                      0.359604
    15      AUSTRALIA                    0.942623     40          POLAND                        0.34044
    16     SWITZERLAND                   0.890169     41         GEORGIA                       0.322154
    17    UNITED STATES                  0.801646     42         SRI LANKA                     0.314183
    18       FINLAND                     0.775452     43           SERBIA                      0.312271
                                                                UNITED ARAB
    19      LITHUANIA                    0.718981     44                                       0.299968
                                                                 EMIRATES
    20      PORTUGAL                      0.68177     45          UKRAINE                      0.299859
    21        LATVIA                       0.6815     46        COSTA RICA                     0.285575
    22       BELGIUM                     0.680638     47           SPAIN                       0.266479
                                                                BOSNIA AND
    23      SLOVENIA                     0.679106     48                                       0.257921
                                                               HERZEGOVINA
    24       AUSTRIA                     0.584192     49          TAIWAN                       0.242402
    25      BULGARIA                     0.575975     50          ALBANIA                      0.238767


    In both user participation and contribution, European countries along with Israel, Australia, Canada,
and New Zealand are highlighted from the rest of the world. These findings were cross evaluated by
comparing with the ICT Development Indexes of countries provided by United Nations [10]. The major
difference found was the underperformance of crowdsourcing activities of countries like South Korea
and Japan who have good global ICT rankings. This situation can be further proven by comparing the
findings with the IMD World Digital Competitiveness Ranking 2017 [11]. Even though this must be
further analyzed, one reason can be the language barrier. Presence of some other popular alternatives
to Stack Overflow also can be also another reason. Under presence of China and Russian Federation
can be also due to this.


                                                    148
5. Conclusion

   Stack Overflow data reveals important patterns in global crowdsourcing beneficial for software
industry. The results on Global User Distribution and Contribution, clearly show that majority of the
users are from USA and India. However, in both participation and contribution aspects, European
countries along with Australia, Canada and New Zealand have higher rankings. It is also noted the less
rankings of Japan, South Korea, Russian Federation, Brazil and China. Since these countries represent
huge portion of world population, further studies should be carried out to find factors for this
phenomenon.

Table 4
Country Rankings for Contribution
                                                        Reputation       Answer          Question
                    Country
                                                         Rank            Rank            Rank
                 SWITZERLAND                                1               1                6
                       UK                                   2               4                5
                   GERMANY                                  3                3              14
                    SWEDEN                                  4               10              13
                  GUATEMALA                                 5               55              97
                     MALTA                                  6               15               3
                     ISRAEL                                 7                2               1
                    AUSTRIA                                 8                6              15
                   NORWAY                                   9               14               9
                 NETHERLANDS                               10                5              21
                  AUSTRALIA                                11               12              16
                 NEW ZEALAND                               12               13              18
                   FINLAND                                 13               11              49
                CZECH REPUBLIC                             14                7               4
                   BULGARIA                                15                8              38
                   DENMARK                                 16               18               7
                 UNITED STATES                             17               22              35
                   SLOVENIA                                18               16               2
                    CANADA                                 19               25              24
                   SLOVAKIA                                20                9              20
                    POLAND                                 21               17              25
                   BELGIUM                                 22               19              10
                     LATVIA                                23               28              17
                    IRELAND                                24               30              11
                      ITALY                                27               23               8
                     PERU                                  32               20              55
              RUSSIAN FEDERATION                           35               38              54
                    CYPRUS                                 44               36              19
                   LEBANON                                 53               50              12
                     INDIA                                 64               58              56


                                                  149
6. References

[1] Y. Zhao, Q. Zhu, 2014, Evaluation on crowdsourcing research: Current status and future direction.
     Information Systems Frontiers. 2014. Vol. 16, no. 3, p. 417–434.
[2] K. Mao, L. Capra, M. Harman, Y. Jia, 2017. A survey of the use of crowdsourcing in software
     engineering. Journal of Systems and Software. 2017. Vol. 126, p. 57–84.
[3] Stack Exchange Inc, 2018. About - Stack Exchange. URL: https://stackexchange.com/about.
[4] J. Atwood, A Theory of Moderation - Stack Overflow Blog, 2009. URL:
     https://stackoverflow.blog/2009/05/18/a-theory-of-moderation/.
[5] D. Schenk, M. Lungu, 2013. Geo-Locating the Knowledge Transfer in Stack Overflow. In:
     Proceedings of the 2013 International Workshop on Social Software Engineering. Saint
     Petersburg, Russia: ACM. 2013. p. 2–5.
[6] T. Ahmed, A. Srivastava, 2017. Understanding and evaluating the behavior of technical users. A
     study of developer interaction at StackOverflow. Human-centric Computing and Information
     Sciences. 2017. Vol. 7, no. 1, p. 1–19.
[7] P. Morrison, E. Murphy-Hill, 2013. Is Programming Knowledge Related To Age?
     People.Engr.Ncsu.Edu , 2013. P. 3–6.
[8] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, 1996. From Data Mining to Knowledge Discovery in
     Databases. AI Magazine. 1996. Vol. 17, p. 37–54
[9] United Nations Department of Social Affairs, Population Division, 2017, World Population
     Prospects: The 2017 Revision.
[10] United Nations International Telecommunication Union, 2017. ITU | 2017 Global ICT
     Development Index, 2017. URL: http://www.itu.int/net4/ITU-D/idi/2017/#idi2017rank-tab.
[11] IMD World Competitiveness Centre, 2017, IMD World Digital Competitiveness Ranking 2017,
     URL:                                        https://www.imd.org/globalassets/wcc/docs/release-
     2017/world_digital_competitiveness_yearbook_2017.pdf.


                                                  150

</pre>