1. Introduction

ORCID:

Patterns of User Participation and Contribution in Global Crowdsourcing: A Data Mining Study of Stack Overflow

Himesha Wijekoon

wijekoon@pef.czu.cz 1 2

Vojtěch Merunka

merunka@pef.czu.cz 0 1 2

User Participation

User Contribution

2 0 Czech Technical University in Prague , Prague , Czech Republic 1 Czech University of Life Sciences Prague , Prague , Czech Republic 2 Stack Overflow, Data Mining, Big Data Analytics , Crowdsourcing, Software Engineering

000 0 0002

Among many popular crowdsourcing platforms, the Question & Answer website Stack Overflow in Stack Exchange Network is used daily to share knowledge globally by millions of software professionals. Therefore, Stack Overflow data can reveal important patterns in global crowdsourcing beneficial for software industry. The aim of this study was to perform data mining on Stack Overflow data, to discover some of these patterns. Focus of this research was to analyze the global user distribution and contribution. Big data analytic techniques were used for data mining activities using Apache Spark with Python language. Oracle Data Visualization Desktop and scikit-learn python library were used for visualization. The results show that although majority of the users are from USA and India, the average contribution is higher in European countries.

1. Introduction 2. Background

2022 Copyright for this paper by its authors.

Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). available for the public for viewing. It also utilizes a comprehensive reputation management system as Atwood states in one of his blog posts in 2009, that he believes in community moderation [ 3 ][ 4 ].

Schenk et al. in 2013 in their research has found out that contribution is highest in Europe and North America. Then Asia, which is mostly represented by India; Oceania contributes not as much as Asia, but more than South America and Africa combined. However, they base their research on the transfer of knowledge. Specifically, who (country) raises the question and who (country) answers it [ 5 ]. However, it will be beneficial also to perform a comprehensive study on the user distribution across the globe with respect to their contribution and reputation.

Reputation measurement can also be manipulated by users who play around with the gamification methods of Stack Overflow [ 6 ]. To tackle this issue, in this research the number of questions and answers posted will be also used to represent the contribution.

When comparing these measurements across users, there is a need of normalization of the figures according to the length of membership for the users. For example, Morrison and Murphy-Hill has used the Reputation per Month without just taking Reputation as the measurement in their research [ 7 ]. Similarly, number of answers posted per month and number of questions posted per month can be used in this research in addition to the reputation.

3. Methodology 3.1. Selection

Methodology of this research is based on the following phases specified by Fayyad et al. for discovering knowledge in databases [ 8 ].

The public data dump of all user-contributed content on the Stack Exchange Network shared in The Internet Archive is used as the main data source for this research. Following files from Stack Exchange data dump which has been published on 8th December 2017 has been downloaded from The Internet Archive for this study.

• Users.xml (2.36 GB) • Posts.xml (56.3 GB)

Then the structure of the above xml files were studied to select the most appropriate data items. The Entity Relationship Diagram of the schema is shown in Figure 1. 3.2.

Pre-processing

Data mining tasks could not be performed directly on top of downloaded raw XML files due to large file size, flat structure of XML files and unbreakable nature of XML files. Therefore, raw data had to be loaded into another format which Apache Spark can utilize its in-memory processing and parallelization power. A MySQL relational database is used for this purpose. A Python script has been written for each raw XML file which was then executed using spark-submit script which is loaded in Spark’s bin directory. The Table 1 shows the number of records loaded into respective MySQL tables. 3.3.

Transformation

Conversion of some of the data into appropriate forms was needed before starting data mining activities which are described below.

3.3.1. Extraction of Country Names

Since names of countries/locations have been specified in different formats in raw data, a special Python program was implemented to extract the country name accurately with the help of a free and open-source Python library named geodict (https://github.com/petewarden/geodict). In the end the location of 1,172,495 users were identified and saved in a new database table. This is 15.83% from all users and 80.24% of all the users who have specified a location.

3.3.2. Aggregation

Number of Records 7,408,959 38,360,000

Since tables have millions of data records, Spark with Python API was chosen leveraging the partition aware loading feature. The groupBy function and other built-in aggregate functions like count, avg in Spark were used. All the necessary aggregated data required for the research were generated with the help of Python scripts executed on Spark engine. 3.3.3. Merging 3.4.

Data Mining

The aggregated data were sometimes needed to be merged prior to data mining. Spark’s feature to join RDDs is utilized for this purpose.

For the numerical data, descriptive summary statistics were used to understand the distribution of data. Mainly the Spark function describe was used for this purpose.

3.5. Interpretation/Evaluation

The descriptive statistics, graphs generated by Oracle Data Visualization Desktop (ODVD) tool and Matplotlib were used to interpret and evaluate the results.

4. Results and Discussion

Country names of 1,172,495 users of Stack Overflow (15.83% from total users) and then 205 country names were identified in the subset under analysis. Top 50 countries sorted in the descending order of user count are presented in Table 2.

As observed United States and India have marginally very high number of users which is more than 200,000 each. Collectively they represent 40% of total users. They are categorized as countries in Cluster 5. Cluster 4 countries have users between 30,000 and 75,000. UK, Germany, Canada, France and China belong to this category. Even though China has the world’s highest population, its participation is not matching with the population. It could be due to language issues. This can be same for Russian Federation. Another notable observation is there are only 78 countries with more than 1000 identified users. Cluster 2 represents countries with more than 3000 and only some of them are in top Cluster 50 list. Cluster 1 represents countries with less than 3000 users which is not even included in the Table 2.

Above data has been merged with world population data for year 2015 published by United Nations, Population Division [ 9 ]. Then users per 1000 capita figure has been calculated for each country for further analysis.

The map in the Figure 2 displays how users per 1000 capita changes across the globe and the Table 3 presents the top 50 countries with users per 1000 capita in descending order. The main observation compared with user count ranking is United States falling to 17th position while India does not even qualify in top 50. However, UK shows consistency in both and the biggest (population wise) country having highest participation. Iceland becomes the number one even though it does not even have sufficient users to be listed in the first list. The main conclusion that can be derived is that most European countries have higher participation per capita generally. The countries like New Zealand, Singapore, Israel, Canada, and Australia are also among the high participating countries.

To compare contribution levels of average users of countries, the user contributions in the means of average reputation per user, average number of questions posted per user and average number of answers posted per user from each country have been analyzed. The Table 4 summarizes the rankings of countries which fall into top 20 of each category and has more than 500 users along with Russian Federation and India for their significance. The cells in blue background color displays the ranks within top 20 while cells with pink background displays rankings greater than 20 for the respective category.

As reputation and answer ranking relates to knowledge sharing, respectively Switzerland has become top country in both rankings while closely followed by UK and Germany. Sweden, Austria, and Israel are among top 10 of both rankings with most of other European countries. New Zealand, Austria and Canada contribute much as well.

However, India and Russian Federation have less contribution despite their large population. Another important observation is that most of countries who are reputed, and good answer providers are also good at asking questions. However, Italy, Ireland, Latvia, and Lebanon are basically question askers but not answer providers. Meanwhile Finland, Netherlands and Bulgaria have higher reputation and answering rate, but they do not ask many questions.

In both user participation and contribution, European countries along with Israel, Australia, Canada, and New Zealand are highlighted from the rest of the world. These findings were cross evaluated by comparing with the ICT Development Indexes of countries provided by United Nations [ 10 ]. The major difference found was the underperformance of crowdsourcing activities of countries like South Korea and Japan who have good global ICT rankings. This situation can be further proven by comparing the findings with the IMD World Digital Competitiveness Ranking 2017 [ 11 ]. Even though this must be further analyzed, one reason can be the language barrier. Presence of some other popular alternatives to Stack Overflow also can be also another reason. Under presence of China and Russian Federation can be also due to this.

5. Conclusion

Stack Overflow data reveals important patterns in global crowdsourcing beneficial for software industry. The results on Global User Distribution and Contribution, clearly show that majority of the users are from USA and India. However, in both participation and contribution aspects, European countries along with Australia, Canada and New Zealand have higher rankings. It is also noted the less rankings of Japan, South Korea, Russian Federation, Brazil and China. Since these countries represent huge portion of world population, further studies should be carried out to find factors for this phenomenon.

6. References

[1]

Zhao ,

Zhu , 2014 , Evaluation on crowdsourcing research: Current status and future direction . Information Systems Frontiers . 2014 . Vol. 16 , no. 3 , p. 417 - 434 .

[2]

Mao ,

Capra ,

Harman ,

Jia , 2017 . A survey of the use of crowdsourcing in software engineering . Journal of Systems and Software . 2017 . Vol. 126 , p. 57 - 84 .

[3]

Stack

Exchange Inc , 2018 . About - Stack Exchange . URL: https://stackexchange.com/about.

[4]

Atwood , A Theory of Moderation - Stack Overflow Blog , 2009 . URL: https://stackoverflow.blog/ 2009 /05/18/a-theory - of-moderation/.

[5]

Schenk ,

Lungu , 2013 . Geo-Locating the Knowledge Transfer in Stack Overflow . In: Proceedings of the 2013 International Workshop on Social Software Engineering. Saint Petersburg, Russia: ACM . 2013 . p. 2 - 5 .

[6]

Ahmed ,

Srivastava , 2017 . Understanding and evaluating the behavior of technical users. A study of developer interaction at StackOverflow . Human-centric Computing and Information Sciences . 2017 . Vol. 7 , no. 1 , p. 1 - 19 .

[7]

Morrison ,

Murphy-Hill , 2013 . Is Programming Knowledge Related To Age? People .Engr.Ncsu.Edu , 2013 . P. 3- 6 .

[8]

Fayyad ,

Piatetsky-Shapiro ,

Smyth , 1996 . From Data Mining to Knowledge Discovery in Databases . AI Magazine . 1996 . Vol. 17 , p. 37 - 54

[9]

United

Nations Department of Social Affairs, Population Division, 2017 , World Population Prospects: The 2017 Revision .

[10]

United

Nations International Telecommunication Union , 2017 . ITU | 2017 Global ICT Development Index , 2017 . URL: http://www.itu.int/net4/ITU-D/idi/2017/#idi2017rank-tab.

[11]

IMD

World Competitiveness Centre , 2017 , IMD World Digital Competitiveness Ranking 2017 , URL: https://www.imd.org/globalassets/wcc/docs/release2017/world_digital_competitiveness_yearbook_ 2017 .pdf.