<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Patterns of User Participation and Contribution in Global Crowdsourcing: A Data Mining Study of Stack Overflow</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Himesha Wijekoon</string-name>
          <email>wijekoon@pef.czu.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojtěch Merunka</string-name>
          <email>merunka@pef.czu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>User Participation</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>User Contribution</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Czech Technical University in Prague</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Czech University of Life Sciences Prague</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Stack Overflow, Data Mining, Big Data Analytics</institution>
          ,
          <addr-line>Crowdsourcing, Software Engineering</addr-line>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Among many popular crowdsourcing platforms, the Question &amp; Answer website Stack Overflow in Stack Exchange Network is used daily to share knowledge globally by millions of software professionals. Therefore, Stack Overflow data can reveal important patterns in global crowdsourcing beneficial for software industry. The aim of this study was to perform data mining on Stack Overflow data, to discover some of these patterns. Focus of this research was to analyze the global user distribution and contribution. Big data analytic techniques were used for data mining activities using Apache Spark with Python language. Oracle Data Visualization Desktop and scikit-learn python library were used for visualization. The results show that although majority of the users are from USA and India, the average contribution is higher in European countries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>2022 Copyright for this paper by its authors.</p>
      <p>
        Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
available for the public for viewing. It also utilizes a comprehensive reputation management system as
Atwood states in one of his blog posts in 2009, that he believes in community moderation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Schenk et al. in 2013 in their research has found out that contribution is highest in Europe and North
America. Then Asia, which is mostly represented by India; Oceania contributes not as much as Asia,
but more than South America and Africa combined. However, they base their research on the transfer
of knowledge. Specifically, who (country) raises the question and who (country) answers it [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
However, it will be beneficial also to perform a comprehensive study on the user distribution across the
globe with respect to their contribution and reputation.
      </p>
      <p>
        Reputation measurement can also be manipulated by users who play around with the gamification
methods of Stack Overflow [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To tackle this issue, in this research the number of questions and
answers posted will be also used to represent the contribution.
      </p>
      <p>
        When comparing these measurements across users, there is a need of normalization of the figures
according to the length of membership for the users. For example, Morrison and Murphy-Hill has used
the Reputation per Month without just taking Reputation as the measurement in their research [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Similarly, number of answers posted per month and number of questions posted per month can be used
in this research in addition to the reputation.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology 3.1.</title>
    </sec>
    <sec id="sec-4">
      <title>Selection</title>
      <p>
        Methodology of this research is based on the following phases specified by Fayyad et al. for
discovering knowledge in databases [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The public data dump of all user-contributed content on the Stack Exchange Network shared in The
Internet Archive is used as the main data source for this research. Following files from Stack Exchange
data dump which has been published on 8th December 2017 has been downloaded from The Internet
Archive for this study.</p>
      <p>• Users.xml (2.36 GB)
• Posts.xml (56.3 GB)</p>
      <p>Then the structure of the above xml files were studied to select the most appropriate data items. The
Entity Relationship Diagram of the schema is shown in Figure 1.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Pre-processing</title>
      <p>Data mining tasks could not be performed directly on top of downloaded raw XML files due to large
file size, flat structure of XML files and unbreakable nature of XML files. Therefore, raw data had to
be loaded into another format which Apache Spark can utilize its in-memory processing and
parallelization power. A MySQL relational database is used for this purpose. A Python script has been
written for each raw XML file which was then executed using spark-submit script which is loaded in
Spark’s bin directory. The Table 1 shows the number of records loaded into respective MySQL tables.
3.3.</p>
    </sec>
    <sec id="sec-6">
      <title>Transformation</title>
      <p>Conversion of some of the data into appropriate forms was needed before starting data mining
activities which are described below.</p>
    </sec>
    <sec id="sec-7">
      <title>3.3.1. Extraction of Country Names</title>
      <p>Since names of countries/locations have been specified in different formats in raw data, a special
Python program was implemented to extract the country name accurately with the help of a free and
open-source Python library named geodict (https://github.com/petewarden/geodict). In the end the
location of 1,172,495 users were identified and saved in a new database table. This is 15.83% from all
users and 80.24% of all the users who have specified a location.</p>
    </sec>
    <sec id="sec-8">
      <title>3.3.2. Aggregation</title>
      <p>Number of Records
7,408,959
38,360,000</p>
      <p>Since tables have millions of data records, Spark with Python API was chosen leveraging the
partition aware loading feature. The groupBy function and other built-in aggregate functions like count,
avg in Spark were used. All the necessary aggregated data required for the research were generated with
the help of Python scripts executed on Spark engine.
3.3.3. Merging
3.4.</p>
    </sec>
    <sec id="sec-9">
      <title>Data Mining</title>
      <p>The aggregated data were sometimes needed to be merged prior to data mining. Spark’s feature to
join RDDs is utilized for this purpose.</p>
      <p>For the numerical data, descriptive summary statistics were used to understand the distribution of
data. Mainly the Spark function describe was used for this purpose.</p>
    </sec>
    <sec id="sec-10">
      <title>3.5. Interpretation/Evaluation</title>
      <p>The descriptive statistics, graphs generated by Oracle Data Visualization Desktop (ODVD) tool and
Matplotlib were used to interpret and evaluate the results.</p>
    </sec>
    <sec id="sec-11">
      <title>4. Results and Discussion</title>
      <p>Country names of 1,172,495 users of Stack Overflow (15.83% from total users) and then 205 country
names were identified in the subset under analysis. Top 50 countries sorted in the descending order of
user count are presented in Table 2.</p>
      <p>As observed United States and India have marginally very high number of users which is more than
200,000 each. Collectively they represent 40% of total users. They are categorized as countries in
Cluster 5. Cluster 4 countries have users between 30,000 and 75,000. UK, Germany, Canada, France
and China belong to this category. Even though China has the world’s highest population, its
participation is not matching with the population. It could be due to language issues. This can be same
for Russian Federation. Another notable observation is there are only 78 countries with more than 1000
identified users. Cluster 2 represents countries with more than 3000 and only some of them are in top
Cluster
50 list. Cluster 1 represents countries with less than 3000 users which is not even included in the Table
2.</p>
      <p>
        Above data has been merged with world population data for year 2015 published by United Nations,
Population Division [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Then users per 1000 capita figure has been calculated for each country for
further analysis.
      </p>
      <p>The map in the Figure 2 displays how users per 1000 capita changes across the globe and the Table
3 presents the top 50 countries with users per 1000 capita in descending order. The main observation
compared with user count ranking is United States falling to 17th position while India does not even
qualify in top 50. However, UK shows consistency in both and the biggest (population wise) country
having highest participation. Iceland becomes the number one even though it does not even have
sufficient users to be listed in the first list. The main conclusion that can be derived is that most
European countries have higher participation per capita generally. The countries like New Zealand,
Singapore, Israel, Canada, and Australia are also among the high participating countries.</p>
      <p>To compare contribution levels of average users of countries, the user contributions in the means of
average reputation per user, average number of questions posted per user and average number of
answers posted per user from each country have been analyzed. The Table 4 summarizes the rankings
of countries which fall into top 20 of each category and has more than 500 users along with Russian
Federation and India for their significance. The cells in blue background color displays the ranks within
top 20 while cells with pink background displays rankings greater than 20 for the respective category.</p>
      <p>As reputation and answer ranking relates to knowledge sharing, respectively Switzerland has
become top country in both rankings while closely followed by UK and Germany. Sweden, Austria,
and Israel are among top 10 of both rankings with most of other European countries. New Zealand,
Austria and Canada contribute much as well.</p>
      <p>However, India and Russian Federation have less contribution despite their large population.
Another important observation is that most of countries who are reputed, and good answer providers
are also good at asking questions. However, Italy, Ireland, Latvia, and Lebanon are basically question
askers but not answer providers. Meanwhile Finland, Netherlands and Bulgaria have higher reputation
and answering rate, but they do not ask many questions.</p>
      <p>
        In both user participation and contribution, European countries along with Israel, Australia, Canada,
and New Zealand are highlighted from the rest of the world. These findings were cross evaluated by
comparing with the ICT Development Indexes of countries provided by United Nations [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The major
difference found was the underperformance of crowdsourcing activities of countries like South Korea
and Japan who have good global ICT rankings. This situation can be further proven by comparing the
findings with the IMD World Digital Competitiveness Ranking 2017 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Even though this must be
further analyzed, one reason can be the language barrier. Presence of some other popular alternatives
to Stack Overflow also can be also another reason. Under presence of China and Russian Federation
can be also due to this.
      </p>
    </sec>
    <sec id="sec-12">
      <title>5. Conclusion</title>
      <p>Stack Overflow data reveals important patterns in global crowdsourcing beneficial for software
industry. The results on Global User Distribution and Contribution, clearly show that majority of the
users are from USA and India. However, in both participation and contribution aspects, European
countries along with Australia, Canada and New Zealand have higher rankings. It is also noted the less
rankings of Japan, South Korea, Russian Federation, Brazil and China. Since these countries represent
huge portion of world population, further studies should be carried out to find factors for this
phenomenon.</p>
    </sec>
    <sec id="sec-13">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <year>2014</year>
          ,
          <article-title>Evaluation on crowdsourcing research: Current status and future direction</article-title>
          .
          <source>Information Systems Frontiers</source>
          .
          <year>2014</year>
          . Vol.
          <volume>16</volume>
          , no.
          <issue>3</issue>
          , p.
          <fpage>417</fpage>
          -
          <lpage>434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Capra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <year>2017</year>
          .
          <article-title>A survey of the use of crowdsourcing in software engineering</article-title>
          .
          <source>Journal of Systems and Software</source>
          .
          <year>2017</year>
          . Vol.
          <volume>126</volume>
          , p.
          <fpage>57</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Stack</given-names>
            <surname>Exchange Inc</surname>
          </string-name>
          ,
          <year>2018</year>
          . About - Stack
          <string-name>
            <surname>Exchange</surname>
          </string-name>
          . URL: https://stackexchange.com/about.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Atwood</surname>
          </string-name>
          ,
          <source>A Theory of Moderation - Stack Overflow Blog</source>
          ,
          <year>2009</year>
          . URL: https://stackoverflow.blog/
          <year>2009</year>
          /05/18/a-theory
          <string-name>
            <surname>-</surname>
          </string-name>
          of-moderation/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Schenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lungu</surname>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Geo-Locating the Knowledge Transfer in Stack Overflow</article-title>
          .
          <source>In: Proceedings of the 2013 International Workshop on Social Software Engineering. Saint Petersburg, Russia: ACM</source>
          .
          <year>2013</year>
          . p.
          <fpage>2</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <year>2017</year>
          .
          <article-title>Understanding and evaluating the behavior of technical users. A study of developer interaction at StackOverflow</article-title>
          .
          <source>Human-centric Computing and Information Sciences</source>
          .
          <year>2017</year>
          . Vol.
          <volume>7</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Morrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Murphy-Hill</surname>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Is Programming Knowledge Related To Age? People</article-title>
          .Engr.Ncsu.Edu ,
          <year>2013</year>
          . P. 3-
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>U.</given-names>
            <surname>Fayyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Piatetsky-Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <year>1996</year>
          .
          <article-title>From Data Mining to Knowledge Discovery in Databases</article-title>
          .
          <source>AI Magazine</source>
          .
          <year>1996</year>
          . Vol.
          <volume>17</volume>
          , p.
          <fpage>37</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>United</given-names>
            <surname>Nations</surname>
          </string-name>
          Department of Social Affairs, Population Division,
          <year>2017</year>
          , World Population Prospects:
          <article-title>The 2017 Revision</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>United</given-names>
            <surname>Nations International Telecommunication Union</surname>
          </string-name>
          ,
          <year>2017</year>
          . ITU | 2017
          <source>Global ICT Development Index</source>
          ,
          <year>2017</year>
          . URL: http://www.itu.int/net4/ITU-D/idi/2017/#idi2017rank-tab.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>IMD</given-names>
            <surname>World Competitiveness Centre</surname>
          </string-name>
          ,
          <year>2017</year>
          ,
          <source>IMD World Digital Competitiveness Ranking</source>
          <year>2017</year>
          , URL: https://www.imd.org/globalassets/wcc/docs/release2017/world_digital_competitiveness_yearbook_
          <year>2017</year>
          .pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>