=Paper= {{Paper |id=Vol-3071/paper11 |storemode=property |title=Predictive Modelling of Socio-Technical Health in Evolving Software Packaging Ecosystems |pdfUrl=https://ceur-ws.org/Vol-3071/paper11.pdf |volume=Vol-3071 |authors=Pooya Rostami Mazrae |dblpUrl=https://dblp.org/rec/conf/benevol/Mazrae21 }} ==Predictive Modelling of Socio-Technical Health in Evolving Software Packaging Ecosystems== https://ceur-ws.org/Vol-3071/paper11.pdf
Predictive Modelling of Socio-Technical Health in
Evolving Software Packaging Ecosystems
Pooya Rostami Mazrae1
1
    University of Mons, Mons, Belgium


                                         Abstract
                                         Software health plays an important role in collaborative software development. My PhD research aims
                                         to analyse how the evolution of socio-technical characteristics in large open source software ecosystems
                                         affects the health of these ecosystems and their building blocks. In order to capture as many different
                                         dimensions of software health, I aim to combine the human (social) and technical aspects of collaborative
                                         software development activity. These dimensions will be integrated into computational machine learning
                                         models and recommendation models to enable the prediction of change trends in software health, and
                                         to improve future health based on historical analysis. I will focus primarily on the health of evolving
                                         software packaging ecosystems, as they are known to have large technical dependency networks, as
                                         well as strongly connected social collaboration networks. This extended abstract presents the research
                                         questions I will explore to reach the aforementioned research goals.

                                         Keywords
                                         Mining Software Repository, Software Health, Open Source Software, Projects Abandonment, Software
                                         Ecosystem, Predictive Modeling, Machine Learning




1. Introduction
The importance of open source software (OSS) development has increased significantly through-
out the last years, covering almost every application domain [1]. Today, over 80% of the software
in technological products or services is OSS, and this trend is still growing1 . In addition to
this, software ecosystems play an ever increasing role in collaborative software development
practices. A software ecosystem can be defined as a collection of software projects which are
developed and which co-evolve together in the same environment [2]. As software projects are
not usually developed in isolation, it is important to take into account the ecosystem of which
they are part to understand the bigger picture.
   Software package distributions can be considered as a specific kind of software ecosystem.
Nearly every popular programming language is accompanied by one or more package managers.
These package managers can be used by software developers to easily install reusable software
libraries. These so-called software packaging ecosystems contain of large number of package

BENEVOL’21: The 20th Belgium-Netherlands Software Evolution Workshop, December 07–08, 2021, ’s-Hertogenbosch
(virtual), NL
Envelope-Open pooya.rostamimazrae@umons.ac.be (P. R. Mazrae)
GLOBE https://pooya-rostami.github.io/ (P. R. Mazrae)
Orcid 0000-0002-4859-1546 (P. R. Mazrae)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                      https://www.linuxfoundation.org/blog/chaoss-project-creates-tools-to-analyze-software-development-and-measure-open-source-comm
releases, accompanied by their distributed version repositories. These repositories are updated
regularly and have many technical and social interdependencies, forming huge socio-technical
package dependency networks.
   The challenges related to the evolution of software packaging ecosystems can be seen from
two dimensions: the social dimension that focuses on problems related to persons who are
contributing to, and interacting with (parts of) the ecosystem, and the technical dimension that
addresses problems related to the technical artefacts (such as the source code, tests, documenta-
tion, or any other artefacts) being produced or maintained. Given that both dimensions cannot
be seen in isolation, socio-technical challenges will involve a combination of technical and social
issues.
   Health issues in the social dimension are manifold. For example, the so-called truck factor
(TF) aims to measure the risk of a project to stop being maintained because too many of its
core developers are abandoning (“run over by a truck”) [3, 4, 5], as well as the risk of having
heroes who are the only core developers who understand and know certain critical parts of a
system [6]. Knowing the reasons why developers are leaving [7] or taking a break [4] might
help in mitigating these risks, and finding good replacements for abandoners might further
reduce these risks [5].
   Health issues in the technical dimension have to do with how packages within an ecosystem
are interrelated through transitive dependencies that may not be updated when new package
releases become available, thereby affecting ecosystem health [8]. As an example, to alleviate
the problem of dependency hell, approaches like semantic versioning or using dependency
graphs [9] have been proposed. Nevertheless, the degree of adherence to semantic versioning
can differ significantly depending on the considered package distribution [10].
   An important requirement for conducting empirical studies in this domain is having access
to a data source containing recent, reliable and sufficiently complete information about large
software ecosystems that contain evidence of socio-technical interaction and collaboration
patterns between their components (including contributors, projects and packages).
   Many studies gather historical project data from GitHub considering mainly popularity
metrics (e.g., number of stars or number of downloads) [6, 11, 3, 5, 7, 12, 1, 13]. While this can
be one of the factors to select projects for analysis, other factors need to be considered as well,
especially if not all relevant socio-technical information is recorded on GitHub. For example,
Avelino et al. found examples of contributions to the Linux kernel in which the entire release
development was pushed to Git in a single squashed commit, thereby masking many individual
contributions on behalf of a unique developer [14].
   Many empirical studies also concentrate on only one dimension of software health. To study
both the social and technical dimension, we will concentrate on software packaging ecosystems.
They contain metadata about the technical interdependencies between all packages. Also, the
full development history of these packages is available, from which the social collaboration
and interaction between contributors of those packages can be retrieved. By accompanying
these social and technical aspects in a single overarching socio-technical dependency network,
I will study the evolution of health problems at a new level and propose new ways to improve
the health of software packaging ecosystems. Works on this matter has already started. For
example, the Linux Foundation working group CHAOSS is creating metrics and associated tools
to help define and measure OSS project and community health [15].
   My PhD research therefore aims to empirically study and reduce socio-technical health
issues in evolving OSS packaging ecosystems, by determining the important features of, and
connections between, different packages that play an important role in health issues. Based on
this, I aim to provide recommendation models and prediction models to reduce these health
issues. Overall, my proposed thesis statement is: Machine learning techniques for predictive
modeling of socio-technical aspects can be used to gain information about, and improve the health
of, evolving software ecosystems. To validate this statement, I will explore the following research
questions:
    • RQ1: How to attract newcomers to projects within a software packaging ecosystem?
    • RQ2: How does the socio-technical activity and abandonment of project contributors affect
      project and ecosystem health?
    • RQ3: How to predict or recommend replacement of abandoning developers by analysing the
      socio-technical dependency network?
    • RQ4: How to rely on social media activity (e.g., Reddit, StackOverflow, Twitter) to improve
      prediction models?


2. Background
OSS research has studied a wide range of aspects, often focusing on only the technical or the
social dimension of software development. Research that combines the social and technical
aspects often leads to more enlightening results. Below, we discuss some of these works and
how they contribute to the domain of OSS health research.
   Conway’s law states that organizations tend to design systems that mimic the communication
structures of these organizations. Cataldo et al. [16] introduced the notion of socio-technical
congruence to reflect the close connection between the technical structure of a software project
and the social structure of the project members. Syeed et al. [17] studied such socio-technical
congruence in the context of the Ruby packaging ecosystem. Golzadeh [18] provided an
initial exploration of the socio-technical congruence in the Cargo packaging ecosystem. We
hypothesise that any socio-technical study of software ecosystem health is likely to be affected
by the socio-technical congruence phenomenon.
   Ricca et al. [6] tried to find the heroes in FLOSS projects, by implementing a tool to compute
the truck factors and identify the heroes. The proposed tool was based on the work of Zazworka
et al. [19]. Since finding a truck factor plays an important role in software health, Ferreira
et al. [3] compared 3 different algorithms for computing it: AVL [20], RIG [21] and CST [22].
Based on an evaluation on 35 open source projects they concluded that AVL is the most accurate
in predicting the truck factor and predicting developers responsible for that truck factor. In
an extension of their study they found that a reason for poor predictions is by not taking into
account social interactions such as code review, documentation, tests and supporting tools [12].
   Researchers have also studied the reason for project failure. Coelho et al. [11] studied 5,000
GitHub repositories with the intent of finding the maintenance challenges. They identified nine
reasons why open source projects fail. In descending order of happening they are usurped by
competitors, obsolescence, lack of time, lack of interest, outdated technologies, low maintainability,
conflict among developers, legal problems and acquisition. They also proposed a list of important
open source maintenance practice, including: the presence of a README file; the presence of
a separate project license file; the availability of a dedicated website to promote the project,
including examples and documentation; the use of a CI service; the presence of a specific file
with guidelines for repository contributors; the presence of an issue template and a pull request
template.
   In a study based on an analysis of 9,977 open-source npm libraries, Qiu et al [23] showed
that, for contributors that want to engage in a new OSS project, features like GitHub stars,
recent commits, comprehensive README files and having issue or pull request templates play an
important role. While projects with a higher number of stars will attract more first-time GitHub
contributors, the presence of contributing guidelines has a significantly negative effect mostly
because it makes newcomers uncomfortable to join the work.
   Multiple OSS development studies aimed to determine why contributors disengage from open
source. This is important to know because around 80% of OSS project failures is related to issues
with contributor turnover [24]. Miller et al. [7] conducted a study to determine the reasons why
people give up working on OSS projects. They considered three types of reasons: occupational,
social and technical. The occupational reasons, in descending order of importance, are: having
a new job that doesn’t support OSS; change of role/project; left job where they contributed
to OSS; no time because of new job; no time because of existing job; used OSS in school but
new job doesn’t support OSS; too much code at work. The social reasons, in descending order
of importance, are: loss of interest; no time due to personal reasons; lack of peer support; no
time (unspecified reason). The technical reasons, in descending order of importance, are: issues
with GitHub or industry; individually moved to private repertories; changed platform; feature
complete project. A possible extension to this work could be to study whether the same reasons
apply for collaborating on packages in a software packaging ecosystem.
   To determine the state of OSS project developers, Iaffaldano et al. [4] considered three states:
Alive, Sleep and Dead. Based on interview with different developers they determined that
sleeping developers are those who do not contribute code but still show interest in the project
in other ways such as answering emails and participating in discussions. In contrast, dead
developers are those who not only have stopped providing code contributions for some time,
but also do not participate in any other community activity. Calefato et al. [13] further studied
this matter and observed that breaks are rather common, in that core members take frequent
breaks or varying length and type. They observed that all developers took at least one break,
97% of them transitioned to non-coding and 89% to inactive. They also analyzed the probability
of the transitions to/from the inactivity state and observed that core developers are more likely
to remain in the projects. However, if they transition to gone, they are less likely to come back
(54% probability on average). Extending this study to software package ecosystems could be
useful to understand the migration of developers between different software packages, as this
may constitute a valid reason for contributors to become inactive or change their state in a
given software package.
   Avelino et al. [5] studied abandonment and survival of OSS projects. They concentrated on
the truck factor developer detachments (TFDDs) and the replacements of such abandoning
developers. They found that 59% of the TFDDs happened in the first two years of project
development; but 71% of the projects with TFDDs have now between 4 and 7 years of develop-
ment. They reported that recovering from a TFDD is not uncommon in that about 41% of the
projects survive. In 86% of these cases they do so by replacing the TF developer by a single new
contributor. This shows that there are many developer migrations between different projects of
the same ecosystems which can be a good source of study for socio-technical health information
like the relation of developers with each other and with the technical aspect of projects. The
information extracted from these migrations can be later used for our predictive model for
health study.
   Another aspect of software health is predicting the project maintenance activity, in order to
determine whether the project is going to be deprecated. Coelho et al. [1] gathered a dataset of
6,785 most starred GitHub projects with more than two years of software development data
available. They created a machine learning classifier based on the Random Forest algorithm
combining social and technical project data. After training the model achieved 86% precision
in predicting project abandonment. The results were confirmed with 129 developers. The top
features of the model were commits, max days without commits in months 22 to 24 of the project,
max days without commits in months 10 to 12, max contributions by developer in months 16 to
18, and closed issues in months 1 to 3 of the last two years of the project. Being able to predict
the project abandonment can help us predict the overall health of the encompassing software
packaging ecosystem. At the ecosystem level, this will help maintainers to predict possible
health issues in the projects they depend upon, or even signal them that their own project is at
the edge of becoming deprecated, and prepare them for the future.
   Decan et al. [25] proposed a probabilistic model and associated tool (called GAP) to predict
the risk of contributor abandonment, based on the previous commit activities of these contribu-
tors. The model was evaluated on GitHub repositories corresponding to development activity
for reusable software packages distributed through the Cargo package manager for the Rust
programming language. Studying migration patterns of developers within the ecosystem is one
of our goals and having a tool that predicts when a developer is going to stop contributing to a
project can help to signal and predict future developer migration patterns.


3. Data Gathering
This section presents the data gathering process to extract and prepare the socio-technical data
from evolving packaging ecosystems that will be required to respond to the Research Questions.
The first phase will consist of gathering all data required to construct the socio-technical package
dependency networks. This will be achieved by retrieving metadata through the APIs provided
by the package managers of the corresponding packaging ecosystems. Each package manager
(e.g. Cargo for Rust crates, npm for Node.JS packages, and PyPI for Python packages) has its
own dedicated package registry and associated API. Examples of project-specific information
that can be retrieved in this way includes the project’s homepage, repository link, owners and
maintainers, project classifier or category, project dependencies, version numbers and release
dates. To address RQ4 we will also gather the data related to the social media channels in
which the project or ecosystem contributors are involved (e.g., Reddit, StackOverflow, Twitter,
LinkedIn), through the dedicated APIs of these social media.
   The second phase of the extraction process consists of retrieving more specific socio-technical
information from the development repositories linked to each package. These repositories
are typically hosted on social coding platforms such as GitHub. We will use the API of the
corresponding social coding platform to retrieve relevant information such as issues, commits,
pull requests, code reviews, comments, tags and releases, number of stars, forks, downloads,
and so on. Another way to gather this information is by using archives and research platforms
such as GHTorrent 2 , Software Heritage 3 , and World of Code 4 . The main limitations of such
platforms are that they do not necessarily contain a complete and up to date archive of the data
of interest.
   The third phase will involve cleaning and transforming the collected socio-technical data
into a uniform dependency network structure that will enable to study socio-technical issues in
an ecosystem-agnostic way. We will explore different graph structures to study these socio-
technical networks of projects and project contributors.


4. Research plan
Considering the fact that my PhD research project started in November 2021, I am actively
exploring the research domain of socio-technical health of evolving software packaging ecosys-
tems. During my research, I aim to gain a deeper understanding about the different health issues
in software packaging ecosystems and leverage this understanding to develop models to predict
the health of these ecosystems and recommend improvements to them. These models will be
validated following a mixed-methods research approach, combining quantitative analysis based
on software repository mining data with qualitative analysis based on surveys and interviews
with ecosystem contributors.
   I will start by concentrating on RQ1 by using project-specific socio-technical information to
determine the important factors that attract newcomers to participate in OSS projects. Later, By
using machine learning algorithms, I will develop prediction models of the overall attractivity
of projects, and provide insights about the most important features that make the specific
ecosystem attractive to contributors.
   The next phase will focus on RQ2 and RQ3. The main purpose of RQ2 is to study the
social interaction between project contributors during technical activities and its effect on the
community that is working on the project. For RQ3, we will explore different graph structures
of the socio-technical networks of the ecosystem, in order to predict abandoning developers
and recommend replacements for them.
   The third phase will focus on RQ4, by studying social media activities of ecosystem contribu-
tors and use such information as a new data source for improving software heatlth prediction
and recommendation models.


Acknowledgments
This work is supported by Action de Recherche Concertée ARC-21/25 UMONS3 financée par le
Ministère de la Communauté française – Direction générale de l’Enseignement non obligatoire
   2
     https://ghtorrent.org/
   3
     https://www.softwareheritage.org/
   4
     https://worldofcode.org/
et de la Recherche scientifique.


References
 [1] J. Coelho, M. T. Valente, L. Milen, L. L. Silva, Is this GitHub project maintained? Measuring
     the level of maintenance activity of open-source projects, Information and Software
     Technology 122 (2020). doi:10.1016/j.infsof.2020.106274 .
 [2] K. Blincoe, F. Harrison, D. Damian, Ecosystems in GitHub and a method for ecosystem
     identification using reference coupling, in: Working Conference on Mining Software
     Repositories, IEEE, 2015, pp. 202–211. doi:10.1109/MSR.2015.26 .
 [3] M. Ferreira, M. T. Valente, K. Ferreira, A comparison of three algorithms for computing
     truck factors, in: International Conference on Program Comprehension (ICPC), IEEE, 2017,
     pp. 207–217. doi:10.1109/ICPC.2017.35 .
 [4] G. Iaffaldano, I. Steinmacher, F. Calefato, M. A. Gerosa, F. Lanubile, Why do developers take
     breaks from contributing to OSS projects? a preliminary analysis, in: 2nd International
     Workshop on Software Health, 2019, pp. 9–16. doi:https://dl.acm.org/doi/10.1109/
     SoHeal.2019.00009 .
 [5] G. Avelino, E. Constantinou, M. T. Valente, A. Serebrenik, On the abandonment and
     survival of open source projects: An empirical investigation, in: International Symposium
     on Empirical Software Engineering and Measurement (ESEM), IEEE, 2019, pp. 1–12. doi:10.
     1109/ESEM.2019.8870181 .
 [6] F. Ricca, A. Marchetto, Are heroes common in floss projects?, in: ACM-IEEE International
     Symposium on Empirical Software Engineering and Measurement, 2010, pp. 1–4. doi:10.
     1145/1852786.1852856 .
 [7] C. Miller, D. G. Widder, C. Kästner, B. Vasilescu, Why do people give up flossing? a study
     of contributor disengagement in open source, in: IFIP International Conference on Open
     Source Systems, Springer, 2019, pp. 116–129. doi:10.1007/978- 3- 030- 20883- 7_11 .
 [8] A. Decan, T. Mens, P. Grosjean, An empirical comparison of dependency network evolution
     in seven software packaging ecosystems, Empirical Software Engineering 24 (2019) 381–416.
     doi:10.1007/s10664- 017- 9589- y .
 [9] G. Fan, C. Wang, R. Wu, X. Xiao, Q. Shi, C. Zhang, Escaping dependency hell: finding
     build dependency errors with the unified dependency graph, in: International Symposium
     on Software Testing and Analysis, 2020, pp. 463–474. doi:10.1145/3395363.3397388 .
[10] A. Decan, T. Mens, What do package dependencies tell us about semantic versioning?,
     IEEE Transactions on Software Engineering (2019). doi:10.1109/TSE.2019.2918315 .
[11] J. Coelho, M. T. Valente, Why modern open source projects fail, in: Joint meeting on
     foundations of software engineering, 2017, pp. 186–196. doi:10.1145/3106237.3106246 .
[12] M. Ferreira, T. Mombach, M. T. Valente, K. Ferreira, Algorithms for estimating truck
     factors: a comparative study, Software Quality Journal 27 (2019) 1583–1617. doi:10.1007/
     s11219- 019- 09457- 2 .
[13] F. Calefato, M. A. Gerosa, G. Iaffaldano, F. Lanubile, I. Steinmacher, Will you come back to
     contribute? Investigating the inactivity of OSS core developers in GitHub, arXiv preprint
     arXiv:2103.04656 (2021). URL: https://arxiv.org/abs/2103.04656.
[14] G. Avelino, L. Passos, A. Hora, M. T. Valente, Measuring and analyzing code authorship
     in 1+ 118 open source projects, Science of Computer Programming 176 (2019) 14–32.
     doi:10.1016/j.scico.2019.03.001 .
[15] S. Goggins, K. Lumbard, M. Germonprez, Open source community health: Analytical
     metrics and their corresponding narratives, in: 2021 IEEE/ACM 4th International Workshop
     on Software Health in Projects, Ecosystems and Communities (SoHeal), IEEE, 2021, pp.
     25–33.
[16] M. Cataldo, J. D. Herbsleb, K. M. Carley, Socio-technical congruence: a framework
     for assessing the impact of technical and work dependencies on software development
     productivity, in: ACM-IEEE international symposium on Empirical software engineering
     and measurement, 2008, pp. 2–11.
[17] M. M. Syeed, K. M. Hansen, I. Hammouda, K. Manikas, Socio-technical congruence in the
     ruby ecosystem, in: International Symposium on Open Collaboration, 2014, pp. 1–9.
[18] M. Golzadeh, Analysing socio-technical congruence in the package dependency network
     of Cargo, in: Joint Meeting on European Software Engineering Conference and Symposium
     on the Foundations of Software Engineering, 2019, pp. 1226–1228.
[19] N. Zazworka, K. Stapel, E. Knauss, F. Shull, V. R. Basili, K. Schneider, Are developers
     complying with the process: an xp study, in: International Symposium on Empirical
     Software Engineering and Measurement, 2010, pp. 1–10. doi:10.1145/1852786.1852805 .
[20] G. Avelino, L. Passos, A. Hora, M. T. Valente, A novel approach for estimating truck
     factors, in: International Conference on Program Comprehension, IEEE, 2016, pp. 1–10.
     doi:10.1109/ICPC.2016.7503718 .
[21] P. C. Rigby, Y. C. Zhu, S. M. Donadelli, A. Mockus, Quantifying and mitigating turnover-
     induced knowledge loss: case studies of Chrome and a project at Avaya, in: International
     Conference on Software Engineering, IEEE, 2016, pp. 1006–1016. doi:10.1145/2884781.
     2884851 .
[22] V. Cosentino, J. L. C. Izquierdo, J. Cabot, Assessing the bus factor of git repositories, in:
     International Conference on Software Analysis, Evolution, and Reengineering, IEEE, 2015,
     pp. 499–503. doi:10.1109/SANER.2015.7081864 .
[23] H. S. Qiu, Y. L. Li, S. Padala, A. Sarma, B. Vasilescu, The signals that potential contributors
     look for when choosing open-source projects, Proceedings of the ACM on Human-
     Computer Interaction 3 (2019) 1–29. doi:10.1145/3359224 .
[24] A. Schilling, S. Laumer, T. Weitzel, Who will remain? an evaluation of actual person-job
     and person-team fit to predict developer retention in floss projects, in: Hawaii International
     Conference on System Sciences, IEEE, 2012, pp. 3446–3455. doi:10.1109/HICSS.2012.644 .
[25] A. Decan, E. Constantinou, T. Mens, H. Rocha, Gap: Forecasting commit activity in git
     projects, Journal of Systems and Software 165 (2020) 110573. doi:10.1016/j.jss.2020.
     110573 .