<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Repository data-based algorithm for selection of product teams of IT specialists</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexey Zhelepov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nadezhda Yarushkina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Ulyanovsk State Technical University</institution>
          ,
          <addr-line>Ulyanovsk, Russia, ORCID: 0000-0002-5718-8732</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Department, Ulyanovsk State Technical University</institution>
          ,
          <addr-line>Ulyanovsk, Russia, ORCID: 0000-0003-1197-4401</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>166</fpage>
      <lpage>170</lpage>
      <abstract>
        <p>-Due to the lack of qualified personnel in the IT sector, companies provide their employees with the opportunity to work remotely. That helps even a small company to stand as a global player on the market and recruit new professionals from all over the world. The companies involved in the development of product solutions are interested in hiring cohesive teams of developers who were working together for a long time. However, the HR processes of the company should be restructured and added by additional tools that will help to analyze the entire team's activity and created artifacts. The article contains a detailed description of the MVP that implements searching, selection of project teams based on data from open-source code repositories, and related artifacts. The report describes the algorithm for selecting the main team from the entire set of developers who took part in the development of the project.</p>
      </abstract>
      <kwd-group>
        <kwd>repository</kwd>
        <kwd>remote team</kwd>
        <kwd>metrics</kwd>
        <kwd>search</kwd>
        <kwd>filtering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        In spite of the constant global growth, the Russian IT
sector still lacks highly qualified personnel [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. RUSSOFT,
the Russian development company community, has prepared
a study that reveals this problem especially for IT companies
from regions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Due to the lack of employees, the
companies change their work model and give them the
opportunity to work remotely. The transition opens up the
possibility to hire new developers on the global market.
      </p>
      <p>
        At the same time, the product development based
companies follow the tendency to look for not only
individual specialists but the entire teams of professionals [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3,
4, 5</xref>
        ]. Within the article, such teams are called cohesive
teams. The concept of such a collaborative model implies
that the group of developers has already been working
together on projects, their internal processes and relations
were established. The HR hunting of such teams is
determined by the features of modern project development
such as rapid hypothesis check, MVP development and etc.
      </p>
      <p>The article describes the data source that was applied in
research and can be used as a source for the global search of
product teams, the basic architecture of components, and the
algorithm for filtering the main team and finally the results
derived by the algorithm.</p>
      <p>II. THE ARCHITECTURE OF THE PRODUCT TEAM SEARCH</p>
      <p>SYSTEM</p>
      <p>
        GitHub is widely used by developers (the service is also
well known as their social network) from all over the world
to store their projects. It was used as a source of data for the
research. The resource contains more than 37 million of
users, 100 million of project repositories. The interaction
layer between the developed system and the data source was
built upon the opened GitHub API [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Database
      </p>
      <p>Response (Rn)</p>
      <p>Request (Qn)
API calls
initialization
Q1 (q1, q2, … qn) …
Qn (q1, q2, … qn)</p>
      <p>API call execution
Q1 (q1, q2, … qn)</p>
      <p>…</p>
      <p>Qn (q1, q2, … qn)
Convert and save
purposed module</p>
      <p>Response handler</p>
      <p>(R1… Rn) of
requests (Q1 … Qn)
Query generator</p>
      <p>Query parameters
q1, q2, q3 … qn
Repository data
and cluster
analysis results
extraction</p>
      <p>Saving
data</p>
      <p>Database
Asynchronous execution</p>
      <p>Derivation of repository
development metrics (I, D, CF)
The search of the main team</p>
      <p>BI-analysis</p>
      <p>Saving cluster
analysis result</p>
      <p>Asynchronous execution
Асинхронное получение
K-means cluдsаtнeнrыaхnрaепlyозsиiтsоbриaеsвed on
development metrics (A, D, F)</p>
      <p>Derivation of Repository K-means cluster analysis based on
management metrics (TC, RC) management metrics (T, R)</p>
      <p>Asynchronous execution
Cluster results displaying</p>
      <p>Cluster results saving
Performance report</p>
      <p>Calculation of project metrics</p>
      <p>Asynchronous execution
Calculation of time metrics
Calculation of development
metrics
Fig. 1. Architecture of the project team search system.</p>
      <p>The designed system consists of 4 modules:
 the query generator Q(q1, q2, … qn) that creates API
calls for providing further data exchange between
GitHub and the system. q1, q2, … qn are query
parameters which describe project technologies, count
of participants and etc.;
 the search of the main project team. The GitHub
repositories are specific due to their public access.
That makes them a perfect field for the participation
of side developers besides the core team. The main
function of the module is filtering and selecting the
main team;
 the BI module that allows calculating teamwork
performance. It helps to measure teamwork. In this
paper, the BI module is only mentioned as a part of a
system and its algorithms and visualizations will be
presented in future research results;
 the recommendation system for improving teamwork.</p>
      <p>The module's purpose is to analyze core team
members and propose some improvements to their
work. This module is still under development.</p>
      <p>The query module supports these input parameters:
 q1 describes technologies that are used in the
development of the project. This parameter allows
partitioning repositories by technological stacks,
which is primarily important for an HR specialist
when he looks for the potential product teams. The
parameter is represented by a set of string values
indicating programming and DevOps technologies: q1
∈ [.NET, Ruby on Rails, Kubernetes …];
 q2 is a “copy index” of the repository by other
developers or project teams. The parameter describes
how the project is used by side teams. The range of
values is [0; +∞];
 q3 describes the index of project's popularity. It is a
social parameter of the repository because any
developer can leave feedback on the project. The
parameter's value bounds within [0; +∞] and reflects
the number of positive reviews left by side
developers;
 q4 contains the name of searched repository;
 q5 means the name of development team;
 q6 assumes the number of project’s participants and
the value lies in range of [0; +∞].</p>
      <p>In the course of further research, it is planned to expand
the search capabilities by adding a number of additional
parameters. Search is implemented in the Python
programming language. It interacts with GitHub services and
retrieves necessary data.</p>
      <p>III. TEAMWORK AND REPOSITORY MODELS
The project team distinctively clarifies its roles:


</p>
      <p>Analysts (A) are team members who identify the
needs of project's users and other communities'
developers. The analyst decides the way of the
project's development;
Developers (D) are team members who solve the
project's problems;
QA specialists (Q) are team members who are
responsible for its quality and workability.








</p>
      <p>Team leaders (T) are team members who manage the
development process and provide task assessment and
distribution between the developers.</p>
      <p>It is assumed that one member of a team can fulfill
several roles. Thus, the model of the project team can be
represented as:</p>
      <p>M = {A, D, Q, T}
</p>
      <p>Each of the roles interacts with the project repository in a
certain way.</p>
      <p>The analyst role can be accomplished by the side
developers or community who are interested in project
development. The interaction between them and the core
team can be tracked by:</p>
      <p>Tasks (T) that are created by non-core team members;
Project changes (PR) that are proposed by community
members to correct errors for the final project's
improvement;
Social activity of participants, expressed in their
comments to the tasks (C), recognition of the
significance of the project in the form of positive
reviews (R) and their own repository copies
represented by forks (F).</p>
      <p>The developer role can be separated between the core
team and side developers who are interested in project's
improvement. The repository describes the role's activities
via the next set of characteristics:</p>
      <p>Commits (Cmt) are contributions to the project in the
form of lines of program code, documentation and
etc.;
Project branches (B) that reflect the activity of the
assigned project tasks execution. The branches are
created in order to simplify development
synchronization.</p>
      <p>QA role is also partly assigned to the community. The
repository describes the role via the set of characteristics:
Tasks (T) that are created to fix errors in the project;
Merged branches (MB). Before adding to the main
branch of the project repository, the completed task
should be tested for errors and compliance with the
requirements of the task.</p>
      <p>The role of the leader can be performed only by the core
developer. The distinctive metrics are:</p>
      <p>Released versions (R). While creating a release, the
team leader combines many branches (MB) of other
developers which is a quite complex task;</p>
      <p>Task assignment (T) for the project's developers.</p>
      <p>Finally, the project repository model can be expressed
through many relationships of project roles and its artifacts:
Rep = { A → (T, PR, C, R, F); D → (Cmt, B); Q → (T,
MB); L → (R, T, MB) }.</p>
      <p>The model determines the relationship between the
interaction of individual project artifacts and the roles of the
project team. Based on the presented model description, an
algorithm for identifying the main project team is developed
and further presented.</p>
      <p>
        IV. THE ALGORITHM FOR SELECTING THE MAIN TEAM
The selection algorithm is designed to search for the
main development team among all developers that
previously made their contributions to the project. The
algorithm operates with the data of chosen repository
(commits, contributors and etc.). The search is based on the
k-means clustering method [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] because of its manual setup of
the number of clusters.
      </p>
      <p>
        Firstly, it was supposed to organize clusters according to
JMS model described in the study [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The clusters would
have been chosen as Junior, Middle, Senior, and indicated
the skill level of specialists.
      </p>
      <p>However, a thorough review has shown that such a
strategy could not be applied in this field. The repository
provides the development statistics: timelines, code, commits
and etc. These data perfectly describes only developers who
actually make the final contributions. However, the work of
their leaders (communications, task and issue management,
release management and etc.) stays in shadow. Ultimately,
the analysis based only on quantitative metrics and the JMS
model will cause a serious deviation.</p>
      <p>Considering the features above, the following clusters
were chosen:</p>
      <p>Contributor (C), a participant who makes a relatively
small contribution to the project;
Participant (P), a specialist who periodically takes
part in project improvement;
A prospective member of the team (PM), a developer
who actively takes part in the project;
Main developer (MD), a specialist who makes the
greatest contributions to the project.</p>
      <p>
        The clustering consists of two stages and based both on
development and management metrics [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The separate stages help to cover not only plain
development metrics dedicated to code writing but to take
into account management processes as well.</p>
      <p>The development features that were chosen as metrics:
The management features selected as metrics:</p>
      <p>Insertions and deletions (I, D). The metric describes
the number of code lines that were added and
removed by each of project’s team member;
Changed files (CF). This characteristic helps to
cover the operational range of each team member.
Task Count (TC). This value helps to analyze the
ability of team members to correctly understand and
determine the project development vector.</p>
      <p>Release Count (RC). It shows the number of
prepared released versions made by each of the
developers. The metric describes the ability of team
members to correlate all contributions into the stable
version of the software.</p>
      <p>The algorithm consists of the stages:








 [STAGE 1] provides pre-processing and preparation
of input data;
 [STAGE 2] conducts clustering procedures for
project contributors;
 [STAGE 3] determines the affiliation of each
contributor to a particular cluster
 [STAGE 4] visualizes the results.</p>
      <p>Before the clustering, it assumes the handling of input
data. Thus, the developers are grouped by a number of
commits, changed files, created releases, and task quantity.</p>
      <p>The input data can be represented via the structures that
are described in figure 2.</p>
      <p>struct label: {
user_id: string</p>
      <p>}
struct management {</p>
      <p>task_count: int
release_count: int</p>
      <p>}
struct development {
additions_count: int
deletions_count: int
changed_files_count: int</p>
      <p>}
Fig. 2. Structures that describe the input data.</p>
      <p>Label structure describes the assignment of values to the
contributor. Structures of management and development
contain the characteristics for both of clustering analysis
(based on management and developer metrics).</p>
      <p>The k-means clustering algorithm assumes a
predetermined number of clusters. In the case, this number is
4, according to the number of presented clusters: “C”, “P”,
“PM” and “MD”.</p>
      <p>Another parameter of the clustering algorithm is the
maximum number of iterations. In the present case, the value
of 10,000 is chosen (obtained experimentally - starting from
about 1000 iterations, the centres of the clusters change
slightly).</p>
      <p>The output of the clustering procedures is presented as a
tuple like {label; cluster name}.</p>
    </sec>
    <sec id="sec-2">
      <title>V. THE EXPERIMENT AND RESULTS</title>
      <p>The algorithm was tested on 10 repositories that describe
these well-known projects:


</p>
      <p>ClickHouse, a column-oriented database
management system created by Yandex. The
project’s history has more than 30 000 contributions
made by 350 developers, including the main team.
The team is not remote and sits in the same office.
Yii 2 Framework which is widely used by PHP
developers all over the world. The quantity of
commits is about 20 000, the number of participants
is 950. The international team is remote and works
from all over the world.</p>
      <p>Albumentations and Catalyst. Most of the machine
learners favor these frameworks despite their
relatively young age. Albumentations is extremely

used by X5 Retail Group in the computer vision
projects, Catalyst is mostly an initiative project of
several developers.
6 other projects of the international product team
named Evil Martians: PostCss, BrowsersList,
AutoPrefixer, NanoId, Gon, ImgProxy. The
repositories are widely used by web developers and
have a high rating on GitHub.</p>
      <p>The choice of these repositories is explained that their
core developers took part in the IT Conference Stachka. One
of the article’s authors organized this conference. Hence it
became possible to make a correct expert assessment of the
algorithm results.</p>
      <p>The analysis of the projects presented in Table 1. The
analysis is based both on development and management
metrics.</p>
      <p>As it is shown in table 1 the distribution of project
developers based on management metrics is smaller than on
the other ones. Such behavior can be explained as a small
percentage of specialists are those who are responsible for
crucial decisions in project development (tasks creation,
preparing releases and etc.). According to GitHub nature,
these developers are certainly the members of core teams
who are most interested in the development of project. The
figure 3 shows the example of visualization for clustering
based on development and management metrics.</p>
      <p>The core developers of the ClickHouse, Yii 2, Catalyst,
Albumentations and members of the Evil Martians Team
assessed the practical significance of the algorithm. The
results of the intersection of core members identified by the
algorithm (P1) and real members (P2) were approved by the
experts and shown in Tables 2, 3. The gray color identifies
the cases when algorithm was correct.
ai
gaza
y,
torbj
on
,
johnbai
gaza
y</p>
      <p>DarthSi</p>
      <p>m,
koenpu</p>
      <p>nt
DarthSi</p>
      <p>m</p>
      <p>There is no coincidence that results are divided between
tables 2 and 3. Table 2 contains projects of high complexity.
Table 3 has results for simpler libraries that simplify the
development process rather than provide a full-scale solution
like projects presented in table 2. However, the algorithm did
not include some core developers. That is why the further
research will take into account the social component of the
development process: soft skills. Its numerical metric will be
presented as the number of messages and related discussion
sub-messages about proposed solutions to repository issues.
Such discussions are commonly leaded by core developers.</p>
      <p>The algorithm did not identify some of the core team
members and put them into C and P clusters. That occurred
because these developers do not make enough changes for
the project because of their other kind of activity: support
functions, community work and even contribution for other
team’s projects.</p>
    </sec>
    <sec id="sec-3">
      <title>VI. CONCLUSION</title>
      <p>This paper describes the architecture of the project team
search system, the model of the project team, and its
members' roles, the algorithm which provides the search of
the core team.</p>
      <p>The practical significance of the approach is that it helps
to automate the search not only of specialists but of
development teams. The HR manager is able to analyze the
team's activity and make a decision on whether to hire the
found team or not.</p>
      <p>
        Further study will be associated with the development of
a recommendation module of the system, which is planned to
be built on the basis of the fuzzy logic paradigm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors of the study thank the Russian Fund for
Fundamental Research for supporting work under grant №
18-47-730019 p_a.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Yarushkina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Afanaseva</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Shiniaeva</surname>
          </string-name>
          , “Research of ITCluster of Ulyanovsk District,” Ulyanovsk, UlSTU Publ., pp.
          <fpage>137</fpage>
          -
          <lpage>138</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Solovyova</surname>
          </string-name>
          , “
          <article-title>Russian IT-sphere expects the extreme employee shortage</article-title>
          ,” IT World,
          <year>2018</year>
          [
          <article-title>"Online]</article-title>
          . URL: https://www.itworld.ru/it-news/it/140881.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sandy</surname>
          </string-name>
          <string-name>
            <surname>Staples</surname>
          </string-name>
          , “
          <article-title>A Study of Remote Workers and Their Differences from Non-Remote Workers,” Organizational and End User Computing</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          ,
          <year>2001</year>
          . DOI:
          <volume>10</volume>
          .4018/ joeuc.2001040101.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Yarushkina</surname>
          </string-name>
          , G. Guskov,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dudarin</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Stuchebnikov</surname>
          </string-name>
          , “An Approach to Similar
          <source>Software Projects Searching and Architecture Analysis Based on Artifcial Intelligence Methods,” Proceedings of the Third International Scientifc Conference Intelligent Information Technologies for Industry (IITI)</source>
          .
          <source>Advances in Intelligent Systems and Computing</source>
          . Springer, Cham, vol.
          <volume>1</volume>
          , pp.
          <fpage>341</fpage>
          -
          <lpage>352</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gil</surname>
          </string-name>
          , “
          <source>High Growth Handbook: scaling startups from 10 to 10</source>
          .000 people,” Stripe Press, pp.
          <fpage>95</fpage>
          -
          <lpage>97</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vasin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yasakov</surname>
          </string-name>
          , “
          <article-title>Distributed data management system for integrated handling of geolocation data,” Computer Optics</article-title>
          , vol.
          <volume>40</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>919</fpage>
          -
          <lpage>928</lpage>
          ,
          <year>2016</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179-2016-40-6-
          <fpage>919</fpage>
          - 928.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kaur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Minhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mehan</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kakkar</surname>
          </string-name>
          , “
          <source>Static and Dynamic Complexity Analysis of Software Metrics,” World Academy of Science</source>
          , Engineering and Technology
          <source>International Journal of Computer</source>
          , Electrical, Automation,
          <source>Control and Information Engineering</source>
          , vol.
          <volume>3</volume>
          , no 8. pp.
          <fpage>1936</fpage>
          -
          <lpage>1938</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Afanasyeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhelepov</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Zagaichuk</surname>
          </string-name>
          , “
          <article-title>Framework for Accessing Professional Growth of Software Developers</article-title>
          ,
          <source>ICCTA, 5th International Conference on Computer and Technology Applications</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          , “
          <string-name>
            <given-names>A Clustering</given-names>
            <surname>Method Based on K-Means</surname>
          </string-name>
          <string-name>
            <surname>Algorithm</surname>
          </string-name>
          ,”
          <source>International Conference on Solid Devices and Materials Science</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Yarushkina</surname>
          </string-name>
          , “
          <article-title>Methods for Fuzzy Expert Systems in Intellectual CAD-Systems,”</article-title>
          <string-name>
            <surname>Saratov</surname>
          </string-name>
          ,
          <year>1997</year>
          , 106 p.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>