Extracting and Comparing Concepts Emerging from
Software Code, Documentation and Tests
Zaki Pauzi1 , Andrea Capiluppi1
1
 Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence (University of Groningen), Nijenborgh
9, 9747 AG Groningen, The Netherlands


                                         Abstract
                                         Traceability in software engineering is the ability to connect different artifacts that have been built or
                                         designed at various points in time. Given the variety of tasks, tools and formats in the software lifecycle,
                                         an outstanding challenge for traceability studies is to deal with the heterogeneity of the artifacts, the
                                         links between them and the means to extract each. Using a unified approach for extracting keywords
                                         from textual information, this paper aims to compare the concepts extracted from three software artifacts:
                                         source code, documentation and tests from the same system. The objectives are to detect similarities in
                                         the concepts emerged, and to show the degree of alignment and synchronisation the artifacts possess.
                                         Using the components of three projects from the Apache Software Foundation, this paper extracts
                                         the concepts from ‘base’ source code, documentation, and tests (separated from the source code). The
                                         extraction is done based on the keywords present in each artifact: we then run multiple comparisons
                                         (through calculating cosine similarities on features extracted by word embeddings) in order to detect
                                         how the sets of concepts are similar or overlap. For similarities between code and tests, we discovered
                                         that using pre-trained language models (with increasing dimension and corpus size) correlates to the
                                         increase in magnitude, with higher averages and smaller ranges. FastText pre-trained embeddings scored
                                         the highest average of 97.33% with the lowest range of 21.8 across all projects. Also, our approach was
                                         able to quickly detect outliers, possibly indicating drifts in traceability within modules. For similarities
                                         involving documentation, there was a considerable drop in similarity score compared to between code
                                         and tests per module – down to below 5%.

                                         Keywords
                                         software traceability, natural language processing, information retrieval, textual analysis


1. Introduction
Software traceability is a fundamentally important task in software engineering: for some do-
mains, traceability is even assessed by certifying bodies [1]. The need for automated traceability
increases as projects become more complex and as the number of artifacts increases [2, 3, 4].
The underlying complexities of the logical relations between these artifacts have prompted a
variety of empirical studies [5, 6, 7] and several areas of research, particularly in the inception
of semantic domain knowledge [8, 9]. Above all, one of the most pressing challenges is linked to
the heterogeneity of the artifacts, the links extracted between them, and the variety of formats
and tools available for the different stages of traceability studies [10]. Much has been done in
BENEVOL’21: The 20th Belgium-Netherlands Software Evolution Workshop, December 07–08, 2021, ’s-Hertogenbosch
(virtual), NL
$ a.z.bin.mohamad.pauzi@rug.nl (Z. Pauzi); a.capiluppi@rug.nl (A. Capiluppi)
 0000-0003-4032-4766 (Z. Pauzi); 0000-0001-9469-6050 (A. Capiluppi)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
re-establishing traceability links between software artifacts, but not particularly in the field of
domain concepts by applying semantic modelling. Major advancement in NLP research in recent
years has resulted in practical uses of language models [11], such as the introduction of deep
and transfer learning, which was more commonly used in computer vision. By applying this to
software artifacts, the human language can be closely related to concepts extracted from source
code and tests, ultimately comprehending the software’s identity through natural language.
This paves the way for a variety of semantics-driven applications, such as automated software
domain classification, taxonomy structure of domains and studies of software evolution.
   The contribution of this paper is the notion of concept similarity, in the context of the
traceability between source code, documentation and tests of a software unit (class, module or
system). The book definition of “concept" is a principle or an idea 1 that is abstract or generic
in nature 2 . In software engineering, we add on to this definition to include the features in
software components, where concepts are presently identified through source code analysis and
manipulation. For our dataset, we use the modules present in three actively managed Java open
source projects from the Apache Software Foundation (ASF): Apache’s Dubbo3 , Skywalking4
and Flink5 . Although the sample is small, our aim is to showcase the methodology behind the
data extraction: the approach is straightforward and scalable, so it will be possible to analyse
a larger sample with minimal effort. We extracted the concepts of each project from three
sources: the source code, documentation and tests. The documentation was extracted from the
README file, which serves as the first point of entry for project stakeholders. The tests of each
project were identified through regular expressions of filenames, while the remaining Java files
constitute the ‘base’ source code for these projects. We extracted the concepts emerging from
the keywords used in each of these sources, and we run multiple similarity measurements to
determine the similarity of these artifacts, answering the following:
   RQ1: How similar (syntactically and semantically) are the three software artifacts: source
code, documentation and tests?
   RQ2: How does feature extraction (through word embeddings) perform when comparing
textual similarity between the source code, documentation and tests?
   This paper is structured as follows: Section 2 summarises the related work, and Section 3
describes the unified approach used to extract the concepts. Section 4 shows the results and
discusses the findings. Section 5 concludes.


2. Background Research
In [12], the authors present a roadmap of the research on software traceability, based on the
topics that researchers had focused on so far. According to that roadmap, our work:

    • is based on vertical traceability relations (i.e., it includes the relations between different
      artifacts);
   1
     https://dictionary.cambridge.org/dictionary/english/concept
   2
     https://www.merriam-webster.com/dictionary/concept
   3
     https://github.com/apache/dubbo
   4
     https://github.com/apache/skywalking
   5
     https://github.com/apache/flink
    • includes overlap relations (i.e., we study whether C1 and C2 6 refer to common concepts
      of a system);
    • targets the automatic generation of traceability relations (as compared to manual or
      semi-automatic approaches);
    • aims to analyse traceability for the purpose of software validation, verification and testing,
      insofar as ‘traceability relations may be used to check the existence of appropriate test cases
      for verifying different requirements and to retrieve such cases for further testing’ [12].

   The traceability field is most often associated with requirements traceability [13]: pre-
packaged, automated tools like TraceLab [14] have often been preferred to trace different
artifacts, or versions of the same artifact. Although it became clear over the years that traceabil-
ity of software artifacts is essentially an information retrieval problem [15], combining software
traceability with semantic information of software artifacts was shown to be a promising tech-
nique. Prospective traceability tools [16] have been developed with both the architecture and
semantic modelling in mind. This empirical approach to recover traceability links through
topic modelling was later adapted and improved through integrating orthogonal information
retrieval methods [17].
   Whilst most of the traditional literature on traceability has focused on requirements, the
research on traceability of open source systems has focused on other artifacts (source code, user
documents, build management documents etc [18, 19]). When requirements are considered, they
are not traditionally elicited through customer feedback, but just-in-time and termed feature
requests [20], or elicited using various other artifacts (e.g., CVS/SVN commit logs and Bugzilla
reports [21]). In the open source context, the extraction of requirements is often considered as
a long term view, for instance in the context of impact analysis [22].
   In the context of deriving semantic value from software artifacts, extracting topics from
source code has been previously presented in [23] by demonstrating that the Latent Dirichlet
Allocation (LDA) technique has a strong applicability in the extraction of topics from classes,
packages and overall systems. This technique was also used in a later paper [24] where experts
were consulted to assign software system in accordance to its domain. The keyword terms
derived from this technique were also compared in terms of text similarity using various word
embeddings in [25].


3. Methodology
3.1. Definitions
The following are the definitions as used in this paper:

    • Corpus keyword (term) – given the source code contained in a class, a token or term
      is any item that is contained in the source code. We do not consider as a term any of
      the Java-specific keywords (e.g., if, then, switch, etc.). Additionally, the camelCase or
      PascalCase notations are first decoupled in their components (e.g., the class constructor
      InvalidRequestTest produces the terms invalid, request and test).
    6
        “C1" and “C2" are arbitrary concepts
    • Topic – this refers to the clusters of keywords extracted with the Latent Dirichlet Alloca-
      tion (LDA) technique, and weighted by their relative relevance. For this paper, we will
      concatenate all the keywords sans weights as topic keywords.
    • Concept – the set of ‘source code concepts’ is the union of the (i) corpus keywords set
      and (ii) topics set, as extracted from the source code. These ‘concepts’ are derived from
      the lexicon used in the code. The ‘test concepts’ have a similar definition, using the sets
      from the test batch.
    • 𝑥/𝑦 SIM – Concept similarity between 𝑥 and 𝑦, where 𝑥 and 𝑦 are software artifacts, and
      𝑥 ̸= 𝑦.

3.2. Selection of Software Systems
Table 1 shows the details of our sample dataset. We are fully aware that the analysis of three
systems (instead of hundreds, or thousands) does not allow to draw any conclusion for any other
system. The empirical study that we present below focuses on top rated systems (representing
the quality of code developed by the ASF community, adhering strictly to established coding
standards7 ) rather than promoting the width of representativeness. We will further discuss
about the implications of our choice in the threats to external validity.

Table 1
Selection of Software Systems
                             ID      Project Name          Modules       Stars
                             P1      Apache Dubbo            14         36,323
                             P2    Apache Skywalking         4          17,994
                             P3       Apache Flink           27         17,378


3.3. Concept Extraction
The extraction of the concepts and similarity measurements was executed in Python via a
Jupyter notebook8 . The extraction was carried out to build the class corpus for each project.
Results were then compared and analysed.
   For any project’s source code and tests, we extracted all the class names and identifiers
that were used for methods and attributes. Additionally, inline comments were extracted as
well, but the Java keywords were excluded9 , along with the project names (such as apache
and dubbo for P1) to provide a more accurate representation. This results in an extraction
that comprehensively represents the semantic overview of concepts whilst minimising noise
from code syntax. The final part of the extraction is the lemmatisation of the terms using
SpaCy’s10 token.lemma_. Lemmatising is deriving the base from the terms (also called their
dictionary form, or ‘lemma’), thus enabling more matches when we compare from the different

   7
      https://directory.apache.org/fortress/coding-standards.html
   8
      https://github.com/zakipauzi/benevol2021/blob/main/concept_similarity_benevol.ipynb
    9
      https://en.wikipedia.org/wiki/List_of_Java_keywords
   10
      https://spacy.io
sources (e.g., best -> good). An excerpt of the complete corpus from the source code of P1’s
dubbo-common module is shown at figure 1.
  Extending this to all the modules in the three projects, we look at the textual similarity of
corpus keywords extracted for (i) source code and (ii) tests per module.

   activate reference service metadata colon separator service key service version
   registry config config available serial...


Figure 1: Excerpt from the corpus of dubbo-common base code from P1


    For the documentation extraction, we looked at the README file and ran through a similar
cleaning pipeline. All non-English characters were disregarded during the exercise by checking
if the character falls within the ASCII Latin space. Figure 2 shows a simplified diagram of our
concept similarity (i.e., traceability) between artifacts.


Figure 2: Diagram of concept similarity between artifacts


3.4. Topic Modelling
For the CODE/DOC SIM and TEST/DOC SIM, we run topic modelling with Latent Dirichlet
Allocation (LDA) due to the vast difference in token count between the README file and the
code. We cluster the extracted class corpus into groups; an unsupervised learning approach
to tag groups of terms to a topic based on their semantic context, then identifying the overar-
ching theme of each cluster through their topics. Using Gensim11 LDA, we identify the topic
clusters present. Next, we concatenate all the topic keywords from all the modules from code
and tests respectively. Figure 3 shows an example of topic keywords that emerge from P1’s
dubbo-common module source code.
   These are used to compare with the topic keywords emerging from documentation. In Section
4, we will look at the results of proportion in overlap of these concepts extracted between code
and documentation, and tests and documentation.


   11
        https://radimrehurek.com/gensim
   ’config’,       ’logger’,    ’key’,    ’path’,    ’level’,    ’listener’,    ’msg’,    ’throwable’,
   ’stream’, ’provider’, ’map’, ’integer’, ’object’, ...


Figure 3: dubbo-common’s source code topic keywords


3.5. Similarity Measures
We apply multiple vectorisation techniques to represent corpus keywords as vectors in a vector
space, and then we run cosine similarity against these vectors to address RQ1 and RQ2. Cosine
similarity is a “distance" metric, irrespective of orientation and magnitude: the lower the
angle between the vectors, the higher the similarity. Table 2 shows the different vectorisation
techniques used with cosine similarity for measurement.

Table 2
Vectorisation Techniques with Cosine Similarity
                    Vectorisation type    Pre-trained?    Dimensions      Source
                    TFIDF Vectorizer           No             -           scikit-learn [26]
                    SO W2V                    Yes            200          Online at [27]
                    SpaCy                     Yes            300          Online at [28]
                    FastText                  Yes            300          Online at [29]

   A toolchain graph showing each part of the process discussed above (subsections IIIA. to
IIIE.) is shown in Figure 4.


4. Results
For space limitations we have made available online the result summary of Code and Test simi-
larity (CODE/TEST SIM)12 , and the results of Code and Documentation similarity (CODE/DOC
SIM) and Test and Documentation similarity (TEST/DOC SIM)13 across all modules in each
project.
   Figure 5 shows the box plot distribution of CODE/TEST SIM for the vectorisation techniques
across all modules for P1, P2 and P3 (RQ2). Similarly to the trends found in [25], the analysis of
the three Apache projects confirms that the baseline TF-IDF measurement has the widest ranges
of 0.62 to 0.85, and lowest mean scores of 0.34 to 0.66. At the same time, the role of pre-trained
embeddings is central in deriving semantic context to concepts: contextual similarity (helped
by the three trained datasets) is linked to a higher cosine similarity when comparing corpus
keywords in software artifacts. This is expected, since the Java syntax closely mirrors the words
already present in the pre-trained vector spaces.
   With pre-trained embeddings trained on a wide vector space (e.g, an English vocabulary
across all domains), the range of similarity score gets even narrower. We see this range decrease
   12
        https://github.com/zakipauzi/benevol2021/blob/main/benevolcodetestsimsummary.csv
   13
        https://github.com/zakipauzi/benevol2021/blob/main/benevolcodetestdocsimsummary.csv
Figure 4: Toolchain implemented throughout the paper


as we run CODE/TEST SIM with StackOverflow data (SO W2V), expanding vocabulary beyond
just within the scope of our current artifacts (TF-IDF), and further decrease with SpaCy and
FastText embeddings. In other words, the wider the vector space of embeddings, the narrower
the range and the higher the mean. Ultimately, we know that a high score does not necessarily
denote a better similarity, but it is worth exploring further on how we can continue to balance the
role of pre-trained embeddings with accurately representing similarity that is domain-specific,
which ideally relates to our traceability reconstruction solution through concept similarities.
   A key advantage of using this approach (particularly with pre-trained embeddings) is that
the outliers can be easily detected in figure 5, indicating drifts in traceability via similarity mea-
surements. For P1, we can see outliers such as modules dubbo-container, dubbo-filter
and P3’s flink-streaming-scala: these show that the concepts emerging from their code
and tests are vastly different and need looking into.
   As for our similarity measurements: CODE/DOC SIM and TEST/DOC SIM, comparing module
Figure 5: CODE/TEST similarity for the modules contained in each system


to the documentation in its entirety will not be accurate, thus we ran topic modelling to represent
the hierarchical structure of the syntax more effectively, a similar approach to [24]. Topics
emerging from corpus keywords represent clusters that are a level higher than the corpus
keywords, bridging the gap between per module syntax with README corpus. The results for
both similarities (using Jaccard Index on topic keywords) are well below CODE/TEST SIM: >5%.
Further work will need to be done to establish the accuracy of this result, such as incorporating
weights to topics and identifying traceability beyond the syntax of keywords. across P1 modules
scored lower than baseline whereas it differs for some P2 modules.

4.1. Threats to validity
The approach that we have shown has some threats and limitations that we have identified.

    • External tests and documentation not included. Other than the basic unit and logic tests, we
      do have other tests that may involve other software and systems, which are not included
      (i.e., integration tests).
    • Construct: definition of concepts as a construct. Our definition looked at both facets – the
      corpus keywords and the derived topics. Moreover, our dataset assumes on the notion
      that top rated projects from an open source body with established coding standards (such
      as ASF) represent good code quality in artifacts and hence, we expect that the concepts
      emerging from the sources are aligned, as in [25]. Our approach ensures that artifacts are
      treated similarly, and drifts between artifacts are clearly captured by outliers.
    • Conclusion: non uniform identifiers and code smells. From the analysis of the systems,
      we observed that the structure of code has some degree of non-uniformity in the way
      identifiers are used to represent meaning.
5. Conclusion and Further Work
In this paper we explored the triangulation of textual similarity via different techniques between
the concepts extracted from the source code, documentation and tests of Java software systems.
The aim was to assess the traceability between the three sources, and to put it in the context of
concept overlap. There is great potential in the results for further development and analysis
in semantic traceability for software artifacts (e.g., establishing connections between topics
derived outside of lexical intersection, exploring different metrics to detect traceability). Also,
expanding our dataset to include projects of various domains, languages and categories. Moving
forward, we would extend this solution to include other artifacts such as architecture diagrams,
bug reports and functional requirements. We would also want to look at ways to adopt this
approach to a supervised automated logic of domain categorisation.


References
 [1] J. Guo, J. Cheng, J. Cleland-Huang, Semantically enhanced software traceability using
     deep learning techniques, in: 2017 IEEE/ACM 39th International Conference on Software
     Engineering (ICSE), IEEE, 2017, pp. 3–14.
 [2] J. Cleland-Huang, B. Berenbach, S. Clark, R. Settimi, E. Romanova, Best practices for
     automated traceability, Computer 40 (2007) 27–35.
 [3] C. Duan, P. Laurent, J. Cleland-Huang, C. Kwiatkowski, Towards automated requirements
     prioritization and triage, Requirements engineering 14 (2009) 73–89.
 [4] J. Guo, M. Gibiec, J. Cleland-Huang, Tackling the term-mismatch problem in automated
     trace retrieval, Empirical Software Engineering 22 (2017) 1103–1142.
 [5] J. I. Maletic, E. V. Munson, A. Marcus, T. N. Nguyen, Using a hypertext model for traceability
     link conformance analysis, in: Proc. of the Int. Workshop on Traceability in Emerging
     Forms of Software Engineering, 2003, pp. 47–54.
 [6] H. Schwarz, J. Ebert, A. Winter, Graph-based traceability: a comprehensive approach,
     Software & Systems Modeling 9 (2010) 473–492.
 [7] P. Mäder, R. Olivetto, A. Marcus, Empirical studies in software and systems traceability,
     Empirical Softw. Engg. 22 (2017) 963–966. URL: https://doi.org/10.1007/s10664-017-9509-1.
     doi:10.1007/s10664-017-9509-1.
 [8] A. Marcus, J. I. Maletic, Recovering documentation-to-source-code traceability links using
     latent semantic indexing, in: 25th International Conference on Software Engineering, 2003.
     Proceedings., IEEE, 2003, pp. 125–135.
 [9] T. Zhao, Q. Cao, Q. Sun, An improved approach to traceability recovery based on word
     embeddings, in: 2017 24th Asia-Pacific Software Engineering Conference (APSEC), IEEE,
     2017, pp. 81–89.
[10] R. M. Parizi, S. P. Lee, M. Dabbagh, Achievements and challenges in state-of-the-art
     software traceability between test and code artifacts, IEEE Transactions on Reliability 63
     (2014) 913–926.
[11] A. Torfi, R. A. Shirvani, Y. Keneshloo, N. Tavvaf, E. Fox, Natural language processing
     advancements by deep learning: A survey, ArXiv abs/2003.01200 (2020).
[12] G. Spanoudakis, A. Zisman, Software traceability: a roadmap, in: Handbook Of Software
     Engineering And Knowledge Engineering: Vol 3: Recent Advances, World Scientific, 2005,
     pp. 395–428.
[13] R. Torkar, T. Gorschek, R. Feldt, M. Svahnberg, U. A. Raja, K. Kamran, Requirements
     traceability: a systematic review and industry case study, International Journal of Software
     Engineering and Knowledge Engineering 22 (2012) 385–433.
[14] E. Keenan, A. Czauderna, G. Leach, J. Cleland-Huang, Y. Shin, E. Moritz, M. Gethers,
     D. Poshyvanyk, J. Maletic, J. H. Hayes, et al., Tracelab: An experimental workbench for
     equipping researchers to innovate, synthesize, and comparatively evaluate traceability
     solutions, in: 2012 34th International Conference on Software Engineering (ICSE), IEEE,
     2012, pp. 1375–1378.
[15] M. Borg, P. Runeson, A. Ardö, Recovering from a decade: a systematic mapping of
     information retrieval approaches to software traceability, Empirical Software Engineering
     19 (2014) 1565–1616.
[16] H. U. Asuncion, A. U. Asuncion, R. N. Taylor, Software traceability with topic modeling,
     in: 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1,
     2010, pp. 95–104. doi:10.1145/1806799.1806817.
[17] M. Gethers, R. Oliveto, D. Poshyvanyk, A. D. Lucia, On integrating orthogonal information
     retrieval methods to improve traceability recovery, in: 2011 27th IEEE International
     Conference on Software Maintenance (ICSM), 2011, pp. 133–142. doi:10.1109/ICSM.
     2011.6080780.
[18] H. Kagdi, J. I. Maletic, B. Sharif, Mining software repositories for traceability links, in:
     15th IEEE International Conference on Program Comprehension (ICPC’07), IEEE, 2007, pp.
     145–154.
[19] H. Kagdi, J. Maletic, Software repositories: A source for traceability links, in: International
     Workshop on Traceability in Emerging Forms of Software Engineering (GCT/TEFSE07),
     2007, pp. 32–39.
[20] P. Heck, A. Zaidman, Horizontal traceability for just-in-time requirements: the case
     for open source feature requests, Journal of Software: Evolution and Process 26 (2014)
     1280–1296.
[21] N. Ali, Y.-G. Guéhéneuc, G. Antoniol, Trustrace: Mining software repositories to improve
     the accuracy of requirement traceability links, IEEE Transactions on Software Engineering
     39 (2012) 725–741.
[22] M. Gethers, B. Dit, H. Kagdi, D. Poshyvanyk, Integrated impact analysis for managing
     software changes, in: 2012 34th International Conference on Software Engineering (ICSE),
     IEEE, 2012, pp. 430–440.
[23] A. Kuhn, S. Ducasse, T. Gírba, Semantic clustering: Identifying topics in source code,
     Information and Software Technology 49 (2007) 230–243.
[24] A. Capiluppi, N. Ajienka, N. Ali, M. Arzoky, S. Counsell, G. Destefanis, A. Miron, B. Nagaria,
     R. Neykova, M. Shepperd, et al., Using the lexicon from source code to determine application
     domain, in: Proceedings of the Evaluation and Assessment in Software Engineering, 2020,
     pp. 110–119.
[25] Z. Pauzi, A. Capiluppi, Text similarity between concepts extracted from source code
     and documentation, in: Intelligent Data Engineering and Automated Learning – IDEAL
     2020 - 21st International Conference, 2020, Proceedings, 2020, pp. 124–135. doi:10.1007/
     978-3-030-62362-3_12.
[26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
[27] V. Efstathiou, C. Chatzilenas, D. Spinellis, Word embeddings for the software engineer-
     ing domain, in: 2018 IEEE/ACM 15th International Conference on Mining Software
     Repositories (MSR), 2018, pp. 38–41.
[28] explosion, en_core_web_md, https://github.com/explosion/spacy-models/releases/tag/en_
     core_web_md-3.1.0, 2021.
[29] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in pre-training
     distributed word representations, 2017. arXiv:1712.09405.