=Paper=
{{Paper
|id=Vol-1743/paper5
|storemode=property
|title=A Corpus Builder: Retrieving Raw Data from GitHub for Knowledge Reuse In Requirements Elicitation
|pdfUrl=https://ceur-ws.org/Vol-1743/paper5.pdf
|volume=Vol-1743
|authors=Roxana Lisette Quintanilla Portugal,Hugo Roque,Julio Cesar Sampaio do Prado Leite
|dblpUrl=https://dblp.org/rec/conf/simbig/PortugalRL16
}}
==A Corpus Builder: Retrieving Raw Data from GitHub for Knowledge Reuse In Requirements Elicitation==
<pdf width="1500px">https://ceur-ws.org/Vol-1743/paper5.pdf</pdf>
<pre>
    A Corpus Builder: Retrieving Raw Data from GitHub for Knowledge
                    Reuse In Requirements Elicitation

 Roxana Lisette Quintanilla Portugal, Hugo Roque, Julio Cesar Sampaio do Prado Leite
      Departamento de Informtica, PUC-Rio / Rio de Janeiro RJ 22451-9000, Brasil
                    {rportugal, julio}@inf.puc-rio.br
                        hugo.roque@aluno.puc-rio.br


                     Abstract                                company, hired to do the job, is not familiar with
                                                             the domain and would have to quickly gain lever-
    Requirement elicitation is an important                  age on the contextual knowledge to better collab-
    task, which can lead to cost reduction in                orate with the musicians, as well as to build a
    the overall software process, as it avoids               proper requirements for future developers. This
    failures due to lack of proper understand-               contextual knowledge must be both related to the
    ing about what to build. However, usu-                   client side, but also to the possible software ecol-
    ally, there is a lack of time devoted to                 ogy where the application will operate.
    proper elicitation during software con-                     We depart from the assumption that
    struction. We assume information from                    requirements-related information can be elicited
    similar projects is a valuable knowledge                 from Big Data, in this case we use the software
    for requirements engineers when facing a                 repository GitHub, since this source owns, to date,
    new project in the same or related domain,               more than 35 million of projects in its repository
    and its acquisition can be speeded up by                 (Metz, 2016). This assumption is founded on the
    knowing their main features. This infor-                 evidence that projects on GitHub has encoded
    mation is usually located in Readme doc-                 knowledge. Although this encoded knowledge is
    uments of GitHub. We present a tool that                 mainly represented in programming languages,
    helps in handle this large amount of infor-              there are annotations in natural language that
    mation by retrieving a corpus of Readme                  describe the project purpose. Of course, those
    documents given a domain-related query.                  projects are different in several manners, either in
    It is described, in detail, how a corpus is              quality of its contents, as well as in the level of in-
    created and stresses the importance of hav-              formation provided in natural language. However,
    ing a quality corpus as base for data min-               most of the projects we have retrieved from this
    ing, or as input for tools of qualitative data           repository do provide some natural language texts
    analysis.                                                i.e., the Readme document of each project, which
                                                             helps in understanding a project purposes. Our
1   Introduction
                                                             work is contextualized in what (Markus, 2001)
Imagine the following situation: a group of mu-              calls secondary knowledge miner which is defined
sicians is looking to produce a music application;           as “people who seek to answer new questions or
they believe it could be a hit. They contacted angel         create new knowledge through analysis of records
investors, which are willing to invest, but needed           produced by other people for different purposes”
more details about the idea. As such, they decided           and “extract knowledge from records that were
to hire a requirements engineering company to or-            collected by others, possibly unknown to the
ganize the intentions, before contracting a soft-            reusers. . . ”. Markus also noted that this reuse is
ware developer company to build the application.             not limited to structured data. “Although most
The musicians overall idea is to have an applica-            research on data mining has focused on structured
tion user to know a city, a neighborhood, or a place         data, this is data on databases or knowledge
like a university, by the music that it is being lis-        datasets, similar issues are likely to apply in the
tened around.                                                case of secondary reusers of documents”.
   It happens that the requirements engineering                 A requirements elicitor could perform a man-


                                                        48
ual revision of GitHub projects given a domain-               use the Google API for gathering information; as
related query; however, the reading of hundreds of            such this mechanism does not cover the internal
projects may not be efficient in time-constrained             documents of GitHub projects. Another project,
settings. For instance, a work from EMSE1 field               very similar to ours is the GHTorent (Gousios,
mentions that researchers manually extracted data             2013), which in fact, can accomplish more than
from 32 publications published in digital libraries           a retrieval of Readmes, making database dumps
which it took 80 hours for two tasks: (1) extraction          of all of the projects on GitHub. However,
and (2) analysis of data (Ekaputra et al., 2014).             we found some technical barriers, specially for
Nowadays both Digital Libraries and GitHub own                users not used to deal with this kind of technol-
a plethora of data; on that ground we automate the            ogy. For instance, to get projects related to a
documents extraction task from projects hosted on             query, the user may need to download a database
GitHub, this time using its Readme perspective.               dump (around 30GB size in 10 hours) and then
Thus, a set of documents is ready for the analy-              supported by a Database Management System
sis task that can be performed manually assisted              (DMBS) as MySQL, the user will be able to query
by tools for qualitative data analysis. e.g., Atlas.ti        in SQL format the information needed. Instead,
or NVivo. Or to perform an automatic analysis by              we are proposing a service to deal with other type
using text-mining techniques.                                 of queries; thinking in requirements, our specifi-
   The remainder of this paper is structured as fol-          cation would be:
lows. Section 2 provides a research baseline for
                                                                       Given a query e.g “music application”
motivation. Section 3 explains the rationale for
                                                                       the user may be able to download a
selecting artifacts in GitHub. Section 4 details the
                                                                       zip file of Readmes with extension .txt,
design and construction of the tool. Section 5 de-
                                                                       numbered by order of result’s appear-
scribes the qualitative analysis conducted in the
                                                                       ance, and each document should be
corpus of Readmes for the domain “music appli-
                                                                       named with the project and its owner
cation”. Section 6 concludes and points out future
                                                                       name.
work.
                                                              Another barrier to accomplish our goal with
2       Corpus of Documents                                   (Gousios, 2013) work is that the schema of
(Sinclair, 2005) states this principle when build-            GHTorent database dump2 does not contain a ta-
ing a Corpus: “The contents of a corpus should be             ble related to Readmes information. However this
selected regardless of their language, but accord-            this tool would be useful when retrieving GitHub
ing to their communicative function in the com-               Issues.
munity in which they occur”. In this respect, a                  Another constraint that motivated us to build a
previous work (Portugal et al., 2015) performed               service, is that Readmes can vary daily on GitHub,
an exploratory research to verify to what extent is           when new projects are created or when the existing
feasible the Readme document for use in require-              change its relevant. This relevance is given by the
ments engineering. In this regard, the Readme per-            Forks, Stars, Pull-request, number of Issues and
spective of GitHub projects has the communica-                Comments a project receives.
tive function to describe the main features of a
                                                              3       GitHub Perspectives for Requirements
project. Similarly, the Issues perspective has the
communicative function of tracking the evolution
                                                                      Elicitation
of software features.                                         Using the concepts of viewpoints and perspectives
   Another aspect about the construction of a Cor-            (do Prado Leite and Freeman, 1991) we, as re-
pus is that it can be considered as the first step to-        quirements engineers, see the GitHub in the fol-
wards the building of a web extraction tool, specif-          lowing way: A project, specifically an application,
ically a Natural Language Processing NLP-based                can express a viewpoint, i.e. a way to address what
wrapper (Laender et al., 2002). A Similar ap-                 the user needs in certain domain. It happens that
proach to extract data given a query can be found             each project, viewpoint, may use several represen-
in the tool Webcorp (Renouf, 2003) however, this              tations (perspectives), to describe a project. These
tool does not cope with our goal, as they mainly              GitHub perspectives are: Readmes, Issues, Issue’s
    1                                                             2
        Empirical Software Engineering                                http://ghtorrent.org/dblite/


                                                         49
                                                            they also speed access by providing an easily di-
                                                            gested intermediate point between a documents ti-
                                                            tle and its full text, that is useful for rapid rel-
                                                            evance assessment”. We judge that Readmes
                                                            have the role of abstracts on GitHub environment.
                                                            Fig.1 shows the Readme document of the project
                                                            android-node-music-sync from user benkaiser.
                                                            This project was found with the query “music ap-
                                                            plication android”. Using the GitHub API v3.0
                                                            to access the data, we obtained its raw version
                                                            Fig.2. As we analyzed the raw data, we figure
                                                            it out that in our ongoing research, we will be
                                                            facing the mining of documents of different na-
                                                            ture, this is, structured data: source code, semi-
                                                            structured data: documents with markups such us
         Figure 1: A Readme on GitHub                       html, xml, markdown3 among others, and unstruc-
                                                            tured data: free texts in comments and other doc-
                                                            uments. This time, by using the Readme docu-
                                                            ment we are dealing with semi-structured texts, as
                                                            most of the Readmes follow the predefined mark-
                                                            down format. For instance (see Fig.2) to indicate
                                                            an url, they used [texto](url). In other exemplars
                                                            we found ![alt text](image path) to indicate an im-
                                                            age.

                                                            3.2      The Requirement-Related Information
                                                            What we pursue with a corpus of Readmes is the
                                                            finding requirement-related information, which
                                                            are phrases that can be mined to give an idea of the
                                                            project purposes. Thus, the reader can reuse this
                                                            knowledge for learning or generating new ideas
                                                            in requirements elicitation tasks. From Readme
                                                            (Fig.1) some candidate phrases to be mined would
                                                            be:
           Figure 2: Raw Readme data                           “A syncing application for Android to sync
                                                             playlist from Node Music Player to an Android
                                                                                 phone”
Comments, Commits, Commit’s Comments, and
Gits. We argue that each of this artifacts express a          “This app does not actually play the music on
perspective of a particular viewpoint (project) be-            your phone, it just syncs the songs and the
cause, on the Readme perspective a user is able             playlist across. You can use one of the following
to see a summary of features that the application                   music players to play your music”
implements. In the Issue perspective it is possible
                                                            4       Working Towards the Tool
to get more specialized information about features
(e.g. bugs or enhancements). Even more, it is pos-          As we started to explore GitHub, we built a script
sible to see the decisions (Comments perspective)           to extract readmes just for the query “Real Estate
taken about an issue before been implemented.               in: readme” (Portugal et al., 2015). This serve
                                                            us for our initial purpose of discover ideas and
3.1 The Readme Artifact                                     find domain-independent regularities (Arora et al.,
(Kupiec et al., 1995) state that “Abstracts are                 3
                                                                 Markdown is a lightweight markup language with plain
sometimes used as full document surrogates, for             text formatting syntax designed so that it can be converted to
example as an input to text search systems, but             HTML. source: Wikipedia


                                                       50
                                                                  Figure 4: Querying by combining sorting options


                                                                         Figure 5: Organizing GitHub results


                                                                  a Readme is inserted in the corpus. The unique
   Figure 3: SADT Model for Retrieval Process                     situation where a Readme is not retrieved, is when
                                                                  this is located out of the root of its project. The
2014), (Ridao et al., 2001) which may allow us to                 outputs: it is expected the corpus of Readmes
find requirement-related information. Following,                  in .txt format and the package zipped containing
using the SADT (Structured Analysis and Design                    them.
Technique) (Chen, 1976) we modeled a process to
address our approach. Fig.3 highlights one of the
activities, Retrieve, which points the construction               4.2   Heuristics
of this tool.
                                                                  In order to bring more than 1000 results through
4.1 The Retrieve Activity                                         the search, our tool explored five of the possible
The Retrieve describes the inputs: the domain-                    sorts a user may do: best match, most starts,
related query (search terms) and the GitHub open-                 fewest stars, most forks, fewest forks. Each
source projects. With this, it is requested the                   sorting becomes a new query. We combine them
projects that match with the query. The con-                      to surpass the 1000 results limitation (Fig.4), and
straints: our process was designed to be suit-                    with the current GitHub API we were able to per-
able for any artifact with natural language descrip-              form those five queries in a single task.
tions. It was considered the request limits using                   There is a possibility of leaving out many
the GitHub API4 . We took care in backward trace-                 projects (see the cells in gray and red), that is be-
ability; thus, once a Readme is in a corpus it is                 cause, it is shown just the first 1000 results of any
possible to locate its source on GitHub. A concern                sorting query; after that is not guaranteed the or-
is the quantity of search results limited to 1000,                der of projects relevance. We had another con-
this fact, made us to think in a situation where the              cern, which is the project rating given by users,
project 1001 could be the interesting one for an re-              giving a star or performing a fork, resulting in
quirements elicitor; therefore, we created heuris-                projects repeated in any of the sorting operations.
tics taking advantage of GitHub metadata to im-                   Our heuristic uses a union operation in order to
prove the recall of results. Finally, we had to deal              capture those intersections. Finally, we organize
with a variety of document extensions (.md, .rtf,                 the corpus in the order shown in Fig.5. As a user
.html, .doc, etc.) and normalize them to .txt before              would not be able to get this extra through GitHub
    4
      GitHub      API      v3:           Rate    Limiting.        website, we consider we improved the recall of re-
https://developer.github.com/v3/#rate-limiting                    sults.


                                                             51
                                                             ing in greatest reduction of development effort and
                                                             time to market”.
                                                                We wanted to test two hypotheses:
                                                             Hypothesis 1: A corpus builder of Readmes per-
                                                             mits the finding of features using a similar-based
                                                             projects approach.
                                                             Hypothesis 2: The mined requirements-related-
                                                             information is useful for reuse.
                                                                For this, we built a corpus for the “music ap-
                                                             plication” query with 1206 Readmes and took a
                                                             representative sample of 291 Readmes to be read
                                                             manually with the aim to find reusable knowledge.
    Figure 6: Web application for corpus retrieval
                                                             Once some phrases in context with high chances
                                                             of being re-used are identified, they were shown
4.3 Tool Presentation                                        for a Music Aplication Startup.
The tool5 presentation (Fig.6) is simple and just               It is worth noting that the selection of Readmes
for the purpose to retrieve a corpus of Readmes              was conducted randomly with an script8 we cre-
given a query. To continue with the GitHub                   ated for future test.
spirit providing Open-source software (OSS) to
                                                             5.1      Findings
the community, we made available the script6 code
and also a complementary script7 developed to                A remarkable finding from our notes is the GitHub
support the performance of the application on web            limitation9 which leads to not support a phrase
browsers.                                                    query, resulting in Readmes vaguely related to the
                                                             “music application” query. This happens because
5       Analyzing the Raw Data Retrieved                     some Readmes contained just the word “music”
                                                             and others only with“application”. This fact im-
Our assumption that a corpus of Readmes could be
                                                             pacts the precision to filter relevant projects within
useful for finding requirements-related informa-
                                                             a corpus created.
tion is based on knowledge reuse literature as well
                                                                As we are investigating patterns to anchor re-
as on our own evidence. This premise can be ques-
                                                             quirements, the manual reading allowed us to see
tionable, as the Readme is a brief user documen-
                                                             some patterns motivated by the work of (Arora et
tation of software projects and may seen hardly a
                                                             al., 2014). We identified six concurrent patterns
reliable way for obtaining requirements for your
                                                             (Table 1) and then we mined them on the entire
own project. In fact, The best one can discover
                                                             corpus (2016 Readmes) to obtain its frequency of
is what features these other projects offer for later
                                                             appearance.
reuse. The usefulness of Readme documents is the
                                                                To answer hypothesis 1 and 2, we select the
identification of relevant projects, for later explo-
                                                             pattern with the lowest rank “allows users to”.
ration of another perspectives (Issues, Commits)
                                                             We grouped 861 Readmes with similar file size
which probably does contain more data with the
                                                             (0kb-1kb), and the mining of this pattern using
stereotypical of requirements. In this regard, we
                                                             the package Qdap (Goodrich et al., 2016) for R
are looking for the type of reuse that (Goldin and
                                                             project, resulted in 14 Readmes matching within
Berry, 2015) state “Reuse can take place during
                                                             these Readmes, then a manual extraction of
any phase of a computed-based system develop-
                                                             phrases in context was done. Below we show four
ment, including during proposal consideration and
                                                             phrases shown to a music application startup10 ,
marketing analysis, requirements elicitation, re-
                                                             and in italic, the startup assessment.
quirements analysis, architecture design, code im-
plementation, and testing. . . Thus, reusing require-
ments can be most beneficial, because if it leads to
off-the-shelf reuse of the required product, result-            8
                                                                    https://github.com/nitanilla/Random-Readme
                                                                9
    5
      http://corpus-retrieval.herokuapp.com/                 https://help.github.com/articles/searching-
    6
      https://github.com/nitanilla/corpus-retrieval          code/#considerations-for-code-search
    7                                                           10
      https://github.com/nitanilla/github-proxy                    Hear: https://www.facebook.com/apphear


                                                        52
                                                            access repository for software projects is a strong
     Table 1: Requirement Patterns in Readmes
                                                            candidate as an information source. However,
    Requirement Pattern          In # of Readmes
                                                            to use GitHub information, which is scattered in
    to provide                   77
                                                            thousands of projects, there is a need to compose
    which can                    57
                                                            a proper corpus, where mining heuristics could be
    can be used to               43
                                                            applied.
    should be able to            38
                                                               This work describes the challenges and what
    that allows you to 31
                                                            has been done to build a Requirements Engi-
    allows users to              28
                                                            neering oriented corpus taking GitHub project’s
                                                            Readme as a source.
   1.- Moment is a web application that stores all             With the results so far, we are more close
your special memories with the music you love.              to build a way of reusing unstructured, semi-
By tapping into Spotify’s Web API, Moment                   structured and structured information linked to
allows users to bookmark music they enjoy in                code, as to help the task of eliciting requirements-
a journal format and navigate all their previous            related information. As such, future work will fo-
memories with music. A good idea to take into               cus on mining heuristics and validation of their ap-
consideration                                               plication.

   2.- SoundShack Leverages the Power of Broad-
com’s Latest Wiced chip to Stream HD Audio                  References
to Wifi enabled android devices. This Android               C. Arora, M. Sabetzadeh, Briand L. C., and F. Zim-
application allows users to syn-chronously stream             mer. 2014. Requirement boilerplates: Transition
music with no lag. User can also asynchronously               from manually-enforced to automatically-verifiable
                                                              natural language patterns. In Requirements Patterns
manage speaker groupings, giving users the power              (RePa), 2014 IEEE 4th International Workshop on,
to listen to the football game on their living room           pages 1–8.
speakers and rock out to their patio speakers at
                                                            Peter Pin-Shan Chen. 1976. The entity-relationship
the same time. Being built for the current release.           model&mdash;toward a unified view of data. ACM
                                                              Trans. Database Syst., 1(1):9–36.
  3.- Powerful web application that allows users
to query music and instantly down-load that                 Julio Cesar. S. do Prado Leite and A. Freeman. 1991.
                                                               Requirements validation through viewpoint resolu-
music. The project was done using nodejs and                   tion. IEEE Transactions on Software Engineering,
expressjs. A good idea to take into consideration.             17(12):1253–1269.

   4.- This android application allows users to lis-        Fajar J. Ekaputra, Estefanı́a Serral, and Stefan Biffl.
                                                              2014. Building an empirical software engineering
ten to a music playlist tailored to them based on             research knowledge base from heterogeneous data
their mood. Their mood is extracted using senti-              sources. In Proceedings of the 14th International
ment analysis on diary en-tries in the app. Being             Conference on Knowledge Technologies and Data-
built for the current release.                                driven Business, i-KNOW ’14, pages 13:1–13:8.
                                                              ACM.
Notes and Future work. We are testing the
                                                            Leah Goldin and Daniel M. Berry. 2015. Reuse of
phrases with other two members of the music ap-               requirements reduced time to market at one indus-
plication Startup, with the intention to perceive             trial shop: a case study. Requirements Engineering,
how much they differ in points of view as they                20(1):23–44.
have different background profiles. We are also
                                                            Bryan Goodrich, D Kurkiewicz, and Tyler Rinker.
working on identify more requirement patterns                 2016. Bridging the gap between qualitative data and
and test them in different Corpus of Readmes.                 quantitative analysis.

6    Conclusion                                             Georgios Gousios. 2013. The ghtorent dataset and
                                                              tool suite. In Proceedings of the 10th Working Con-
Mining existing information is becoming a strong              ference on Mining Software Repositories, MSR ’13,
ally in the process of requirements elicitation,              pages 233–236, Piscataway, NJ, USA. IEEE Press.
since more and more information is being stored             Julian Kupiec, Jan Pedersen, and Francine Chen. 1995.
with open access in the web. GitHub as an open                 A trainable document summarizer. In Proceedings


                                                       53
  of the 18th Annual International ACM SIGIR Con-
  ference on Research and Development in Informa-
  tion Retrieval, SIGIR ’95, pages 68–73, New York,
  NY, USA. ACM.
Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Alti-
  gran S. da Silva, and Juliana S. Teixeira. 2002. A
  brief survey of web data extraction tools. SIGMOD
  Rec., 31(2):84–93.
Lynne M. Markus. 2001. Toward a theory of knowl-
  edge reuse: Types of knowledge reuse situations and
  factors in reuse success. Journal of Management In-
  formation Systems, 18(1):57–93.
Cade Metz. 2016. Triple play: Githubs code now lives
  in three places at once. Wired, Last accessed 08-14-
  2016.
Roxana L.Q. Portugal, Julio Cesar. S. do Prado Leite,
  and E. Almentero. 2015. Time-constrained require-
  ments elicitation: reusing github content. In Just-
  In-Time Requirements Engineering (JITRE), 2015
  IEEE Workshop on, pages 5–8. IEEE.
Antoinette Renouf. 2003. Webcorp: providing a re-
  newable data source for corpus linguists. Language
  and Computers, 48(1):39–58.
M. Ridao, J. Doorn, and Julio Cesar. S. do Prado Leite.
  2001. Domain independent regularities in scenar-
  ios. In Requirements Engineering, 2001. Proceed-
  ings. Fifth IEEE International Symposium on, pages
  120–127.
J. Sinclair. 2005. Corpus and Text - Basic Princi-
   ples in Developing Linguistic Corpora: a Guide to
   Good Practice. Appendix: How to build a Corpus.
   Oxford-Oxbow Books.


                                                          54

</pre>