=Paper=
{{Paper
|id=Vol-1743/paper5
|storemode=property
|title=A Corpus Builder: Retrieving Raw Data from GitHub for Knowledge Reuse In Requirements Elicitation
|pdfUrl=https://ceur-ws.org/Vol-1743/paper5.pdf
|volume=Vol-1743
|authors=Roxana Lisette Quintanilla Portugal,Hugo Roque,Julio Cesar Sampaio do Prado Leite
|dblpUrl=https://dblp.org/rec/conf/simbig/PortugalRL16
}}
==A Corpus Builder: Retrieving Raw Data from GitHub for Knowledge Reuse In Requirements Elicitation==
A Corpus Builder: Retrieving Raw Data from GitHub for Knowledge Reuse In Requirements Elicitation Roxana Lisette Quintanilla Portugal, Hugo Roque, Julio Cesar Sampaio do Prado Leite Departamento de Informtica, PUC-Rio / Rio de Janeiro RJ 22451-9000, Brasil {rportugal, julio}@inf.puc-rio.br hugo.roque@aluno.puc-rio.br Abstract company, hired to do the job, is not familiar with the domain and would have to quickly gain lever- Requirement elicitation is an important age on the contextual knowledge to better collab- task, which can lead to cost reduction in orate with the musicians, as well as to build a the overall software process, as it avoids proper requirements for future developers. This failures due to lack of proper understand- contextual knowledge must be both related to the ing about what to build. However, usu- client side, but also to the possible software ecol- ally, there is a lack of time devoted to ogy where the application will operate. proper elicitation during software con- We depart from the assumption that struction. We assume information from requirements-related information can be elicited similar projects is a valuable knowledge from Big Data, in this case we use the software for requirements engineers when facing a repository GitHub, since this source owns, to date, new project in the same or related domain, more than 35 million of projects in its repository and its acquisition can be speeded up by (Metz, 2016). This assumption is founded on the knowing their main features. This infor- evidence that projects on GitHub has encoded mation is usually located in Readme doc- knowledge. Although this encoded knowledge is uments of GitHub. We present a tool that mainly represented in programming languages, helps in handle this large amount of infor- there are annotations in natural language that mation by retrieving a corpus of Readme describe the project purpose. Of course, those documents given a domain-related query. projects are different in several manners, either in It is described, in detail, how a corpus is quality of its contents, as well as in the level of in- created and stresses the importance of hav- formation provided in natural language. However, ing a quality corpus as base for data min- most of the projects we have retrieved from this ing, or as input for tools of qualitative data repository do provide some natural language texts analysis. i.e., the Readme document of each project, which helps in understanding a project purposes. Our 1 Introduction work is contextualized in what (Markus, 2001) Imagine the following situation: a group of mu- calls secondary knowledge miner which is defined sicians is looking to produce a music application; as “people who seek to answer new questions or they believe it could be a hit. They contacted angel create new knowledge through analysis of records investors, which are willing to invest, but needed produced by other people for different purposes” more details about the idea. As such, they decided and “extract knowledge from records that were to hire a requirements engineering company to or- collected by others, possibly unknown to the ganize the intentions, before contracting a soft- reusers. . . ”. Markus also noted that this reuse is ware developer company to build the application. not limited to structured data. “Although most The musicians overall idea is to have an applica- research on data mining has focused on structured tion user to know a city, a neighborhood, or a place data, this is data on databases or knowledge like a university, by the music that it is being lis- datasets, similar issues are likely to apply in the tened around. case of secondary reusers of documents”. It happens that the requirements engineering A requirements elicitor could perform a man- 48 ual revision of GitHub projects given a domain- use the Google API for gathering information; as related query; however, the reading of hundreds of such this mechanism does not cover the internal projects may not be efficient in time-constrained documents of GitHub projects. Another project, settings. For instance, a work from EMSE1 field very similar to ours is the GHTorent (Gousios, mentions that researchers manually extracted data 2013), which in fact, can accomplish more than from 32 publications published in digital libraries a retrieval of Readmes, making database dumps which it took 80 hours for two tasks: (1) extraction of all of the projects on GitHub. However, and (2) analysis of data (Ekaputra et al., 2014). we found some technical barriers, specially for Nowadays both Digital Libraries and GitHub own users not used to deal with this kind of technol- a plethora of data; on that ground we automate the ogy. For instance, to get projects related to a documents extraction task from projects hosted on query, the user may need to download a database GitHub, this time using its Readme perspective. dump (around 30GB size in 10 hours) and then Thus, a set of documents is ready for the analy- supported by a Database Management System sis task that can be performed manually assisted (DMBS) as MySQL, the user will be able to query by tools for qualitative data analysis. e.g., Atlas.ti in SQL format the information needed. Instead, or NVivo. Or to perform an automatic analysis by we are proposing a service to deal with other type using text-mining techniques. of queries; thinking in requirements, our specifi- The remainder of this paper is structured as fol- cation would be: lows. Section 2 provides a research baseline for Given a query e.g “music application” motivation. Section 3 explains the rationale for the user may be able to download a selecting artifacts in GitHub. Section 4 details the zip file of Readmes with extension .txt, design and construction of the tool. Section 5 de- numbered by order of result’s appear- scribes the qualitative analysis conducted in the ance, and each document should be corpus of Readmes for the domain “music appli- named with the project and its owner cation”. Section 6 concludes and points out future name. work. Another barrier to accomplish our goal with 2 Corpus of Documents (Gousios, 2013) work is that the schema of (Sinclair, 2005) states this principle when build- GHTorent database dump2 does not contain a ta- ing a Corpus: “The contents of a corpus should be ble related to Readmes information. However this selected regardless of their language, but accord- this tool would be useful when retrieving GitHub ing to their communicative function in the com- Issues. munity in which they occur”. In this respect, a Another constraint that motivated us to build a previous work (Portugal et al., 2015) performed service, is that Readmes can vary daily on GitHub, an exploratory research to verify to what extent is when new projects are created or when the existing feasible the Readme document for use in require- change its relevant. This relevance is given by the ments engineering. In this regard, the Readme per- Forks, Stars, Pull-request, number of Issues and spective of GitHub projects has the communica- Comments a project receives. tive function to describe the main features of a 3 GitHub Perspectives for Requirements project. Similarly, the Issues perspective has the communicative function of tracking the evolution Elicitation of software features. Using the concepts of viewpoints and perspectives Another aspect about the construction of a Cor- (do Prado Leite and Freeman, 1991) we, as re- pus is that it can be considered as the first step to- quirements engineers, see the GitHub in the fol- wards the building of a web extraction tool, specif- lowing way: A project, specifically an application, ically a Natural Language Processing NLP-based can express a viewpoint, i.e. a way to address what wrapper (Laender et al., 2002). A Similar ap- the user needs in certain domain. It happens that proach to extract data given a query can be found each project, viewpoint, may use several represen- in the tool Webcorp (Renouf, 2003) however, this tations (perspectives), to describe a project. These tool does not cope with our goal, as they mainly GitHub perspectives are: Readmes, Issues, Issue’s 1 2 Empirical Software Engineering http://ghtorrent.org/dblite/ 49 they also speed access by providing an easily di- gested intermediate point between a documents ti- tle and its full text, that is useful for rapid rel- evance assessment”. We judge that Readmes have the role of abstracts on GitHub environment. Fig.1 shows the Readme document of the project android-node-music-sync from user benkaiser. This project was found with the query “music ap- plication android”. Using the GitHub API v3.0 to access the data, we obtained its raw version Fig.2. As we analyzed the raw data, we figure it out that in our ongoing research, we will be facing the mining of documents of different na- ture, this is, structured data: source code, semi- structured data: documents with markups such us Figure 1: A Readme on GitHub html, xml, markdown3 among others, and unstruc- tured data: free texts in comments and other doc- uments. This time, by using the Readme docu- ment we are dealing with semi-structured texts, as most of the Readmes follow the predefined mark- down format. For instance (see Fig.2) to indicate an url, they used [texto](url). In other exemplars we found ![alt text](image path) to indicate an im- age. 3.2 The Requirement-Related Information What we pursue with a corpus of Readmes is the finding requirement-related information, which are phrases that can be mined to give an idea of the project purposes. Thus, the reader can reuse this knowledge for learning or generating new ideas in requirements elicitation tasks. From Readme (Fig.1) some candidate phrases to be mined would be: Figure 2: Raw Readme data “A syncing application for Android to sync playlist from Node Music Player to an Android phone” Comments, Commits, Commit’s Comments, and Gits. We argue that each of this artifacts express a “This app does not actually play the music on perspective of a particular viewpoint (project) be- your phone, it just syncs the songs and the cause, on the Readme perspective a user is able playlist across. You can use one of the following to see a summary of features that the application music players to play your music” implements. In the Issue perspective it is possible 4 Working Towards the Tool to get more specialized information about features (e.g. bugs or enhancements). Even more, it is pos- As we started to explore GitHub, we built a script sible to see the decisions (Comments perspective) to extract readmes just for the query “Real Estate taken about an issue before been implemented. in: readme” (Portugal et al., 2015). This serve us for our initial purpose of discover ideas and 3.1 The Readme Artifact find domain-independent regularities (Arora et al., (Kupiec et al., 1995) state that “Abstracts are 3 Markdown is a lightweight markup language with plain sometimes used as full document surrogates, for text formatting syntax designed so that it can be converted to example as an input to text search systems, but HTML. source: Wikipedia 50 Figure 4: Querying by combining sorting options Figure 5: Organizing GitHub results a Readme is inserted in the corpus. The unique Figure 3: SADT Model for Retrieval Process situation where a Readme is not retrieved, is when this is located out of the root of its project. The 2014), (Ridao et al., 2001) which may allow us to outputs: it is expected the corpus of Readmes find requirement-related information. Following, in .txt format and the package zipped containing using the SADT (Structured Analysis and Design them. Technique) (Chen, 1976) we modeled a process to address our approach. Fig.3 highlights one of the activities, Retrieve, which points the construction 4.2 Heuristics of this tool. In order to bring more than 1000 results through 4.1 The Retrieve Activity the search, our tool explored five of the possible The Retrieve describes the inputs: the domain- sorts a user may do: best match, most starts, related query (search terms) and the GitHub open- fewest stars, most forks, fewest forks. Each source projects. With this, it is requested the sorting becomes a new query. We combine them projects that match with the query. The con- to surpass the 1000 results limitation (Fig.4), and straints: our process was designed to be suit- with the current GitHub API we were able to per- able for any artifact with natural language descrip- form those five queries in a single task. tions. It was considered the request limits using There is a possibility of leaving out many the GitHub API4 . We took care in backward trace- projects (see the cells in gray and red), that is be- ability; thus, once a Readme is in a corpus it is cause, it is shown just the first 1000 results of any possible to locate its source on GitHub. A concern sorting query; after that is not guaranteed the or- is the quantity of search results limited to 1000, der of projects relevance. We had another con- this fact, made us to think in a situation where the cern, which is the project rating given by users, project 1001 could be the interesting one for an re- giving a star or performing a fork, resulting in quirements elicitor; therefore, we created heuris- projects repeated in any of the sorting operations. tics taking advantage of GitHub metadata to im- Our heuristic uses a union operation in order to prove the recall of results. Finally, we had to deal capture those intersections. Finally, we organize with a variety of document extensions (.md, .rtf, the corpus in the order shown in Fig.5. As a user .html, .doc, etc.) and normalize them to .txt before would not be able to get this extra through GitHub 4 GitHub API v3: Rate Limiting. website, we consider we improved the recall of re- https://developer.github.com/v3/#rate-limiting sults. 51 ing in greatest reduction of development effort and time to market”. We wanted to test two hypotheses: Hypothesis 1: A corpus builder of Readmes per- mits the finding of features using a similar-based projects approach. Hypothesis 2: The mined requirements-related- information is useful for reuse. For this, we built a corpus for the “music ap- plication” query with 1206 Readmes and took a representative sample of 291 Readmes to be read manually with the aim to find reusable knowledge. Figure 6: Web application for corpus retrieval Once some phrases in context with high chances of being re-used are identified, they were shown 4.3 Tool Presentation for a Music Aplication Startup. The tool5 presentation (Fig.6) is simple and just It is worth noting that the selection of Readmes for the purpose to retrieve a corpus of Readmes was conducted randomly with an script8 we cre- given a query. To continue with the GitHub ated for future test. spirit providing Open-source software (OSS) to 5.1 Findings the community, we made available the script6 code and also a complementary script7 developed to A remarkable finding from our notes is the GitHub support the performance of the application on web limitation9 which leads to not support a phrase browsers. query, resulting in Readmes vaguely related to the “music application” query. This happens because 5 Analyzing the Raw Data Retrieved some Readmes contained just the word “music” and others only with“application”. This fact im- Our assumption that a corpus of Readmes could be pacts the precision to filter relevant projects within useful for finding requirements-related informa- a corpus created. tion is based on knowledge reuse literature as well As we are investigating patterns to anchor re- as on our own evidence. This premise can be ques- quirements, the manual reading allowed us to see tionable, as the Readme is a brief user documen- some patterns motivated by the work of (Arora et tation of software projects and may seen hardly a al., 2014). We identified six concurrent patterns reliable way for obtaining requirements for your (Table 1) and then we mined them on the entire own project. In fact, The best one can discover corpus (2016 Readmes) to obtain its frequency of is what features these other projects offer for later appearance. reuse. The usefulness of Readme documents is the To answer hypothesis 1 and 2, we select the identification of relevant projects, for later explo- pattern with the lowest rank “allows users to”. ration of another perspectives (Issues, Commits) We grouped 861 Readmes with similar file size which probably does contain more data with the (0kb-1kb), and the mining of this pattern using stereotypical of requirements. In this regard, we the package Qdap (Goodrich et al., 2016) for R are looking for the type of reuse that (Goldin and project, resulted in 14 Readmes matching within Berry, 2015) state “Reuse can take place during these Readmes, then a manual extraction of any phase of a computed-based system develop- phrases in context was done. Below we show four ment, including during proposal consideration and phrases shown to a music application startup10 , marketing analysis, requirements elicitation, re- and in italic, the startup assessment. quirements analysis, architecture design, code im- plementation, and testing. . . Thus, reusing require- ments can be most beneficial, because if it leads to off-the-shelf reuse of the required product, result- 8 https://github.com/nitanilla/Random-Readme 9 5 http://corpus-retrieval.herokuapp.com/ https://help.github.com/articles/searching- 6 https://github.com/nitanilla/corpus-retrieval code/#considerations-for-code-search 7 10 https://github.com/nitanilla/github-proxy Hear: https://www.facebook.com/apphear 52 access repository for software projects is a strong Table 1: Requirement Patterns in Readmes candidate as an information source. However, Requirement Pattern In # of Readmes to use GitHub information, which is scattered in to provide 77 thousands of projects, there is a need to compose which can 57 a proper corpus, where mining heuristics could be can be used to 43 applied. should be able to 38 This work describes the challenges and what that allows you to 31 has been done to build a Requirements Engi- allows users to 28 neering oriented corpus taking GitHub project’s Readme as a source. 1.- Moment is a web application that stores all With the results so far, we are more close your special memories with the music you love. to build a way of reusing unstructured, semi- By tapping into Spotify’s Web API, Moment structured and structured information linked to allows users to bookmark music they enjoy in code, as to help the task of eliciting requirements- a journal format and navigate all their previous related information. As such, future work will fo- memories with music. A good idea to take into cus on mining heuristics and validation of their ap- consideration plication. 2.- SoundShack Leverages the Power of Broad- com’s Latest Wiced chip to Stream HD Audio References to Wifi enabled android devices. This Android C. Arora, M. Sabetzadeh, Briand L. C., and F. Zim- application allows users to syn-chronously stream mer. 2014. Requirement boilerplates: Transition music with no lag. User can also asynchronously from manually-enforced to automatically-verifiable natural language patterns. In Requirements Patterns manage speaker groupings, giving users the power (RePa), 2014 IEEE 4th International Workshop on, to listen to the football game on their living room pages 1–8. speakers and rock out to their patio speakers at Peter Pin-Shan Chen. 1976. The entity-relationship the same time. Being built for the current release. model—toward a unified view of data. ACM Trans. Database Syst., 1(1):9–36. 3.- Powerful web application that allows users to query music and instantly down-load that Julio Cesar. S. do Prado Leite and A. Freeman. 1991. Requirements validation through viewpoint resolu- music. The project was done using nodejs and tion. IEEE Transactions on Software Engineering, expressjs. A good idea to take into consideration. 17(12):1253–1269. 4.- This android application allows users to lis- Fajar J. Ekaputra, Estefanı́a Serral, and Stefan Biffl. 2014. Building an empirical software engineering ten to a music playlist tailored to them based on research knowledge base from heterogeneous data their mood. Their mood is extracted using senti- sources. In Proceedings of the 14th International ment analysis on diary en-tries in the app. Being Conference on Knowledge Technologies and Data- built for the current release. driven Business, i-KNOW ’14, pages 13:1–13:8. ACM. Notes and Future work. We are testing the Leah Goldin and Daniel M. Berry. 2015. Reuse of phrases with other two members of the music ap- requirements reduced time to market at one indus- plication Startup, with the intention to perceive trial shop: a case study. Requirements Engineering, how much they differ in points of view as they 20(1):23–44. have different background profiles. We are also Bryan Goodrich, D Kurkiewicz, and Tyler Rinker. working on identify more requirement patterns 2016. Bridging the gap between qualitative data and and test them in different Corpus of Readmes. quantitative analysis. 6 Conclusion Georgios Gousios. 2013. The ghtorent dataset and tool suite. In Proceedings of the 10th Working Con- Mining existing information is becoming a strong ference on Mining Software Repositories, MSR ’13, ally in the process of requirements elicitation, pages 233–236, Piscataway, NJ, USA. IEEE Press. since more and more information is being stored Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. with open access in the web. GitHub as an open A trainable document summarizer. In Proceedings 53 of the 18th Annual International ACM SIGIR Con- ference on Research and Development in Informa- tion Retrieval, SIGIR ’95, pages 68–73, New York, NY, USA. ACM. Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Alti- gran S. da Silva, and Juliana S. Teixeira. 2002. A brief survey of web data extraction tools. SIGMOD Rec., 31(2):84–93. Lynne M. Markus. 2001. Toward a theory of knowl- edge reuse: Types of knowledge reuse situations and factors in reuse success. Journal of Management In- formation Systems, 18(1):57–93. Cade Metz. 2016. Triple play: Githubs code now lives in three places at once. Wired, Last accessed 08-14- 2016. Roxana L.Q. Portugal, Julio Cesar. S. do Prado Leite, and E. Almentero. 2015. Time-constrained require- ments elicitation: reusing github content. In Just- In-Time Requirements Engineering (JITRE), 2015 IEEE Workshop on, pages 5–8. IEEE. Antoinette Renouf. 2003. Webcorp: providing a re- newable data source for corpus linguists. Language and Computers, 48(1):39–58. M. Ridao, J. Doorn, and Julio Cesar. S. do Prado Leite. 2001. Domain independent regularities in scenar- ios. In Requirements Engineering, 2001. Proceed- ings. Fifth IEEE International Symposium on, pages 120–127. J. Sinclair. 2005. Corpus and Text - Basic Princi- ples in Developing Linguistic Corpora: a Guide to Good Practice. Appendix: How to build a Corpus. Oxford-Oxbow Books. 54