Research on NLP for RE at the University of Hamburg: a Report Davide Fucci Christoph Stanik HITeC/University of Hamburg HITeC/University of Hamburg Hamburg, Germany Hamburg, Germany fucci@informatik.uni-hamburg.de stanik@informatik.uni-hamburg.de Lloyd Montgomery Zijad Kurtanović HITeC/University of Hamburg University of Hamburg Hamburg, Germany Hamburg, Germany montgomery@informatik.uni-hamburg.de kurtanovic@informatik.uni-hamburg.de Timo Johann Walid Maalej University of Hamburg University of Hamburg Hamburg, Germany Hamburg, Germany johann@informatik.uni-hamburg.de maaleej@informatik.uni-hamburg.de Abstract The Mobile Applied Software Technology (MAST) group at the Univer- sity of Hamburg focuses its research on context-aware adaptive systems and the social side of software engineering. In the context of natural language processing for requirements engineering, the group has mostly focused on mining app stores reviews. Currently, the group is involved in the OpenReq project where natural language processing is being used to recommend requirements from diverse sources (e.g., social me- dia, issue trackers), and to improve the structural quality of existing requirements. 1 Research Group Overview The MAST group at the University of Hamburg1 concentrates its research effort on context-aware adaptive systems (CAAS) and social software engineering (SSE) with a particular focus on the mobile service domain. In particular, CAAS observe their users and environments to create a context and automatically adjust and optimize their behavior to it. We are interested in context-aware recommender systems for supporting individuals as well as groups in accessing information, sharing information, and taking collective decisions in software engineering and management scenarios. In our research, we aim to support different stakeholders—for example, requirement engineers in getting recommendations for release planning, or software developer for whom recommender systems can suggest useful documentation. Copyright c 2018 by the paper’s authors. Copying permitted for private and academic purposes. 1 https://mast.informatik.uni-hamburg.de/ SSE pertains the social and human aspects of software engineering as well as the engineering of social software. Within SSE, we recognize the importance of software socialness—the systematic involvement of end-users and their communities in the software life cycle, from authoring documentation to even development and integration tasks. There are several synergies between these research topics investigated by the group and NLP for RE. With the advent of app stores, this is especially the case in mobile services domain. Users produce large, complex, yet information-rich textual data on app stores which can be analyzed, using NLP approaches, to extract require- ments. At the same time, recommender systems leverage structured and semi-structured data to support the work of requirements engineers (e.g., requirements elicitation) together with other stakeholders (e.g., require- ments negotiation). 2 Past Research on NLP for RE This section summarizes, in ascending chronological order, the work done by the MAST research group, which focuses on NLP to advance the state-of-the-art in requirements engineering. Our investigations focus on user-driven requirements engineering. In particular, our NLP studies target user- generated textual content in review systems such as app stores (e.g., Google Play, Apple App Store, Amazon Appstore). 2.1 App Reviews The data, scripts, and tools for the paper described in this subsection are available at research group website2 . How Do Users Like this Feature? A Fine Grained Sentiment Analysis of App Reviews Guzman and Maalej [GM14] use NLP to extract app features from app reviews and analyze the sentiment users show when discussing these features. For the feature extraction, they perform ordinary text preprocessing steps such as stop-word removal, lemmatization, and part-of-speech filtering. After the preprocessing, collocations are used to find app features in the reviews. The collocation process ignores the word order, takes a word window of three words, and is only considered if it appears in at least three reviews. Then collocations, also with similar words, are grouped. Finally, the most frequent collocation within each group was selected as the representative name for that feature. Moreover, a sentiment analysis was performed using SentiStrength [TBP+ 10]. This analysis shows how users express their opinion about specific features, or in the whole review. SentiStrength calculates a positive and a negative score for a given text, as both types of expressions can be part of a single text. As a result of this work, we can extract app features with an average f1-score of 55% and show how these features are perceived (e.g., either positively or negatively) by the users. On the automatic classification of app reviews Maalej et al.[MKNS16] paper on the classification of app reviews is an extended version of the previously submitted work of Maalej and Nabil [MN15], which focuses on automatically classifying app reviews as bug report, feature request, user experience, and rating. This paper approaches the classification problem by analyzing which classifier achieves better results and by trying different combinations of machine learning features. In the paper, we consider metadata and NLP based information as machine learning features. For the classification, the results are reported by using only reviews metadata, or only NLP-based machine learning features, or with the combination of both. The data used in the approach are app reviews from the Google Play Store and the Apple App Store. The classification benchmark shows promising results with f1-scores ranging from 89% to 99% for the four classes. Besides the classification, we developed a prototype of an analytics tool that aggregates the information retrieved from the classification. The tool shows, for example, how the number of bugs evolved, the distribution of the four classes for an app in different app stores, and gives deeper insight by showing concrete reviews in each class. Finally, the tool was evaluated by interviews with nine practitioners, such as software developers and analysts. The interviews show that most practitioners have a need for filtering app reviews that do not contain useful information, such as “great app”, or “I hate it”. 2 https://mast.informatik.uni-hamburg.de/app-review-analysis/ SAFE: A Simple Approach for Feature Extraction from App Descriptions and App Reviews In this paper, Johann et al. [JSM+ 17] describe a uniform approach (SAFE) that can extract app features from app descriptions, app reviews, and matches both together. SAFE extracts app features without prior machine learning training to analyze what features the app developer provide and to understand how the users talk about it. To extract app features, we use NLP to analyze the structure of sentences. Through qualitative analysis, we found that there are 18 common part-of-speech patterns and four common sentence structures that describe app features. The extraction from app descriptions achieved an average f1-score of 46% while the extraction from reviews had an average f-score of 35%. After SAFE extracted the app features from the app description and reviews, the final step was to match which features were mentioned in both sources. This information provides insights about the app, such as the identification of (un)popular features, feature requests, and bug reports. The matching was performed in three steps. First SAFE checks if the terms contained in both sources (i.e., app description and app reviews) are identical. Second, we tackle language ambiguity using WordNet to compare the synonyms of each word of the app feature. Third, SAFE extracts the semantic similarity of the app features and calculates the cosine similarity to find a match. The matching procedure achieved and accuracy of 87%. 2.2 Mining User Rationale from Software Reviews Kurtanović and Maalej [KM17b] introduce user rationale for requirements engineering. Motivated by the amount of data available in social media, user forums, and app stores, software vendors started to give these channels increasing attention. Software vendors want to easily access users’ input to make better decisions about software design, its development, and the evolution. This work focuses on the identification of design- and user rationale, which can be valuable for software and requirements engineering. In this work, we found, among others, that rationale, alternatives, criteria, and decisions often co-occur in user comments and that in 21% to 70% of the cases they contain justifications. In this work, we studied 32,414 reviews for 52 software applications in the Amazon Store. To identify user rationale, we employ a supervised machine learning approach using text, metadata, sentiments, and syntactic features and compare these results between three classification algorithms (Naive Bayes, Support Vector Machine, and Logistic Regression). The classification is tested with different configurations and predicts user rationale at comment and sentence level. The precision and recall for all considered user rationale concepts range between 80%-99% at a comment level and between 69%-98% at a sentence level. 2.3 Other In this section, we report our experience with topics, other than the application of NLP to app reviews, which we deem interesting for the community. Toward Data-Driven Requirements Engineering In this paper [MNJR16], we suggest a shift in the re- quirements engineering community to include user feedback to enable user-centered, data-driven identification, prioritization, and management of software requirements. We show the importance of user feedback and explain what research has achieved so far. These achievements are scoped to the area of analytics of user feedback, such as classifying user feedback into bug reports and feature requests, the classification of stakeholders, and the summarization of user reviews. One primary focus of the paper is to show how these topics are addressed using NLP-based approaches. Automatically Classifying Functional and Non-Functional Requirements Using Supervised Ma- chine Learning Kurtanović and Maalej [KM17a] use the supervised machine learning classifier Support Vector Machine to classify functional (FR) and non-functional requirements (NFR) automatically using metadata, lex- ical features, and syntactical features of the requirement text. We show how to classify fine-grained NFRs, such as Usability, Security, Operational, and Performance. From a methodological perspective, one contribution of this paper is the use of under- and over-sampling strategies to handle imbalanced data in the different NFR classes. The classification of FRs and NFRs achieved an f1-score of up to 93%. The classification results of more specific NFRs achieved f1-scores ranging between 51% and 82%. 3 Research Plan on NLP for RE Currently, the group is involved in the H2020-funded project OpenReq.3 The goal of the project is to research, develop, and evaluate intelligent recommendation and decision tech- nologies that will support communities and individual stakeholders in the gathering and management of software requirements. In particular, OpenReq wants to bridge the gap between the development and usage of software products and services. To that end, the project aim is to take into account the user community as part of the innovation process and continuously observe and involve stakeholders and end users in the decision-making process. OpenReq use cases will cover open-source development, telecommunications, and railways bidding. In the context of the project, the group will apply NLP to two specific activities related to requirements engineering, i) derive/improve requirements from unstructured text, and ii) improve the quality of existing requirements. Activity i) is currently under development. It consists of collecting explicit user feedback from public chan- nels, such as social media, reviews system, ticketing systems, and discussion forums, and then aggregating and analyzing this large amount data to facilitate stakeholders understanding of users needs and help them to react quickly. From an NLP perspective, we are using such data to tackle four tasks: • provide features, based on statistical language processing (e.g., tf-idf, GloVe), for machine learning classifiers. Here, we want to differentiate between relevant and irrelevant feedback, as well as further categorize relevant one—e.g., understand whether the feedback contains a request for a new feature, a complaint about an existing one, or both • perform sentiment analysis, to assess, for example, the user base reception of a new feature and allow stakeholders to act accordingly, • perform summarization and facilitate the access to this significant amount of data to stakeholders and decision makers; in this regard, we are interested in visualization techniques to support this task, • perform Named-Entity Recognition (NER) and topic recognition to understand what are the specific areas in which the previous tasks can be applied. The above points are particularly interesting from a research point-of-view, as the language used in these texts is not only English but also Italian. Moreover, since much of the data is collected from channels such as Twitter, the text tends to be short and colloquial. Activity ii) is currently in a preliminary phase. Here, we will analyze requirement documents—either struc- tured (e.g., user stories) or not (e.g., free-form text). NLP techniques will be used to build a recommender system for improving structural properties of the requirements text. In particular, we expect to focus on the following tasks: • Word sense disambiguation and coreference resolution to identify ambiguous passages in the requirement text and suggest corrective actions. • Chunking and relationship extraction to assess (and eventually correct) conformance to templates, such as user stories. • Semantic role labeling and textual entailment to assess the completeness of a requirement text concerning several concerns (e.g., risk). As these documents contain domain-specific knowledge, we are investigating the possibility to support the NLP approaches with ontologies and glossaries. Acknowledgment We would like to acknowledge the H2020 EU research project OpenReq (ID 732463). 3 http://openreq.eu/ References [GM14] Emitza Guzman and Walid Maalej. How do users like this feature? a fine grained sentiment analysis of app reviews. In Requirements Engineering Conference (RE), 2014 IEEE 22nd International, pages 153–162. IEEE, 2014. [JSM+ 17] Timo Johann, Christoph Stanik, Walid Maalej, et al. Safe: A simple approach for feature extraction from app descriptions and app reviews. In Requirements Engineering Conference (RE), 2017 IEEE 25th International, pages 21–30. IEEE, 2017. [KM17a] Zijad Kurtanović and Walid Maalej. Automatically classifying functional and non-functional require- ments using supervised machine learning. In Requirements Engineering Conference (RE), 2017 IEEE 25th International, pages 490–495. IEEE, 2017. [KM17b] Zijad Kurtanović and Walid Maalej. Mining user rationale from software reviews. In Requirements Engineering Conference (RE), 2017 IEEE 25th International, pages 61–70. IEEE, 2017. [MKNS16] Walid Maalej, Zijad Kurtanović, Hadeer Nabil, and Christoph Stanik. On the automatic classification of app reviews. Requirements Engineering, 21(3):311–331, 2016. [MN15] Walid Maalej and Hadeer Nabil. Bug report, feature request, or simply praise? on automatically clas- sifying app reviews. In Requirements Engineering Conference (RE), 2015 IEEE 23rd International, pages 116–125. IEEE, 2015. [MNJR16] Walid Maalej, Maleknaz Nayebi, Timo Johann, and Guenther Ruhe. Toward data-driven requirements engineering. IEEE Software, 33(1):48–54, 2016. [TBP+ 10] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. Sentiment strength detection in short informal text. Journal of the Association for Information Science and Technology, 61(12):2544–2558, 2010.