Introduction

A Taxonomy for User Feedback Classi cations

Rubens Santos

Eduard C. Groen

Karina Villela Fraunhofer IESE

Kaiserslautern

Germany rubens.santos@iese-extern.fraunhofer.de

eduard.groen

karina.villelag@iese.fraunhofer.de

2016

Online user feedback contains information that is of interest to requirements engineering (RE). Natural language processing (NLP) techniques, especially classi cation algorithms, are a popular way of automatically classifying requirements-relevant contents. Research into this use of NLP in RE has sought to answer di erent research questions, often causing their classi cations to be incompatible. Identifying and structuring these classi cations is therefore urgently needed. We present a preliminary taxonomy that we constructed based on the ndings from a systematic literature review, which places 78 classi cations categories for user feedback into four groups: Sentiment, Intention, User Experience, and Topic. The taxonomy reveals the purposes for which user feedback is analyzed in RE, provides an initial harmonization of the vocabulary in this research area, and may inspire researchers to investigate classi cations they had previously not considered. This paper intends to foster discussions among NLP experts and to identify further improvements to the taxonomy.

Introduction

Copyright c 2019 by the paper's authors. Copying permitted for private and academic purposes. summarization; [Ber17]). However, nearly all research has sought to provide more detailed sub-classi cations of requirements-relevant content (see Section 3 for a review). It is at the level of these prede ned categories, or classi cations, that CrowdRE research diverges, by using classi cations that are at best only modestly compatible with those of other works analyzing user feedback from di erent RE-relevant perspectives. However, these di erences make it harder to nd the best classi cation for a particular usage scenario.

Previously, an ontology has been proposed for types of user feedback [MPG15], but we do not know of any previous e ort that combines classi cations for RE in a comprehensive taxonomy in a way that would (1) help to understand the purposes for which user feedback can be classi ed, and (2) contribute to an initial harmonization of the focus and vocabulary of the research in this area. This is why in this paper we present a preliminary taxonomy of classi cation categories based on an investigation of existing literature on this topic. We present our taxonomy at this early stage in order to foster discussions among RE and NLP experts, and to get inspiration for further improvements to the taxonomy. This contribution is of an analytic nature as it intends to introduce some degree of order in the proliferation of classi cations. It is not meant to impose a standardization governing which classi cations to use; on the contrary, we hope to inspire researchers and practitioners to use classi cations not previously considered. Through this work, we intend to answer the following research questions:

RQ1: Which classi cation categories have research publications used to classify requirements-relevant content?

RQ2: How can the identi ed classi cation categories be structured into a taxonomy? RQ3: What are possible analysis purposes for which each category of the taxonomy can be used? in Section 2, we describe the methodology we applied to answer our research questions, followed by a presentation of the resulting taxonomy in Section 3. Section 4 presents a discussion of possible uses of the taxonomy in practice, and in Section 5, we conclude and provide perspectives on further developing the taxonomy. 2

Method

In this section, we rst discuss the approach we employed to identify relevant literature (Section 2.1), followed by a presentation of our methodology for deriving our taxonomy (Section 2.2). 2.1

Systematic Literature Review

Within the scope of a larger benchmarking study, we performed a systematic literature review (SLR) [KC07] to obtain a comprehensive and broad overview of the literature on classifying user feedback. We used an SLR to exclude any potential selection bias and prevent gaps in our research. As part of this e ort, we noticed that our set of systematically obtained literature works proposed and used many disjunct classi cation structures and categories. This nding led us to launching an e ort towards harmonizing these classi cation categories, resulting in the taxonomy presented in this work.

Our SLR protocol speci es the research questions, a search strategy including explicit inclusion and exclusion criteria, and the information to be extracted from the primary research found (cf. [KC07]). We de ned the following research questions for the SLR:

Overall Objective: What are the state-of-the-art automated approaches for assisting the task of requirements extraction from user feedback acquired from the crowd, and which NLP techniques and features do they use? Objective 1: Regarding requirements elicitation from user feedback acquired from the crowd, what are the state-of-art automated approaches for classifying user feedback? Objective 2: How do such approaches classify user feedback? { Objective 2.1: What are the di erent sets of categories in which user feedback is classi ed? { Objective 2.2: Which automated techniques are used? { Objective 2.3: What are the characteristics of the user feedback these approaches aim to classify? To perform our search, we composed a search string by de ning search terms, many of which are common terms from known literature, and tested these in di erent combinations. We also used previously identi ed papers to verify whether the search string would correctly nd these publications. The nal search string was as follows: ((\CrowdRE" OR \Crowd RE") OR (((\User Review" OR \User Feedback" OR \App Review" OR \Feature Requests" OR \User Opinions" OR \User Requirements")) AND (Classif* OR Framework OR Tool OR \Text Analysis" OR Mining OR \Feature Extraction") AND \Requirements Engineering")) We selected papers according to the eight exclusion criteria (EC) and two inclusion criteria (IC) listed below. A paper meeting one or more ECs was excluded from the selection, while a paper meeting one or more ICs and no ECs was included in the selection.

EC1: The paper is not written in English.

EC2: The paper was published before 2013.1 EC3: The work or study is not published in a peer-reviewed venue.

EC4: The paper is not related to RE for software products and/or the title is clearly not related to the research questions.

EC5: The paper does not address the topic of requirements extraction from user feedback analysis. EC6: The paper proposes a tool or dataset that does not aim to assist a requirements extraction process from online user reviews, or could not be used in this way; for example, recommender systems, information retrieval for search engines, or approaches that link source code changes to bug xes.

EC7: The paper proposes an approach or tool that does not process textual user feedback. For example, approaches that analyze implicit feedback, process requirements documents, or merely collect user feedback instead of processing it.

EC8: The paper proposes an approach that does not make use of automation because the user feedback analysis is done entirely manually; for example, crowdsourced requirements elicitation.

IC1: The paper proposes an approach for ltering out irrelevant user feedback from raw data, regardless of whether or not this is done using classi cation techniques.

IC2: The paper proposes an approach for classifying user feedback into default predetermined categories.

We rst applied our search string to search for suitable papers in March 2018, using three prominent scienti c databases on software engineering research: ACM, Springer, and IEEE Xplore. Exclusion criteria EC1{EC3 were applied directly through database lters. This query returned a combined result of 1,219 papers. After removing duplicates and screening the title and abstract (to which EC4{EC8 and IC1{IC2 were applied), 146 papers remained. These included papers for which the results of our title and abstract analysis were inconclusive, so this number also included papers where we were uncertain whether they matched our selection criteria. Further scanning of the introduction and conclusion sections, to which EC4{EC8 were re-applied, reduced the number of papers for data extraction to a total of 40. This work was performed by the rst author of this paper, and the third author cross-checked a random subset to assure the quality of this work. Any disagreements were discussed and resolved. We repeated the query on 18 December 2018 to include papers that had been added over the course of 2018, which resulted in 14 new papers, of which 3 were relevant to our SLR, for a total number of 43 analyzed papers. Due to space restrictions, we present the complete list of primary papers in a separate document2, and reference them in this document with the notation Pn.

Serving the overall goal of the SLR, i.e., to prepare a benchmarking study, we systematically extracted three major groups of data from the selected papers:

Dataset-related information, such as dataset size in number of entries, object granularity (sentence vs. review), source (e.g., app stores, social media), and mean text object size.

NLP techniques applied, such as algorithms, parsers, ML features, and text pre-processing techniques. Classi cation categories into which the tool was designed to classify user feedback, along with their de nitions, where available. We also paid speci c attention to any explicit rationales behind design decisions made for a tool to understand for which goal or under which circumstances speci c categories are best used.

The aggregated overview of the third group of data, \classi cation categories", revealed that a benchmarking study would be impeded by the use of di erent categories. This nding led to our e orts to derive a taxonomy. 2.2

Taxonomy Derivation We established our taxonomy of user feedback classi cations in ve steps:

1This year was chosen because prior to the introduction of CrowdRE in 2015 [GDA15], the analysis of online user feedback via NLP for RE was not considered to be a serious source of requirements. We additionally considered six years as a technical obsoleteness threshold to t our paper selection e orts to our resource constraints.

2Bibliography of primary studies: zenodo.org/record/2546422, doi:10.5281/zenodo.2546422

Step 1: Collect and complete categories. Having gathered the various classi cation categories as part of our SLR, we created an overview listing the categories used in each paper, along with their source. We then veri ed that we had identi ed all the relevant information from each paper.

Step 2: Merge similar categories. Many of the primary studies presented their own category de nitions. To organize them, we inspected their de nitions in order to identify similar classi cation categories that intend to lter the same type of text but have a di erent name. We then determined the most appropriate name and description for this category. If a paper did not de ne or explain what the categories used should lter, we assumed that they adopted the same de nition as the papers they discussed in their related work section. If any doubt persisted, we contacted the authors by email.

Here is an example of how we merged categories: The category \Feature Request" received this name because it was the most prevalent name in the literature, although it combines the categories \User Requirements" from P20, \Functional Requirements" from P24, \Feature Requests" from P25, and \Request" from P6, all of which we found to refer to texts containing requests for functional enhancements, either by implementing new features or by enhancing existing ones. We then based the de nition of this category on the de nitions found in P1 and P31. For space reasons, we provide the complete overview of the 78 merged categories3, along with their de nitions and references to the papers in which they were found, in a separate document4. The names of all classi cation categories are also shown in the taxonomy.

Step 3: Group related categories. The studies in P11 and P24 on quality-related aspects used the ISO 25010 software product quality characteristics [ISO10], while P2, P17, P26 and P27 based their work on notable publications from user experience (UX) [BAH11, KR08, Nie93] and the ISO 25010 quality-in-use characteristics [ISO10]. These served as the initial framework for clustering our categories because most other papers did not compose the categories or their de nitions systematically. Similarity in de nitions or even names allowed us to draw parallels to these standardized structures. However, we also took heed not to include characteristics that cannot be found in user feedback according to research, such as \Maintainability" in software product quality, as found in P11. Similarly, we omitted the ISO 25010 quality-in-use characteristics \Freedom from Risk" and \Context Coverage" with their sub-characteristics because these were not found in the UX research, possibly because it might be impossible to estimate them based on the opinion of users. Conversely, we included re nements of these frameworks found in the literature. For example, \Battery" in P4 and P20 re nes \Resource Utilization". Relationships between papers, for example papers written by some of the same authors or referencing similar works, were used as indicators that particular categories could be grouped. For example, the 21 categories on UX were found in six papers that aimed to identify UX-related information in user reviews. This is how, in addition to the aforementioned papers on UX, we identi ed that P26 focuses on a higher-level goal in which the trait of UX is only one classi cation, while P8 and P25 identify UX traits without further distinguishing them.

After having made attributions based on the standardized framework, we sought to identify patterns among the remaining categories so that they could be organized into conceptually distinct groups. Sentiment-related categories clearly stood out, even though we are aware that some works, such as P11 and P18, juxtapose them with other classi cation categories. All other categories were initially sorted according to what they aim to lter from the text. Moreover, two works suggested additional categorization structures: types of topics was suggested in P25, which we found to be compatible with the ISO 25010 software product quality characteristics, and the author's intention was suggested in P35, which we adopted and expanded through discussions with peers. In this way, we came up with the four groupings in our taxonomy and succeeded in to assigning all categorizations to a single group, except for \Learnability", which appears twice in our taxonomy (under Topic and UX) due to its proximity to concepts in both groups.

Step 4: Identify subgroups. Once we had established the four main groups with their categories, we subdivided them into logical subgroups to provide even more structure. For example, we found the category Topic to contain all categories of user feedback that address topics regarding the software product, speci cally general statements, particular functions or qualities of the product, or aspects from its extended context.

Step 5: Validate taxonomy. Finally, we performed an early validation of our taxonomy through individual commenting sessions with ve domain experts|three RE experts and two UX experts|three of whom have experience in both academia and industry. Their feedback predominantly led to making clearer distinctions or partially reorganizing some clusters of categorizations. The resulting preliminary taxonomy is presented in Section 3.

3\Learnability" appears twice in our taxonomy, but is counted once.

4Table of classi cation categories: zenodo.org/record/2577863, doi:10.5281/zenodo.2577863

Taxonomy

By grouping the categories identi ed in the literature, we composed the taxonomy shown in Figure 1. Rounded rectangles in the taxonomy represent classi cation categories from the literature. One-way arrows signify a subset relationship between categories. Each category except for \Requirements-Irrelevant" is covered under \Requirements-Relevant". Similarly, \Satisfaction" includes \Trust", \Pleasure", \Comfort", and \Utility". Swimlanes within each group represent logical subgroups to further organize the classi cation categories. Doublesided arrows show explicit antagonists, i.e., categories that cannot be assigned to the same text snippet as a matter of principle.

The premise of this taxonomy builds on the distinction between \informative" content and other, noninformative content that according to P3 does not contribute to RE purposes. We renamed this distinction \Requirements-Relevant" and \Requirements-Irrelevant". The de nition we use for \Requirements-Relevant" does not signi cantly di er from \Informative", but we had to change the scope for the category \RequirementsIrrelevant" because it includes several categories that have been used in the literature to discard certain types of text: \Other", \Noisy", \Unclear", \Unrelated" from P13, \Non-Bug" from P19, and \Miscellaneous and Spam" from P41.

The primary papers proposed a wide range of di erent classi cation categories, such as 14 unique categories in P26 and 23 unique categories in P2, while others classi ed requirements-relevant feedback into just three overall categories, such as \Suggestions for New Features", \Bug Reports", and \Other" in P39. Our taxonomy consists of four groups of user feedback classi cations: \Sentiment", \Intention", \User Experience", and \Topic", which we will describe separately in the following subsections. Note that these categorizations are not mutually exclusive, but can also be used in combination, which we will further discuss in Section 4. 3.1

Sentiment

We found several papers on CrowdRE research, P1, P21, P23 and P33, that applied sentiment analysis ; a commonly applied NLP technique that determines the extent to which texts or elements of such texts are positive or negative. Most sentiment analysis techniques search for prede ned sentiment-related words cataloged in dictionaries such as SentiWordNet or AFFIN to assign a word-speci c sentiment score on a bipolar scale ranging from very negative (e.g., -2) to very positive (e.g., +2) to calculate a total score for a sentence or an entire text, like in P33 and P42.

Some techniques merely distinguish between \Positive", \Negative", and \None" (i.e., neutral), like in P1 and P42, treating sentiment analysis as a binary or ternary classi cation problem. In addition to determining the polarity, some review classi cation tools used for CrowdRE have suggested classi cation categories such as \Praise" and \General Complaint", as suggested in P14, which enable them to make a better assessment of how users perceive the product even if only short user feedback is given.

According to P12, associating sentiment analysis with information from other classi cations such as Topic (see Section 3.4) can reveal user acceptance levels regarding speci c aspects of the software, based on which the aspects receiving the most criticism can be prioritized to be improved rst. 3.2

Intention

According to several publications by one research group, P9, P10, P35, and P36, understanding the motivation or drive behind why a user provides feedback can help determine the requirements of this user. This notion underlies the classi cation according to the user's intention or goal, which we could subdivide into requesting, informing, and reporting intention.

Informing user reviews typically seek to persuade or dissuade other crowd members to use the product or to provide a justi cation for why a particular star rating was given. P13 asserts that users will often describe what was poor or excellent about their interaction with the product. The category \Job Advertisement" may seem unusual in this context, but was used in P14 to classify user feedback on Twitter regarding a job o ering at a software company that may be of interest to non-technical stakeholders such as marketing representatives, and for the general public. When informing user feedback also addresses aspects of a product that are present or absent, they may provide interesting topics (see Section 3.4).

Reporting user reviews intend to inform the developer of the product of a problem or defect the user found, which will often be a \Bug Report". Because bug reports are usually objective descriptions of problems and quality issues found in already present features, according to P25 they are a popular user feedback classi cation type for identifying possible functional or quality requirements. For this reason, P26 further subdivides them into categories such as \Network Issue", \Unexpected Behavior", or \Crashing". These will often coincide with classi cations of quality aspects (see Section 3.4).

Requesting user reviews harbor the type of user feedback classi cation found most frequently in the primary studies we analyzed in our work, namely \Feature Requests", which according to P26 represent requests from users to add new (or reintroduce previous) functional aspects to the product, or to remove, modify, or enhance existing features. Users may also make requests to improve a particular quality, for example to make the product faster, more reliable, or more compatible with other systems, or they may place a request to receive information about the product. 3.3

User Experience

An important aspect of user feedback according to P11 is that it is written by users who report on their practical experience with a software product. As a result, aspects of UX relate to user requirements because they re ect the users' perceptions of the product or their response to the use or anticipated use of this product. This is why UX and RE are often addressed together in development activities such as elicitation, prototyping, and testing. We found several works, P2, P17, and P27, that sought to classify parts of texts according to UX-related dimensions. According to the ISO 9241-210 standard, the opinions found in user reviews on UX are shaped by a user's personal emotions, beliefs, preferences, perceptions, physical and psychological responses, behaviors, and accomplishments that occur before, during, and after use [ISO09]. What distinguishes the classi cations in this group from others is that they are of a more subjective nature [MHBR02]. As a result, a classi cation regarding a UX aspect does not primarily result in explicit suggestions, but rather provides information about the users' emotions, motivation, and expectations. These may be indicative of problems (e.g., as a source of frustration) or well-liked features (e.g., as a source of excitement).

Several classi cations in this group coincide with some of the ISO 25010 quality-in-use characteristics [ISO10], speci cally \E ciency", \E ectiveness" (called \Errors and E ectiveness" in P2 and P17), and \Satisfaction" with its four sub-characteristics \Trust", \Pleasure", \Comfort", and \Usefulness". The second and third subgroup in this category sort the classi cations into user-oriented perception, which involves emotional and behavioral aspects of the user, and product-oriented perception, which are opinions that can be attributed to the product or its context. Similar to sentiment analysis (see Section 3.1), the perception of users regarding the UX can provide an indication of product acceptance because greater enjoyment with the product will increase acceptance. Together, according to P2 these analyses can help determine how users react to individual features. 3.4

Topics

The nal group of classi cations assesses the software product, its aspects, and its context as speci c topics on which users share their opinion in user feedback. This may reveal actual requirements if the user provided su cient information. The classi cations in this group are distinguished into user opinion pertaining to the functions, quality, or context of a product, and are described in P4, P9, P10, P11, P13, P26, P27, P35, and P36.

Product quality aspects often involve variations of the ISO 25010 software product quality characteristics [ISO10], such as in P4, P11, P26, and P40, with some classi cations going into more detail than the standard prescribes (e.g., speci cally categorizing user feedback on \Battery" in P4). Product context aspects found in P9, P10, P35 and P36 deal with the functional aspects of interoperability, with planned extensions mostly to other software products, discussions of content created or accessible through the product, the behavior of the product in a speci c version, and general opinions; the latter do not specify which aspect of the product a user nds good or bad and might does not necessarily pertain to the product itself. Other product-related aspects include the users' opinions on the pricing, the development company or team, and the quality of the service they provide, as well as comparisons users make between the product and competitor products to describe what functionality is missing or unique in the product. Classi cations on topics may correlate with classi cations on user intention (see Section 3.2), especially when users address a product function to make a request, whereas bug reports often address defects in product quality. 4

Practical Application

A key nding of this work is that the types of classi cations can be placed into four main groups: Sentiment, Intention, UX, and Topic. The classi cations in these four groups of our taxonomy are conceptually di erent. This not only means that they will produce di erent results, but also that they may be better suited for some purposes than for others, which we will explore in this section.

From the primary papers and our own experience, we derived seven common RE activities that can bene t from input from user feedback analysis. Most papers pursued only one kind of RE activity, except for some exploratory works, such as P14. The activities are listed in Table 1, where we indicate which of the classi cation groups are better suited than others. An example of how this overview can be read: The activity of eliciting requirements from user feedback is most likely to bene t from a classi cation according to topic to identify user feedback that addresses the quality, functions, or context of a product. Additionally, assessing user feedback by its intention may reveal requesting user feedback and help to speci cally nd feature requests. Conversely, the Sentiment and UX categories are less suited because they may only lead to requirements indirectly, usually requiring a manual inspection to nd them.

Overall, we found that for each activity, two or three groups are suited. Moreover, each of the four classi cation groups can serve multiple purposes within RE, showing that they are suitable for obtaining di erent kinds of RErelevant knowledge, provided users disclose this knowledge in their feedback. These ndings can be interpreted in three ways:

The application of the classi cation categories of a particular group may be suitable for more purposes than the ones to which they have been applied so far. For example, UX analysis has focused mainly on product acceptance and usage context, but would also be useful for identifying unique selling propositions and potential process improvements.

One may choose to apply classi cation according to just one suitable group. This choice may depend on a trade-o between the amount of work required to perform the analysis versus the quality of the results, as some types of analyses are relatively easy to set up (e.g., sentiment analysis), while others can provide deeper insights if more e ort is spent on tailoring them to RE.

This outcome also suggests that the classi cation groups are not mutually exclusive and can support each other when used in combination. For example, Sentiment and Topic could provide sentiment scores and extracted features, respectively, which could then be aggregated to obtain an overview of the best- and worst-rated product functions. In this paper, we presented a practical auxiliary nding from an SLR into research on the classi cation of user feedback in CrowdRE research, namely, a preliminary taxonomy of the classi cation categories found in this research. We found a total of 78 unique classi cation categories, counting the duplicate occurrence of \Learnability" once (RQ1). For space reasons, we had to make the table listing the classi cation categories available as a separate le5, but their names are all shown in the taxonomy. Even though the number of unique classi cation categories was higher than anticipated, it did con rm our suspicion that the lack of an existing structure has caused a proliferation of classi cation categories in CrowdRE research, which became especially evident from the di erent names being used for the same concepts. Moreover, several primary papers failed to explain how their categories were chosen or to provide a clear de nition of these categories, suggesting that some of the categories found were constructed rather freely. Conversely, only ve papers, P2, P11, P17, P24, and P40, based their category de nitions entirely on a formal standard. To structure the large number of categories, we took a systematic approach towards establishing a taxonomy (shown in Figure 1), which consists of four main groups: Sentiment, Intention, UX, and Topic (RQ2), revealing the four predominant foci of identifying information in user feedback that is of relevance to RE. Finally, we assessed how suitable the classi cations of each group are for typical RE-related analyses, which revealed that for most purposes, classi cations from di erent groups can be used (RQ3). The choice of classi cation will often depend on a trade-o between the degree of detail required and the ease of con guring and performing the analysis.

One aspect that was revealed through the taxonomy is that its groups are not mutually exclusive, and that certain aspects can be identi ed through di erent classi cations; most notably bug reports and feature requests. We argue that this is no contradiction to the way this classi cation is structured, but rather a logical result of the strong correlation between the categories. For example, although Sentiment is a category of its own, the degree to which user feedback is positive or negative often also underlies the other three groups. We also observe similar overlaps between categories of authoritative standards; for example, according to the ISO 25010 standard [ISO10], poor maintainability during development will likely a ect reliability at runtime, with the distinction being the perspective taken. Our taxonomy does not seek to impose a standardization, but rather to be a constructive source of inspiration for research and industry applications. It is also intended as a rst step towards introducing harmonization between the kinds of analysis performed and the naming used for the categories.

The premise of this ontology was the bottom-up construction in which we organized the existing classi cation categories used in the literature. Although it would be possible to theorize about including other potentially useful categorizations in our taxonomy, we present only those categories that research has con rmed to be 5Table of classi cation categories: zenodo.org/record/2577863, doi:10.5281/zenodo.2577863 appropriate for classifying user reviews, omitting those that have been shown or assumed to not be found in user feedback (see Section 2.2 for examples). Moreover, due to the nature of the research, we considered only those categories that have been applied in research studies; an assessment of the categories used in commercial tools available on the market may reveal additional categories. We intend to further validate this taxonomy with specialists in the eld of software quality assurance, RE, and UX, and to test its practical applicability as a framework for selecting appropriate classi cation categories depending on the goal of the user feedback analysis. Furthermore, we believe the taxonomy could be part of a quality framework with guidelines regarding best practices for using classi cation categories for RE. Such a framework could include metrics for evaluating the quality of classi cation tools, and the taxonomy could serve as a means for standardizing the classi cation categories in order to facilitate benchmarking with regard to the quality of the results produced by di erent tools.

Acknowledgments

We would like to thank the experts, including Dr. Fabiano Dalpiaz, Dr. Jorg Dorr, and Dr. Nash Mahmoud for reviewing earlier versions of the taxonomy. We thank Sonnhild Namingha from Fraunhofer IESE for proofreading this article. [BAH11]

Javier A. Bargas-Avila and Kasper Hornb k. Old wine in new bottles or novel challenges? A critical analysis of empirical studies of user experience. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), pages 2689|2698, 2011.

Daniel Berry. Evaluation of tools for hairy requirements and software engineering tasks. In Proceedings of the IEEE 25th International Requirements Engineering Conference (RE) Workshops, pages 284{291, 2017. [CLH+14] Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, and Boshen Zhang. AR-Miner: Mining informative reviews for developers from mobile app marketplace. In Proceedings of the 36th International Conference on Software Engineering (ICSE), pages 767{778, 2014.

Eduard C. Groen, Joerg Doerr, and Sebastian Adam. Towards crowd-based requirements engineering a research preview. In Samuel A. Fricker and Kurt Schneider, editors, Requirements Engineering: Foundation for Software Quality, pages 247{253, Cham, 2015. Springer.

Emitza Guzman, Mohamed Ibrahim, and Martin Glinz. A little bird told me: Mining tweets for requirements and software evolution. In Proceedings of the IEEE 25th International Requirements Engineering Conference (RE), pages 11{20, 2017. [GKH+17] Eduard C. Groen, Sylwia Kocpzynska, Marc P. Hauer, Tobias D. Kra t, and Joerg Doerr. Users |The hidden software product quality experts? A study on how app users report quality aspects in online reviews. In Proceedings of the IEEE 25th International Requirements Engineering Conference (RE), pages 80{89, 2017. [GSA+17] Eduard C. Groen, Norbert Sey , Raian Ali, Fabiano Dalpiaz, Joerg Doerr, Emitza Guzman, et al.

The crowd in requirements engineering: The landscape and challenges. IEEE Software, 34(2):44{52, March/April 2017. [GSK+18] Eduard C. Groen, Jacqueline Schowalter, Sylwia Kocpzynska, Svenja Polst, and Sadaf Alvani. Is there really a need for using NLP to elicit requirements? A benchmarking study to assess scalability of manual analysis. In Klaus Schmid and Paolo Spoletini, editors, Requirements Engineering { Foundation for Software Quality (REFSQ) Joint Proceedings of the Co-Located Events, CEUR Workshop Proceedings 2075, 2018.

C. Iacob and R. Harrison. Retrieving and analyzing mobile apps feature requests from online reviews. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pages 41{44, 2013. [ISO10] [KC07] [KR08] [LL17]

ISO/IEC. ISO/IEC 25010 - Systems and software engineering { Systems and software Quality Requirements and Evaluation (SQuaRE) { System and software quality models. Technical report, ISO/IEC, 2010.

Nishant Jha and Anas Mahmoud. Mining user requirements from application store reviews using frame semantics. In P. Grunbacher and A. Perini, editors, Requirements Engineering: Foundation for Software Quality (REFSQ), LNCS 10153, pages 273{287, Cham, 2017. Springer.

B. A. Kitchenham and S Charters. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE{2007{01, School of Computer Science and Mathematics, Keele University, 2007.

Pekka Ketola and Virpi Roto. Exploring user experience measurement needs. In Proceedings of the 5th COST294-MAUSE Open Workshop on Valid Useful User Experience Measurement (VUUM), pages 23{26, 2008.

Mengmeng Lu and Peng Liang. Automatic classi cation of non-functional requirements from augmented app user reviews. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering (EASE), pages 344{353, 2017. [MN15] [MPG15] [Nie93] [WM17]

Walid Maalej and Hadeer Nabil. Bug report, feature request, or simply praise? On automatically classifying app reviews. In Proceedings of the IEEE 23rd International Requirements Engineering Conference (RE), pages 116{125, 2015.

Itzel Morales-Ramirez, Anna Perini, and Renata Silva Souza Guizzardi. An ontology of online user feedback in software engineering. Applied Ontology, 10(3{4):297{330, 2015.

Jakob Nielsen. Usability Engineering. Morgan Kaufmann, San Francisco, 1993. Grant Williams and Anas Mahmoud. Mining Twitter feeds for software user requirements. In

[JM17] ISO. ISO 9241 -210: Ergonomics of human-system interaction { Part 210: Human-centred design for interactive systems . Technical report, ISO , 2009 .

Grant

Williams and

Anas

Mahmoud . Mining Twitter feeds for software user requirements . In Proceedings of the IEEE 25th International Requirements Engineering Conference (RE) , pages 1 { 10 , 2017 .

[WZL+18] Chong

Wang

, Fan Zhang, Peng Liang, Maya Daneva, and Marten van Sinderen. Can app changelogs improve requirements classi cation from app reviews?: An exploratory study . In Proceedings of theACM/IEEE 12th International Symposium on Empirical Software Engineering and Measurement (ESEM) , Article 43 , 2018 .