<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Taxonomy for User Feedback Classi cations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rubens Santos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduard C. Groen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karina Villela Fraunhofer IESE</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kaiserslautern</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany rubens.santos@iese-extern.fraunhofer.de</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>eduard.groen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>karina.villelag@iese.fraunhofer.de</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>Online user feedback contains information that is of interest to requirements engineering (RE). Natural language processing (NLP) techniques, especially classi cation algorithms, are a popular way of automatically classifying requirements-relevant contents. Research into this use of NLP in RE has sought to answer di erent research questions, often causing their classi cations to be incompatible. Identifying and structuring these classi cations is therefore urgently needed. We present a preliminary taxonomy that we constructed based on the ndings from a systematic literature review, which places 78 classi cations categories for user feedback into four groups: Sentiment, Intention, User Experience, and Topic. The taxonomy reveals the purposes for which user feedback is analyzed in RE, provides an initial harmonization of the vocabulary in this research area, and may inspire researchers to investigate classi cations they had previously not considered. This paper intends to foster discussions among NLP experts and to identify further improvements to the taxonomy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c 2019 by the paper's authors. Copying permitted for private and academic purposes.
summarization; [Ber17]). However, nearly all research has sought to provide more detailed sub-classi cations of
requirements-relevant content (see Section 3 for a review). It is at the level of these prede ned categories, or
classi cations, that CrowdRE research diverges, by using classi cations that are at best only modestly compatible
with those of other works analyzing user feedback from di erent RE-relevant perspectives. However, these
di erences make it harder to nd the best classi cation for a particular usage scenario.</p>
      <p>Previously, an ontology has been proposed for types of user feedback [MPG15], but we do not know of any
previous e ort that combines classi cations for RE in a comprehensive taxonomy in a way that would (1) help to
understand the purposes for which user feedback can be classi ed, and (2) contribute to an initial harmonization
of the focus and vocabulary of the research in this area. This is why in this paper we present a preliminary
taxonomy of classi cation categories based on an investigation of existing literature on this topic. We present our
taxonomy at this early stage in order to foster discussions among RE and NLP experts, and to get inspiration
for further improvements to the taxonomy. This contribution is of an analytic nature as it intends to introduce
some degree of order in the proliferation of classi cations. It is not meant to impose a standardization governing
which classi cations to use; on the contrary, we hope to inspire researchers and practitioners to use classi cations
not previously considered. Through this work, we intend to answer the following research questions:</p>
      <p>RQ1: Which classi cation categories have research publications used to classify requirements-relevant
content?</p>
      <p>RQ2: How can the identi ed classi cation categories be structured into a taxonomy?
RQ3: What are possible analysis purposes for which each category of the taxonomy can be used?
in Section 2, we describe the methodology we applied to answer our research questions, followed by a
presentation of the resulting taxonomy in Section 3. Section 4 presents a discussion of possible uses of the taxonomy
in practice, and in Section 5, we conclude and provide perspectives on further developing the taxonomy.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>In this section, we rst discuss the approach we employed to identify relevant literature (Section 2.1), followed
by a presentation of our methodology for deriving our taxonomy (Section 2.2).
2.1</p>
      <sec id="sec-2-1">
        <title>Systematic Literature Review</title>
        <p>Within the scope of a larger benchmarking study, we performed a systematic literature review (SLR) [KC07] to
obtain a comprehensive and broad overview of the literature on classifying user feedback. We used an SLR to
exclude any potential selection bias and prevent gaps in our research. As part of this e ort, we noticed that
our set of systematically obtained literature works proposed and used many disjunct classi cation structures
and categories. This nding led us to launching an e ort towards harmonizing these classi cation categories,
resulting in the taxonomy presented in this work.</p>
        <p>Our SLR protocol speci es the research questions, a search strategy including explicit inclusion and exclusion
criteria, and the information to be extracted from the primary research found (cf. [KC07]). We de ned the
following research questions for the SLR:</p>
        <p>Overall Objective: What are the state-of-the-art automated approaches for assisting the task of
requirements extraction from user feedback acquired from the crowd, and which NLP techniques and features do
they use?
Objective 1: Regarding requirements elicitation from user feedback acquired from the crowd, what are the
state-of-art automated approaches for classifying user feedback?
Objective 2: How do such approaches classify user feedback?
{ Objective 2.1: What are the di erent sets of categories in which user feedback is classi ed?
{ Objective 2.2: Which automated techniques are used?
{ Objective 2.3: What are the characteristics of the user feedback these approaches aim to classify?
To perform our search, we composed a search string by de ning search terms, many of which are common
terms from known literature, and tested these in di erent combinations. We also used previously identi ed
papers to verify whether the search string would correctly nd these publications. The nal search string was as
follows:
((\CrowdRE" OR \Crowd RE") OR (((\User Review" OR \User Feedback" OR \App Review" OR
\Feature Requests" OR \User Opinions" OR \User Requirements")) AND (Classif* OR Framework OR
Tool OR \Text Analysis" OR Mining OR \Feature Extraction") AND \Requirements Engineering"))
We selected papers according to the eight exclusion criteria (EC) and two inclusion criteria (IC) listed below.
A paper meeting one or more ECs was excluded from the selection, while a paper meeting one or more ICs and
no ECs was included in the selection.</p>
        <p>EC1: The paper is not written in English.</p>
        <p>EC2: The paper was published before 2013.1
EC3: The work or study is not published in a peer-reviewed venue.</p>
        <p>EC4: The paper is not related to RE for software products and/or the title is clearly not related to the
research questions.</p>
        <p>EC5: The paper does not address the topic of requirements extraction from user feedback analysis.
EC6: The paper proposes a tool or dataset that does not aim to assist a requirements extraction process
from online user reviews, or could not be used in this way; for example, recommender systems, information
retrieval for search engines, or approaches that link source code changes to bug xes.</p>
        <p>EC7: The paper proposes an approach or tool that does not process textual user feedback. For example,
approaches that analyze implicit feedback, process requirements documents, or merely collect user feedback
instead of processing it.</p>
        <p>EC8: The paper proposes an approach that does not make use of automation because the user feedback
analysis is done entirely manually; for example, crowdsourced requirements elicitation.</p>
        <p>IC1: The paper proposes an approach for ltering out irrelevant user feedback from raw data, regardless of
whether or not this is done using classi cation techniques.</p>
        <p>IC2: The paper proposes an approach for classifying user feedback into default predetermined categories.</p>
        <p>We rst applied our search string to search for suitable papers in March 2018, using three prominent scienti c
databases on software engineering research: ACM, Springer, and IEEE Xplore. Exclusion criteria EC1{EC3 were
applied directly through database lters. This query returned a combined result of 1,219 papers. After removing
duplicates and screening the title and abstract (to which EC4{EC8 and IC1{IC2 were applied), 146 papers
remained. These included papers for which the results of our title and abstract analysis were inconclusive, so
this number also included papers where we were uncertain whether they matched our selection criteria. Further
scanning of the introduction and conclusion sections, to which EC4{EC8 were re-applied, reduced the number
of papers for data extraction to a total of 40. This work was performed by the rst author of this paper, and
the third author cross-checked a random subset to assure the quality of this work. Any disagreements were
discussed and resolved. We repeated the query on 18 December 2018 to include papers that had been added over
the course of 2018, which resulted in 14 new papers, of which 3 were relevant to our SLR, for a total number
of 43 analyzed papers. Due to space restrictions, we present the complete list of primary papers in a separate
document2, and reference them in this document with the notation Pn.</p>
        <p>Serving the overall goal of the SLR, i.e., to prepare a benchmarking study, we systematically extracted three
major groups of data from the selected papers:</p>
        <p>Dataset-related information, such as dataset size in number of entries, object granularity (sentence vs.
review), source (e.g., app stores, social media), and mean text object size.</p>
        <p>NLP techniques applied, such as algorithms, parsers, ML features, and text pre-processing techniques.
Classi cation categories into which the tool was designed to classify user feedback, along with their de
nitions, where available. We also paid speci c attention to any explicit rationales behind design decisions
made for a tool to understand for which goal or under which circumstances speci c categories are best used.</p>
        <p>The aggregated overview of the third group of data, \classi cation categories", revealed that a benchmarking
study would be impeded by the use of di erent categories. This nding led to our e orts to derive a taxonomy.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Taxonomy Derivation</title>
        <sec id="sec-2-2-1">
          <title>We established our taxonomy of user feedback classi cations in ve steps:</title>
          <p>1This year was chosen because prior to the introduction of CrowdRE in 2015 [GDA15], the analysis of online user feedback
via NLP for RE was not considered to be a serious source of requirements. We additionally considered six years as a technical
obsoleteness threshold to t our paper selection e orts to our resource constraints.</p>
          <p>2Bibliography of primary studies: zenodo.org/record/2546422, doi:10.5281/zenodo.2546422</p>
          <p>Step 1: Collect and complete categories. Having gathered the various classi cation categories as part of our
SLR, we created an overview listing the categories used in each paper, along with their source. We then veri ed
that we had identi ed all the relevant information from each paper.</p>
          <p>Step 2: Merge similar categories. Many of the primary studies presented their own category de nitions. To
organize them, we inspected their de nitions in order to identify similar classi cation categories that intend to
lter the same type of text but have a di erent name. We then determined the most appropriate name and
description for this category. If a paper did not de ne or explain what the categories used should lter, we
assumed that they adopted the same de nition as the papers they discussed in their related work section. If any
doubt persisted, we contacted the authors by email.</p>
          <p>Here is an example of how we merged categories: The category \Feature Request" received this name because
it was the most prevalent name in the literature, although it combines the categories \User Requirements" from
P20, \Functional Requirements" from P24, \Feature Requests" from P25, and \Request" from P6, all of which
we found to refer to texts containing requests for functional enhancements, either by implementing new features
or by enhancing existing ones. We then based the de nition of this category on the de nitions found in P1
and P31. For space reasons, we provide the complete overview of the 78 merged categories3, along with their
de nitions and references to the papers in which they were found, in a separate document4. The names of all
classi cation categories are also shown in the taxonomy.</p>
          <p>Step 3: Group related categories. The studies in P11 and P24 on quality-related aspects used the ISO 25010
software product quality characteristics [ISO10], while P2, P17, P26 and P27 based their work on notable
publications from user experience (UX) [BAH11, KR08, Nie93] and the ISO 25010 quality-in-use characteristics
[ISO10]. These served as the initial framework for clustering our categories because most other papers did not
compose the categories or their de nitions systematically. Similarity in de nitions or even names allowed us to
draw parallels to these standardized structures. However, we also took heed not to include characteristics that
cannot be found in user feedback according to research, such as \Maintainability" in software product quality, as
found in P11. Similarly, we omitted the ISO 25010 quality-in-use characteristics \Freedom from Risk" and
\Context Coverage" with their sub-characteristics because these were not found in the UX research, possibly because
it might be impossible to estimate them based on the opinion of users. Conversely, we included re nements of
these frameworks found in the literature. For example, \Battery" in P4 and P20 re nes \Resource Utilization".
Relationships between papers, for example papers written by some of the same authors or referencing similar
works, were used as indicators that particular categories could be grouped. For example, the 21 categories on
UX were found in six papers that aimed to identify UX-related information in user reviews. This is how, in
addition to the aforementioned papers on UX, we identi ed that P26 focuses on a higher-level goal in which the
trait of UX is only one classi cation, while P8 and P25 identify UX traits without further distinguishing them.</p>
          <p>After having made attributions based on the standardized framework, we sought to identify patterns among
the remaining categories so that they could be organized into conceptually distinct groups. Sentiment-related
categories clearly stood out, even though we are aware that some works, such as P11 and P18, juxtapose them
with other classi cation categories. All other categories were initially sorted according to what they aim to lter
from the text. Moreover, two works suggested additional categorization structures: types of topics was suggested
in P25, which we found to be compatible with the ISO 25010 software product quality characteristics, and the
author's intention was suggested in P35, which we adopted and expanded through discussions with peers. In
this way, we came up with the four groupings in our taxonomy and succeeded in to assigning all categorizations
to a single group, except for \Learnability", which appears twice in our taxonomy (under Topic and UX) due to
its proximity to concepts in both groups.</p>
          <p>Step 4: Identify subgroups. Once we had established the four main groups with their categories, we subdivided
them into logical subgroups to provide even more structure. For example, we found the category Topic to contain
all categories of user feedback that address topics regarding the software product, speci cally general statements,
particular functions or qualities of the product, or aspects from its extended context.</p>
          <p>Step 5: Validate taxonomy. Finally, we performed an early validation of our taxonomy through individual
commenting sessions with ve domain experts|three RE experts and two UX experts|three of whom have
experience in both academia and industry. Their feedback predominantly led to making clearer distinctions
or partially reorganizing some clusters of categorizations. The resulting preliminary taxonomy is presented in
Section 3.</p>
          <p>3\Learnability" appears twice in our taxonomy, but is counted once.</p>
          <p>4Table of classi cation categories: zenodo.org/record/2577863, doi:10.5281/zenodo.2577863</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Taxonomy</title>
      <p>By grouping the categories identi ed in the literature, we composed the taxonomy shown in Figure 1. Rounded
rectangles in the taxonomy represent classi cation categories from the literature. One-way arrows signify a
subset relationship between categories. Each category except for \Requirements-Irrelevant" is covered under
\Requirements-Relevant". Similarly, \Satisfaction" includes \Trust", \Pleasure", \Comfort", and \Utility".
Swimlanes within each group represent logical subgroups to further organize the classi cation categories.
Doublesided arrows show explicit antagonists, i.e., categories that cannot be assigned to the same text snippet as a
matter of principle.</p>
      <p>The premise of this taxonomy builds on the distinction between \informative" content and other,
noninformative content that according to P3 does not contribute to RE purposes. We renamed this distinction
\Requirements-Relevant" and \Requirements-Irrelevant". The de nition we use for \Requirements-Relevant"
does not signi cantly di er from \Informative", but we had to change the scope for the category
\RequirementsIrrelevant" because it includes several categories that have been used in the literature to discard certain types
of text: \Other", \Noisy", \Unclear", \Unrelated" from P13, \Non-Bug" from P19, and \Miscellaneous and
Spam" from P41.</p>
      <p>The primary papers proposed a wide range of di erent classi cation categories, such as 14 unique categories
in P26 and 23 unique categories in P2, while others classi ed requirements-relevant feedback into just three
overall categories, such as \Suggestions for New Features", \Bug Reports", and \Other" in P39. Our taxonomy
consists of four groups of user feedback classi cations: \Sentiment", \Intention", \User Experience", and \Topic",
which we will describe separately in the following subsections. Note that these categorizations are not mutually
exclusive, but can also be used in combination, which we will further discuss in Section 4.
3.1</p>
      <sec id="sec-3-1">
        <title>Sentiment</title>
        <p>We found several papers on CrowdRE research, P1, P21, P23 and P33, that applied sentiment analysis ; a
commonly applied NLP technique that determines the extent to which texts or elements of such texts are
positive or negative. Most sentiment analysis techniques search for prede ned sentiment-related words cataloged
in dictionaries such as SentiWordNet or AFFIN to assign a word-speci c sentiment score on a bipolar scale
ranging from very negative (e.g., -2) to very positive (e.g., +2) to calculate a total score for a sentence or an
entire text, like in P33 and P42.</p>
        <p>Some techniques merely distinguish between \Positive", \Negative", and \None" (i.e., neutral), like in P1
and P42, treating sentiment analysis as a binary or ternary classi cation problem. In addition to determining
the polarity, some review classi cation tools used for CrowdRE have suggested classi cation categories such as
\Praise" and \General Complaint", as suggested in P14, which enable them to make a better assessment of how
users perceive the product even if only short user feedback is given.</p>
        <p>According to P12, associating sentiment analysis with information from other classi cations such as Topic
(see Section 3.4) can reveal user acceptance levels regarding speci c aspects of the software, based on which the
aspects receiving the most criticism can be prioritized to be improved rst.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Intention</title>
        <p>According to several publications by one research group, P9, P10, P35, and P36, understanding the motivation
or drive behind why a user provides feedback can help determine the requirements of this user. This notion
underlies the classi cation according to the user's intention or goal, which we could subdivide into requesting,
informing, and reporting intention.</p>
        <p>Informing user reviews typically seek to persuade or dissuade other crowd members to use the product or to
provide a justi cation for why a particular star rating was given. P13 asserts that users will often describe what
was poor or excellent about their interaction with the product. The category \Job Advertisement" may seem
unusual in this context, but was used in P14 to classify user feedback on Twitter regarding a job o ering at a
software company that may be of interest to non-technical stakeholders such as marketing representatives, and
for the general public. When informing user feedback also addresses aspects of a product that are present or
absent, they may provide interesting topics (see Section 3.4).</p>
        <p>Reporting user reviews intend to inform the developer of the product of a problem or defect the user found,
which will often be a \Bug Report". Because bug reports are usually objective descriptions of problems and
quality issues found in already present features, according to P25 they are a popular user feedback classi cation
type for identifying possible functional or quality requirements. For this reason, P26 further subdivides them
into categories such as \Network Issue", \Unexpected Behavior", or \Crashing". These will often coincide with
classi cations of quality aspects (see Section 3.4).</p>
        <p>Requesting user reviews harbor the type of user feedback classi cation found most frequently in the primary
studies we analyzed in our work, namely \Feature Requests", which according to P26 represent requests from
users to add new (or reintroduce previous) functional aspects to the product, or to remove, modify, or enhance
existing features. Users may also make requests to improve a particular quality, for example to make the product
faster, more reliable, or more compatible with other systems, or they may place a request to receive information
about the product.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>User Experience</title>
        <p>An important aspect of user feedback according to P11 is that it is written by users who report on their practical
experience with a software product. As a result, aspects of UX relate to user requirements because they re ect
the users' perceptions of the product or their response to the use or anticipated use of this product. This
is why UX and RE are often addressed together in development activities such as elicitation, prototyping, and
testing. We found several works, P2, P17, and P27, that sought to classify parts of texts according to UX-related
dimensions. According to the ISO 9241-210 standard, the opinions found in user reviews on UX are shaped by a
user's personal emotions, beliefs, preferences, perceptions, physical and psychological responses, behaviors, and
accomplishments that occur before, during, and after use [ISO09]. What distinguishes the classi cations in this
group from others is that they are of a more subjective nature [MHBR02]. As a result, a classi cation regarding
a UX aspect does not primarily result in explicit suggestions, but rather provides information about the users'
emotions, motivation, and expectations. These may be indicative of problems (e.g., as a source of frustration)
or well-liked features (e.g., as a source of excitement).</p>
        <p>Several classi cations in this group coincide with some of the ISO 25010 quality-in-use characteristics [ISO10],
speci cally \E ciency", \E ectiveness" (called \Errors and E ectiveness" in P2 and P17), and \Satisfaction"
with its four sub-characteristics \Trust", \Pleasure", \Comfort", and \Usefulness". The second and third
subgroup in this category sort the classi cations into user-oriented perception, which involves emotional and
behavioral aspects of the user, and product-oriented perception, which are opinions that can be attributed to the
product or its context. Similar to sentiment analysis (see Section 3.1), the perception of users regarding the
UX can provide an indication of product acceptance because greater enjoyment with the product will increase
acceptance. Together, according to P2 these analyses can help determine how users react to individual features.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Topics</title>
        <p>The nal group of classi cations assesses the software product, its aspects, and its context as speci c topics
on which users share their opinion in user feedback. This may reveal actual requirements if the user provided
su cient information. The classi cations in this group are distinguished into user opinion pertaining to the
functions, quality, or context of a product, and are described in P4, P9, P10, P11, P13, P26, P27, P35, and P36.</p>
        <p>Product quality aspects often involve variations of the ISO 25010 software product quality characteristics
[ISO10], such as in P4, P11, P26, and P40, with some classi cations going into more detail than the standard
prescribes (e.g., speci cally categorizing user feedback on \Battery" in P4). Product context aspects found in P9,
P10, P35 and P36 deal with the functional aspects of interoperability, with planned extensions mostly to other
software products, discussions of content created or accessible through the product, the behavior of the product
in a speci c version, and general opinions; the latter do not specify which aspect of the product a user nds good
or bad and might does not necessarily pertain to the product itself. Other product-related aspects include the
users' opinions on the pricing, the development company or team, and the quality of the service they provide,
as well as comparisons users make between the product and competitor products to describe what functionality
is missing or unique in the product. Classi cations on topics may correlate with classi cations on user intention
(see Section 3.2), especially when users address a product function to make a request, whereas bug reports often
address defects in product quality.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Practical Application</title>
      <p>A key nding of this work is that the types of classi cations can be placed into four main groups: Sentiment,
Intention, UX, and Topic. The classi cations in these four groups of our taxonomy are conceptually di erent.
This not only means that they will produce di erent results, but also that they may be better suited for some
purposes than for others, which we will explore in this section.</p>
      <p>From the primary papers and our own experience, we derived seven common RE activities that can bene t
from input from user feedback analysis. Most papers pursued only one kind of RE activity, except for some
exploratory works, such as P14. The activities are listed in Table 1, where we indicate which of the classi cation
groups are better suited than others. An example of how this overview can be read: The activity of eliciting
requirements from user feedback is most likely to bene t from a classi cation according to topic to identify user
feedback that addresses the quality, functions, or context of a product. Additionally, assessing user feedback
by its intention may reveal requesting user feedback and help to speci cally nd feature requests. Conversely,
the Sentiment and UX categories are less suited because they may only lead to requirements indirectly, usually
requiring a manual inspection to nd them.</p>
      <p>Overall, we found that for each activity, two or three groups are suited. Moreover, each of the four classi cation
groups can serve multiple purposes within RE, showing that they are suitable for obtaining di erent kinds of
RErelevant knowledge, provided users disclose this knowledge in their feedback. These ndings can be interpreted
in three ways:</p>
      <p>The application of the classi cation categories of a particular group may be suitable for more purposes
than the ones to which they have been applied so far. For example, UX analysis has focused mainly on
product acceptance and usage context, but would also be useful for identifying unique selling propositions
and potential process improvements.</p>
      <p>One may choose to apply classi cation according to just one suitable group. This choice may depend on
a trade-o between the amount of work required to perform the analysis versus the quality of the results,
as some types of analyses are relatively easy to set up (e.g., sentiment analysis), while others can provide
deeper insights if more e ort is spent on tailoring them to RE.</p>
      <p>This outcome also suggests that the classi cation groups are not mutually exclusive and can support each
other when used in combination. For example, Sentiment and Topic could provide sentiment scores and
extracted features, respectively, which could then be aggregated to obtain an overview of the best- and
worst-rated product functions.
In this paper, we presented a practical auxiliary nding from an SLR into research on the classi cation of
user feedback in CrowdRE research, namely, a preliminary taxonomy of the classi cation categories found in
this research. We found a total of 78 unique classi cation categories, counting the duplicate occurrence of
\Learnability" once (RQ1). For space reasons, we had to make the table listing the classi cation categories
available as a separate le5, but their names are all shown in the taxonomy. Even though the number of unique
classi cation categories was higher than anticipated, it did con rm our suspicion that the lack of an existing
structure has caused a proliferation of classi cation categories in CrowdRE research, which became especially
evident from the di erent names being used for the same concepts. Moreover, several primary papers failed to
explain how their categories were chosen or to provide a clear de nition of these categories, suggesting that some
of the categories found were constructed rather freely. Conversely, only ve papers, P2, P11, P17, P24, and
P40, based their category de nitions entirely on a formal standard. To structure the large number of categories,
we took a systematic approach towards establishing a taxonomy (shown in Figure 1), which consists of four
main groups: Sentiment, Intention, UX, and Topic (RQ2), revealing the four predominant foci of identifying
information in user feedback that is of relevance to RE. Finally, we assessed how suitable the classi cations
of each group are for typical RE-related analyses, which revealed that for most purposes, classi cations from
di erent groups can be used (RQ3). The choice of classi cation will often depend on a trade-o between the
degree of detail required and the ease of con guring and performing the analysis.</p>
      <p>One aspect that was revealed through the taxonomy is that its groups are not mutually exclusive, and that
certain aspects can be identi ed through di erent classi cations; most notably bug reports and feature requests.
We argue that this is no contradiction to the way this classi cation is structured, but rather a logical result
of the strong correlation between the categories. For example, although Sentiment is a category of its own,
the degree to which user feedback is positive or negative often also underlies the other three groups. We also
observe similar overlaps between categories of authoritative standards; for example, according to the ISO 25010
standard [ISO10], poor maintainability during development will likely a ect reliability at runtime, with the
distinction being the perspective taken. Our taxonomy does not seek to impose a standardization, but rather
to be a constructive source of inspiration for research and industry applications. It is also intended as a rst
step towards introducing harmonization between the kinds of analysis performed and the naming used for the
categories.</p>
      <p>The premise of this ontology was the bottom-up construction in which we organized the existing classi cation
categories used in the literature. Although it would be possible to theorize about including other potentially
useful categorizations in our taxonomy, we present only those categories that research has con rmed to be
5Table of classi cation categories: zenodo.org/record/2577863, doi:10.5281/zenodo.2577863
appropriate for classifying user reviews, omitting those that have been shown or assumed to not be found in
user feedback (see Section 2.2 for examples). Moreover, due to the nature of the research, we considered only
those categories that have been applied in research studies; an assessment of the categories used in commercial
tools available on the market may reveal additional categories. We intend to further validate this taxonomy
with specialists in the eld of software quality assurance, RE, and UX, and to test its practical applicability
as a framework for selecting appropriate classi cation categories depending on the goal of the user feedback
analysis. Furthermore, we believe the taxonomy could be part of a quality framework with guidelines regarding
best practices for using classi cation categories for RE. Such a framework could include metrics for evaluating
the quality of classi cation tools, and the taxonomy could serve as a means for standardizing the classi cation
categories in order to facilitate benchmarking with regard to the quality of the results produced by di erent
tools.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We would like to thank the experts, including Dr. Fabiano Dalpiaz, Dr. Jorg Dorr, and Dr. Nash Mahmoud for
reviewing earlier versions of the taxonomy. We thank Sonnhild Namingha from Fraunhofer IESE for proofreading
this article.
[BAH11]</p>
      <p>Javier A. Bargas-Avila and Kasper Hornb k. Old wine in new bottles or novel challenges? A critical
analysis of empirical studies of user experience. In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), pages 2689|2698, 2011.</p>
      <p>Daniel Berry. Evaluation of tools for hairy requirements and software engineering tasks. In
Proceedings of the IEEE 25th International Requirements Engineering Conference (RE) Workshops, pages
284{291, 2017.
[CLH+14] Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, and Boshen Zhang. AR-Miner: Mining
informative reviews for developers from mobile app marketplace. In Proceedings of the 36th International
Conference on Software Engineering (ICSE), pages 767{778, 2014.</p>
      <p>Eduard C. Groen, Joerg Doerr, and Sebastian Adam. Towards crowd-based requirements engineering
a research preview. In Samuel A. Fricker and Kurt Schneider, editors, Requirements Engineering:
Foundation for Software Quality, pages 247{253, Cham, 2015. Springer.</p>
      <p>Emitza Guzman, Mohamed Ibrahim, and Martin Glinz. A little bird told me: Mining tweets for
requirements and software evolution. In Proceedings of the IEEE 25th International Requirements
Engineering Conference (RE), pages 11{20, 2017.
[GKH+17] Eduard C. Groen, Sylwia Kocpzynska, Marc P. Hauer, Tobias D. Kra t, and Joerg Doerr. Users
|The hidden software product quality experts? A study on how app users report quality aspects in
online reviews. In Proceedings of the IEEE 25th International Requirements Engineering Conference
(RE), pages 80{89, 2017.
[GSA+17] Eduard C. Groen, Norbert Sey , Raian Ali, Fabiano Dalpiaz, Joerg Doerr, Emitza Guzman, et al.</p>
      <p>The crowd in requirements engineering: The landscape and challenges. IEEE Software, 34(2):44{52,
March/April 2017.
[GSK+18] Eduard C. Groen, Jacqueline Schowalter, Sylwia Kocpzynska, Svenja Polst, and Sadaf Alvani. Is
there really a need for using NLP to elicit requirements? A benchmarking study to assess scalability
of manual analysis. In Klaus Schmid and Paolo Spoletini, editors, Requirements Engineering {
Foundation for Software Quality (REFSQ) Joint Proceedings of the Co-Located Events, CEUR Workshop
Proceedings 2075, 2018.</p>
      <p>C. Iacob and R. Harrison. Retrieving and analyzing mobile apps feature requests from online reviews.
In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pages 41{44,
2013.
[ISO10]
[KC07]
[KR08]
[LL17]</p>
      <p>ISO/IEC. ISO/IEC 25010 - Systems and software engineering { Systems and software Quality
Requirements and Evaluation (SQuaRE) { System and software quality models. Technical report,
ISO/IEC, 2010.</p>
      <p>Nishant Jha and Anas Mahmoud. Mining user requirements from application store reviews using
frame semantics. In P. Grunbacher and A. Perini, editors, Requirements Engineering: Foundation
for Software Quality (REFSQ), LNCS 10153, pages 273{287, Cham, 2017. Springer.</p>
      <p>B. A. Kitchenham and S Charters. Guidelines for performing systematic literature reviews in software
engineering. Technical Report EBSE{2007{01, School of Computer Science and Mathematics, Keele
University, 2007.</p>
      <p>Pekka Ketola and Virpi Roto. Exploring user experience measurement needs. In Proceedings of
the 5th COST294-MAUSE Open Workshop on Valid Useful User Experience Measurement (VUUM),
pages 23{26, 2008.</p>
      <p>Mengmeng Lu and Peng Liang. Automatic classi cation of non-functional requirements from
augmented app user reviews. In Proceedings of the 21st International Conference on Evaluation and
Assessment in Software Engineering (EASE), pages 344{353, 2017.
[MN15]
[MPG15]
[Nie93]
[WM17]</p>
      <p>Walid Maalej and Hadeer Nabil. Bug report, feature request, or simply praise? On automatically
classifying app reviews. In Proceedings of the IEEE 23rd International Requirements Engineering
Conference (RE), pages 116{125, 2015.</p>
      <p>Itzel Morales-Ramirez, Anna Perini, and Renata Silva Souza Guizzardi. An ontology of online user
feedback in software engineering. Applied Ontology, 10(3{4):297{330, 2015.</p>
      <sec id="sec-5-1">
        <title>Jakob Nielsen. Usability Engineering. Morgan Kaufmann, San Francisco, 1993. Grant Williams and Anas Mahmoud. Mining Twitter feeds for software user requirements. In</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[JM17] ISO. ISO</source>
          <volume>9241</volume>
          -210:
          <article-title>Ergonomics of human-system interaction { Part 210: Human-centred design for interactive systems</article-title>
          .
          <source>Technical report, ISO</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Grant</given-names>
            <surname>Williams</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anas</given-names>
            <surname>Mahmoud</surname>
          </string-name>
          .
          <article-title>Mining Twitter feeds for software user requirements</article-title>
          .
          <source>In Proceedings of the IEEE 25th International Requirements Engineering Conference (RE)</source>
          , pages
          <fpage>1</fpage>
          {
          <fpage>10</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [WZL+18]
          <string-name>
            <surname>Chong</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Fan Zhang, Peng Liang, Maya Daneva, and Marten van Sinderen.
          <article-title>Can app changelogs improve requirements classi cation from app reviews?: An exploratory study</article-title>
          .
          <source>In Proceedings of theACM/IEEE 12th International Symposium on Empirical Software Engineering and Measurement (ESEM)</source>
          ,
          <source>Article</source>
          <volume>43</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>