<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Overview of User Feedback Classi cation Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rubens Santos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduard C. Groen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karina Villela Fraunhofer IESE</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kaiserslautern</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany rubens.santos@iese-extern.fraunhofer.de</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>eduard.groen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>karina.villelag@iese.fraunhofer.de</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Online user feedback about software products is a promising source of user requirements. To allow scaling analyses to large amounts of user feedback, research on Crowd-based Requirements Engineering (CrowdRE) seeks to tailor natural language processing (NLP) techniques to Requirements Engineering (RE). Various frameworks have been proposed, but it remains largely unclear why particular NLP techniques are better suited for CrowdRE than others, which makes it hard to make a well-founded choice for a technique. We found that CrowdRE research most often uses machine learning (ML) and has so far applied twelve clusters of ML algorithms and seven clusters of ML features. The prevalent algorithm{feature pair is Nave Bayes with Bag of Words { Term Frequency (BOW-TF), followed by Support Vector Machines (SVM) with BOW-TF. An initial comparison of the reported precision and recall suggests that classi cations in RE need further improvement. Our research presents a preliminary overview of the current landscape of automated classi cation techniques for RE whose results may inspire researchers to apply new strategies to advance research in this eld, or to include ML models they had not considered previously in their benchmarks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Software products today typically have many geographically dispersed stakeholders who together form a large
heterogeneous group of (potential) end-users known as a \crowd" [GSA+17]. Using traditional requirements
engineering (RE) approaches, such as interviews, focus groups, questionnaires, and eld observations to elicit
and validate requirements from a representative sample of such a crowd faces scalability problems. The discipline
within RE that investigates solutions for how to derive validated requirements from the crowd by motivating its
members to provide feedback, analyzing their feedback and behavior, and validating the outcomes is generally
known as Crowd-based RE (CrowdRE) [GSA+17].</p>
      <p>One promising approach within CrowdRE is the use of online user feedback about software products for RE.
This kind of feedback is widely and often publicly available on review platforms, social media, and tracking
systems and forms a valuable resource for useful information from which potential software improvements and
changed or new requirements can be derived [GSA+17]. However, because the crowd providing this feedback
is large, the amount of information they provide can quickly become too abundant to be analyzed manually
[GSK+18], causing similar scalability problems as those observed with traditional RE techniques. This is why
Copyright c 2019 by the paper's authors. Copying permitted for private and academic purposes.</p>
      <p>CrowdRE research is particularly interested in applying automation techniques to identify requirements-relevant
content in user feedback from the crowd in order to make this activity scalable. To this end, techniques from
the domain of natural language processing (NLP) are being considered, especially so-called \classi ers".
Classi cation approaches provide an automated solution for solving a text classi cation problem (in our case, the
classi cation of user feedback for RE) through algorithms that place text snippets into predetermined categories
with a certain degree of con dence. The algorithms used most often recently for text classi cation are supervised
machine learning (ML) algorithms, which can identify patterns in labeled data and generalize it to unlabeled
data with a certain degree of accuracy. However, because these algorithms were not originally designed for
classifying text, they need to be paired with a primary ML feature that represents the text as vectors from which
the algorithms can deduce patterns. We refer to the combination of an algorithm with a feature as an ML model.</p>
      <p>Currently, it is unclear what the most suitable approaches are for analyzing online user feedback. For one,
most research on classifying user feedback in RE stems from recent years, with many questions still left to be
answered. A further challenge is that the user feedback is typically obtained from sources that are not aimed at
collecting requirements, such as software distribution platforms, social media, and knowledge-sharing websites,
which makes the data inherently ambiguous for RE and the analysis very di erent from that of other NLP
applications in RE, such as the analysis of requirements speci cations. Frameworks that have been proposed to
date for classifying user feedback for RE di er in both their focus and their choice of technique(s). For example,
AR-Miner lters out non-informative feedback using Expectation Maximization for Nave Bayes (NB) [CLH+14]
and ALERTme models topics through multinomial NB to help select or rank the most informative user feedback
[GIG17]; others classify sentiments and extract topics through Latent Dirichlet Markov Allocation to identify
common concerns among users [KGM16], or apply Term Frequency { Inverse Document Frequency (TF-IDF) and
2 to rank the most problematic software features for usability and user experience analysis [HS13]. The varying
performance of these frameworks in di erent contexts makes it di cult for both researchers and practitioners to
choose one that is suitable for their purpose. Moreover, the authors often do not explain why they selected a
particular technique.</p>
      <p>In this paper, we therefore seek to present a preliminary comparison of the classi cation techniques used
to date in CrowdRE research. We are aware that based on our results, conclusions about the suitability of a
technique for di erent purposes can only be drawn to a limited degree, but we assert that our work can provide
some initial insights into the current trends and help us assess how promising these techniques are from an RE
perspective. Our aggregated overview of classi ers in CrowdRE is intended as a rst step towards understanding
why particular combinations of algorithms and features are better suited for RE than others. To achieve this,
we seek to answer the following research questions:</p>
      <p>RQ1: Which automated techniques have been used in research to classify requirements-relevant content in
user feedback?</p>
      <p>RQ2: Which classi cation algorithms and features have been used most often?
RQ3: Which algorithm{feature pairs have yielded the highest precision and recall values?
We rst describe how we identi ed existing literature and assessed their use of NLP techniques in Section 2,
and then present the outcomes in Section 3. In Section 4, we discuss these results, concluding in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>Within the scope of a larger benchmarking study, we initiated a systematic literature review (SLR) [KC07] to
obtain a comprehensive and broad overview of the literature on classifying user feedback, while excluding any
potential selection bias and preventing gaps in our research. As part of this e ort, we noticed that our set of
systematically obtained literature works proposed and used many disjunct classi cation structures and categories.
This nding led us to investigating through a systematic mapping study [PFMM08] which approaches were used
and how e ective they were, resulting in the preliminary comparison presented in this work. We de ned the
following objectives for our mapping study:</p>
      <p>Overall Objective: What are the state-of-the-art automated approaches for assisting the task of
requirements extraction from user feedback acquired from the crowd, and which NLP techniques and features do
they use?
Objective 1: Regarding requirements elicitation from user feedback acquired from the crowd, what are the
state-of-art automated approaches for classifying user feedback?
Objective 2: How do such approaches classify user feedback?
{ Objective 2.1: What are the di erent sets of categories in which user feedback is classi ed?
{ Objective 2.2: Which automated techniques are used?
{ Objective 2.3: What are the characteristics of the user feedback these approaches aim to classify?
To perform our search, we composed a search string by de ning search terms, many of which are common
terms from known literature, and tested these in di erent combinations. We also used previously identi ed
papers to verify whether the search string would correctly nd these publications. The nal search string was as
follows:
We selected papers according to the eight exclusion criteria (EC) and two inclusion criteria (IC) listed below.
A paper meeting one or more ECs was excluded from the selection, while a paper meeting one or more ICs and
no ECs was included in the selection.</p>
      <p>EC1: The paper is not written in English.</p>
      <p>EC2: The paper was published before 2013.1
EC3: The work or study is not published in a peer-reviewed venue.</p>
      <p>EC4: The paper is not related to RE for software products and/or the title is clearly not related to the
research questions.</p>
      <p>EC5: The paper does not address the topic of requirements extraction from user feedback analysis.
EC6: The paper proposes a tool or dataset that does not aim to assist a requirements extraction process
from online user reviews, or could not be used in this way; for example, recommender systems, information
retrieval for search engines, or approaches that link source code changes to bug xes.</p>
      <p>EC7: The paper proposes an approach or tool that does not process textual user feedback. For example,
approaches that analyze implicit feedback, process requirements documents, or merely collect user feedback
instead of processing it.</p>
      <p>EC8: The paper proposes an approach that does not make use of automation because the user feedback
analysis is done entirely manually; for example, crowdsourced requirements elicitation.</p>
      <p>IC1: The paper proposes an approach for ltering out irrelevant user feedback from raw data, regardless of
whether or not this is done using classi cation techniques.</p>
      <p>IC2: The paper proposes an approach for classifying user feedback into default predetermined categories.</p>
      <p>We rst applied our search string to search for suitable papers in March 2018, using three prominent scienti c
databases on software engineering research: ACM, Springer, and IEEE Xplore. Exclusion criteria EC1{EC3 were
applied directly through database lters. This query returned a combined result of 1,219 papers. After removing
duplicates and screening the title and abstract (to which EC4{EC8 and IC1{IC2 were applied), 146 papers
remained. These included papers for which the results of our title and abstract analysis were inconclusive, so
this number also included papers where we were uncertain whether they matched our selection criteria. Further
scanning of the introduction and conclusion sections, to which EC4{EC8 were re-applied, reduced the number
of papers for data extraction to a total of 40. This work was performed by the rst author of this paper, and
the third author cross-checked a random subset to assure the quality of this work. Any disagreements were
discussed and resolved. We repeated the query on 18 December 2018 to include papers that had been added over
the course of 2018, which resulted in 14 new papers, of which 3 were relevant to our SLR, for a total number
of 43 analyzed papers. Due to space restrictions, we present the complete list of primary papers in a separate
document2, and reference them in this document with the notation Pn.</p>
      <p>To satisfy the overall goal of preparing a benchmarking study, we systematically extracted three major groups
of data from the selected papers:</p>
      <p>Dataset-related information, such as dataset size in number of entries, object granularity (sentence vs.
review), source (e.g., app stores, social media), and mean text object size.</p>
      <p>NLP techniques applied, such as algorithms, parsers, ML features, and text pre-processing techniques.
Classi cation categories into which the tool was designed to classify user feedback, along with their de
nitions, where available. We also paid special attention to any explicit rationales behind design decisions
made for a tool to understand for which goal or under which circumstances speci c categories are best used.
1This year was chosen because prior to the introduction of CrowdRE in 2015 [GDA15], the analysis of online user feedback
via NLP for RE was not considered to be a serious source of requirements. We additionally considered six years as a technical
obsoleteness threshold to t our paper selection e orts to our resource constraints.</p>
      <p>2Bibliography of primary studies: zenodo.org/record/2546422, doi:10.5281/zenodo.2546422</p>
      <p>The results from the second group of data, \NLP techniques applied", was a list with many parameters. We
sought to structure these ndings to nd commonalities between the set-ups presented in the papers that can
be used as a basis for our future benchmarking study. To assess and map the NLP techniques applied, we based
a systematic mapping study on these SLR results. Because nearly all papers appeared to use a set-up using ML
algorithms, in the mapping process we categorized the parameters of each set-up according to the following ve
aspects:</p>
      <p>ML algorithm, such as Nave Bayes, Logistic Regression, Support Vector Machines, or Decision Trees.
Primary ML features that represent the text according to word count or related methods, such as Bag of
Words (BOW), Term Frequency - Inverse Document Frequency (TF-IDF), and Bag of Frames (BOF).
Secondary ML features that represent speci c aspects of the text or metadata, such as length (found in
P14), sentiment score (found in P25), or star rating (found in P34). Secondary ML features yield low scores
when used on their own, but can help achieve greater e ciency and quality when used in combination with
primary features.</p>
      <p>Semi-supervised classi cation algorithms, such as Expectation-Maximization, Self-Training, or Rasco.
Pre-processing techniques, such as stop words removal, synonym uni cation, stemming, lemmatization,
special characters removal, abbreviation transformation, and negation handling.</p>
      <p>The distinction between primary and secondary ML features is not always clear-cut and can sometimes involve
a speci c (enhanced) variation on a primary ML feature. As a decision rule, we considered variations of particular
primary ML features independent of whether these were compared within the same paper. For example, n-Gram,
NLP Heuristics, and Augmented User Reviews BOW (AUR-BOW) can be interpreted as combinations of a
secondary feature with BOW, and the feature selection technique 2 is also applied in combination with BOW
in P24.</p>
      <p>We subsequently tabulated these results according to the ML algorithm, juxtaposed with the other aspect
that is always used in combination with an algorithm: a primary ML feature. In the cells, we then cited the
papers using this algorithm{feature pair. To each citation, we added footnotes to indicate additional details
of the set-ups from the remaining three categories because they can be used in many di erent combinations.
A set-up can include any number of pre-processing techniques and secondary ML features, or none, and can
optionally choose to make use of a semi-supervised classi cation algorithm. Combined, the aggregation resulted
in an intricate matrix that takes all these variations into account. We will leave a discussion of these variations
for later stages of our benchmarking study because it would make the presentation of our results too lengthy
and complex.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In this section, we will begin with a presentation of the classi cation approaches found in research (Section 3.1),
followed by a comparison of the ML models applied to identify \Feature Requests" (Section 3.2).
3.1</p>
      <p>Prevalent Classi cation Approaches
Among the 43 papers that suggest or employ a type of user feedback classi cation approach, ML clearly forms
the predominant group of approaches, being applied in 37 (86%) of the primary studies analyzed. Only six
papers, P11, P18, P19, P21, P27, and P33, propose approaches based exclusively on a dictionary, regular
expressions, or a parser combined with heuristic rules. Table 1 presents a simpli ed version of the aggregated
table discussed in Section 2, from which details about the use of a semi-supervised classi cation algorithm,
pre-processing techniques, and secondary ML features in the set-up have been omitted, both for space reasons
and in order to present a comprehensive overview of results focusing on the key aspects of the ML models
themselves. In the table, we have broken the ML models down into their algorithm{feature pairs to better
understand which classi ers are used. In the literature, we found a total of 139 such pairs. The table clusters
algorithms and features that belong to the same category (e.g., the Decision Tree algorithm C4.5 includes its
Java implementation J48; NB includes both its binary and multinomial versions). Most papers included several
ML models (M = 3:8; M in = 1; M ax = 14) because they reported on comparative experiments.</p>
      <p>
        We found seven clusters of ML algorithms being applied: NB (found in 27 publications), Support Vector
Machines (SVM; 22), Decision Trees (DT; 17), Logistic Regression (LR; 12), k-Nearest Neighbors (k-NN; 3),
Bayesian Network (1), and Neural Networks (1). Decision Trees could be further subdivided into ve distinct
algorithms, for a total of 12 algorithm clusters. We also found four main ML features among the studies:
BOW / Term Frequency (found in 23 publications), TF-IDF (12), NLP Heuristics (8), n-Gram (5), 2 (1,
in combination with three algorithms), Bag of Frames (BOF; 1, in combination with two algorithms), Binary
BOW (1; in combination with one algorithm), and AUR-BOW (1, in combination with 3 algorithms). Although
theoretically any algorithm{feature pair is possible, we found 34 in the literature, 12 of which were used only
once. The most frequently found pairs are: NB + BOW (found in 15 publications); SVM + BOW (
        <xref ref-type="bibr" rid="ref8">11</xref>
        ); NB +
TF-IDF (8), and LR + BOW (7).
To compare the quality of the user feedback analysis approaches in spite of papers having di erent classi cation
purposes, we limited our comparison to those that used the most common classi cation category: \Feature
Request". In total, we found 51 reported precision and recall values and 10 F1-measures on 30 di erent ML
models, reported in 19 papers. Seven of these papers provided precision and recall or F1-measure values for
one ML model: P8, P12, P15, P25, P34, P38, and P39. The other twelve papers, P1, P7, P13, P14, P20, P24,
P26, P28, P31, P35, P40, and P41, provided values for two to twelve ML models. For 22 ML models, only one
measurement was available; the largest number of measurements was for NB + BOW, with nine measurements.
One paper, P24, compared twelve ML models and found AUR-BOW to yield a slightly better performance than
other ML features; in P35, BOW with NLP Heuristics was found to perform better than BOW alone across all
ve algorithms these two features were paired with. Of the six papers that suggested non-ML approaches, two
papers also provided precision and recall values for \Feature Request": 0.95/0.87 in P18, and 0.91/0.89 in P19,
but these results could not be regarded as part of our comparison of ML models.
0:77 P1
0:72=0:90 P8
0:28=0:62 P20
0:67=0:64 P24
0:71=0:79 P25
0:20=1:00 P28
0:89=0:81 P31
0:67=0:58 P35
0:77=0:58 P41
0:76 P28
0:46=0:46 P35
0:84 P1
0:45=0:68 P20
0:51=0:85 P28
0:84=0:79 P31
0:65=0:49 P34
0:59=0:61 P35
0:72=0:60 P41
0:66=0:67 P24
0:57=0:58 P35
      </p>
      <p>In order to perform comparative analyses, we needed the precision and recall values for all measurements, so
we requested these from the authors of the four papers that reported F1-measures; the authors of P40 provided
these values, the authors of P1 and P28 could not provide them, and the authors of P12 did not respond. The
resulting 54 precision{recall pairs and seven F1-measures are shown in Table 2; the F1 measures were omitted
from further analysis. This especially a ected the measurements provided for the ML model n-Gram, with four
out of six measurements being excluded. Overall, moderate values were reported on both precision (M = 0:63;
SD = 0:15; M ed = 0:67) and recall (M = 0:65; SD = 0:16 M ed = 0:66). P28 provided a measurement on
NB + BOW where both the precision and recall values were outliers (M 2SD), as well as an outlier value for
precision on DT{Single Tree + n-Gram. Further negative outliers were found for precision on NB + BOW in
P20 and for recall on NB + TF-IDF and LR + TF-IDF in P14.</p>
      <p>Due to the di erences between the studies, we can only draw preliminary conclusions about the suitability
of the ML algorithms and primary ML features. In order to use an objective measurement for comparing the
performance of the classi ers, we calculated a weighted version of the F -measure, F . This is a harmonic
average that accounts for the relative importance of recall (R) and precision (P ) by using a variable value that
is determined by the contextual characteristics of either the task or the tool (cf [Ber17]). The value in this
formula was determined separately for each paper. Because information about the tool's performance was not
available in the papers we assessed, we instead determined the task-based T , calculated as 1 , where is the
fraction of identi ed \Feature Requests" in the dataset. This assumes that the less frequent the true positives
are in a document, the more challenging the task of nding them is for human actors [Ber17]. The average T
was found to be 6:96 (SD = 4:57; M ed = 5:36; M in = 2:10; M ax = 15:64) across papers, which means that
greater emphasis was placed on the reported recall values overall. We then calculated the F -measure for each
precision{recall pair using the following formula:</p>
      <p>F
= (1 +
2)</p>
      <p>This resulted in a total of 54 F -measures (M = 0:64; SD = 0:15; M ed = 0:65). The F -measures and the
T values on which they are based are presented in Table 3. The highest F values were 0:92 for SVM + BOF in
P20, 0:89 for NB + BOW in P8, and 0:86 for DT{RF + BOW in P28. The lowest were 0:18, 0:26, and 0:33, for
various algorithms in combination with TF-IDF reported in P14. Although the two lowest values are outliers,
we chose to retain them in our analysis.</p>
      <p>The average F values per ML algorithm showed that DT{RF (F = 0:84; 2 measurements) and SVM
(0:68; 12) scored best, while Neural Networks (0:39; 1) and LR (0:40; 3) scored poorest. The ML features BOF
(0:81; 2) and NLP-Heuristics (0:70; 7) had the highest average F values, and TF-IDF (0:55; 17) and
AURBOW (0:63; 3) the lowest. The best and worst average F score per ML model (1{8 measurements) were all
based on single measurements; SVM + BOF (0:92 in P20), DT{RF + BOW (0:86 in P39), and DT{RF +
n-Gram (0:82 in P38) performed best, while LR + TF-IDF (0:26 in P14), NN + TF-IDF (0:39 in P14), and
LR + BOW (0:46 in P35) performed worst. This shows that the highest individual measurements found earlier
for ML algorithms and ML features were diminished by speci c lower reported performances. This nding is
corroborated by the nding that the prevalent ML models, those for which we found three or more measurements
in the literature, all ranked between 7 and 25 (F = 0:51 { 0:72) due to the large di erences in their performance
across settings (e.g., NB + BOW: 8 measurements, F ranging from 0:57 in P20 to 0:89 in P8, rank 7).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>In recent years, research on CrowdRE has used various approaches to classify user feedback for RE. Some
publications suggest the use of dictionary-based approaches, regular expressions or parsing, but the vast majority
of approaches use ML models, consisting of ML algorithms paired with ML features (RQ1). Based on this
nding, we further investigated the speci c classi cation algorithms and features. The most popular algorithms
were found to be NB, SVM, LR, and DT (especially Single Tree), while BOW and TF-IDF were almost always
used as ML features. The models NB + BOW, SVM + BOW, and NB + TF-IDF were found most often in
research (RQ2). A common denominator of the prevalent ML algorithms is that they provide a relatively large
degree of control over the supervised ML, while the ML features BOW and TF-IDF are likely popular because
of their versatile nature. They also share a similar degree of tool support, and are often already familiar to the
researcher.</p>
      <p>To assess the performance, we compared the ML models' F -measures. We found that SVM + BOF,
NB + BOW, DT-RF + BOW, and NB + BOW each achieved an F -measure above 0.85 in a single study
(RQ3), but when other measurements were available, their averaged score would be lower than that of other
average F -measures. This is in line with our nding that overall, the precision, recall, and F metrics were
surprisingly moderate if we assume that the phenomenon of publication bias usually promotes the reporting of
positive results, paired with the best possible outcomes of the analyses performed. All of the popular ML
algorithms used can potentially result in good-quality results as well as poorer results. In this preliminary analysis
we did not investigate how particular study characteristics might have contributed to better or poorer
performance of the ML models, but the strong variance in the performance of the ML models does suggest that other
factors have a strong impact on the classi cation e ciency. These characteristics include the goals and problems
the respective study sought to address; the type, quality, and size of the dataset and gold standard; and the
choice of classi cation categories and additional ML techniques (e.g., semi-supervised classi cation algorithms,
pre-processing techniques, and secondary ML features; see Section 2). Moreover, the great diversity in the total
number of di erent algorithm{feature pairs found and the strong variance in study set-ups suggests that the
research on user feedback analysis is still exploring appropriate ML models.</p>
      <p>We assert that the choice of appropriate ML algorithms and ML features may need to play a larger role in
future research on classifying user feedback for RE: Some works made a strong case for NLP Heuristics and
n-Grams that can introduce context information into the classi cation task, which under some circumstances
leads to better results. Moreover, the CrowdRE literature has proposed two adaptations of BOW with promising
results: BOF in P20 and AUR-BOW in P24. However, none of these directions have so far been pursued in
subsequent research, suggesting that this eld could bene t from a more thorough exchange on the technologies
applied. Support for these features in ML solutions such as Weka [HFH+09] or Scikit-learn [PVG+11] could
facilitate wider use. Similarly, the use of a Bayesian Network as in P43, or the use of a Deep Learning approach
such as the Neural Networks in P14, could be replicated in further studies. In addition to ML approaches, the
results of several works on non-ML approaches for RE suggest that carefully designed heuristics may in some
cases also provide accurate results (i.e., high precision). However, we asserted in P11 that high recall values are
harder to achieve because it is challenging to create enough rules.</p>
      <p>In terms of threats to validity, the exclusion of other in uencing study characteristics in our mapping may
have limited our ability to address confounding variables in uencing the variance in the reported precision and
recall values. By following the procedure of a systematic mapping study [PFMM08], we claim to have achieved
strong internal validity and to have covered the publications in this domain well. Conclusion validity may
be a ected by the heterogeneity of the publications found; by focusing on the precision and recall values for
classi cations of \Feature Requests" using ML models, we were only able to compare 16 of 43 papers. In terms
of external validity, our work might not be generalizable to areas outside of user feedback analysis in RE because
we speci cally investigated the classi cation techniques used within CrowdRE.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion &amp; Future Work</title>
      <p>In this work, we presented the preliminary results of a systematic mapping study regarding the classi cation
techniques used in CrowdRE studies that sought to classify user feedback for RE using NLP techniques. The eld
is strongly dominated by ML, for which we identi ed the twelve clusters of ML algorithms and seven clusters
of ML features that have been used to date. A comparison of the reported results, however, shows a varied
outcome. It was not possible to draw decisive conclusions from the data about which ML models are most
suitable for classifying user feedback, because factors other than the ones we considered in this work appear to
have had a strong in uence on the performance of the ML models. However, we did nd that good results have
been attained with the most often used ML algorithms, especially when used in combination with appropriate
primary and secondary ML features.</p>
      <p>We have seen that the current landscape is still one of exploration into the most suitable techniques. However,
progress is hindered by a certain lack of cross-fertilization, where researchers pick up on promising ndings in
other works to investigate whether these approaches work well in their context. Most notably, this pertains to
adaptations of ML features such as BOF and AUR-BOW, which were speci cally adapted for CrowdRE research
but have not yet found any application in research by others. Similarly, several other ML features and even some
ML algorithms have shown promise in merely a single publication. For research on user feedback analysis for
RE to progress, it may even be indispensable to evaluate these techniques for di erent purposes in comparative
studies and benchmarks. On the other hand, the potential of non-ML approaches reported in some works should
not be ignored either when investigating new solutions for user feedback classi cation.</p>
      <p>This work was descriptive in nature and was limited to a comparison of only the ML algorithms and primary
ML features. Future work should assess which other study-speci c aspects (e.g., dataset, classi cation
categories, other ML techniques; see Section 4) impact the performance of a classi cation approach when analyzing
user feedback. This could lead to more prescriptive results regarding the most suitable kinds of classi cation
techniques and the rationale for choosing a particular con guration. Although our work focused on the analysis
of user feedback, the performance of ML models in other contexts within and outside of RE may also reveal
which techniques are most appropriate. We found that most publications do not provide any explanation for
their choice of techniques, even though the motives could be an inspiration to others. Moreover, we see the need
for an elaborate comparison or benchmarking of classi cation techniques for the purposes of RE. A thorough
comparison of potential ML models is lacking, with only a few papers having compared up to ten ML models
for one particular purpose. Although the only study in this domain that we found to apply Deep Learning, P14,
did not yield convincing results, this may be a line of research that could be pursued further in coming years.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank Dr. Daniel M. Berry for his constructive feedback and support on using F -measures.
We thank Sonnhild Namingha from Fraunhofer IESE for proofreading this article.
[Ber17]</p>
      <p>Daniel Berry. Evaluation of tools for hairy requirements and software engineering tasks. In
Proceedings of the IEEE 25th International Requirements Engineering Conference (RE) Workshops, pages
284{291, 2017.
[GDA15]</p>
      <p>Eduard C. Groen, Joerg Doerr, and Sebastian Adam. Towards crowd-based requirements engineering
a research preview. In Samuel A. Fricker and Kurt Schneider, editors, Requirements Engineering:
Foundation for Software Quality, pages 247{253, Cham, 2015. Springer.</p>
      <p>Emitza Guzman, Mohamed Ibrahim, and Martin Glinz. A little bird told me: Mining tweets for
requirements and software evolution. In Proceedings of the IEEE 25th International Requirements
Engineering Conference (RE), pages 11{20, 2017.
[GSA+17] Eduard C. Groen, Norbert Sey , Raian Ali, Fabiano Dalpiaz, Joerg Doerr, Emitza Guzman, et al.</p>
      <p>The crowd in requirements engineering: The landscape and challenges. IEEE Software, 34(2):44{52,
March/April 2017.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [CLH+14]
          <string-name>
            <surname>Ning</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Jialiu Lin,
          <string-name>
            <surname>Steven C. H. Hoi</surname>
          </string-name>
          , Xiaokui Xiao, and Boshen Zhang. AR-Miner:
          <article-title>Mining informative reviews for developers from mobile app marketplace</article-title>
          .
          <source>In Proceedings of the 36th International Conference on Software Engineering (ICSE)</source>
          , pages
          <fpage>767</fpage>
          {
          <fpage>778</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [GSK+18]
          <string-name>
            <surname>Eduard</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Groen</surname>
            , Jacqueline Schowalter, Sylwia Kocpzynska, Svenja Polst, and
            <given-names>Sadaf</given-names>
          </string-name>
          <string-name>
            <surname>Alvani</surname>
          </string-name>
          .
          <article-title>Is there really a need for using NLP to elicit requirements? A benchmarking study to assess scalability of manual analysis</article-title>
          .
          <source>In Klaus Schmid and Paolo Spoletini</source>
          , editors, Requirements Engineering {
          <article-title>Foundation for Software Quality (REFSQ) Joint Proceedings of the Co-Located Events</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <year>2075</year>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [HFH+09]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Hall</surname>
          </string-name>
          , Eibe Frank, Geo rey Holmes, Bernhard Pfahringer,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Reutemann</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>The WEKA data mining software: An update</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <volume>10</volume>
          {
          <fpage>18</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [HS13]
          <article-title>[KC07] [KGM16] Ste en Hedegaard and Jakob Grue Simonsen</article-title>
          .
          <article-title>Extracting usability and user experience information from online user reviews</article-title>
          .
          <source>Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI)</source>
          , pages
          <year>2089</year>
          {
          <year>2098</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Kitchenham</surname>
          </string-name>
          and
          <string-name>
            <given-names>S</given-names>
            <surname>Charters</surname>
          </string-name>
          .
          <article-title>Guidelines for performing systematic literature reviews in software engineering</article-title>
          .
          <source>Technical Report EBSE{2007{01</source>
          , School of Computer Science and Mathematics, Keele University,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Preet</given-names>
            <surname>Chandan</surname>
          </string-name>
          <string-name>
            <surname>Kaur</surname>
          </string-name>
          , Tushar Ghorpade, and
          <string-name>
            <given-names>Vanita</given-names>
            <surname>Mane</surname>
          </string-name>
          .
          <article-title>Topic extraction and sentiment classi cation by using latent dirichlet markov allocation and SentiWordNet</article-title>
          .
          <source>In Proceedings of the International Conference on Advances in Information Communication Technology &amp; Computing (AICTC)</source>
          ,
          <source>Article</source>
          <volume>86</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [PFMM08]
          <string-name>
            <given-names>Kai</given-names>
            <surname>Petersen</surname>
          </string-name>
          , Robert Feldt, Shahid Mujtaba, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Mattsson</surname>
          </string-name>
          .
          <article-title>Systematic mapping studies in software engineering</article-title>
          .
          <source>In Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering, EASE'08</source>
          , pages
          <fpage>68</fpage>
          {
          <fpage>77</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [PVG+11]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Pedregosa</given-names>
          </string-name>
          , Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Grisel</surname>
          </string-name>
          , et al.
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          (Oct):
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>