Is there Really a Need for Using NLP to Elicit Requirements? A Benchmarking Study to Assess Scalability of Manual Analysis Eduard C. Groen Jacqueline Schowalter Sylwia Kopczyńska Fraunhofer IESE University of Kaiserslautern Poznań University of Technology Kaiserslautern, Germany Kaiserslautern, Germany Poznań, Poland eduard.groen@iese.fraunhofer.de j.schowalt09@uni-kl.de sylwia.kopczynska@cs.put.poznan.pl Svenja Polst Sadaf Alvani Fraunhofer IESE University of Kaiserslautern Kaiserslautern, Germany Kaiserslautern, Germany svenja.polst@iese.fraunhofer.de alvanis@rhrk.uni-kl.de Abstract The growing interest of the requirements engineering (RE) community to elicit user requirements from large amounts of available online user feedback about software-intensive products resulted in identification of such data as a sensible source of user requirements. Some researchers proposed automation approaches for extracting the requirements from user reviews. Although there is a common assumption that manu- ally analyzing large amounts of user reviews is challenging, no bench- marking has yet been performed that compares the manual and the automated approaches conderning their efficiency. We performed an expert-based manual analysis of 4,006 sentences from typical user feed- back contents and formats and measured the amount of time required for each step. Then, we conducted an automated analysis of the same dataset to identify the degree to which automation makes the analysis more scalable. We found that a manual analysis indeed does not scale well and that an automated analysis is many times faster, and scales well to increasing numbers of user reviews. 1 Introduction Developers of consumer products face the challenge of receiving increasing amounts of user feedback about their products. This feedback has been shown to contain useful experience-based information from users about their functional and non-functional requirements [GKH+17], so it can help improve the products to gain market advantage and retain user satisfaction [GSA+17]. A product developing company may choose to hire external consultants to do market research or to try and save costs by assigning a team member to assess the user feedback, Copyright c 2018 by the paper’s authors. Copying permitted for private and academic purposes. among others by reading user reviews. But as the amount of user feedback grows, it becomes more difficult to put in the effort required to assess all user reviews. Theoretically, this makes automating user feedback analysis a promising alternative. A growing body of research in requirements engineering (RE) has dedicated to understanding and classifying user feedback using natural language processing (NLP) techniques. The proposed methods and tools are usually evaluated by comparing the results from NLP analysis to truth sets based on a manual analysis of a smaller subset of the user feedback (e.g., [IH13, PDG+15, WM17]). The categorization is often one-dimensional (e.g., “positive”, “negative”, and “feature request” [IH13]). These studies often report the time spent by each annotator on classifying the subset. That time is the basis to conclude that an analysis performed entirely manually is not sensible. However, we do not know of any research that has looked into determining the actual scalability of manual analysis. To determine the effort involved in performing a manual analysis of user reviews, we conducted an expert-based analysis of online user reviews of compact cameras. Then, we compared the results with the effort needed for automated analysis. Our study considered cameras that is software-intensive systems in which the boundaries between software and hardware are blurring. The first reason is that the up-to-date research has primarily focused on apps [PM13]. Secondly, due to the rising popularity of smart systems. With the study we sought to answer the following research questions: RQ1: How well does the manual classification of requirements in user feedback scale in terms of required effort? RQ2: How does automating some parts of the user feedback analysis process influence its scalability? Section 2 of this paper discusses related work. Sections 3, 4 and 5 present the method, results and discussion of our content analysis. Section 6 describes threats to validity, and Section 7 summarizes and suggests possible future work. 2 Related Work RE can considerably benefit from NLP techniques. For example, quality defects such as inconsistencies can be identified in documented requirements. Additionally, specifications of very- and ultra-large-scale systems can be better managed [RSW08] by having an ability to find and categorize requirements for instance. In recent years, a growing body of research evaluates the use of NLP techniques for requirements elicitation. Most of them aim at deriving requirements from various types of user feedback [GSA+17]. Our study extends this area by providing the perspective of efficiency of requirements elicitation by extraction from the user feedback. Earlier studies have analyzed larger samples of user feedback from an RE perspective by categorizing contents. Pagano and Maalej [PM13] performed statistical analyses on app store reviews, and manually analyzed 1,100 reviews, reporting that around 30% of all reviews could be categorized as “requirements”, including sentences containing bug reports and shortcomings. Guzman and Maalej [GM14] analyzed 2,800 user reviews and found feature requests in 10%. Panichella et al [PDG+15] tagged a sample of 1,421 sentences from app store reviews, from which non-informative reviews were already omitted, and classified 97.82% as either “information seeking”, “information giving”, “feature request” or “problem discovery”. Guzman, Alkadhi, and Seyff [GAS16] manually analyzed 1,000 tweets, and in requirements-relevant categories found 0.90% bug reports, 1.50% feature short- comings, and 0.10% feature requests. Guzman, Ibrahim, and Glinz [GIG17] manually analyzed 1,350 tweets from the support accounts of three apps, of which 42% could be classified as improvement requests. Kurtanović and Maalej [KM17] coded 1,020 user reviews (6,268 sentences; 7,288 sentences including title) on the user ra- tionale concepts “issues”, “alternatives”, “criteria”, “decisions”, and “justifications”, and found that 48% of the sentences and 86% of the reviews contained one or more of these concepts. Williams and Mahmoud [WM17] manually classified 4,000 tweets, and found that 27% expressed “bug reports” and another 24% contained “user requirements”. Our study differs from the related work by analyzing results for software-intensive products, and making a more fine-grained analysis by not only considering features and sentiments, but also quality aspects (cf. [GKH+17]), general sentences, and sentence topics. 3 Method Our analysis, based on Neuendorf’s content analysis method [Neu02], consisted of the following steps: Step 1: Select and obtain user reviews. As online user reviews to apps have received much attention (e.g., [GW13, GM14, GIG17, JM17, PM13, SAH+15]), we chose to analyze user reviews to smart software-intensive systems, as we see an increasing trend towards these systems. We assumed a case in which a company would Table 1: Overview of analyzed cameras and number of user reviews available. The number of sentences is composed of the number of sentences in the title and in the contents. Camera User Review Count Sentence Count Avg. Sentence/Review Panasonic Lumix DMC-7270 10 35 3.50 Olympus Stylus TG-8609 163 1,010 6.20 Canon PowerShot SX610 HS 59 335 5.68 Canon PowerShot D30 149 1,043 7.00 Nikon Coolpix AW130 123 685 5.57 Nikon Coolpix S7000 82 532 6.49 Sony Cyber-shot DSC-HX60 38 366 9.63 Total or average 624 4,006 6.42 Table 2: Overview of tags and their tagging criteria. Tag Criterion Positive This sentence contains a positive remark about the product Negative This sentence contains a negative remark about the product Requested This sentence contains a plea or wish regarding the product None This sentence is not relevant Feature This sentence is about a functional aspect (i.e., a feature or product characteristic) (Quality characteristics) Seven out of eight product quality characteristics as defined in the ISO 25010 standard [ISO11] General This statsentence is neither about a feature (product characteristic) nor a quality (non- functional aspect) like to gain insights into what users want or need from currently popular pocket-sized compact cameras in the price range around 200 euros that have innovative or (software-based) smart functions. We selected 12 presently popular compact cameras. Ten of them were available on Amazon, from where we crawled the user reviews. To obtain an objective overview of all requirements for that particular camera, we intended to analyze all user reviews of each camera rather than choosing a sample. To reduce skewing, we omitted three cameras with deviating review counts and lengths; one camera had only 17 reviews with an average length of 12.59 sentences (versus 7.12 for all cameras), and the reviews about two other cameras provided about 2,000 sentences per camera, which would cause half the dataset of nine cameras to consist of two cameras. As a result, our dataset consisted of the seven cameras presented in Table 1. Step 2: Structure and import user reviews. Because a user review usually covers several aspects (e.g., both positive and negative aspects), we split the reviews into separate sentences using a sentence splitter. Based on the punctuation used, the sentence splitter could also split compound sentences and sentences that include a listing into separate sentences. We then stored the list of sentences in a CSV file, which we loaded into a customized web-based tagging tool. Here, annotators could assign tags representing specific categories to which a sentence belongs, such as “positive” or “negative”. This tool allowed the annotators to tag independently of one another. Step 3: Define tagging categories and a tagging schema. The exact tags that could be assigned in the tool were determined based on previous methods that determined the valence (“positive” and “negative”) of a sentence, along with the option “feature request” [GM14, GW13, IH13, MN15]. Based on our findings [GKH+17] that sentences also contain information about product qualities (i.e., non-functional aspects) in addition to product features (i.e., functional aspects), we introduced a second tagging level (see Table 2). The first level enabled the annotator to assign “positive”, “negative”, “requested” or “none” as main tag. In this level, “feature request” was changed to “requested” to also enable combinations with quality requests and general requests. If an annotator picked an option other than “none”, a sub-level tag was assigned. This sub-level contained the options “feature”, seven of the eight software product characteristics from the ISO 25010 standard [ISO11] (“maintainability” was omitted because users do not have insight into the aspects of this development-specific quality [GKH+17]), and the category “general” for positive, negative, and requesting sentences that do not specifically concern a product characteristic or quality. The two levels together can provide combinations such as “feature requests” (requested × f eature), or “negative sentences about usability” (negative × usability). For a common understanding of the meaning of the tags, a tagging schema was based on an earlier study’s schema with examples and signal words, and the annotators were trained on the ISO 25010 characteristics. Step 4: Perform tagging. As the content was expected to have ambiguities, each sentence was annotated separately by three of this paper’s authors; one researcher and two computer science students. The annotators selected and read each sentence, and assigned the tag that they found most suitable. More than one tag and sub-tag per sentence were allowed for each distinctive part of a sentence that could be clearly cut in several distinct parts (e.g., containing independent clauses separated by “but” or “and”, or because the sentence splitter failed to split it due to missing punctuation). In order to obtain benchmarking results, the annotators were instructed to tag as precisely as possible but without excessive deliberation, and to log their time. Step 5: Reconcile and process tagging results. We exported the tagging results to an Excel file and assigned each sentence a tag that scored the majority of votes. For the main tag, there was a majority agreement on 3,946 of the 4,006 sentences (98.50%). After omitting the sentences assigned“none”, there was an agreement on the assigned sub-tag for 1,387 of the 1,844 remaining sentences (75.22%). Sentences with multiple tags by multiple annotators were split, because it indicated that different portions of a sentence should be assigned its own tag. Sentences on which there was disagreement were manually reconciled. To calculate inter-rater reliabilities, we used Krippendorff’s alpha [HK07] because the sub-tags had deliberately missing data as the annotators did not assign sub-tags if the main tag was “none”. We considered the result our ground truth. An annotator then analyzed sentences to identify the “topic” (i.e., a central keyword) and any attributes. For example, “I like the size of the display” has “display” as topic, and “size” as a positive attribute. Not all sentences have a topic. Similar topics (e.g., “screen” and “display”) were merged. This was also done entirely manually to measure and assess the amount of effort involved in classifying data without the use of automation. Step 6: Analyze data. The data resulting from the manual tagging were analyzed by calculating sums and frequencies of the source data. A more thorough analysis of the content analysis to determine the quality of the results is beyond the scope of this paper and will be discussed in another paper. We additionally calculated and analyzed the time measurements of all steps to estimate the scalability of a manual approach. Step 7: Estimate automation potential. In addition to the steps suggested by Neuendorf [Neu02], we identified which steps can potentially be automated. Furthermore, the amount of time that could be saved by automation compared to the manual analysis was calculated. 4 Results 4.1 Descriptive Statistics The dataset consisted of 624 user reviews, providing a total of 4,006 sentences on seven compact cameras (see Table 1). Each product received on average 89.1 user review and each user review consisted on average of 6.2 sentences. All three annotators considered 1,563 sentences (39.02%) as irrelevant. After reconciliation, a further 774 sentences (19.32% of total) were considered irrelevant, and 27 sentences (e.g., compound sentences) were split into two, three or four parts. This resulted in a final set of 1,708 sentences that were considered as relevant from an RE perspective. 534 reviews (85.58% of all reviews) contained all sentences. We expected low inter-rater reliability due to the inherent ambiguity of statements in user reviews (cf. [PDG+15]). We found moderate agreement on the main tags (α = 0.58), which was nearly substantial (values of 0.61 and up) and fair agreement on the sub-tags (α = 0.37), but approaching moderate agreement (values of 0.41 and up). 4.2 Manual Labor (RQ1) The annotators needed on average 12:46 hours to tag the 4,006 sentences (see Table 3). The time was not continuous, as they required ample breaks in between sessions. Moreover, the speed of annotation is variable, which is confirmed by the strong negative correlation between session length and average annotation speed (r = -0.88). It suggests that the performance decreases as vigilance wanes. More specifically, this explains the large difference in the performance between annotators 1 and 3. Table 3: Overview of the performance of the various annotators. Annotator Total duration Avg. session duration [h] Avg. sentences/minute Min;Max sentences / session Annotator 1 08:37:00 00:16:41 7.92 45; 270 Annotator 2 13:18:00 00:36:16 4.84 43; 445 Annotator 3 16:23:00 01:10:13 4.08 110; 733 Average 12:46:00 00:41:03 5.61 482.67 We determined the effort per process step (presented in Table 4) to assess the scalability of manual analysis. For each step we identified fixed timespans (constant for the process), and variable time amounts that depends Table 4: Composition of time per process step in a manual analysis. The values of step 1a and step 4a present the combined duration of the work performed by three persons, each spending on average a third of that time. Steps 2b until 5a are not considered “automatable” as they would be redundant in an automated procedure. Step Duration [h] Automatable Variable 1a. Select products 01:30 No Products 1b. Gather hyperlinks 00:15 No Products 1c. Obtain (crawl) user reviews 00:06 Fully User reviews 2a. Apply sentence splitter 00:05 Fully Sentences 2b. Export file 00:05 N/A Sentences 2c. Develop, deploy& test tagging tool & environment 03:30 N/A N/A 2d. Configure tagging tool 00:15 N/A N/A 2e. Import user reviews 00:30 N/A N/A 3. Define tagging categories & tagging schema 00:40 N/A Sentences 4a. Perform tagging 38:18 N/A Sentences 4b. Adjust export criteria & export results 02:00 N/A Sentences 5a. Reconcile results 08:00 N/A Sentences 5b. Process results 15:00 Fully Sentences 6. Analyze data 18:00 Partially Visualizations Total [hours] 88:14:00 on how much user feedback is analyzed. The variability can be expressed as a relationship with the number of products, reviews, or sentences analyzed. To assess whether this manual work can be scaled to large amounts of user feedback, we had tracked the time needed to perform each step (see Table 4). For the sake of simplicity, we proposed to model the total time as a linear function of data. One of the consequences of this approach is that we considered the combined annotation time from Table 3 as a continuous timespan, disregarding the need for frequent breaks or the decrease in annotation performance over time. In a joint discussion, we determined the share of fixed and variable time spans needed for completing each step based on our experience. For example, in step 1a, a group of three persons chose the products to be analyzed, amounting to on average 9 minutes per person, or 27 minutes in total. This was the fixed time, leaving 63 minutes of variable time to look up and decide upon the inclusion of seven products, which added a variable time of 9 minutes per product. Note that we have subtracted the additional time spent to obtain the ten products we had initially identified. This results in the formula Tstep1a = 27 + 9NP , where T is the time required in minutes, and NP is the total number of products. We repeated this process for all steps. We then calculated the summed values for each step, which in the case of step 1 is the sum of steps 1a, 1b, and 1c. For the later steps, we also included the variables NR for the number of user reviews and NS for the number of sentences. Pc • Tstep1 = i=a Tstep1,i = 27 + 9NP + 2.14286NP + 0.00962NR = 27 + 11.14286NP + 0.00962NR Pf • Tstep2 = i=a Tstep2, i = 0.00125NS + 2 + 0.00075NS + 210 + 15 + 15 + 0.00374NS = 242 + 0.00574NS • Tstep3 = 40 Pb • Tstep4 = i=a Tstep4,i = 0.57364NS + 90 + 0.00749NS = 90 + 0.58112NS Pb • Tstep5 = i=a Tstep5,i = 0.11982NS + 0.22466NS = 0.34448NS • Tstep6 = 1, 000 + 0.01997NS Together, they form the following formula: 6 X Ttotal(manual) = Tstep(i) = 1, 399 + 11.14286NP + 0.00962NR + 0.95132NS i=1 The number of products influences only the two short steps 1a and 1b; the influence of user reviews number is negligible. But the number of sentences affects the analysis duration. In addition, at least 1,399 minutes (23.32 hours or 2.9 8-hour person days) are needed for preparing and post-processing the manual analysis. Our analysis used 7 products, 624 user reviews, and 4,006 sentences. Based on the proposed formula that is Ttotal(manual) = 1399 + 11.14286 × 7 + 0.00962 × 624 + 0.95132 × 4, 006 = 1399 + 78.00 + 6.00 + 3810.99 = 5, 294 minutes (88:14 hours), reflecting the total in Table 4. It assumes three annotators perform the tagging, because with one annotator the results can be biased, and with two annotators it is difficult to resolve conflicting results. Based on the formula, we can determine that each additional sentence adds on average 57.08 seconds to the Table 5: Composition of time per process step in the automated analysis. The numbering of the step has been aligned with Table 2. Steps 1a and 1b are still performed manually, and step 6 is performed partially manually. Step Duration [h] Automatable Variable 1a. Select products 01:30 No Products 1b. Gather hyperlinks 00:15 No Products 1c. Obtain (crawl) user reviews 00:06 Fully User reviews 2a. Apply sentence splitter 00:05 Fully Sentences X. Match language patterns 01:00 Fully Sentences & Pattern Combination 5b. Process results 00:30 Fully Sentences 6. Analyze data 18:00 Partially Visualizations Total [hours] 21:26:00 analysis process. Of this time, 34.42 s (60.30%) are spent on tagging, 13.48 s (23.62%) on processing results, 7.19 s (12.60%) on reconciliation, and the remaining 1.99 s (3.49%) on data import, export, and analysis. This shows that the steps following tagging also still require a great deal of manual work. To assess the actual scalability of manual work, we calculated the increase in time required for the manual analysis of sentences. To this end, we performed a correction for the increase in the number of products and reviews by assuming the average review lengths and the average number of reviews per product from our dataset. This shows that even a threefold number of sentences (12,000) will already require 217:78 hours of work (27.2 person days or 1.24 person months), or a workload of 41.3% FTE when evenly divided over three persons. This work is required for about 1,935 user reviews, which some apps receive on a daily basis. 4.3 Automation Benchmark (RQ2) To make the comparison between the manual and automated approach, we compared the duration of the steps required to conduct each one. We considered all steps of the processes, regardless of whether they are automated or performed manually. We also assumed that all the steps are taken when a process is already in place. We had not measured the time for preparation, such as developing and configuring tooling, or defining the classification categories of the manual or automated approaches. The reason behind our assumption is that we could not reliably estimate the time those initial activities require as they greatly depend on the context. When automating the analysis process, we could determine that steps 2b until 5b are replaced by two steps: sentence matching, and post-processing to find topics and attributes. The steps and their duration are shown in Table 5. The language pattern matching duration depends not only on the number of sentences matched but also on the number of possible combinations of patterns to be matched to the sentences. An analysis with 41 simpler language patterns could be performed at 160 sentences per second. With growing number of language patterns that also get increased in complexity, along with writing the results to the database, this speed is considerably reduced. We chose to over-approximate the duration of pattern matching over 4,006 sentences to last one hour (i.e., 1.11 sentences per second), which is way over the actual duration of the step. The processing of results with part-of-speech tagging is a quicker process and lasts approximately 30 minutes for this dataset. Assuming the above estimations, the steps had the following durations: Pc • Tstep1 = i=a Tstep1,i = 27 + 9NP + 2.14286NP + 0.00962NR = 27 + 11.14286NP + 0.00962NR • Tstep2 = Tstep2,a = 0.00125NS • TstepX = 0.01498NS • Tstep5 = Tstep5,b = 0.00749NS • Tstep6 = 1000 + 0.01997NS Together, they form the following formula: Ttotal(automated) = 1, 027 + 11.14286NP + 0.00962NR + 0.26086NS We can see that the process for 4,006 sentences is reduced from 88:14 to 21:26 hours, and each additional sentence adds only 2.62 seconds to the total time. Of these 2.62 seconds, 1.19s (45.71%) is spent on the crawling step, 0.90s (34.29%) on the pattern matching, 0.45s (17.14%) on processing the results, and a negligible part on sentence splitting (0.07s; 2.86%). Based on the formula for Ttotal(automated) , we can determine that analyzing a threefold number of sentences (12,000 sentences) increases the total time spent by only 8.61 hours to 30:03 hours, of which the largest portion (18 hours) is still spent on performing the largely manual interpretation stage of the analysis results. 5 Discussion Our study showed that manual elicitation of requirements from user reviews is possible, but very time-intensive. For only 324 user reviews, classifying their 4,006 sentences took on average 12:46 hours of non-continuous work. This is in line with the reported effort in related works, in which an annotator spent 13:36 hours to classify topics of 327 reviews [GW13], between 8:00 and 12:30 hours per annotator to categorize 900 reviews [GM14], or an average of 13.6 hours to label 1,350 tweets [GIG17]. A manual analysis of even larger amounts will, therefore, be very tedious [FLL+13]. An essential finding of this study is that especially the number of sentences determines the effort of the analysis. We observed that each additional sentence is worth about a minute of the analysis manually performed by three persons. We additionally found tagging time to strongly correlate with session length, suggesting that a human annota- tor gets exhausted after some time. In addition to this, being a logical consequence of concentrated reading, user reviews have many redundancies. Such work becomes tedious and uninteresting for a human annotator, which would likely lead to less precise work while performing the categorization and processing results. For optimal performance, tagging sessions shall be short with ample breaks, limiting the tagging activity to about two hours (about 20% FTE or 600 sentences) on a given work day. On a monthly basis, this would translate to roughly 13,200 sentences or about 2,057 reviews per month (assuming no learning effect) that can be three persons would be able to handle efficiently. Moreover, the other process steps would require additional work time. The manual analysis of reviews to an app such as the Amazon Video app, with on average 2,280 user reviews per month, would according to our calculation require 1.48 person months of intensive work. From this perspective, this number of user reviews seems too large to analyze sensibly manually. The relationship between the amount of user feedback to analyze and the total time of the analysis that can be expressed as the polynomial might seem that manual analysis do scale well. However, from the business perspective, it must be taken into account that human work is not cheap. Our first argument is that the coefficients that describe the relationship have high values (one sentence is worth around a minute of work). Secondly, the analysis is a repetitive and cognitively demanding task and the associated need for taking breaks strongly affects the practical scalability. Thus, the upper boundary for manual analyses to elicit requirements, according to our calculations, seems to be around 2,000 user reviews per month. For greater amounts, the potential to continue performing the analysis manually should seriously be called into question (RQ1). Table 5 shows that when introducing automation into the analysis of user feedback, the steps to crawl, process, and annotate sentences become obsolete, and especially the categorization of sentences gets drastically shortened (RQ2). The product selection remains a manual task. The presentation and interpretation of the results are tool-supported activities that, currently, also involve quite much manual work. Reliable measurements of the time required to implement tooling, it would be possible to determine a break- even point for the investment in tooling. The investment is likely to pay itself off very quickly as the number of analyzed sentences increases, even if the automation is still being fine-tuned in the process. The repetitive nature of the work makes it an unpleasant task for humans, but filtering out irrelevant and repetitive contents would help. However, machine learning will have some difficulty with finer nuances in language, including slang, expressions (e.g., “I got a lemon”) and sarcasm. 6 Threats to Validity We discuss the validity threats to our study based on the guidelines by Wohlin et al. [WRH+12]. Construct validity. The opinions of users are inherently ambiguous, among others because natural language is informal, and authors of user reviews can be computer laypeople, non-native English speakers or just write without much attention. This made a tagging difficult and explains the considerable amount of time dedicated it took. Another threat is that the annotators had experience with compact cameras, but not as users of the analyzed models, which might have caused misinterpretations, and occasionally required re-reading sentences for better understanding. To reduce bias and misinterpretation, we chose annotators who speak English fluently, who received a training on the ISO 25010 quality characteristics, iteratively refined the tagging schema, and performed a pilot tagging. We have not made a comparison of the quality of the outcomes, although it can be assumed that especially the recall of the automated analysis will be less than of three human raters. Internal validity. We selected the reviews of the domain and type of product arbitrarily. The goal behind this decision was to investigate a set of opinions regarding particular software-intensive systems. To mitigate the subjectivity of the experts in assigning tags, three reviewers performed the tagging. We omitted products from our sample that were not on Amazon or had extreme deviations from the mean to reduce the difference in numbers and lengths of reviews per camera. For each of the remaining cameras, we analyzed all of the reviews available at the time of crawling. External validity. The user reviews were about software-intensive products, and were taken exclusively from Amazon. We argue that this data source contains representative opinions of the cameras. Although the results may not be directly comparable to the tagging of user reviews to mobile apps, we did see commonalities in the time required to perform the tagging. However, reviews for these kinds of products may be longer, which possibly has caused our formula to somewhat underestimate the influence of the number of reviews on the analysis time. Next, the effort of our three annotators shall be treated like the upper band, as experienced annotators might work faster. However, the times we reported were comparable to that in other scientific works. We have assessed the scalability of automation on only one tool, but have been generous in estimating the time it requires to perform an analysis, so that we are confident that we have been able to cover the performance of many comparable algorithms and tools. Conclusion validity. A potential threat relates to the size of our sample of data. The set contained opinions of seven cameras with 624 reviews. We have simplified the formula by assuming it to be linear, although in practice, several aspects are not necessarily linear (e.g., human annotators not being able to perform their work continuously). 7 Conclusions & Future Work In this paper, based on the manual analysis of online user reviews of smart compact cameras, we determined that it is possible to perform the process of deriving requirements from user feedback manually. However, due to the required effort and fatigue occurring, we found that manual analysis does not scale well. Additionally, manually analyzing 2,000 user reviews per month seems to be the upper limit when continuously monitoring the user feedback a product receives (e.g., to identify problems and ideas for enhancements). The main contribution of this paper is twofold. Research towards NLP has often expressed the assumption that manually analyzing large amounts of user feedback is hopeless work, but this claim was based on gut feeling. Our study is the first investigation we know on the effort involved in performing such analysis. We have provided tentative evidence in favor of perception. The second contribution is that we have put an approximate number on the limits of manual analysis, along with a formula that distinguishes between the fixed and variable effort. It might support a company in assessing the effort needed to analyze user feedback of their product(s) and in deciding whether humans shall perform the work. Hence, we have shown that NLP techniques and algorithms are indispensable for being able to monitor what users write about a product. By performing a benchmarking, we have shown that even with very cautious estimates, these techniques scale well to large amounts of user feedback. When introducing automation, manual steps regarding data import/export, tagging and categorization can be omitted or receive an efficiency boost. However, some manual steps remain, and currently, much effort is still needed to develop and fine-tune tool- support. Although automated approaches are often criticized for missing aspects that are hard to classify without a better understanding of the context and the finesses of a language (i.e., resulting in a lower recall), human annotators can be unintentionally biased when reading sentences, possibly emphasizing less important topics. It then remains the question if the speed and convenience of using NLP techniques outweighs having a loss of recall, which is currently often found to be around 0.70 (e.g., [WM17]) One of the direction for future research could be to propose more accurate formulas modelling the relationship between the total effort and the amount of user feedback. However, we would rather suggest that future research focuses on improving the results from the automated analysis, and on better deriving user requirements from the results. At best, a more thorough analysis could suggest a formula to determine the break-even point between automation and manual analysis, which would help practitioners determine whether they should use automation to elicit requirements from user feedback. Acknowledgment The research described in this paper was performed in the project Opti4Apps (grant no. 02K14A182) of the German Federal Ministry of Education and Research. References [FLL+13] B. Fu, J. Lin, L. Li, C. Faloutsos, J. Hong, N. Sadeh. Why People Hate Your App: Making Sense of User Feedback in a Mobile App Store. 19th ACM SIGKDD Intl Conference on Knowledge Discovery and Data Mining, pp. 1276–1284. ACM, 2013. [GW13] L. V. Galvis Carreño, K. Winbladh. Analysis of User Comments: An Approach for Software Require- ments Evolution. 35th Intl Conference on Software Engineering, pp. 582–591. IEEE Press, 2013. [GKH+17] E. C. Groen, S. Kopczyńska, M. P. Hauer, T. D. Krafft, J. Doerr. Users–The Hidden Software Product Quality Experts? 25th Intl RE Conference, pp. 72–81. IEEE, 2017. [GSA+17] E. C. Groen, N. Seyff, R. Ali, F. Dalpiaz, J. Doerr, E. Guzman, et al. The Crowd in Requirements Engineering: The Landscape and Challenges. IEEE Software, 34(2), pp. 44-52. IEEE, 2017. [GM14] E. Guzman, W. Maalej. How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews. 22nd Intl RE Conference, pp. 153–162. IEEE, 2014. [GAS16] E. Guzman, R. Alkadhi, N. Seyff. A Needle in a Haystack: What Do Twitter Users Say about Software? 24th Intl RE Conference, pp. 96–105. IEEE, 2016. [GIG17] E. Guzman, M. Ibrahim, M. Glinz. A Little Bird Told Me: Mining Tweets for Requirements and Software Evolution. 25th Intl RE Conference, pp. 11–20. IEEE , 2017. [HK07] A. F. Hayes, K. Krippendorff, Answering the Call for a Standard Reliability Measure for Coding Data. Communication Methods and Measures, 1, pp. 77–89, 2007. [IH13] C. Iacob, R. Harrison. Retrieving and Analyzing Mobile Apps Feature Requests from Online Reviews. 10th International Conference on Mining Software Repositories, 41–44, 2013. [ISO11] International Organization for Standardization. ISO/IEC 25010:2011 – Systems and software Quality Requirements and Evaluation (SQuaRE) – System and software quality models. ISO, 2011. [JM17] N. Jha, A. Mahmoud. Mining User Requirements from Application Store Reviews Using Frame Seman- tics. In: P. Grünbacher, A. Perini (eds.), Requirements Engineering: Foundation for Software Quality, LNCS, vol. 10153, pp. 273–287. Springer, 2017. [KM17] Z. Kurtanović, W. Maalej. Mining User Rationale from Software Reviews. 25rd Intl RE Conference, pp. 53–62. IEEE, 2017. [MN15] W. Maalej, H. Nabil. Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews. 23rd Intl RE Conference, pp. 116–125. IEEE, 2015. [Neu02] K. Neuendorf. The Content Analysis Guidebook. Sage Publications, 2002. [PM13] D. Pagano, W. Maalej. User Feedback in the AppStore: An Empirical Study. 21st Intl RE Conference, pp. 125–134. IEEE , 2013. [PDG+15] S. Panichella, A. Di Sorbo, E. Guzman, C. Visaggio, G. Canfora, H. Gall. How Can I Improve My App? Classifying User Reviews for Software Maintenance and Evolution. 31st Intl Conference on Software Maintenance and Evolution, pp. 281–290, 2015. [RSW08] B. Regnell, R. B. Svensson, K. Wnuk. Can we Beat the Complexity of Very Large-Scale Requirements Engineering? In: B. Paech, C. Rolland (eds.), Requirements Engineering: Foundation for Software Quality, LNCS, vol. 5025, pp. 123–128. Springer, 2008. [SAH+15] F. Sarro, A. A. Al-Subaihin, M. Harman, Y. Jia, W. Martin, Y. Zhang. Feature Lifecycles as They Spread, Migrate, Remain, and Die in App Stores. 23rd Intl RE Conference, pp. 76–85. IEEE, 2015. [WM17] G. Williams, A. Mahmoud, Mining Twitter Feeds for Software User Requirements, 25th Intl RE Conference, pp. 1–10. IEEE, 2017. [WRH+12] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén, Experimentation in Software Engineering. Springer, 2012.