<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mobile Application Review Classi er</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nishant Jha</string-name>
          <email>njha1@lsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anas Mahmoud</string-name>
          <email>mahmoud@csc.lsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Division of Computer Science and Engineering Louisiana State University</institution>
          ,
          <addr-line>Baton Rouge, LA, 70803</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Mobile application stores enable end-users of software to directly express their needs and share their experience with mobile apps in the form of textual reviews. These reviews often contain important user feedback that can be leveraged by app developers to help them understand their end-user needs. However, such information are not readily available, and vetting individual reviews manually can be a tedious task. To alleviate this e ort, we introduce MARC, a Mobile Application Review Classi er. MARC is a stand-alone automated solution that enables developers to extract and classify user reviews into ne-grained software maintenance requests, including bug reports and user requirements. MARC is equipped with a set of con guration features to enable practitioners and researchers to classify user reviews under di erent settings. A dataset of app reviews sampled from three apps are used to evaluate the performance of MARC. The results show that MARC achieves accuracy levels that can be adequate for practical and research applications Copyright 2017 for this paper by its authors. Copying permitted for private and academic purposes</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        As of March 2015, the Apple App Store alone has reported around 2.25 million
active apps, growing by over 1,000 apps per day. This scale of app production has
resulted in an unprecedented level of competition in the app market, forcing
software creators to look beyond traditional time-consuming software engineering
practices into a new paradigm of methods that enable a more responsive
software production process. Recent analysis of large datasets of app user reviews
has revealed that almost one third of these reviews contain useful information
to app developers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Users not only report technical bugs that they nd in
the apps they use, but also express features that they would like to, or not like
to, see in newer versions of the application [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To help app developers e
ectively extract such information, in this paper we introduce MARC|a Mobile
Application Review Classi er. MARC is a stand-alone tool that enables app
developers to extract and classify most recent user reviews from the iOS App
Store into ne-grained software maintenance requests, including bug reports and
user requirements.
      </p>
      <p>
        MARC1 supports two classi cation approaches, including the Bag-of-Words
(BOW) and the Bag-of-Frames (BOF) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The former approach uses the
individual words of review sentences as classi cation features. The latter approach
relies on the notion of semantic role labeling (SRL) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. SRL is used to
generalize from raw text (individual words) to more abstract scenarios (contexts) by
characterizing the lexical meaning of the words in the form of semantic units, or
frames [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The main objective is to reduce the dimensionality of the classi cation
data, and consequently, enhance the predictive capabilities of the classi er.
      </p>
      <p>MARC is also equipped with several text pre-processing features to allow
both researchers and practitioners to classify app reviews under di erent con
guration settings. The performance of MARC is evaluated using user reviews
collected from three iOS apps from di erent application domains. In what follows,
we describe MARC, its basic features, classi cation engine, and our evaluation
process in greater detail.
2</p>
    </sec>
    <sec id="sec-2">
      <title>MARC's Features</title>
      <p>MARC provides a set of features that enable users to automatically extract
reviews from the iOS App Store and experiment with di erent classi cation
settings. In what follows, we describe these features in greater detail.
2.1</p>
      <sec id="sec-2-1">
        <title>App Review Extraction</title>
        <p>MARC provides a feature for extracting recent user reviews from the iOS App
Store. The user can select any app through its unique App Store ID. MARC
then makes a web request to the App Store's RSS feed generator. The generated
JSON pages are parsed to extract the selected app's reviews. The user can extract
between 50 and 500 reviews at a time. For example, to get the list of most recent
Gmail app reviews, MARC makes the following Web request:
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Text pre-processing</title>
        <p>
          It is not uncommon in text classi cation tasks to use text reduction strategies to
minimize the number of classi cation features (words). The objective is to only
keep important words that have an actual impact on the predictive capabilities
of the classi er [
          <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
          ]. The current release of MARC supports the following text
pre-processing techniques:
{ Stemming: Stemming reduces words to their morphological roots by
removing derivational and in ectional su xes. This leads to a reduction in
the number of features (words) in text as only one base form of the word is
considered. MARC supports stemming through Porter stemmer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
1 https://github.com/seelprojects/MARC
{ Stop-word removal: MARC provides a feature for removing English words
that are considered too generic (e.g., the, in, will ). These words appear in
most reviews and are highly unlikely to be distinctive to the classi er.
{ Sentence extraction: A single user review might include a user
requirement, a bug report, and some other irrelevant or unuseful feedback.
Therefore, to help developers better extract information, MARC processes reviews
a sentence at a time relying on the punctuation available in the review's
text [
          <xref ref-type="bibr" rid="ref10 ref11 ref9">9,10,11</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>MARC's Classi cation Engine</title>
      <p>The core feature of MARC is to classify technically informative user reviews
into user requirements and bug reports. To facilitate this process, MARC
supports two di erent classi cation techniques, including Bag-of-Words and
Bagof-Frames. The following is a description of MARC's classi cation engine and
its classi cation techniques.
3.1</p>
      <sec id="sec-3-1">
        <title>Frame Semantics</title>
        <p>
          In addition the classical BOW classi cation approach, MARC supports a more
semantically-aware form of classi cation, known as the Bag-of-Frames (BOF).
This approach relies on the notion of Semantic Role Labeling (SRL) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. SRL
allows generalizing from raw text to more abstract scenarios called frames [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. A
semantic frame (or simply a frame) can be described as a schematic
representation of a situation (events, actions) involving various elements. A frame element
(FE) can be de ned as a participant entity or a semantic role in the action
described by the frame. Lexical units (LU) are basically the words that evoke
di erent frame elements. For instance, the frame TRAVEL describes an event in
which a traveler moves form a source location to a goal along a path, or within an
area. This frame has the core frame elements traveler and goal. In the
sentence \Lisa traveled to Germany.", the subject `Lisa' evokes the frame element
Traveler and the phrase `to Germany' evokes the frame element Goal. This
unique form of semantic annotation allows for a deeper understanding of the
semantic information in individual user reviews. Using abstract general
meanings of text rather than exact words helps to reduce the number of classi cation
features, which in turn enables a more e cient classi cation process and reduces
the risk of over- tting [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          To support frame semantics, MARC uses SEMAFOR2|a probabilistic frame
semantic parser to parse each review [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. SEMAFOR automatically processes
English sentences according to the form of semantic analysis in FrameNet [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
MARC makes a web request to the SEMAFOR parser to obtain the annotations
for each sentence. The generated annotations are represented using JSON. A
special parser is used to extract the frames of each annotated sentence from the
JSON output. For example, the following review sentences are parsed as follows:
2 demo.ark.cs.cmu.edu/parse
1) It crashed when I zoomed into the page.
2) Please add gif comments.
3) I love the app so much.
1) Cause_impact Temporal_collocation Contacting
2) Stimulus_focus Statement Statement
3) Experiencer_focus Relational_quantity
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Classi cation Algorithms</title>
        <p>
          MARC uses Support Vector Machines (SVM) and Naive Bayes (NB) to classify
app reviews. These two classi ers have been shown to be very e ective in app
review classi cation tasks, detecting di erent types of user reviews at decent
levels of accuracy over multiple datasets [
          <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
          ].
        </p>
        <p>
          MARC uses the Weka's API3 as the core classi er. This API is used to convert
the input review's text into a Weka compatible le format (.arff). The lter
StringToWordVector is applied to generate the word x document matrix for
the input review to be classi ed. The classi er then uses term frequency (TF)
to assign weights to words. MARC uses a default training dataset of manually
classi ed reviews to train and test the underlying classi cation engine. This
dataset was compiled from di erent sources, including two datasets collected
from previous related research [
          <xref ref-type="bibr" rid="ref14 ref2">2,14</xref>
          ] and a dataset that was collected locally.
The BOW and BOF representations of the data are available in external les
that the end-user of MARC can edit to add more examples to the training data.
The generated classi cation model is then used to classify each sentence in the
input review individually.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Limitations</title>
      <p>The performance of MARC is evaluated using reviews collected from 3 apps
selected from the iOS App Store. These apps include: Adobe Acrobat (469337564),
Chrome (535886823), and CNN (331786748). A total of 100 of most recent
reviews of these apps were extracted. Each sentence was then manually classi ed
as a bug report, a user requirement, or other. In total, 39 bug reports, 18 user
requirements, and 43 other miscellaneous reviews were classi ed.</p>
      <p>MARC is then used to automatically classify these reviews. The default
dataset is used to train the classi er. For evaluation purposes, the reviews were
classi ed using the BOW and BOF approaches. We further experimented with
the text pre-processing features under the BOW approach. The results of the
classi cation process is shown in Table 1. The performance of MARC is measured
using precision and recall.</p>
      <p>
        The results show that under the BOF approach, MARC managed to achieve
an average of 75% precision and 93% recall using SVM and an average of 57%
3 http://www.cs.waikato.ac.nz/ml/weka/
precision and 72% recall using NB. In comparison, under the BOW approach,
MARC managed to achieve an average of 32% precision and 55% recall using
SVM and an average of 55% precision and 72% recall using NB. On average,
the best results over our evaluation dataset was achieved using SVM under the
BOF approach. A thorough evaluation of the BOW and the BOF approaches is
available in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The current release of MARC is intended for both practical and research
applications. App creators can use MARC to quickly access and classify the most
recent reviews of their apps. Researchers, on the other hand, can use MARC to
prepare and classify large datasets under di erent classi cation settings.
However, MARC still su ers from performance limitations that need to be addressed
in our future releases. For instance, MARC currently takes around 200 seconds
to train the classi cation model on a 3.30GHz CPU with 8.0GB of RAM. The
current running time could be improved by implementing faster classi cation
algorithms. Furthermore, classi cation results tend to be less accurate when
classifying reviews from application domains that have never been classi ed before.
Our expectation is that the classi cation accuracy could be signi cantly
improved by implementing a feedback mechanism that keeps updating the training
dataset with new instances.
In this paper, we introduced MARC|a Mobile Application Review Classi er.
MARC provides a set of features that enable users to extract reviews from the
iOS App Store and classify them into actionable software maintenance requests,
including bug reports and user requirements. MARC provides a set of text
preprocessing features to allow users to classify input reviews under di erent con
guration settings. The current release of MARC supports a Bag-of-Words (BOW)
and a Bag-of-Frames (BOF) representations of the input text. Semantic frames
are used to classify the input review sentences based on their context, or
meaning, rather than relying on their individual words. MARC was evaluated using
reviews collected from three sample applications. The results showed levels of
accuracy that can be adequate for practical applications, with the BOF
approach slightly outperforming the BOW approach. To enhance its practicality,
our future releases of MARC will support more application stores (e.g. Google
Play and Windows App Store), more classi cation algorithms, and other
classication features (e.g., n-grams and POS tagging) to improve the accuracy and
performance of the classi cation engine.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This work was supported by the Louisiana Board of Regents Research
Competitiveness Subprogram, contract number: LEQSF(2015-18)-RD-A-07.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Pagano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maalej</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>User feedback in the AppStore: An empirical study</article-title>
          .
          <source>In: Requirements Engineering Conference</source>
          . (
          <year>2013</year>
          )
          <volume>125</volume>
          {
          <fpage>134</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Maalej</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nabil</surname>
          </string-name>
          , H.:
          <article-title>Bug report, feature request, or simply praise? On automatically classifying app reviews</article-title>
          . In: Requirements Engineering Conference. (
          <year>2015</year>
          )
          <volume>116</volume>
          {
          <fpage>125</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jha</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahmoud</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Mining user requirements from application store reviews using frame semantics</article-title>
          . In: Requirements Engineering:
          <article-title>Foundation for Software Quality REFSQ</article-title>
          . (
          <year>2017</year>
          )
          <volume>1</volume>
          {
          <fpage>15</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fillmore</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Frame semantics and the nature of language</article-title>
          .
          <source>In: Annals of the New York Academy of Sciences: Conference on the Origin and Development of Language and Speech</source>
          . (
          <year>1976</year>
          )
          <volume>20</volume>
          {
          <fpage>32</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fillmore</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Berkeley Framenet project</article-title>
          .
          <source>In: International Conference on Computational Linguistics</source>
          . (
          <year>1998</year>
          )
          <volume>86</volume>
          {
          <fpage>90</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A comparative study on feature selection in text categorization</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          . (
          <year>1997</year>
          )
          <volume>412</volume>
          {
          <fpage>420</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rogati</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>High-performing feature selection for text classi cation</article-title>
          .
          <source>In: International Conference on Information and Knowledge Management</source>
          . (
          <year>2002</year>
          )
          <volume>659</volume>
          {
          <fpage>661</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>An algorithm for su x stripping</article-title>
          .
          <source>Program</source>
          <volume>14</volume>
          (
          <year>1980</year>
          )
          <volume>130</volume>
          {
          <fpage>137</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Carreno</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winbladh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Analysis of user comments: An approach for software requirements evolution</article-title>
          . In: International Conference on Software Engineering. (
          <year>2013</year>
          )
          <volume>582</volume>
          {
          <fpage>591</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Panichella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di</surname>
            <given-names>Sorbo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Guzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Visaggio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Canfora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Gall</surname>
          </string-name>
          , H.:
          <article-title>How can I improve my app? Classifying user reviews for software maintenance and evolution</article-title>
          .
          <source>In: International Conference on Software Maintenance and Evolution</source>
          . (
          <year>2015</year>
          )
          <volume>281</volume>
          {
          <fpage>290</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Guzman</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maalej</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>How do users like this feature? A ne grained sentiment analysis of app reviews</article-title>
          . In: Requirements Engineering Conference. (
          <year>2014</year>
          )
          <volume>153</volume>
          {
          <fpage>162</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Mitchell, T. In: Machine Learning.
          <source>McGraw-Hill</source>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>Probabilistic frame-semantic parsing</article-title>
          .
          <source>In: Human Language Technologies</source>
          . (
          <year>2010</year>
          )
          <volume>948</volume>
          {
          <fpage>956</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>AR-Miner: Mining informative reviews for developers from mobile app marketplace</article-title>
          .
          <source>In: International Conference on Software Engineering</source>
          . (
          <year>2014</year>
          )
          <volume>767</volume>
          {
          <fpage>778</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>